## Trial Prediction Exercise

### Objective

We want to build a model able to predict if a user is going to become a paying user of the Lingokids service after its trial expires. Remember the trial is 7 days. 
We have four sources of data that can be classified into two broad categories
1. Subscription data: This includes three different datasources
    * Information related to onboarding: All the actions that the user take to create its profile
    * Information related to subscription paywall: Here is where the user can opt to start a trial. It can be shown at different stages: it is always shown right after registration but it can also be shown after reaching daily limits or accesing limited features. If the user agrees to the trial is considered a succesful subscription and an invoice is charged after 7 days
    * Information related to the billing: After 7 days of a succesful subscription a bill is sent to the user. This means that the user has become a paying user (the event we want to predict)
2. Activity data: contains information about the activities that the user does within the app. 

This is a "sequence to class" prediction problem where we have a sequence of events and we want to know if they will end up on an specific target variable (user becomes a paying user). 

There are many approaches to this problem. In our case we are going to aggregate all the different events that happen over time for each user and engineer the features that will form part of the prediction model. Engineering all these features will require a lot of data wrangling and cleaning. Remember that we want to predict as soon as possible if we are going to have a paying user. We need to start giving predictions from the beginning which could seem a contradiction with aggregating over time. However it can be done by establishing a process that listens to the stream of events and updates the aggregated metrics of the different users as they arrive. Every period of time in a batch manner we can evaluate the prediction models with those updated counters. With those predictions, decisions can be taken (send a discount/reminder if the probability of becoming a paying user is low, extend the trial, etc...).

1. <a href='#onboard_data'>Onboard data</a>
1. <a href='#subscription_data'>Subscription data</a>
1. <a href='#invoice_data'>Invoice data</a>
1. <a href='#activities_data'>Activities data</a>
1. <a href='#create_dataset_and_training'>Create dataset and training</a>
1. <a href='#evaluation'>Evaluation</a>
1. <a href='#interpretability'>Interpretability</a>
1. <a href='#conclusion'>Conclusion</a>
1. <a href='#future_work'>Future work</a>

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from functools import reduce

from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.width', 2000)
pd.set_option('max_colwidth', 200000)
import json
from sklearn.impute import SimpleImputer

from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_curve, plot_roc_curve, plot_precision_recall_curve
from sklearn.model_selection import KFold
import shap

fpath_activities = 'data/activities.tsv'
fpath_invoices = 'data/invoices.tsv'
fpath_onboard_events = 'data/onboarding_events.tsv'
fpath_subscription_events = 'data/subscription_events.tsv'

### <a id='onboard_data'>Onboard events data</a>

We suspect that the onboard events do not carry much information for this specific prediction problem. Mainly because it's a process that is mandatory to be able to use the application. However, it could be useful to know information like the level that the parent assigns to the child and if the signup process was succesful. (TODO: We did not include some other features that could characterise the user like the signup_provider or the age of the child)

In [None]:
# user_id	ID for a user.
# session_id	ID for a session. Those events who don't have a session_id associated are sent from our server, normally as confirmation
# event_at	Timestamp of event. Timezone is UTC, not locale timezone of device
# event_name	Name of the event. We capture different user behaviour by using different events when they perform certain actions.
#           - onboarding_home: Start of onboarding, First screen
#           - signup_level: Sent when a user has selected the english level.
#                - level: Level selected. Three possible levels 2, 4 or 6
#                - level_name:
#           - signup_age: birthday in epoch format
#           - signup_result: Event confirms registration.
#               - child_id: id assigned for a child
#               - success: boolean indicating if registering has been succesful
#               - signup_provider. Method the user has used to register
# data	JSON of relevant data captured for this event. Each event_name has different information inside data (see Data tab)
# context	JSON of relevant data related to device, locale or other useful information, not related to specific action performed by user (see Context tab)
onboard_events_df = pd.read_csv(fpath_onboard_events, sep='\t')

In [None]:
onboard_events_df.head(5)

In [None]:
# set the types for the different columns with a function that will be applied across all dataframes in this exercise
def set_dtypes_for_df(df, date_columns, categorical_columns, json_columns):
    df[date_columns] = df[date_columns].astype('datetime64[ns]')
    df[categorical_columns] = df[categorical_columns].astype("category")
    df[json_columns] = df[json_columns].astype("string")
    return df


In [None]:
onboard_events_df = set_dtypes_for_df(onboard_events_df, ['event_at'], ['event_name'], ['data','context'])

In [None]:
# Most of the users only have 4 onboarding events which makes sense since they are the ones that need to be fulfilled to start using the application
onboard_events_df.groupby('user_id').count()['session'].value_counts()

In [None]:
# Let's define a function that will count the number of events per user and pivot them as columns
def column_count_values_and_pivot(df, column_group, column_count):
    df = df.groupby([column_group, column_count]).count().reset_index(
        level=column_count).pivot(columns=column_count, values=df.columns[-2])
    df.columns = [str(column_count) + '_' + str(x) for x in df.columns.to_list()]
    return df.fillna(0)

In [None]:
onboard_events_count_df = column_count_values_and_pivot(onboard_events_df, 'user_id', 'event_name')
onboard_events_count_df

In [None]:
# Let's also extract the info out of the jsons in the different event types
# we will use an auxiliary function to extract the information that we need in the jsons
def append_json_information(df, json_column, fields):
    df.loc[:, json_column] = df[json_column].astype("string")
    for field in fields:
        df.loc[:, field] = df[json_column].apply(lambda x: json.loads(x)[field])        
    return df

In [None]:
# take the success field out of the data JSON and eliminate any duplicates by grouping the user id
onboard_success_df = append_json_information(onboard_events_df[onboard_events_df['event_name']=='signup_result'], 'data', ['success']).groupby('user_id').agg({'success' : 'first'}).rename(columns={'success' : 'onboard_success'})
# same with the english level of the child
onboard_level_df = append_json_information(onboard_events_df[onboard_events_df['event_name']=='signup_level'], 'data', ['level']).groupby('user_id').agg({'level' : 'first'}).rename(columns={'level' : 'onboard_level'})

In [None]:
# Finally, put together all the dfs to generate the features of the onboard_events
dfs = [onboard_events_count_df, onboard_level_df, onboard_success_df]
onboard_events_features = reduce(lambda left, right: left.join(right, how='outer'), dfs )

In [None]:
# Any NA's after the join?
onboard_events_features.isnull().sum()

622 users didn't complete the onboarding successfully. It makes sense to ignore this users since they will never get to the 7 days trial. We will drop them

In [None]:
onboard_events_features = onboard_events_features.dropna()

### <a id='subscription_data'>Subscription data</a>

This events are related to the paywall that is shown to the user and where she decides if she wants to start a trial or not. It contains valueable information because it will show us how many times the user postpones the trial or if he goes for the trial straight away after the onboarding process. We will count the number of subscription enter events that the user sees, the source of these events and if the subscription was succesful or not. 

In [None]:
# user_id	ID for a user.
# session_id	ID for a session. Those events who don't have a session_id associated are sent from our server, normally as confirmation
# event_at	Timestamp of event. Timezone is UTC, not locale timezone of device
# event_name	Name of the event. We capture different user behaviour by using different events when they perform certain actions.
#       - subscription_enter: When a user enters in paywall.
#           - source: Which user flow has triggered. postonboarding (shown when user finishes on boarding) or launcher
#           - child_id: The id of the child
#           - platform
#       - subscription_succesful: Confirm subscription.
#           - subscription_id: id for subscription
#           - price: the price of the subscription
#           - currency:
#           - payment platform
# data	JSON of relevant data captured for this event. Each event_name has different information inside data (see Data tab)
# context	JSON of relevant data related to device, locale or other useful information, not related to specific action performed by user (see Context tab)
subscription_events_df = pd.read_csv(fpath_subscription_events, sep='\t')

In [None]:
subscription_events_df.head(5)

In [None]:
subscription_events_df = set_dtypes_for_df(subscription_events_df, ['event_at'], ['event_name'], ['data','context'])

In [None]:
# Many users only have 2 subscription events which makes sense since they are the ones that need to be fulfilled to start using the application (subscription_enter[postonboarding] followed by subscription_succesful)
subscription_events_df.groupby('user_id').count()['event_name'].value_counts()

Counting the number of times a user enters the subscription event

In [None]:
subscription_events_count_df = column_count_values_and_pivot(subscription_events_df, 'user_id', 'event_name')

In [None]:
subscription_events_count_df

In [None]:
# Some users have more than one subscription succesful which does not really makes sense. We change values greater than one to one
subscription_events_count_df['event_name_subscription_successful'].value_counts()

In [None]:
subscription_events_count_df.loc[subscription_events_count_df['event_name_subscription_successful']>1, 'event_name_subscription_successful'] = 1

In [None]:
subscription_events_count_df['event_name_subscription_successful'].value_counts()

In [None]:
# Before continuing, there is a source in the jsons that is an empty dict
subscription_enter_source_df = append_json_information(subscription_events_df[subscription_events_df['event_name']=='subscription_enter'], 'data', ['source'])
subscription_enter_source_df = subscription_enter_source_df[~(subscription_enter_source_df['source'] == {})]

In [None]:
subscription_enter_source_df

In [None]:
subscription_enter_source_df = column_count_values_and_pivot(subscription_enter_source_df, 'user_id', 'source')

In [None]:
subscription_enter_source_df.columns = ['subscription_enter_' + str(col) for col in subscription_enter_source_df.columns]

In [None]:
subscription_enter_source_df

In [None]:
# We will extract the subscription_id as we will need it to correlate it with invoices
subscription_id_df = append_json_information(subscription_events_df[subscription_events_df['event_name']=='subscription_successful'], 'data', ['subscription_id'])


In [None]:
# There are users with two subscriptions. Not sure if this is intended, so we are going to just keep the first time they subscribed
subscription_id_df = subscription_id_df.sort_values('event_at').drop_duplicates(subset='user_id', keep="first")

In [None]:
subscription_id_df = subscription_id_df.set_index(subscription_id_df['user_id'])[['subscription_id']]


In [None]:
dfs = [subscription_events_count_df, subscription_enter_source_df, subscription_id_df]
subscription_events_features = reduce(lambda left, right: left.join(right, how='outer'), dfs)

In [None]:
subscription_events_features

### <a id='invoice_data'>Invoice data</a>

The invoice data is going to be our target variable. We want to predict if the user is going to generate an invoice or not. In other words, we want to know if a user is going to become a customer of the lingokids service

In [None]:
invoices_df = pd.read_csv(fpath_invoices, sep='\t')

In [None]:
invoices_df

In [None]:
# There is not much preprocessing for the invoices, just converting the dates to datetime, renaming the columns and droping duplicates
invoices_df['purchased_at'] = invoices_df['purchased_at'].astype('datetime64[ns]')
invoices_df = invoices_df.sort_values('purchased_at').drop_duplicates(subset='subscription_id', keep="first")
invoices_df.columns = ['invoices_' + str(col) for col in invoices_df.columns]


In [None]:
invoices_df

In [None]:
# We can already merge it with the subscription data
# save the index since merge eliminates the index
ix = subscription_events_features.index
subscription_events_features = subscription_events_features.merge(invoices_df, left_on='subscription_id', right_on='invoices_subscription_id', how='left')
subscription_events_features = subscription_events_features.set_index(ix)

In [None]:
subscription_events_features

### <a id='activities_data'>Activities data</a>

The activities data is also very useful. It will allow us to measure what's the interaction of the user with the app. Do they play a lot?. Do they complete a lot of activities?. From where do they play? 

In [None]:
# user_id	ID for a user.
# session	ID for a session.
# event_at	Timestamp of event. Timezone is UTC, not locale timezone of device
# source	Origin where activity has been launched
# activity	Id of activity
# activity_name	Name of activity
# type	Type of activity
# subtype	Subtype of activity
# child_id	Id assigned for a child. An user may have more than one child associated
# duration	Number of seconds spent on activity
# completed	Activity has been completed? An user may exit without compliting an activity
# game_completed	Same as completed but only for activities whose type is 'game'
# downloading_time	Seconds spent in download info needed
# loading_time	Seconds spent in loading the activity in the app
# replay_times	Number of times activity has been played so far
# os	JSON with information relevant to operating system of device (see Context below)
# location	JSON with information relevant to location of user (see Context below)
# timezone	Timezone. Needed if you need to translate timestamp UTC to timestamp locale
# locale	Language of device
# device	JSON with information about device (see Context below)
activities_df = pd.read_csv(fpath_activities, sep='\t')

In [None]:
activities_df.head(5)

In [None]:
activities_df = set_dtypes_for_df(activities_df, ['event_at'], ['source', 'activity', 'activity_name', 'type', 'subtype', 'completed', 'game_completed', 'timezone', 'locale'], ['os','location'])

In [None]:
activities_df.dtypes

In [None]:
# There is one row that has only NaN values
activities_df[activities_df['location'].isnull()]

In [None]:
# Eliminate this one
activities_df = activities_df[~activities_df['location'].isnull()]

In [None]:
# Get out of the jsons information that could be important like the name of the os and the location of the activity
activities_df['activity_os_name'] = append_json_information(activities_df, 'os', ['name'])['name'].astype('category')
activities_df['activity_location'] = append_json_information(activities_df, 'location', ['country'])['country'].astype('category')

In [None]:
activities_df

In [None]:
# We are now ready to aggregate all this events by user. lets check null values first
activities_df.isnull().sum()

Game completed and subtype missing values are not important. We won't use them in our prediction model because of their high cardinality (curse of dimensionality) and game completed is a duplicate of completed

In [None]:
# nulls for source. Which ones?
activities_df[activities_df['source'].isna()]

In [None]:
# Impute them with the most common values
imp = SimpleImputer(strategy="most_frequent")
activities_df['source'] = imp.fit_transform(activities_df[['source']])[:, 0]


In [None]:
# nulls for location. Which ones?
# They are in the Asia and Africa timezone but we cannot deduct the country from here. We will drop them as the country is a very important piece of information that describes the user behaviour
activities_df = activities_df[~activities_df['activity_location'].isna()]

In [None]:
activities_df.isnull().sum()

In [None]:
# We are going to drop the following features. Activity name because it is a duplicate of activity and subtype because is a subset of type
# child_id, game_completed because is a duplicate of completed, os and location (json format), timezone, locale, os_version we are going to assume that do not influence in the result
columns_to_drop = ['session', 'activity_name', 'subtype', 'child_id', 'game_completed', 'os', 'location', 'timezone', 'locale', 'name', 'country']
activities_df = activities_df.drop(columns_to_drop, axis=1)

In [None]:
activities_df

In [None]:
# We start aggregating all of this information by user id
# First the numeric values
numeric_fields_activities_df = activities_df.groupby('user_id').agg({'duration': 'sum',
                                                                    'downloading_time': 'sum',
                                                                    'loading_time': 'sum',
                                                                    'replay_times': 'sum'
                                                                    })
numeric_fields_activities_df.columns = ['activities_' + col for col in numeric_fields_activities_df.columns]

In [None]:
# Then the activity categories (location and os name) which are suppose to be unique so we keep just one
unique_fields_activities_df = activities_df.groupby('user_id').agg({'activity_location': 'first',
                                                                   'activity_os_name': 'first'
                                                                   })

In [None]:
unique_fields_activities_df

We create counters for the source of the activity. We suspect that past activities is a good predictor as it means that a previous activity was of interest to the user. Same with the number of activities launched

In [None]:
source_activity_count_df = column_count_values_and_pivot(activities_df, 'user_id', 'source')
source_activity_count_df.columns = ['activity_' + col for col in source_activity_count_df.columns]
source_activity_count_df

The type of activity could also be important. Users that experience the full range of activities might be more attracted to the app.

In [None]:
type_activity_count_df = column_count_values_and_pivot(activities_df, 'user_id', 'type')
type_activity_count_df.columns = ['activity_' + col for col in type_activity_count_df.columns]
type_activity_count_df

The number of activities completed or interrupted is also an important insight. Do they play until the end and complete activities or just get tired of it?

In [None]:
completed_activity_count_df = column_count_values_and_pivot(activities_df, 'user_id', 'completed')
completed_activity_count_df.columns = ['activity_' + col for col in completed_activity_count_df.columns]
completed_activity_count_df

In [None]:
# join all the features into the activity features
dfs = [completed_activity_count_df, type_activity_count_df, source_activity_count_df, unique_fields_activities_df, numeric_fields_activities_df]
activity_features_df = reduce(lambda left, right: left.join(right, how='outer'), dfs)

In [None]:
activity_features_df

### <a id='create_dataset_and_training'>Create Dataset and Training</a>

To create the dataset we join the activity features, the subscription events features and the onboard events features. Remember that the last one contains information of wether the user generated an invoice or not

In [None]:
dfs = [activity_features_df, onboard_events_features, subscription_events_features ]
dataset = reduce(lambda left, right: left.join(right, how='outer'), dfs)

In [None]:
dataset

 Now that we have joined the dataset let's check the nans again as the joining process could leave some fields with NaN value

In [None]:
dataset.isnull().sum()

Several things to take into account when dealing with the NaN's
- Activities counts will be filled with 0 as a NaN value means absence of playing that particular activity. Same with duration, downloading time, loading time, etc...
- Subscription enter counted events will be filled with the most frequent values as we assume that the user follows the normal procedure of subscription
- Invoices values nans will be engineered to be our target variable. A nan means that there is no invoice and the customer did not become a paying customer. 
- Most of the users subscribed succesfully. Two of them didn't but we will drop them
- For the users that somehow did not complete the onboarding process, we will impute them the most common values and assume that they followed the same onboarding process as most users
- The rest of categories like the os_name and location or the child level in the onboarding process will be imputed with the most common values

In [None]:
# Many users didn't go through the onboarding process (2714 users). Let's assume that they went through the same onboarding process as the majority of users and impute most frequent values
features_to_impute = ['event_name_onboarding_home','event_name_signup_age', 'event_name_signup_level', 'event_name_signup_result', 'onboard_level', 'onboard_success']
imp = SimpleImputer(strategy="most_frequent")
dataset[features_to_impute] = imp.fit_transform(dataset[features_to_impute])

In [None]:
# Activities counts will be filled with 0 as a NaN value means absence of playing that particular activity. Same with duration, downloading time, loading time, etc... However we must impute something into the activity os name and the location
features_to_impute = ['activity_os_name', 'activity_location']
imp = SimpleImputer(strategy="most_frequent")
dataset[features_to_impute] = imp.fit_transform(dataset[features_to_impute])

In [None]:
# drop those two users without a succesful registration (aparently they just did the onboarding and we are interested on knowing if a registered user is going to become a paying user)
dataset = dataset[~dataset['event_name_subscription_successful'].isna()]

In [None]:
# some users do not have a subscriptionid. They didn't enter the subscription succesful event either so we drop them too
dataset = dataset[~dataset['subscription_id'].isna()]

In [None]:
dataset.isnull().sum()

In [None]:
features_to_impute = ['subscription_enter_source_launcher','subscription_enter_source_parents', 'subscription_enter_source_postonboarding', 'subscription_enter_source_stickeralbum', 'subscription_enter_source_upsell_download_modal']
imp = SimpleImputer(strategy="most_frequent")
dataset[features_to_impute] = imp.fit_transform(dataset[features_to_impute])

In [None]:
# let's create the output variable. If there is no data for refunded_invoice then it's not a paying user 
dataset['paying_customer'] = 1
dataset.loc[dataset['invoices_purchased_at'].isna(), 'paying_customer'] = 0

In [None]:
dataset

In [None]:
# And as a final step, drop those features that do not bring any value and fill the previous counts that are numeric
features_to_drop = ['invoices_subscription_id', 'subscription_id', 'invoices_purchased_at', 'invoices_refunded_invoice']
dataset = dataset.drop(features_to_drop, axis=1)
dataset.fillna(0)

In [None]:
dataset.dtypes

In [None]:
# Set the types of the dataset
categorical_fields = ['activity_location', 'activity_os_name', 'onboard_success', 'onboard_level', 'paying_customer']
dataset[categorical_fields] = dataset[categorical_fields].astype("category")
# the onboard level is going to be change to a string to not be mistaken with a numeric value
dataset['onboard_level'] = dataset['onboard_level'].cat.rename_categories(["low", "medium", "high"])

Let's train the model. We are goint to use CatBoost. Ensemble methods are a good way of reducing noise, bias and variance. Boosting is one of these methods and Catboost gives us an implementation that works well out of the box and that outputs and interpretable model. We also do not have to deal with normalisation and correlated features, so the data preprocessing is reduced even further

In [None]:
X = dataset.drop(['paying_customer'], axis=1)
categorical_indexes = [X.columns.get_loc(c) for c in X.select_dtypes('category').columns]
Y = dataset['paying_customer']
kfold = KFold(n_splits=2)
roc_curve_scores = []
for train_index, test_index in kfold.split(dataset):
    x_train, x_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = Y.iloc[train_index], Y.iloc[test_index]
    clf = CatBoostClassifier()
    clf.fit(x_train, y_train, cat_features=categorical_indexes)

### <a id='evaluation'>Evaluation</a>

This is a binary classification task (customer will pay or not). One of the ways to evaluate this is through a ROC curve, which measures the performance of a classifier when using different probabilty thresholds to classify an instance into one of the two options. We obviously want to get right our predictions (true positives) but we don't want to wrongly classify a user as a paying customer (false positives). The balance between the two depends on the use case. For this kind of use case a false positive does not have severe consequences, so we can afford a relatively high number of them. On the other hand, in other use cases such as cancer detection, we cannot afford many false positives. 

In [None]:
plot_roc_curve(clf, x_test, y_test) 

Our classifier has a AUC of 0.65. That means that is better than a random guess which can already provide value. It could be improved by also considering the time dimension (did users play a lot the first few days or consistently during the trial?). We leave that as future work

precision-recall is also used when working with unbalanced datasets (not our case, see below). In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.

In [None]:
plot_precision_recall_curve(clf, x_test, y_test)

In [None]:
# the problem is not unbalanced
dataset['paying_customer'].value_counts()

### <a id='interpretability'>Interpretability</a>

An important thing is also to interpret the model. It can provide us insights on which factors are more important on our customer churn and what are the focus points. For this study we will use ShapValues which works nicely with the CatBoost library

In [None]:
shap_values = clf.get_feature_importance(Pool(x_train, label=y_train,cat_features=categorical_indexes),
                                                                     type="ShapValues")
shap_values = shap_values[:,:-1]
shap.summary_plot(shap_values, x_train)

We can see the impact of the most important attributes in the chart. If the points are in red it means high values for the feature. If they are to the left it means that they affect negatively the paying user possibility and viceversa. Some insights: 
- We can see that surprisingly, those users that are shown the paywall after onboarding are less likely to become paying customers. This could be misleading as not many users are not shown the paywall after onboarding and therefore, might be just a misinterpretation. More data will clarify this point
- The more the user plays, the more likely he is going to become a paying customer. Same with loading time
- Paywalls shown in the parents area have a bigger impact on turning a customer into a paying one
- Users that play past activities are most likely to become customers. 
- Onboard level also has an impact. Seems that users that put their children into the higher levels are most likely to become paying customers
- Users of some countries are more likely to pay for the app

See more plots below to understand better the impact of the variables

We also include a correlation matrix. We can see that the higher correlations are usage of the app with more activities completed and the duration of the usage

In [None]:
cor_dataset = dataset
cor_dataset['paying_customer'] = cor_dataset['paying_customer'].astype('float')
cor_dataset.corr()[cor_dataset.corr()['paying_customer']>0.1]

In [None]:
def filter_by_top_n(data, n, category_name):
    top_n_categories = data[category_name].value_counts().nlargest(n).index
    result_df = data[data[category_name].isin(top_n_categories)]
    result_df[category_name].cat.set_categories(result_df[category_name].unique(), inplace=True)
    return result_df


def plot_top_n(data, n, category):
    return sns.countplot(x=category, data=filter_by_top_n(data, n, category), hue='paying_customer')

In [None]:
# Users of some countries are more likely to pay for the app
plot_top_n(dataset, 10, 'activity_location')

In [None]:
sns.countplot(data=dataset, x='subscription_enter_source_postonboarding', hue='paying_customer' )

In [None]:
# Clearly the duration is bigger with paying customers
sns.barplot(x="paying_customer", y="activities_duration", data=dataset)

In [None]:
sns.countplot(data=dataset, x='activity_os_name', hue='paying_customer' )

In [None]:
sns.countplot(data=dataset, x='subscription_enter_source_parents', hue='paying_customer' )

In [None]:
# zoom into the paywalls shown more than one time in the parents section. They have a greater conversion rate
sns.countplot(data=dataset[dataset['subscription_enter_source_parents']> 0], x='subscription_enter_source_parents', hue='paying_customer' )

In [None]:
# Children in higher levels are more likely to become paying users
sns.countplot(data=dataset, x='onboard_level', hue='paying_customer' )

In [None]:
sns.barplot(x="paying_customer", y="activity_source_pastActivities", data=dataset)

In [None]:
sns.barplot(x="paying_customer", y="event_name_onboarding_home", data=dataset)

In [None]:
sns.barplot(x="paying_customer", y="activity_type_game", data=dataset)

### <a id='conclusion'>Conclusion</a>

In most of the data science projects, a big deal of the time is spent on preparing the data. This was no exception as there was a lot of data massaging and wrangling involved in engineering features out of the dataset, which took most of the time of this exercise. 
There are indeed several factors that influence wether the user will become a paying customer. In general, greater usage times and more activities completed help to become a paying customer. But also some factors like the location or information about the onboarding process can provide inmmediate information, without waiting for the customer to use the app further. The OS is also an important factor. Finally, some paywalls are more effective that others, specially those shown in the parents section. 

### <a id='future_work'>Future Work</a>

Note that here we are aggregating the events for each customer during the whole trial. We could aggregate this events for day 1, day 2, etc... of the trial, as the impact of one variable in day 1 does not have to be the same as in the last days of the trial. 
Another point to improve is the model. Since this is a sequence prediction problem, models like LTSM could be more effective since they are able to memorise the order of the events instead of just aggregating them without considering their sequence. 