# Directing customers to subscription products through app behaviour analysis.

In todays market, companies have apps that are free but also provide paid versions of the app which have additional features. An example of this is YouTube Red. Since marketing is always costly to companies, it will be beneficial to know exactly who to target with offers and promotions. 

- **Market:** The target audience are customers who the companys free product. In this case study, this refers to users who downloaded and used the free app.
<br>
<br>
- **Product:** Paid memberships often provide enhanced versions of the free products already given for free, alongside new features. For example, YouTube Red allows you to leave the app while the audio from the video is still playing.
<br>
<br>
- **Goal:** The aim of the model is to predict which users will not subscribe to the paid membership, so that greater marketing efforts can go into trying to convert them to be paid users. This selection of people can be referred to as the 'persuadables'. The term 'persuadables' was used during the Brexit campaign by data scientists who spent effort targetting voters who were deemed to have a Brexit voting probability of around 50% ± p%, where p% was a pre-agreed threshold (e.g. 10%). This was so that voters who were hovering around 50% (on the fence about voting for Brexit) could be pushed into making a firm decision for voting for Brexit.

**Data Source:** https://www.kaggle.com/biphili/customer-behavior-app-data-analysis

# Importing the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dateutil import parser # This is for the date and time fields

# Importing the dataset

In [None]:
ds = pd.read_csv('appdata10.csv')

# Visualising the dataset

In [None]:
ds.head()

In [None]:
# We observe that the only column with missing data is the enrolled date.

sns.heatmap(ds.isnull(), yticklabels = False, cbar = False, cmap = 'Blues' )

In [None]:
# We observe that the hour column is missing.

ds.describe()

In [None]:
ds.info()

In [None]:
ds.hour = ds.hour.str.slice(1,3).astype(int)

In [None]:
ds.head()

In [None]:
ds1 = ds.drop(columns = ['user','screen_list','enrolled_date','first_open','enrolled'])

In [None]:
ds1.head()

In [None]:
ds1.shape

In [None]:
ds1_columns = []
for i in (ds1.columns):
    ds1_columns.append(i)

In [None]:
ds1_columns

In [None]:
ds1_column_names = []
for i in [0,1,4,5,6]:
    ds1_column_names.append(ds1_columns[i])

In [None]:
ds1_column_names

In [None]:
plt.suptitle('Countplot of Numerical Columns', fontsize = 20)
for i in ds1_column_names:
    
    plt.title(i)
    sns.countplot(ds1[i])
    plt.show()

In [None]:
plt.hist(ds.age)

In [None]:
plt.hist(ds.numscreens)

# Correlation Analysis

In [None]:
sns.heatmap(ds1.corr(), annot = True)

In [None]:
# Observations/Interpretation
# The later the day of the week, the more likely they are to enrol (but it is very weak correlation so we may not consider this).
# The earlier the hour of the day, the more likely they are to enrol.
# The younger the age, the more likely they are to enrol.
# The higher the number of screens, the more likely they are to enrol.
# If they played the minigame, the more likely they are to enrol.
# If they used the premium features, the less likely they are to enrol.
# If they liked the app, the less likely they are to enrol.

ds1.corrwith(ds.enrolled).plot.bar(figsize = (20,10), title = 'Correlation with response variable',
                                   fontsize = 15, rot = 45, grid = True,
                                   color = ['Blue','Green','Red','Orange','Purple','Brown','Black'])

**The below is my favourite correlation plot**

In [None]:
# Correlation Matrix

sns.set(style = 'white', font_scale = 1.3) # Builds the background

# Compute the correlation matrix
corr = ds1.corr() # Creating a 2D array of each correlation feature to each other

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True # This creates a the lower diagonal of the matrix as it is symmetrical

# Set up the matplotlib figure
fig, axes = plt.subplots(figsize = (12,12)) # Size of the plot
fig.suptitle("Correlation Matrix", fontsize = 40) # Title

# Generate a custom diverging colourmap

cmap = sns.diverging_palette(220, 10, as_cmap = True) # Colouring

# Draw the heatmap with the mask and correct aspect ratio

sns.heatmap(corr, mask = mask, cmap = cmap, vmax = 0.3, center = 0, 
            square = True, linewidth = 0.5, cbar_kws = {'shrink': 0.5})

# As we observe no strong correlation between any variables (linear dependence), we can conclude low multicollinearity.
# We will therefore move onto building the model.

**Note:** We wish to have a time limit for enrollments to be considered. This will allow us to valdiate the model within a given timeframe for future datasets.
For instance if we set the limit to be one week, we only need to wait one week to validate the accuracy of our model. To examine what time period will be good enough, we will visualise the data for the response times.

In [None]:
# We wish to compare the dates, therefore requiring us to convert them to datatime objects. 
# We will be subracting them in order to see how long it took them to enroll.
ds.info()

In [None]:
ds['first_open'] = [parser.parse(i) for i in ds['first_open']]

In [None]:
# If we run this code like the code above it will throw an error due to blanks being present in the dataset.

ds['enrolled_date'] = [parser.parse(i) if isinstance(i, str) else i for i in ds['enrolled_date']]

In [None]:
# astype('timedelta64[h]') converts the time into hours.
ds['difference'] = (ds['enrolled_date'] - ds['first_open']).astype('timedelta64[h]')

In [None]:
# We observe a positive skewed distribution.

plt.rcParams['figure.figsize'] = (10,6)
plt.title('Distribution of time since enrolled')
plt.hist(ds['difference'].dropna(), color = 'blue')
plt.show()

In [None]:
# We observe a positive skewed distribution and that most of the enrolments happen
# within the first 10 hours. We will therefore set our time limit to two days (48 hours).

plt.rcParams['figure.figsize'] = (10,6)
plt.title('Distribution of time since enrolled')
plt.hist(ds['difference'].dropna(), color = 'blue', range = [0,100])
plt.show()

In [None]:
# We will remove all enrolled statues that took over 48 hours.

ds.loc[ds.difference > 48, 'enrolled'] = 0

In [None]:
ds = ds.drop(columns = ['difference', 'enrolled_date', 'first_open'])

In [None]:
ds.head()

**Encoding the screen_list columns**

In [None]:
# We have obtained data from an analyst which contains the top screens.
# We will encorporate this data to make things easier as, encoding the column manually will results in too many columns in the resulting dataset.

In [None]:
top_screens = pd.read_csv('top_screens.csv').top_screens.values

In [None]:
top_screens

In [None]:
# We wish to map the screen names from the screen_list to the screens
# mentioned in the top_screens dataset.
# The comma ',' creates as many commas as there are strings.

ds['screen_list'] = ds.screen_list.astype(str) + ','

In [None]:
ds

In [None]:
# Check to see if the row contains a top screen. This will return a boolean (True/False) but the 
# .astype(int) will return 0 or 1.
# The second line removes the screens that were included in the top_screen list from the screen_list 
# and replace it with an empty string.

for i in top_screens:
    ds[i] = ds.screen_list.str.contains(i).astype(int)
    ds['screen_list'] = ds.screen_list.str.replace(i + ',', '')

In [None]:
ds

In [None]:
# The other columns will indicate how many left over screens we have.

ds['other'] = ds.screen_list.str.count(',')
ds = ds.drop(columns = ['screen_list'])

In [None]:
# To reduce multicollinearity betweens screens, we will groups the screens 
# into Funnels.

savings_screens = ["Saving1",
                    "Saving2",
                    "Saving2Amount",
                    "Saving4",
                    "Saving5",
                    "Saving6",
                    "Saving7",
                    "Saving8",
                    "Saving9",
                    "Saving10"]
ds["SavingCount"] = ds[savings_screens].sum(axis=1)
ds = ds.drop(columns=savings_screens)

cm_screens = ["Credit1",
               "Credit2",
               "Credit3",
               "Credit3Container",
               "Credit3Dashboard"]
ds["CMCount"] = ds[cm_screens].sum(axis=1)
ds = ds.drop(columns=cm_screens)

cc_screens = ["CC1",
                "CC1Category",
                "CC3"]
ds["CCCount"] = ds[cc_screens].sum(axis=1)
ds = ds.drop(columns=cc_screens)

loan_screens = ["Loan",
               "Loan2",
               "Loan3",
               "Loan4"]
ds["LoansCount"] = ds[loan_screens].sum(axis=1)
ds = ds.drop(columns=loan_screens)

#### Saving Results ####
ds.head()
ds.describe()
ds.columns

ds.to_csv('new_appdata10.csv', index = False)

# Importing the new dataset

In [None]:
ds = pd.read_csv('new_appdata10.csv')

In [None]:
ds.head()

In [None]:
response = ds.enrolled
ds = ds.drop(columns = 'enrolled')

In [None]:
ds.head()

# Splitting the dataset into the training set and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ds, response, test_size = 0.2, random_state = 0)

**Note:** At the end of model, we would like the associate the user from which the prediction came from. Before we remove the 'user' column, we will therefore be saving it.

In [None]:
train_identifier = X_train['user']
X_train = X_train.drop(columns = 'user')
test_identifier = X_test['user']
X_test = X_test.drop(columns = 'user')

# Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

**Note:** A StandardScaler returns a numpy array of multiple dimensions. The problem with this process is that it loses the columns names and index. The index is how we identify each set of fields to the user, and we would like the column names to be build within our model. We therefore save the scaled part into a different data frame by converting the result of the StandardScaler into its data frame.

In [None]:
X_train2 = pd.DataFrame(sc_X.fit_transform(X_train))
X_test2 = pd.DataFrame(sc_X.transform(X_test))

In [None]:
X_train2.columns = X_train.columns.values
X_test2.columns = X_test.columns.values

In [None]:
X_train2.index = X_train.index.values
X_test2.index = X_test.index.values

In [None]:
X_train = X_train2
X_test = X_test2

In [None]:
X_train

# Fitting the Logistic Regression to the dataset

**Note:** Penalty l1 will penalise any particular field that is strongly correlated to the response variable. If one type of screen is highly correlated to the response variable, the Penalty l1 will penalise this to ensure that particular screen does not end up with a large coefficient in the correlation equations. This is essential with models when working with mobile application screens.

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0, penalty = 'l1')
lr.fit(X_train, y_train)

In [None]:
# Predicting the y_test results

y_pred = lr.predict(X_test)

# Model Evaluation - Confusion Matrix and K-Fold Cross Validation

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
cm1 = pd.DataFrame(cm, index = (0,1), columns = (0,1))
plt.figure(figsize = (10,7))
sns.set(font_scale = 1.4)
sns.heatmap(cm1, annot = True, fmt = 'g')
print('Accuracy Score: %0.4f' % accuracy_score(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = lr, X = X_train, y = y_train, cv = 10)
accuracy_mean = accuracies.mean()
accuracy_std = accuracies.std()

In [None]:
print(accuracy_mean)
print(accuracy_std)

# Formatting the final results

In [None]:
y_test

In [None]:
y_pred

In [None]:
test_identifier

In [None]:
results = pd.concat([y_test, test_identifier], axis = 1).dropna()
results['predicted_results'] = y_pred
results[['user', 'enrolled', 'predicted_results']].reset_index(drop = True)

In [None]:
results