

Translating data exploration of Cogo Labs intuition into a machine learning model to predict email open rates for new customers based on their browsing behaviour.


## Setup

Lets start by importing the packages we'll need and mounting our Google Drive as before. 

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We'll use the `read_csv` function to read the dataset, be sure to take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). There are mamy optional arguments you may find useful when working on your project dataset. 

In [4]:
df = pd.read_csv('/content/drive/My Drive/MLBA/cogo-all.tsv', delimiter='\t')

We have to split the data into a training and testing set. 

We create a new column called `train` which is `True` if the instance should be included in the training by using the numpy random number generator. Once we have this column, we can filter on it to create two new dataframes. 

In [5]:
df['train'] = np.random.rand(len(df)) < 0.8

df_train = df[df.train == True]
df_test = df[df.train == False]

print(df_train.shape, df_test.shape)

(230760, 18) (57538, 18)


We can refresh our memories of what goes on in the dataset by looking at the column names. 

In [6]:
df.columns

Index(['state', 'user_id', 'browser1', 'browser2', 'browser3', 'device_type1',
       'device_type2', 'device_type3', 'device_type4', 'activity_observations',
       'activity_days', 'activity_recency', 'activity_locations',
       'activity_ids', 'age', 'gender', 'p_open', 'train'],
      dtype='object')

## Training a linear regression model

Let's start by setting up a very simple model that only cares about which browser a user uses. We will create a list, `predictors1` to hold ne column names of the predictors we want to include in our model to make indexing easier.

In [7]:
predictors = ['browser1', 'browser2', 'browser3']
X1_train = df_train[predictors]
X1_test = df_test[predictors]
y_train = df_train['p_open']
y_test = df_test['p_open']

In [None]:
y_train

0         0.000000
1         0.018100
2         0.035912
5         0.070740
8         0.006849
            ...   
288292    0.004878
288293    0.002083
288294    0.337838
288296    0.008000
288297    0.009009
Name: p_open, Length: 230396, dtype: float64

Now we can follow the same four steps as always. First, choose a model class, instantiante the model and set hyperparameter values, then fit to your data. Remember we can access model attributes, in this case the coefficients and intercepts term.  

In [8]:
from sklearn.linear_model import LinearRegression # 1. choose model class

model = LinearRegression()                         # 2. instantiate model
model = model.fit(X1_train, y_train)               # 3. fit model to data

model.coef_, model.intercept_

(array([0.01604796, 0.0061872 , 0.04667057]), 0.0703354127738299)

Finally, we make predictions on the training and test set and evaluate the mean squared error. Of course, there are automated functions for this, but let's do it manually so that we can make sure we understand how it works. 

First we predict on the training data:

In [9]:
y_train_fit = model.predict(X1_train)                  # 4a. predict on training data

mse_train = np.mean( (y_train - y_train_fit)**2 )   # Evaluate
print(np.sqrt(mse_train), mse_train)

0.16845445616029722 0.0283769038002615


Then on the testing data:

In [10]:
y_test_fit = model.predict(X1_test)                  # 4a. predict on testing data

mse_test = np.mean( (y_test - y_test_fit)**2 )   # Evaluate
print(np.sqrt(mse_test), mse_test) # root mean squared error and mean squared error 

0.17102062847620725 0.02924805536439691


In [None]:
# MSE train = 0.028518754475621244
# MSE test = 0.02867812563757728

In this case our MSE is pretty similar, so it's unlikely we overfit. Is this a "good" MSE? We don't really know, but we can say that our open-rate predictions are, on average, off by about 17\%. 


## KNN regression

Use the same steps as above to train a KNN regression, and compute MSE train and MSE test. How do they compare to linear regression?

In [16]:
# Using the same process as above perform a KNN regression, and compute MSE train/test
from sklearn import neighbors                           # 1. choose model class
from sklearn.neighbors import KNeighborsRegressor

n_neighbors = 5
model = neighbors.KNeighborsRegressor(n_neighbors)      # 2. instantiate model
model.fit(X1_train, y_train)                            # 3. fit model to data

y_train_fit = model.predict(X1_train)                   # 4a. predict on training data

mse_train = np.mean( (y_train - y_train_fit)**2 )
print(np.sqrt(mse_train), mse_train)

y_test_fit = model.predict(X1_test)                     # 4b. predict on test data
mse_test = np.mean( (y_test - y_test_fit)**2 )
print(np.sqrt(mse_test), mse_test)

0.18083482061022202 0.03270123234513118
0.18304417639399126 0.033505170511754584


## Categorical predictors and polynomial basis functions


Up to now we have only used numerical columns. Let's include some categorical predictors like `gender`  and create polynomial features for the `activity_days` feature. 

In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures

# all predictors 
numerical_predictors = ['browser1', 'browser2', 'browser3'] # numerical
categorical_predictors = ['gender'] # categorical
poly_predictors = ['activity_days'] # polynomial features 
all_predictors = numerical_predictors + categorical_predictors + poly_predictors

# list of the two transformation we want to do
t = [('cat', OneHotEncoder(), categorical_predictors), 
     ('poly', PolynomialFeatures(2, include_bias=False), poly_predictors)]

# instantiate columntransformer with our transforamtions t
col_transform = ColumnTransformer(transformers=t, remainder='passthrough')

Now we can apply the transformation to our training and testing sets. Notice that we have 7 columns after transformation: 3 for the browsers; 2 for the gender one-hot encoding and 2 for `activity_days` and `activity_days`$^2$.

In [18]:
xt_train = col_transform.fit_transform(df_train[all_predictors])
xt_test = col_transform.fit_transform(df_test[all_predictors])
xt_train.shape

(230760, 7)

We can choose and fit a model as before on this new data. 

In [19]:
model = LinearRegression(fit_intercept=True)
model.fit(xt_train, y_train)
yhat_test = model.predict(xt_test)
mse_test = np.mean( (y_test - yhat_test)**2)
print(mse_test)

0.029028577145774018
