# Lab - Regression

You explored the computational advertising dataset provided by Cogo Labs as part of the descriptive analytics exercise, and should have some intution about what factors influence email open rates. It is now time to translate this intuition into a machine learning model to predict email open rates for new customers based on their browsing behaviour.



## Setup

Lets start by importing the packages we'll need and mounting our Google Drive as before.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We'll use the `read_csv` function to read the dataset, be sure to take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). There are mamy optional arguments you may find useful when working on your project dataset.

In [2]:
df = pd.read_csv('/content/drive/My Drive/ba476-test/data/cogo-all.tsv', delimiter='\t')

In [3]:
df.head()

Unnamed: 0,state,user_id,browser1,browser2,browser3,device_type1,device_type2,device_type3,device_type4,activity_observations,activity_days,activity_recency,activity_locations,activity_ids,age,gender,p_open
0,AK,1087,0,0,1,0,0,0,1,74,15,100,2,3,20,M,0.0
1,AK,1656,0,1,0,0,1,1,1,39,22,36,4,7,26,F,0.0181
2,AK,2071,0,1,0,0,0,0,1,9,8,27,3,2,28,M,0.035912
3,AK,2228,0,1,0,0,0,0,1,14,11,68,2,2,19,M,0.0
4,AK,2500,0,0,0,0,1,1,0,2,2,100,1,1,21,F,0.0


We have to split the data into a training and testing set. `sklearn.model_selection` offers us automated ways of doing this which we will use in the future but since this is our first time, let's do it manually.

Create a new column called `train` which is `True` 80\% of the time --- when the instance should be included in the training set. You can use a random number generator for this. Once we have this column, we can filter on it to create two new dataframes.

In [6]:
df.shape[0]
train = df[ :df.shape[0]*0.8 ]

288298

In [14]:
np.random.rand?

In [13]:
np.random.rand() < 0.8

0.0035895854076011258

In [15]:
df['train'] = np.random.rand(df.shape[0]) < 0.8

df_train = df[df.train == True]
df_test = df[df.train == False]
print(df_train.shape, df_test.shape)

(230369, 18) (57929, 18)


We can refresh our memories of what goes on in the dataset by looking at the column names.

In [None]:
df.columns

Index(['state', 'user_id', 'browser1', 'browser2', 'browser3', 'device_type1',
       'device_type2', 'device_type3', 'device_type4', 'activity_observations',
       'activity_days', 'activity_recency', 'activity_locations',
       'activity_ids', 'age', 'gender', 'p_open', 'train'],
      dtype='object')

# Training a linear regression model

Let's start by setting up a very simple model that only cares about which browser a user uses. We will create a list, `predictors1` to hold ne column names of the predictors we want to include in our model to make indexing easier.

In [16]:
predictors = ['browser1', 'browser2', 'browser3']
X1_train = df_train[predictors]
X1_test = df_test[predictors]
y_train = df_train['p_open']
y_test = df_test['p_open']

In [18]:
y_train

0         0.000000
1         0.018100
2         0.035912
3         0.000000
4         0.000000
            ...   
288293    0.002083
288294    0.337838
288295    0.034409
288296    0.008000
288297    0.009009
Name: p_open, Length: 230369, dtype: float64

Now we can follow the same four steps as always. First, choose a model class, instantiante the model and set hyperparameter values, then fit to your data. Remember we can access model attributes, in this case the coefficients and intercepts term.  

In [19]:
# 1. choose model class
from sklearn.linear_model import LinearRegression
# 2. instantiate model
linear_model = LinearRegression()
# 3. fit model to data
linear_model.fit( X1_train , y_train )

In [20]:
linear_model.coef_

array([0.01547922, 0.00614914, 0.04480118])

Finally, we make predictions on the training and test set and evaluate the mean squared error. Of course, there are automated functions for this, but let's do it manually so that we can make sure we understand how it works.

First we predict on the training data:

In [23]:
# 4a. predict on training data, evaluate
yhat_train = linear_model.predict( X1_train )

from sklearn.metrics import mean_squared_error
print(mean_squared_error( y_train, yhat_train ))

0.028419352837391842


Then on the testing data:

In [24]:
# 4b. predict on test data, evaluate
yhat_test = linear_model.predict( X1_test )

print(mean_squared_error( y_test, yhat_test ))

0.02907472446671503


In this case our MSE is pretty similar, so it's unlikely we overfit. Is this a "good" MSE? We don't really know yet, but we can say that our open-rate predictions are, on average, off by about 17\%.


# k-NN

We discussed k-NN as an example of a non-parametric estimator. In sklearn, this is implemented as [`KNeighborsRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) for regression, and KNeighboursClassifier for classification.  The main parameter we will be interested in fixing is called ``n_neighbors``, read the documentation to learn about other parameters.

Train and evaluate a knn model using the same steps as before.

In [25]:
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor(n_neighbors=10)
knn_model.fit(X1_train, y_train)

yhat_test_knn = knn_model.predict(X1_test)
print(mean_squared_error(y_test, yhat_test_knn))

0.03180310635076774


# Categorical predictors and polynomial basis functions

All of the features we used above contained numerical values, but what if this is not the case? Now we need a way to encode categorical values to numbers. This is typically done with one-hot encoding. For example, a column `gender` with three possible values (male, female, other) will be transformed into three binary columns, one for each possible categorical value. We will use the [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for this.

Another common transformation is to use polynomial basis functions. Suppose we have predictors $X_1, X_2$. Instead of fitting the linear model $y=\beta_0 + \beta_1 X_1 + \beta_2 X_2$, we may want to fit the (still linear) $y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2 + \beta_4X_1^2 + \beta_5X_2^2.$ To do this we must add the columns $X_1\cdot X_2, X_1^2$ and $X_2$ to the data matrix. We will use [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) to add these columns automatically, notice that you can pass the maximum degree when instantiating the transformer.


The example that follows creates a `ColumnTransformer` to transform to do preprocessing on the five feature columns. The three browser columns are left unchanged and kept (if we did not specify `remainder='passthrough'` they would have been discarded); the `gender` column is transformed to two binary columns, one for each of the genders that appear in the data; and we create quadratic (degree 2 polynomial) features for the  `activity_days` column.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures

# all predictors
numerical_predictors = ['browser1', 'browser2', 'browser3']
categorical_predictors = ['gender'] # ['gender', 'state']
poly_predictors = ['activity_days']
all_predictors = numerical_predictors + categorical_predictors + poly_predictors

# make list of the two transformation we want to do
t = [...]

# instantiate columntransformer with our transforamtions t
col_transform = ColumnTransformer(...)

Now we can apply the transformation to our training and testing sets. Notice that we have 7 columns after transformation: 3 for the browsers; 2 for the gender one-hot encoding and 2 for `activity_days` and `activity_days`$^2$.

In [None]:
# ...

We can choose and fit a model as before on this new data.

In [None]:
# your code here

Did the model improve? Experiment some more and see if you can improve the model by adding features, trying different transformations or tuning the regularization parameters.

(Hint: When you're stuck, try including `state` in the categorical predictors above.)