# Linear regression and KNN



## Setup

Lets start by importing the packages we'll need and mounting our Google Drive as before. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')

We'll use the `read_csv` function to read the dataset, be sure to take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). There are mamy optional arguments you may find useful when working on your project dataset. 

In [None]:
df = pd.read_csv('/content/drive/My Drive/MLBA/cogo-all.tsv', delimiter='\t')

We have to split the data into a training and testing set. `sklearn.model_selection` offers us automated ways of doing this which we will use in the future but since this is our first time, let's do it manually. 

We create a new column called `train` which is `True` if the instance should be included in the training by using the numpy random number generator. Once we have this column, we can filter on it to create two new dataframes. 

In [None]:
df['train'] = np.random.rand(len(df)) < 0.8

df_train = df[df.train == True]
df_test = df[df.train == False]

print(df_train.shape, df_test.shape)

(230660, 18) (57638, 18)


We can refresh our memories of what goes on in the dataset by looking at the column names. 

In [None]:
df.columns

Index(['state', 'user_id', 'browser1', 'browser2', 'browser3', 'device_type1',
       'device_type2', 'device_type3', 'device_type4', 'activity_observations',
       'activity_days', 'activity_recency', 'activity_locations',
       'activity_ids', 'age', 'gender', 'p_open', 'train'],
      dtype='object')

## Training a linear regression model

Let's start by setting up a very simple model that only cares about which browser a user uses. We will create a list, `predictors1` to hold ne column names of the predictors we want to include in our model to make indexing easier.

In [None]:
predictors = ['browser1', 'browser2', 'browser3']
X1_train = df_train[predictors]
X1_test = df_test[predictors]
y_train = df_train['p_open']
y_test = df_test['p_open']

Now we can follow the same four steps as always. First, choose a model class, instantiante the model and set hyperparameter values, then fit to your data. Remember we can access model attributes, in this case the coefficients and intercepts term.  

In [None]:
from sklearn.linear_model import LinearRegression # 1. choose model class
model = LinearRegression(fit_intercept=True)      # 2. instantiate model
model.fit(X1_train, y_train)                      # 3. fit model to data

model.coef_, model.intercept_

(array([0.01618858, 0.00640617, 0.04715025]), 0.07030346777090489)

Finally, we make predictions on the training and test set and evaluate the mean squared error. Of course, there are automated functions for this, but let's do it manually so that we can make sure we understand how it works. 

First we predict on the training data:

In [None]:
y_train_fit = model.predict(X1_train)              # 4a. predict on training data

mse_train = np.mean( (y_train - y_train_fit)**2 )
print(np.sqrt(mse_train), mse_train)

0.16881952807394998 0.028500033059111182


Then on the testing data:

In [None]:
y_test_fit = model.predict(X1_test)                # 4b. predict on test data
mse_test = np.mean( (y_test - y_test_fit)**2 )
print(np.sqrt(mse_test), mse_test)

0.16956802938033952 0.028753316587931687


In this case our MSE is pretty similar, so it's unlikely we overfit. Is this a "good" MSE? We don't really know, but we can say that our open-rate predictions are, on average, off by about 17\%. 


## KNN

Use the same steps as above to train a KNN regression, and compute MSE train and MSE test. How do they compare to linear regression?

In [None]:
from sklearn import neighbors                           # 1. choose model class
n_neighbors = 5
model = neighbors.KNeighborsRegressor(n_neighbors)      # 2. instantiate model
model.fit(X1_train, y_train)                            # 3. fit model to data

y_train_fit = model.predict(X1_train)                   # 4a. predict on training data

mse_train = np.mean( (y_train - y_train_fit)**2 )
print(np.sqrt(mse_train), mse_train)

y_test_fit = model.predict(X1_test)                     # 4b. predict on test data
mse_test = np.mean( (y_test - y_test_fit)**2 )
print(np.sqrt(mse_test), mse_test)

## Categorical predictors and polynomial basis functions

All of the features we used above contained numerical values, but what if this is not the case? Now we need a way to encode categorical values to numbers. This is typically done with one-hot encoding. For example, a column `gender` with three possible values (male, female, other) will be transformed into three binary columns, one for each possible categorical value. We will use the [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for this.

Another common transformation is to use polynomial basis functions. Suppose we have predictors $X_1, X_2$. Instead of fitting the linear model $y=\beta_0 + \beta_1 X_1 + \beta_2 X_2$, we may want to fit the (still linear) $y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1X_2 + \beta_4X_1^2 + \beta_5X_2^2.$ To do this we must add the columns $X_1\cdot X_2, X_1^2$ and $X_2$ to the data matrix. We will use [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) to add these columns automatically, notice that you can pass the maximum degree when instantiating the transformer. 


The example that follows creates a `ColumnTransformer` to transform to do preprocessing on the five feature columns. The three browser columns are left unchanged and kept (if we did not specify `remainder='passthrough'` they would have been discarded); the `gender` column is transformed to two binary columns, one for each of the genders that appear in the data; and we create quadratic (degree 2 polynomial) features for the  `activity_days` column. 

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures

# all predictors 
numerical_predictors = ['browser1', 'browser2', 'browser3']
categorical_predictors = ['gender'] #['gender', 'state']
poly_predictors = ['activity_days']
all_predictors = numerical_predictors + categorical_predictors + poly_predictors

# list of the two transformation we want to do
t = [('cat', OneHotEncoder(), categorical_predictors), 
     ('poly', PolynomialFeatures(2, include_bias=False), poly_predictors)]

# instantiate columntransformer with our transforamtions t
col_transform = ColumnTransformer(transformers=t, remainder='passthrough')

Now we can apply the transformation to our training and testing sets. Notice that we have 7 columns after transformation: 3 for the browsers; 2 for the gender one-hot encoding and 2 for `activity_days` and `activity_days`$^2$.

In [None]:
xt_train = col_transform.fit_transform(df_train[all_predictors])
xt_test = col_transform.fit_transform(df_test[all_predictors])
xt_train.shape

(230660, 7)

In [None]:
model = LinearRegression(fit_intercept=True)
model.fit(xt_train, y_train)
yhat_test = model.predict(xt_test)
mse_test = np.mean( (y_test - yhat_test)**2)
print(mse_test)

0.028583370783762215
