# Stepwise feature selection

 A machine learning model to predict email open rates for new customers based on their browsing behaviour.



## Setup

Lets start by importing the packages we'll need and mounting our Google Drive as before. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We'll use the `read_csv` function to read the dataset, be sure to take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). There are mamy optional arguments you may find useful when working on your project dataset. 

In [None]:
df = pd.read_csv('/content/drive/My Drive/MLBA/cogo-all.tsv', delimiter='\t')

We have to split the data into a training and testing set. `sklearn.model_selection` offers us automated ways of doing this which we will use in the future but since this is our first time, let's do it manually. 

We create a new column called `train` which is `True` if the instance should be included in the training by using the numpy random number generator. Once we have this column, we can filter on it to create two new dataframes. 

In [None]:
df['train'] = np.random.rand(len(df)) < 0.8

df_train = df[df.train == True]
df_test = df[df.train == False]

print(df_train.shape, df_test.shape)

(230624, 18) (57674, 18)


We can refresh our memories of what goes on in the dataset by looking at the column names. 

In [None]:
df.columns

Index(['state', 'user_id', 'browser1', 'browser2', 'browser3', 'device_type1',
       'device_type2', 'device_type3', 'device_type4', 'activity_observations',
       'activity_days', 'activity_recency', 'activity_locations',
       'activity_ids', 'age', 'gender', 'p_open', 'train'],
      dtype='object')

## Training a linear regression model

Let's start by setting up a very simple model that only cares about which browser a user uses. We will create a list, `predictors1` to hold ne column names of the predictors we want to include in our model to make indexing easier.

In [None]:
predictors = ['browser1', 'browser2', 'browser3','device_type2', 'device_type3', 'device_type4', 'activity_observations',
       'activity_days', 'activity_recency', 'activity_locations', 'age']
X1_train = df_train[predictors]
X1_test = df_test[predictors]
y_train = df_train['p_open']
y_test = df_test['p_open']

Now we can follow the same four steps as always. First, choose a model class, instantiante the model and set hyperparameter values, then fit to your data. Remember we can access model attributes, in this case the coefficients and intercepts term.  

In [None]:
from sklearn.linear_model import LinearRegression # 1. choose model class
model = LinearRegression(fit_intercept=True)      # 2. instantiate model
model.fit(X1_train, y_train)                      # 3. fit model to data

model.coef_, model.intercept_

(array([ 6.27978297e-03, -4.04776870e-03,  3.44039441e-02, -7.21018183e-02,
         5.48376517e-02,  9.33053464e-03, -1.70310798e-05,  9.98075096e-04,
        -6.97881390e-05, -3.45354789e-03,  1.59248749e-03]),
 0.0295734835321467)

Finally, we make predictions on the training and test set and evaluate the mean squared error. Of course, there are automated functions for this, but let's do it manually so that we can make sure we understand how it works. 

First we predict on the training data:

In [None]:
y_train_fit = model.predict(X1_train)              # 4a. predict on training data
mse_train = np.mean( (y_train - y_train_fit)**2 )
print(f"RMSE train: {np.sqrt(mse_train)}, MSE train: {mse_train}")

RMSE train: 0.16658220649393307, MSE train: 0.02774963152038736


Then on the testing data:

In [None]:
y_test_fit = model.predict(X1_test)                # 4b. predict on test data
mse_test = np.mean( (y_test - y_test_fit)**2 )
print(f"RMSE test: {np.sqrt(mse_test)}, MSE test: {mse_test}")

RMSE test: 0.16539390994272948, MSE test: 0.027355145446143713


In this case our MSE is pretty similar, so it's unlikely we overfit. Is this a "good" MSE? We don't really know, but we can say that our open-rate predictions are, on average, off by about 17\%. 


## Stepwise feature selection

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

sfs_forward = SequentialFeatureSelector(
    model, direction="forward"
).fit(X1_train, y_train)

predictors_fwd = sfs_forward.get_feature_names_out()
predictors_fwd

array(['browser3', 'device_type2', 'device_type3', 'activity_days', 'age'],
      dtype=object)

In [None]:
X1_train_fwd = df_train[predictors_fwd]
X1_test_fwd = df_test[predictors_fwd]

model.fit(X1_train_fwd, y_train)
model.intercept_, model.coef_

(0.024122763112132628,
 array([ 0.03707248, -0.07180775,  0.05088075,  0.00095493,  0.00159022]))

In [None]:
y_train_fwd_fit = model.predict(X1_train_fwd)             
mse_train_fwd = np.mean( (y_train - y_train_fwd_fit)**2 )
print(f"RMSE train: {np.sqrt(mse_train_fwd)}, MSE train: {mse_train_fwd}")

RMSE train: 0.16667267282451015, MSE train: 0.027779779866466205


In [None]:
y_test_fwd_fit = model.predict(X1_test_fwd)             
mse_test_fwd = np.mean( (y_test - y_test_fwd_fit)**2 )
print(f"RMSE test: {np.sqrt(mse_test_fwd)}, MSE test: {mse_test_fwd}")

RMSE test: 0.1654996498472957, MSE test: 0.027390134099577485


In [None]:
sfs_backward = SequentialFeatureSelector(
    model, direction="backward"
).fit(X1_train, y_train)

predictors_bkw = sfs_backward.get_feature_names_out()
predictors_bkw

array(['browser3', 'device_type2', 'device_type3', 'activity_days', 'age'],
      dtype=object)

In [None]:
X1_train_bkw = df_train[predictors_bkw]
X1_test_bkw = df_test[predictors_fwd]

model.fit(X1_train_bkw, y_train)
model.intercept_, model.coef_

(0.024122763112132628,
 array([ 0.03707248, -0.07180775,  0.05088075,  0.00095493,  0.00159022]))

In [None]:
y_train_bkw_fit = model.predict(X1_train_bkw)             
mse_train_bkw = np.mean( (y_train - y_train_bkw_fit)**2 )
print(f"RMSE train: {np.sqrt(mse_train_bkw)}, MSE train: {mse_train_bkw}")

RMSE train: 0.16667267282451015, MSE train: 0.027779779866466205


In [None]:
y_test_bkw_fit = model.predict(X1_test_bkw)             
mse_test_bkw = np.mean( (y_test - y_test_bkw_fit)**2 )
print(f"RMSE test: {np.sqrt(mse_test_bkw)}, MSE test: {mse_test_bkw}")

RMSE test: 0.1654996498472957, MSE test: 0.027390134099577485
