# BA 476 Lab 3: Cross-validation and intuition

Today we will look at how to do cross-validation in Python and build some intuition about what happens when some of our implicit assumptions about our data breaks down. We will specifically look at what happens when we train on datasets that are too small or non-representative, and look at how to include interaction terms in our models.

## Background

We will continue using the dataset provided by Cogo Labs that we've used in previous labs. Recall that we are trying to predict customers' email open rates.

## Setup

Lets start by importing the necessary libraries and mounting the Google Drive:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline


from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import the data

The dataset we will use is the same that we used for the labs on descriptive analytics; refer to the earlier
descriptions for details. Let’s load the training data. Change the path below to accurately reflect the location
of the data on your Drive.

In [None]:
df = pd.read_csv('/content/drive/My Drive/ba476-test/data/cogo-all.tsv', sep='\t')

### Train and Test Sets

We start by splitting the data into a training and testing set. We've done this manually before, but today we'll use Scikit-learn's `train_test_split` function.

In [None]:
predictors = ["browser1", "browser2", "browser3", 'activity_observations','activity_days', 'activity_recency', 'activity_locations' ]
#X = df.loc[:, df.columns != "p_open"]
X = df.loc[:, predictors]
y = df["p_open"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((230638, 7), (57660, 7), (230638,), (57660,))


Note that if you did not set your random seed (or set it to something else) your results may look different since we are training on randomly selected rows.

## Lasso Regression (recap)

We've already seen how to train a lasso regression model, inspect its coefficients and compute mean squared error.

In [None]:
# Create a lasso regression object
# Train the model
# Predict
# Evaluate

## Cross-validation from scratch

Implement cv for model evaluation to get a better estimate of your model's out of sample performance. Start with a function that returns the train and validation folds for each iteration.

In [None]:
# write a function that returns the training/validation folds

###Model evaluation
Now we can do cross validation on our lasso model. Print the mse per iteration as well as the final estimate of performance.

In [None]:
# use your function to do cv for model evaluation on a lasso model

### Parameter tuning
We can do something similar when tuning a parameter. Remember to evaluate your model on the test set after tuning the parameter.

In [None]:
# tune the alpha parameter of the lasso model

In [None]:
# after selecting a parameter, evaluate the tuned model on the test set

## Cross-validation in sklearn
We discussed how cross-validation gives us more accurate error estimates by repeatedly treating a different subset of our data as validation set. Scikit-learn has built-in functions for cross-validation.

We will use the `cross_val_score` function, which has three required arguments when doing supervised learning: a classifier, your data (`X`), and your outcomes (`y`). The optional argument `cv` let's you set the number of folds you want to use.
The  `scoring` argument evaluates several known scoring rules automatically so that you don't have to compute the error rate by hand. The available scoring rules are discussed [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [None]:
#use cross_val_score to evaluate a linear regression model

Notice that it returns negative mean squared error by default (but this is easy enough to negate). Now that we have the accuracy on every fold, we can compute our final accuracy estimate.

In [None]:
#

### Cross-validation for parameter selection

Recall that Lasso and Ridge have a regularization parameter $\lambda$ which must be tuned. One way to tune this would be to use the `cross_val_score` function several times, once for each of the parameter values you are considering.  You should try to implement this to make sure you understand the steps.

In [None]:
# use cross_val_score for parameter tuning

Now that we have our cross-validation estimates for each parameter, we should train a model on the entire training set using the best parameter. This is the model we will evaluate on the test set.

## LassoCV, RidgeCV

The above process can be used to tune any type of model with hyperparameters. Tuning a lasso/ridge model is very common, so it has been automated in sklearn.  [`LassoCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) and [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)  allow you to specify a vector (or number) of $\alpha$'s  to try as well as the number of folds used for the cross-validation.

Repeat the tuning process above using `RidgeCV`.

In [None]:
#

## Building intuition

Spend the remainder of the lab investigating what happens when some of our common assumptions fail. In particular, test what happens when:
1. you train on a training set with a very different distribution from the testing set (for example by choosing to include only instances that ssatisfy a certain criteria in the training set). This will highlight the importance of randomness.
2. the size on the training set is very small (or more generally, how does the number of instances you train on influence the quality of the model). This highlights the importance of having enough data.
3. you add predictors to a model. This should highlight that additional predictors increase the flexibility of the model.



### Data preparation
 Let’s take a random subsample
of the training set to speed up training. Then, when we are happy with the tuning of our algorithms, we can increase
the size of the training set further.

In [None]:
X = df.loc[:, df.columns != "p_open"]
y = df["p_open"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

df_train = X_train.copy()

# Add p_open to the combined training dataframe
df_train["p_open"] = y_train

# Randomly sample 5000 rows from the training data
train_sample_size = 5000
df_train_sample = df_train.sample(n=train_sample_size, random_state=5)

df_train_sample.shape

(5000, 17)

In [None]:
# it's up to you to add more predictors as you see fit
predictors1 = ["browser1", "browser2", "browser3"]

X_train_sample_p1 = df_train_sample[predictors1]
y_train_sample = df_train_sample["p_open"][:, np.newaxis]

X_test_p1 = X_test[predictors1]
y_train_sample.shape

  y_train_sample = df_train_sample["p_open"][:, np.newaxis]


(5000, 1)