# Model Validation

![](https://www.mihaileric.com/static/model-selection-meme-bd4a6a86f615583d1a1bbc497ca4640e-67414.jpeg)

This notebook is inspired from:
[Hands on ML with Scikit-Learn and TensorFlow](https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb)

## Lesson Objectives


By the end of this lesson students will be able to: 

- Perform Linear Regression with Sklearn

- Transform target variables in linear regression

- Conduct Model Selection

- Conduct Model Assessment

## Reading in our Data

We are going to be using a dataset today that contains information about houses!  We are going to create a linear regression model to predict the prices of houses!

![](https://media3.giphy.com/media/3oeHLmEqXbVrQtaGBy/giphy.gif)


In [None]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

## Examining our data

Now that we have read in our data files let's take a quick look at what we have in our dataframe.

In [None]:
housing.head()

In [None]:
housing.info()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

### Turn and Talk

In your group discuss the above histograms.  What do you notice?  What conclusions do you make?  Does anything seem odd?

## Split the data!  (aka Train-Test Split)

Before we look at anything else related to our data we will split our data apart.  We want to avoid data-leakage by making this split early on so that we don't learn anything about the data that will lead you to make decisions about our model.  We will create this "test set" which will simulate new and unseen data.  Once we create this (at the very beginning of our analysis) we will "forget about it" until the end of the analysis.

![](https://media2.giphy.com/media/l4FGyUShS5LTbbOmI/source.gif)

### Why do we split data as train/test?

__The generalization performance__ of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us __a measure of the quality of the ultimately chosen model.__

 Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data.

<img src  = img/train-test-validation.png width = 350/>

<img src=img/test_train_split.png width =450>

[Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

####  Homemade Train-Test Split Function


In [None]:
import numpy as np

# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

#now let's split our data maintaining 20% in our test set
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")

In [None]:
train_set.head()

#### Train Test Split with Scikit-Learn

Luckily we won't need to create our own function because sklearn has built in the [`train_test_split` function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
## import the function
from sklearn.model_selection import train_test_split

## as you can see it is a function
type(train_test_split)

In [None]:
## note that this function produces the same result as our home-brewed function- although the indices are different
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(len(train_set), "train +", len(test_set), "test")

In [None]:
train_set.head()

### Stratified Splitting

In some cases you may want to do stratified sampling instead of random sampling when making your splits.  In this sampling method the population is divided into homogeneous groups called strata and a proportionate number of instances are samples from each of the stratum to guarantee the test set is representative of the overall population.

![](https://faculty.elgin.edu/dkernler/statistics/ch01/images/strata-sample.gif)

In the case of our housing data we might have some expert knowledge that the median income is a very important feature in predicting median housing prices.  In this case we might want to ensure that our test set is representative of the various median income categories.  

In [None]:
## Let's bin the median_income column into 5 different bins.
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

housing["income_cat"].hist()
plt.draw()

In [None]:
housing["income_cat"].value_counts() / len(housing)

Now that we have our income categories we can do a stratified test-train split.  This will ensure that we have sampled a representative number of data for each income category when we make our split.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[housing.index.intersection(train_index)]
    strat_test_set = housing.loc[housing.index.intersection(test_index)]

strat_test_set.income_cat.value_counts()/len(test_set)

NOTE: that when we use vanilla train-test split income distributions in the test set is not exactly the aligns with the dataset.

In [None]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
test_set.income_cat.value_counts()/len(test_set)

We can also stratify our split by using the `stratify` argument in train_test_split.

In [None]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42, stratify=housing.income_cat)
test_set.income_cat.value_counts()/len(test_set)

Finally, let's compare the different splitting techniques. 

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

compare_props

In the above table we see that the stratified technique will generate a test set that is nearly identical to the overall proportions, where the randomly generated test set is skewed.

## Prepare the data for Machine Learning

Let's do some data cleaning and feature engineering to prepare the data for modeling.

In [None]:
housing =train_set.drop("median_house_value", axis=1) # drop labels for training set
y = train_set["median_house_value"].copy()

In [None]:
## Let's take a look at rows with missing values:
sample_incomplete_rows = train_set[train_set.isnull().any(axis=1)]
sample_incomplete_rows

Great! We don't have any missing values!

Now let's separate out the numeric columns from the categorical column(`ocean_proximity`).

In [None]:
housing_num = housing.drop('ocean_proximity', axis=1)

Finally let's preprocess the categorical input features by dummy coding them!  We will use the `OneHotEncoder` option in sklearn to do this.

In [None]:
housing_cat = housing[['ocean_proximity']]
housing_cat.head(10)

In [None]:
try:
    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
    from sklearn.preprocessing import OneHotEncoder
except ImportError:
    from future_encoders import OneHotEncoder # Scikit-Learn < 0.20

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
housing_cat_1hot.toarray()

Arguments we might want to adjust in `OneHotEncoder`

`drop=first` :  this will drop the first category when creating the dummy coding and create a referent group

`sparse == False` : this will return an array instead of space matrix

And if we want to connect the categories back to our array we can access the `categories_` attribute which will return a list of the feature names.


Let's now join the categorical and numeric features back together in a dataframe with labels so we know which column is which!

In [None]:
X = np.c_[(housing_num, housing_cat_1hot.toarray())]

In [None]:
X.shape

In [None]:
cols = housing_num.columns.tolist() +cat_encoder.categories_[0].tolist()
housing_tr = pd.DataFrame(X, columns=cols)
housing_tr.head()

In [None]:
X_train = housing_tr
y_train = y

## Baseline Regression model

Now that we have prepared our data we can start our modeling process by running a baseline model. A baseline model is a model with NO predictors.  In regression essentially it is the mean ( or median) of the y-variable.  We can create a baseline model in sklearn using the [`DummyRegressor` object](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html).

In [None]:
from sklearn.dummy import DummyRegressor

dummy = DummyRegressor()  # by default this will use the mean

dummy.fit(X_train, y_train)

dummy.score(X_train, y_train) # the score of a regression model is the r-squared value

In [None]:
y_pred = dummy.predict(X_train) #making predicted values of y based on our x values and our model

We can also look at the RMSE of our dummy model as another way to assess the model fit.

In [None]:
from sklearn.metrics import mean_squared_error
dummy_rmse = mean_squared_error(y_train, y_pred, squared=False)
dummy_rmse

The RSME here can be interpreted as the amount on average a data point differs from the line of best fit. This value is in units of the y variable so in our case the price of the house will differ from the predicted value on average of 115,619 dollars. That is a lot!  Let's see if we can do better!

## Linear Regression model

Now that we know our baseline results we can move on to modeling with our set of features.  The goal here is that we want to create a model that is better than our baseline model!

In [None]:
## import necessary tools
from sklearn.linear_model import LinearRegression

## prepare(Instantiate) LinearRegression to use
lr = LinearRegression()

## coefficients are learnt and stored in "lr" at this step
lr.fit(X_train, y_train)

Unlike the ols function in statsmodels the sklearn implementation does not have a summary table.  The `lr` object has all the information we need for this linear regression problem but we gave to dig a little.

In [None]:
## Check coefficients

lr.coef_

In [None]:
## check the intercept of the model

lr.intercept_

In [None]:
y_pred = lr.predict(X_train)

In [None]:
lr.score(X_train, y_train)

In [None]:
lr_rsme = mean_squared_error(y_train, y_pred, squared=False)
lr_rsme

### Turn and Talk!

How did our linear regression model with our features do in relation to the dummy model?  Was it better?  Worse?  How do you know?

##  Test Set!

Now let's pretend this linear regression model is our "final" model (aka. we have tweaked the model and we have determined this one is the best according to our metrics).  Now let's use that model to make predictions using our test data!

The first thing we need to do is to transform our training data in the same way as our test data.

In [None]:
test_set.isna().sum()

In [None]:
test_set = test_set.dropna()

In [None]:
housing_test= test_set.drop("median_house_value", axis=1)
y_test=test_set['median_house_value'].copy()

In [None]:
housing_test_num = housing_test.drop('ocean_proximity', axis =1)
housing_test_cat = housing_test[['ocean_proximity']]
housing_test_cat_1hot = cat_encoder.transform(housing_test_cat)  ####NOTE: we only transform our test data!
housing_test_cat_1hot.toarray()
X_test = np.c_[(housing_test_num, housing_test_cat_1hot.toarray())]
X_test =pd.DataFrame(X_test, columns =cols)
X_test

Now we are ready to model!  __Note:  We don't fit our model on our training data!  We just use it to predict!__

Unlike the ols function in statsmodels the sklearn implementation does not have a summary table.  The `lr` object has all the information we need for this linear regression problem but we gave to dig a little.

In [None]:
y_pred = lr.predict(X_test)

In [None]:
lr.score(X_test, y_test)

In [None]:
lr_rsme = mean_squared_error(y_test, y_pred, squared=False)
lr_rsme

## Model Selection

It is important to note that there are in fact two separate goals that we might have in mind

<img src = 'img/model_selection.png' width = 550/>

If we have plenty of data then, the best approach is to randomly divide the dataset into three parts:

<img src = 'img/validation_ratio.png' width = 550/>

![](./img/cross-validation.png)

![](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

[Cross-validation-sklearn](https://scikit-learn.org/stable/modules/cross_validation.html)

In [None]:
from sklearn.model_selection import cross_val_score
lr = LinearRegression()
cross_val_score(estimator=lr, X=X, y=y, cv = 5)

The above output displays the score (here r-squared as the default) for each of the 5 folds.  We want to see this number stay the same and not fluctuate.  If it fluctuates than it's possible we have overfit our model. 

There are other methods in sklearn with slightly different features for cross_validation

[Cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate)

## Your turn!!

In your group, using the cleaned_movie_data.csv run a multiple linear regression model to predict gross revenue. Start with 3 continuous variables and one categorical predictor.

- Be sure to do a test train split of your data
- Use the training data to build and adjust your model- be sure to start with a baseline model and iterate from there
- Make sure to look at evaluation metrics as you build your model
- Once you have a "final" model, make new predictions with your test data and evaluate it!

In [None]:
# your code here

## Extra Readings

[Why train-validation-test splitting?](https://medium.com/datadriveninvestor/data-science-essentials-why-train-validation-test-data-b7f7d472dc1f)