<h1>Train Test Split</h1>

Notebook Goals

* Learn how to simulate how a model would perform on new data using train test split

<h2> What is train test split?</h2> 

A goal of supervised learning is to build a model that performs well on new data. If you have new data, you could see how your model performs on it. The problem is that you may not have new data, but you can simulate this experience with a procedure like train test split. 

Here is how the procedure works: 

1. Split the dataset into two pieces: a **training set** and a **testing set**. Typically, about 75% of the data goes to your training set and 25% goes to your test set. 
2. Train the model on the **training set**.
3. Test the model on the **testing set** and evaluate the performance 

<h2> Import Libraries</h2>

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

<h2> Load the Dataset </h2>
The boston house-price dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below loads the boston dataset.

In [2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


<h2>  Remove Missing or Impute Values </h2>
If you want to build models with your data, null values are (almost) never allowed. It is important to always see how many samples have missing values and for which columns.

In [3]:
# Look at the shape of the dataframe
df.shape

(506, 14)

In [4]:
# There are no missing values in the dataset
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
target     0
dtype: int64

<h2> Arrange Data into Features Matrix and Target Vector </h2>
What we are predicing is the continuous column "target" which is the median value of owner-occupied homes in $1000’s. 

In [5]:
X = df.loc[:, ['RM', 'LSTAT', 'PTRATIO']].values

In [6]:
y = df.loc[:, 'target'].values

<h2> Train Test Split </h2>

![images](images/trainTestSplitBoston.png)
The colors in the image indicate which variable (X_train, X_test, y_train, y_test) the data from the dataframe df went to for a particular train test split (not necessarily the exact split of the code below).

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

<h2> Linear Regression</h2>

<b>Step 1:</b> Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [8]:
# This was already imported earlier in the notebook so commenting out
#from sklearn.linear_model import LinearRegression

<b>Step 2:</b> Make an instance of the Model

In [9]:
# Make a linear regression instance
reg = LinearRegression(fit_intercept=True)

<b>Step 3:</b> Training the model on the data, storing the information learned from the data

Model is learning the relationship between x and y

In [10]:
reg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<b>Step 4:</b> Predict the values of new data. Uses the information the model learned during the model training process

Predict for One Observation

In [11]:
# Input needs to be two dimensional (reshape makes input two dimensional )
reg.predict(X_train[0].reshape(1,-1))

array([21.98928961])

Predict for Multiple Observations at Once

In [12]:
reg.predict(X_train[0:10])

array([21.98928961, 23.8327555 , 27.94686035, 21.74260868, 32.74726059,
       24.31528369, 11.9688125 , 18.85759571,  8.51392924, 17.82242859])

<h2> Measuring Model Performance</h2> 

By measuring model performance on the test set, you can estimate how well your model is likely to perform on new data (out-of-sample data)

In [13]:
# Pretty terrible model
score = reg.score(X_test, y_test)
print(score)

0.7200532651395151


In [14]:
# Curious why 
score = reg.score(X_train, y_train)
print(score)

0.662877892436671


## Common questions

<h3>How could the model perform better on the test set when it was trained on the the test set</h3>

Linear regression is a high bias model. This means that no matter what it will do its best to learn a linear relationship. If we have a high variance model like a decision tree, this might have been different.

In [26]:
"""
Note that we could also do a lot of other things to improve performance like add more features
"""

lin_reg = LinearRegression()
dt_reg = DecisionTreeRegressor()

lin_reg.fit(X_train, y_train)
dt_reg.fit(X_train, y_train)

lin_reg_score = lin_reg.score(X_test, y_test)
dt_reg_score = dt_reg.score(X_test, y_test)

print('Linear regression train score: ' + str(lin_reg.score(X_train, y_train)) )
print('Linear regression test score: ' + str(lin_reg.score(X_test, y_test)) )
print('Decision tree regressor train score: ' + str(dt_reg.score(X_train, y_train)) )
print('Decision tree regressor test score: ' + str(dt_reg.score(X_test, y_test)) )

Linear regression train score: 0.662877892436671
Linear regression test score: 0.7200532651395151
Decision tree regressor train score: 1.0
Decision tree regressor test score: 0.659800937446445


<h3>What is random state?</h3>

The `random_state` is a pseudo-random number that allows you to reproduce the same results every time you run them. It is useful for testing that your model was made correctly since it provides you with the same train test split each time. It is also useful for tutorials and talks so that you get the exact same results as the person giving the tutorial. However, it is recommended you remove it if you are trying to see how well it generalizes to new data. 

You can see this in the BostonSplit.ipynb notebook

<h3>How was the train test split image made?</h3>

You can see this in the BostonSplit.ipynb notebook. If you don't want to see how, it is essentially pure [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html). 