A goal of supervised learning is to build a model that performs well on new data. If I have new data, I could see how my model performs on it. The problem is that I may not have new data, but I can simulate this experience with a train test split.

## What is `train_test_split`

1. Split the dataset into two pieces: a **training set** and a **testing set**. Typically, about 75% of the data goes to my training set and 25% goes to my test set. 
2. Train the model on the **training set**.
3. Test the model on the **testing set** and evaluate the performance 



## Import Libraries

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression



## Load the Dataset
The boston house-price dataset is one of the datasets scikit-learn comes with that does not require the downloading of any file from some external website. The code below loads the boston dataset.

The goal of this dataset is to predict house prices based on features like number of rooms.

In [2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [3]:
# Create the features matrix 
X = df.loc[:, ['RM', 'LSTAT', 'PTRATIO']].values

In [4]:
# Create the target vector
y = df.loc[:, 'target'].values

## Train Test Split 

![images](../images/trainTestSplitBoston.png)
The colors in the image indicate which variable (X_train, X_test, y_train, y_test) the data from the dataframe df went to for a particular train test split (not necessarily the exact split of the code below).

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)

## Linear Regression Model

In [6]:
# Make a linear regression instance
reg = LinearRegression(fit_intercept=True)

# Train the model on the training set.
reg.fit(X_train, y_train)

## Measuring Model Performance
By measuring model performance on the test set, I can estimate how well my model is likely to perform on new data (out-of-sample data)

In [7]:
# Test the model on the testing set and evaluate the performance
score = reg.score(X_test, y_test)
print(score)

0.7155620757319656


So that's it, train_test_split helps you simulate how well a model would perform on new data