# Split Data into Train and Test Data

After training a model, you should evaluate its performance and get an estimate of its accuracy when applied to new data. You cannot use the train data again for this, since this will give you an overestimate of its performance on new data. However, you can solve this by training the model on a portion of the data (~70%) and evaluate its performance on the rest. These are called the training set and test set respectively. This recipe tells you how you can do this using the scikit-learn's `train_test_split()` function.

In [1]:
# Load packages
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
%config InlineBackend.figure_format = 'retina'

In [2]:
# Upload your data as CSV and load as data frame
df = pd.read_csv('housing.csv')
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


The dataset (source: [Kaggle](https://www.kaggle.com/camnugent/california-housing-prices)) consists of information about houses in California. We will now split this dataset into four parts using the library. These four parts are: `X_train`, `X_test`, `y_train` and `y_test`.

In [3]:
TO_PREDICT = 'median_house_value'                                                  # The column that we would like to predict; the output
TEST_SIZE = .30                                                                    # The proportion of data that will be used in the tests

def split_in_X_y(df):                                                              # This function first splits the dataset into the output (y) and the input (X) of the model
    X = df.loc[:, df.columns != TO_PREDICT]
    y = df[TO_PREDICT]
    
    return X,y

X,y = split_in_X_y(df)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = TEST_SIZE)     # Finally, the data is split into train and test data using the scikit-learn package

And that's it. The data is now split into the four parts. Let's save these dataframes into four separate csv-files.

In [4]:
X_train.to_csv('X_train.csv')
y_train.to_csv('y_train.csv')
X_test.to_csv('X_test.csv')
y_test.to_csv('y_test.csv')

### Useful links

[Housing dataset](https://www.kaggle.com/camnugent/california-housing-prices)  
[Library documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)