# Data Split to Train and Test Sets

<img src='https://developers.google.com/machine-learning/crash-course/images/PartitionTwoSets.svg'>

When you’re working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset.

For this, you’ll a dataset which is different from the training set you used earlier. But it might not always be possible to have so much data during the development phase.

In such cases, the obviously solution is to split the dataset you have into two sets, one for training and the other for testing; and you do this before you start training your model.

But the question is, how do you split the data? You can’t possibly manually split the dataset into two. And you also have to make sure you split the data in a random manner.

To help us with this task, the SciKit library provides a function called `train_test_split` Using this we can easily split the dataset into the training and the testing datasets in various proportions.

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

%matplotlib inline
sns.set(rc={'figure.figsize': [10, 10]}, font_scale=1.3)

In [9]:
df = sns.load_dataset('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


**lets preprocess other features first**

In [11]:
df = pd.get_dummies(df, columns=['sex', 'smoker', 'day', 'time'], drop_first=True)
df

Unnamed: 0,total_bill,tip,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
0,16.99,1.01,2,1,1,0,0,1,1
1,10.34,1.66,3,0,1,0,0,1,1
2,21.01,3.50,3,0,1,0,0,1,1
3,23.68,3.31,2,0,1,0,0,1,1
4,24.59,3.61,4,1,1,0,0,1,1
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,0,1,0,1,0,1
240,27.18,2.00,2,1,0,0,1,0,1
241,22.67,2.00,2,0,0,0,1,0,1
242,17.82,1.75,2,0,1,0,1,0,1


first we extract the x Featues and y Label

In [12]:
x = df.drop('tip', axis=1)
y = df['tip']

In [13]:
x

Unnamed: 0,total_bill,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
0,16.99,2,1,1,0,0,1,1
1,10.34,3,0,1,0,0,1,1
2,21.01,3,0,1,0,0,1,1
3,23.68,2,0,1,0,0,1,1
4,24.59,4,1,1,0,0,1,1
...,...,...,...,...,...,...,...,...
239,29.03,3,0,1,0,1,0,1
240,27.18,2,1,0,0,1,0,1
241,22.67,2,0,0,0,1,0,1
242,17.82,2,0,1,0,1,0,1


In [14]:
x.shape

(244, 8)

In [15]:
y

0      1.01
1      1.66
2      3.50
3      3.31
4      3.61
       ... 
239    5.92
240    2.00
241    2.00
242    1.75
243    3.00
Name: tip, Length: 244, dtype: float64

In [16]:
y.shape

(244,)

We will split this into two different datasets, one for the independent features X, and one for the dependent variable y (which is the `tip` column).

We’ll now split the dataset x into two separate sets `x_train` and `x_test`.

Similarly, we’ll split the dataset y into two sets as well `y_train` and `y_test`.

Doing this using the sklearn library is very simple. Let’s look at the code:

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [25]:
x_train

Unnamed: 0,total_bill,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
228,13.28,2,0,1,0,1,0,1
208,24.27,2,0,0,0,1,0,1
96,27.28,2,0,0,1,0,0,1
167,31.71,4,0,1,0,0,1,1
84,15.98,2,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...
106,20.49,2,0,0,0,1,0,1
14,14.83,2,1,1,0,0,1,1
92,5.75,2,1,0,1,0,0,1
179,34.63,2,0,0,0,0,1,1


In [26]:
x_train.shape

(195, 8)

In [27]:
x_test

Unnamed: 0,total_bill,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
24,19.82,2,0,1,0,1,0,1
6,8.77,2,0,1,0,0,1,1
153,24.55,4,0,1,0,0,1,1
211,25.89,4,0,0,0,1,0,1
198,13.0,2,1,0,0,0,0,0
176,17.89,2,0,0,0,0,1,1
192,28.44,2,0,0,0,0,0,0
124,12.48,2,1,1,0,0,0,0
9,14.78,2,0,1,0,0,1,1
101,15.38,2,1,0,1,0,0,1


In [28]:
x_test.shape

(49, 8)

In [29]:
y_train

228    2.72
208    2.03
96     4.00
167    4.50
84     2.03
       ... 
106    4.06
14     3.02
92     1.00
179    3.55
102    2.50
Name: tip, Length: 195, dtype: float64

In [30]:
y_train.shape

(195,)

In [31]:
y_test

24     3.18
6      2.00
153    2.00
211    5.16
198    2.00
176    2.00
192    2.56
124    2.52
9      3.23
101    3.00
45     3.00
233    1.47
117    1.50
177    2.00
82     1.83
146    1.36
200    4.00
15     3.92
66     2.47
142    5.00
33     2.45
19     3.35
109    4.00
30     1.45
186    3.50
120    2.31
10     1.71
73     5.00
159    2.00
156    5.00
112    4.00
218    1.44
25     2.34
60     3.21
18     3.50
119    2.92
97     1.50
197    5.00
139    2.75
241    2.00
75     1.25
127    2.00
113    2.55
16     1.67
196    2.00
67     1.00
168    1.61
38     2.31
195    1.44
Name: tip, dtype: float64

In [32]:
y_test.shape

(49,)

# Great Work!