## Splitting The Data

The last part of data processing is making the appropriate splits between
training and testing data.

Very typically your data gets split into 4 variables:
X_train, X_test, y_train, and y_test

### X And Y
The first part is separating the target feature from the rest of the data.
Conventionally, the target feature is put into a variable y and the rest of the data
is put into a variable X. (yes, it’s specifically capital X and lowercase y)

X is known as the independent variables and y is the dependant variable or the target variable

### Train And Test
The training set is the large majority of the data that the model uses to learn. 
Afterwards we fit the test data into the model and observe the performance.

Usually 75-80% of the data for train and 20-25% for test. 
Sklearn has a great method train_test_split to accomplish this.

In [14]:
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

In [2]:
X, y = np.arange(10).reshape((5, 2)), range(5)
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [3]:
list(y)

[0, 1, 2, 3, 4]

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 

In [5]:
X_train

array([[4, 5],
       [0, 1],
       [6, 7]])

In [6]:
X_test

array([[2, 3],
       [8, 9]])

In [7]:
y_train

[2, 0, 3]

In [8]:
y_test

[1, 4]

### Time Series
However there are cases where the data should not be split randomly. 
For example in time series data, such as the stock market,
you’re generally trying to predict the next day’s or week’s value and therefore the data should be split so that
the training data contains the older samples and test has the newer samples

### Classification
Additionally, in Classification, when doing splits you want to make sure you don’t cause a data imbalance.
For example, if variable y is a binary categorical variable with values 0 and 1 you don’t want the test set
to contain all the ones and the training set to contain all the zeros.

train_test_split has a stratify parameter that deals with this. So if your y variable has 25% of zeros and 75% of
ones, stratify=y will make sure that your random split has 25% of zeros and 75% of ones.

In [9]:
X, y = np.arange(10).reshape((5, 2)), [1,1,1,0,0]
y

[1, 1, 1, 0, 0]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,test_size=0.33, random_state=42) 

In [11]:
y_train

[0, 1, 1]

In [12]:
y_test

[1, 0]

## Ex

In [22]:
#Using the Iris dataset, split the dataset into the 4 train and test variables.
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
X=df.values[:,:-1]
y=df.values[:,-1]
print(X)
print("\n")
print(y)

[[5.1 3.5 1.4 0.2]
 [4.9 3.0 1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.0 3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.0 3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.0 1.4 0.1]
 [4.3 3.0 1.1 0.1]
 [5.8 4.0 1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.0 0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.0 3.0 1.6 0.2]
 [5.0 3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.0 3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.0 1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.0 3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.0 3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.0 1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.0 3.3 1.4 0.2]
 [7.0 3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [29]:
X=df.iloc[:,:-1]
y=df.iloc[:,-1]

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,test_size=0.33, random_state=42) 

In [43]:
y_train

136     virginica
17         setosa
142     virginica
59     versicolor
6          setosa
          ...    
49         setosa
86     versicolor
45         setosa
60     versicolor
47         setosa
Name: species, Length: 100, dtype: object

In [32]:
y_test

133     virginica
56     versicolor
7          setosa
67     versicolor
107     virginica
57     versicolor
55     versicolor
18         setosa
66     versicolor
53     versicolor
42         setosa
14         setosa
23         setosa
39         setosa
10         setosa
132     virginica
138     virginica
69     versicolor
122     virginica
99     versicolor
147     virginica
88     versicolor
2          setosa
116     virginica
22         setosa
104     virginica
140     virginica
35         setosa
20         setosa
148     virginica
111     virginica
141     virginica
21         setosa
84     versicolor
28         setosa
41         setosa
145     virginica
96     versicolor
77     versicolor
85     versicolor
63     versicolor
93     versicolor
38         setosa
1          setosa
134     virginica
78     versicolor
108     virginica
106     virginica
51     versicolor
127     virginica
Name: species, dtype: object

Using set(), and pandas' shape, make sure that no data from the training data is in the testing data.
(Can be done with the DataFrame's index) and make sure that the column sizes are the same in both sets

In [44]:
print(X_train.shape)
print(X_test.shape)

(100, 4)
(50, 4)


In [45]:
if X_train.shape[1]==X_test.shape[1]:
    idx_train= set(X_train.index)
    idx_test= set(X_test.index)
    if idx_test in idx_train:
        print("Data from train is in the test set")
    else:
        print("Ready")

Ready
