## Train-Test Split

In [1]:
import pandas as pd

In [25]:
df = pd.read_csv('./age1.csv')
df.head()

Unnamed: 0,Age,Income
0,20,10000
1,30,15000
2,80,50000
3,50,30000
4,10,5000


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Age     10 non-null     int64
 1   Income  10 non-null     int64
dtypes: int64(2)
memory usage: 288.0 bytes


In [27]:
x = df[['Age']]
y = df['Income']
print(x, y, sep='\n\n')

   Age
0   20
1   30
2   80
3   50
4   10
5   45
6   30
7   55
8   60
9   70

0    10000
1    15000
2    50000
3    30000
4     5000
5    25000
6    20000
7    35000
8    40000
9    45000
Name: Income, dtype: int64


In [23]:
from sklearn.model_selection import train_test_split

<h2>X_train = Training Data</h2>
<h2>X_test = Testing Data</h2>

In [28]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.3, shuffle=False)

In [31]:
print(X_train,"\n")
print(X_test, "\n")
print(Y_train, "\n")
print(Y_test, "\n")

   Age
0   20
1   30
2   80
3   50
4   10
5   45
6   30 

   Age
7   55
8   60
9   70 

0    10000
1    15000
2    50000
3    30000
4     5000
5    25000
6    20000
Name: Income, dtype: int64 

7    35000
8    40000
9    45000
Name: Income, dtype: int64 



## Importance of Shuffle
<h2>shuffle = True</h2>
<ul>
    <li>By default shuffle value is True</li>
    <li>If shuffle is false then train_test_split will choose fix values from 1 to N for train and test data.</li>
    <li>The disadvantage to this is we will not have data from every sample.</li>
    
</ul>

## IRIS DATASET

In [36]:
iris_df = pd.read_csv('./Iris.csv')
iris_df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [37]:
iris_x = iris_df[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
iris_y = iris_df['Species']

In [40]:
iris_x_train, iris_x_test, iris_y_train, iris_y_test = train_test_split(iris_x, iris_y, test_size=0.4, shuffle=False)

In [42]:
print(iris_x_train, "\n")
print(iris_x_test, "\n")
print(iris_y_train, "\n")
print(iris_y_test, "\n")

    SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
0             5.1           3.5            1.4           0.2
1             4.9           3.0            1.4           0.2
2             4.7           3.2            1.3           0.2
3             4.6           3.1            1.5           0.2
4             5.0           3.6            1.4           0.2
..            ...           ...            ...           ...
85            6.0           3.4            4.5           1.6
86            6.7           3.1            4.7           1.5
87            6.3           2.3            4.4           1.3
88            5.6           3.0            4.1           1.3
89            5.5           2.5            4.0           1.3

[90 rows x 4 columns] 

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
90             5.5           2.6            4.4           1.2
91             6.1           3.0            4.6           1.4
92             5.8           2.6            4.0          

<h3>
    If we keep shuffle = True then after every run we will get a new shuffled data.
</h3>
<h3>
    See below.
</h3>

In [44]:
iris_x_train, iris_x_test, iris_y_train, iris_y_test = train_test_split(iris_x, iris_y, test_size=0.4, shuffle=True)

In [45]:
print(iris_x_train, "\n")
print(iris_x_test, "\n")
print(iris_y_train, "\n")
print(iris_y_test, "\n")

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
142            5.8           2.7            5.1           1.9
83             6.0           2.7            5.1           1.6
47             4.6           3.2            1.4           0.2
54             6.5           2.8            4.6           1.5
94             5.6           2.7            4.2           1.3
..             ...           ...            ...           ...
35             5.0           3.2            1.2           0.2
15             5.7           4.4            1.5           0.4
48             5.3           3.7            1.5           0.2
109            7.2           3.6            6.1           2.5
0              5.1           3.5            1.4           0.2

[90 rows x 4 columns] 

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
56             6.3           3.3            4.7           1.6
81             5.5           2.4            3.7           1.0
132            6.4           2.8            5

<h3> 
    This time we get a new combination of these 4 columns.
</h3>

In [46]:
iris_x_train, iris_x_test, iris_y_train, iris_y_test = train_test_split(iris_x, iris_y, test_size=0.4, shuffle=True)

In [47]:
print(iris_x_train, "\n")
print(iris_x_test, "\n")
print(iris_y_train, "\n")
print(iris_y_test, "\n")

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
90             5.5           2.6            4.4           1.2
149            5.9           3.0            5.1           1.8
8              4.4           2.9            1.4           0.2
47             4.6           3.2            1.4           0.2
16             5.4           3.9            1.3           0.4
..             ...           ...            ...           ...
86             6.7           3.1            4.7           1.5
50             7.0           3.2            4.7           1.4
61             5.9           3.0            4.2           1.5
95             5.7           3.0            4.2           1.2
147            6.5           3.0            5.2           2.0

[90 rows x 4 columns] 

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
127            6.1           3.0            4.9           1.8
82             5.8           2.7            3.9           1.2
141            6.9           3.1            5

<h4>
    Everytime we get a new combination of these 4 columns. This is an unexpected behaviour, we don't want this.
    because we want to know on what particular data we have trained our model. If it will change then we will unable to identify it.
</h4>
<h4>
    So, we use a new parameter called random_state and assign some positive integer value to it.
</h4>

In [57]:
iris_x_train, iris_x_test, iris_y_train, iris_y_test = train_test_split(iris_x, iris_y, test_size=0.4, random_state=0)

In [58]:
print(iris_x_train, "\n")
print(iris_x_test, "\n")
print(iris_y_train, "\n")
print(iris_y_test, "\n")

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
85             6.0           3.4            4.5           1.6
30             4.8           3.1            1.6           0.2
101            5.8           2.7            5.1           1.9
94             5.6           2.7            4.2           1.3
64             5.6           2.9            3.6           1.3
..             ...           ...            ...           ...
9              4.9           3.1            1.5           0.1
103            6.3           2.9            5.6           1.8
67             5.8           2.7            4.1           1.0
117            7.7           3.8            6.7           2.2
47             4.6           3.2            1.4           0.2

[90 rows x 4 columns] 

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
114            5.8           2.8            5.1           2.4
62             6.0           2.2            4.0           1.0
33             5.5           4.2            1

<h4>
    Now, our train test data will always remain in consistent state.
<h4>