<a href="https://colab.research.google.com/github/Kavya-sree/machinelearningbrain_code_samples/blob/main/train_test_split_best_practices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

There are certain things, you need to consider before splitting your data.
We can do a little experimentations with the function `train_test_split()`

Lets create a simple multiclass dataset

In [1]:
import pandas as pd

data = {'Name': ['A','C','A','A','A','B','A','B','B','C'],
        'Label': [1,1,1,1,1,2,2,2,3,3]}

# Create the DataFrame
data = pd.DataFrame(data)
print(data)

  Name  Label
0    A      1
1    C      1
2    A      1
3    A      1
4    A      1
5    B      2
6    A      2
7    B      2
8    B      3
9    C      3


In [2]:
from sklearn.model_selection import train_test_split

# Random Shuffling

The first thing you need to check is whether, the datas are randomly shuffled for data splitting. Keep in mind that there are cases were, you shouldn't shuffle the data. For instance, if your data is time series. If there is no problem in data shuffling, then make sure to shuffle the data to avoid adding bias into our model.

In [3]:
train, test = train_test_split(data, test_size=0.3, shuffle=False, random_state=None)
print(train)
print(test)

  Name  Label
0    A      1
1    C      1
2    A      1
3    A      1
4    A      1
5    B      2
6    A      2
  Name  Label
7    B      2
8    B      3
9    C      3


Here data is split sequentially. Can you see the problem? Some features and labels doesn't appear on the sets. For eg: the label '3' is not in the training set. This is because, we have set `shuffle=False`. By default, it is set to `shuffle=True`. This means that, by default, the data's are shuffled into random order before splitting.
Let's see what happens when we set `shuffle=True`

In [4]:
train, test = train_test_split(data, test_size=0.3, shuffle=True, random_state=None)
print(train)
print(test)

  Name  Label
6    A      2
8    B      3
1    C      1
7    B      2
3    A      1
9    C      3
0    A      1
  Name  Label
4    A      1
2    A      1
5    B      2


Now the datas are shuffled. If you rerun the code, you may get different results. This is because, we have set `random_state=None`.

Let's pass an int value to `random_state`for reproducible output.

In [5]:
train, test = train_test_split(data, test_size=0.3, shuffle=True, random_state=100)
print(train)
print(test)

  Name  Label
5    B      2
4    A      1
2    A      1
0    A      1
3    A      1
9    C      3
8    B      3
  Name  Label
7    B      2
6    A      2
1    C      1


If you rerun the code multiple times with the same random_state, the output will always remain the same. You have set a seed, which is useful for reproducibility of the test results.


# Stratified Splitting

Sometimes, shuffling the dataset before splitting may not be sufficient. We have to make sure that the dataset is split by preserving the proportions of the class. This is especially useful when we have **imbalanced dataset** like our example. In our dataset, label '1' appears 5 times, label '2' appears 3 times and label '3' appears 2 times. After splitting the dataset, the label/class counts should be evenly distributed across training and testing set. This procesure is called stratified splitting.

In [6]:
X = data.iloc[:, :-1].values
y = data.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.7,
                                                    random_state=100)

print(f"Train labels:\n{y_train}")
print(f"Test labels:\n{y_test}")

Train labels:
[2 1 1 1 1 3 3]
Test labels:
[2 2 1]


We can see that the class labels are not evenly distributed.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.7,
                                                    random_state=100,
                                                    stratify=y)

In [8]:
print(f"Train labels:\n{y_train}")
print(f"Test labels:\n{y_test}")

Train labels:
[1 1 1 2 3 1 2]
Test labels:
[2 3 1]


Now the datas are evenly distributed.

# Validation splits

Instead of splitting the dataset into two sets (train and test set), we can split it into three sets: train, test and validation set.
Validation set is usually taken from training set and used to tune hyperparameters. For example, to find learning rate, batch size, the number of hidden layers in neural network, or the best kernel for a support vector machine.

Always remember that the test set should only be taken out at the end when the hyperparameters that give the best accuracy on the validation set have been found. Test set should not be used to make choices about hyperparameters.


How can we split the dataset into three sets using Sklearn's `train_test_split()`? As far as we know, it can only into train and test set. The trick is to use `train_test_split()`twice. First it splits into train and set test. Then again apply the function on train set to get train and validation set.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=100)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=100)

print(f"Train labels:\n{y_train}")
print(f"Validation labels:\n{y_val}")
print(f"Test labels:\n{y_test}")

Train labels:
[1 1 3 1 3 1]
Validation labels:
[2 1]
Test labels:
[2 2]
