<a href="https://colab.research.google.com/github/Kavya-sree/machinelearningbrain_code_samples/blob/main/Train_Test_Split_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For showing the application of train-test-split, we can use the Medical Cost Personal Dataset available on [Kaggle](https://www.kaggle.com/datasets/mirichoi0218/insurance?datasetId=13720&sortBy=voteCount). Here we are not doing any exploratory data analysis or modelling. We are only doing train-test-split.

In [1]:
import pandas as pd

**Load dataset**

In [2]:
data = pd.read_csv('/content/insurance.csv')
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# train-test-split

Scikit_Learn provides implementation of train-test-split using the function [`train-test-split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)`. It splits arrays or matrices into random train and test subsets.

`train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)`

`*arrays`: It can be lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

`test_size`: defines the size of test set. If you give a float value, it should be be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to default value 0.25.

`train_size`: defines the size of train set. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

`random_state`: Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

`shuffle`: Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

`stratify`: If not None, data is split in a stratified fashion, using this as the class labels.

Before using the function, you need to first import `train_test_split`

In [3]:
from sklearn.model_selection import train_test_split


There are two ways you can do split:
1. Split the entire dataset by loading dataset as input on the `train_test_split` function.
2. Split the dataset into input (X) and output (y) columns, then use the `train_test_split` function and pass the X and y values as inputs

In [4]:
# First method of splitting the entire dataset into two subsets, train and test
train, test = train_test_split(data, test_size=0.2, random_state=1)
print(train.shape, test.shape)

(1070, 7) (268, 7)


For the second and most common method, we need to first split the original dataset into inputs X and output y. Then apply train test split.


In [5]:
# prepare X and y
X = data.iloc[:, :-1].values #independent variables
y = data.iloc[:,-1].values #dependent variable
print("Shape of X:",X.shape)
print("Shape of y:",y.shape)

Shape of X: (1338, 6)
Shape of y: (1338,)


In [6]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("shape of original dataset :", data.shape)
print("shape of X_train", X_train.shape)
print("shape of y_train", y_train.shape)
print("shape of X_test", X_test.shape)
print("shape of y_test", y_test.shape)

shape of original dataset : (1338, 7)
shape of X_train (1070, 6)
shape of y_train (1070,)
shape of X_test (268, 6)
shape of y_test (268,)


An important thing to note is that the splitting of dataset is random by default. To make the results reproducible across multiple function calls, you need to set the parameter `random_state`. The value of `random_state` is not important.

After train test split you can proceed with modelling.

