# what is train-test split?

When we’re building a machine learning model, we use data. But we don’t want to just train and test the model on the same data — that would give us fake results. The model would just memorize everything and perform well, but it won’t generalize to new data.


So we split the data into two parts:

Training data, which we use to train the model.


Testing data, which we use to check how well the model performs on unseen data.

This is called train-test split. It helps us test the model in a realistic way.


![image.png](attachment:image.png)
    

Now let’s see how to actually do this in code using Scikit-Learn.


In [4]:
import pandas as pd

df = pd.read_csv("binary_classification_sample.csv")

df.head()

Unnamed: 0,Age,Salary,Experience,Gender,Department,Education,LocationScore,Purchased
0,56,51905.183591,27,Female,HR,Bachelors,67.964728,0
1,69,31258.344158,16,Female,Engineering,High School,21.825389,0
2,46,79176.734217,4,Male,HR,PhD,94.996118,0
3,32,47699.953137,4,Male,Engineering,High School,78.634501,1
4,60,36395.191619,5,Male,Marketing,High School,8.9411,1


In [8]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Purchased'],axis=0)
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((160, 7), (40, 7), (160,), (40,))

Here:

X is our input features

y is the output or labels

test_size=0.2 means 20% of the data will be used for testing

random_state=42 just helps to make the results reproducible

After this, you get 4 things:

X_train, y_train: for training

X_test, y_test: for testing

In [None]:
"""
A quick bonus tip — if you’re working on classification 
problems and want to keep the same class distribution in both train and test sets, use:
"""

# train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)