## Train Test Split in Data Set

In machine learning, Train-Test Split is a technique used to evaluate how well a model generalizes to unseen data. It involves dividing the dataset into two parts:

### 1. Training Set
- Used to train the model.
- Usually 70–80% of the entire dataset.

### 2. Test Set
- Used to evaluate the trained model’s performance.
- Usually 20–30% of the dataset.

### Why Train-Test Split?
- To detect overfitting or underfitting.
- To get an unbiased evaluation of a model's performance.



In [1]:
import pandas as pd

In [2]:
DS = pd.read_csv("Boston.csv")
DS.head()

Unnamed: 0.1,Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [3]:
DS.shape

(506, 15)

In [4]:
input_data = DS.iloc[:,:-1]
output_data = DS["medv"]

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train ,x_test,y_train,y_test = train_test_split(input_data,output_data,test_size = 0.25)

In [7]:
X_train ,x_test,y_train,y_test

(     Unnamed: 0     crim    zn  indus  chas     nox     rm   age     dis  rad  \
 95           96  0.12204   0.0   2.89     0  0.4450  6.625  57.8  3.4952    2   
 334         335  0.03738   0.0   5.19     0  0.5150  6.310  38.5  6.4584    5   
 195         196  0.01381  80.0   0.46     0  0.4220  7.875  32.0  5.6484    4   
 237         238  0.51183   0.0   6.20     0  0.5070  7.358  71.6  4.1480    8   
 170         171  1.20742   0.0  19.58     0  0.6050  5.875  94.6  2.4259    5   
 ..          ...      ...   ...    ...   ...     ...    ...   ...     ...  ...   
 217         218  0.07013   0.0  13.89     0  0.5500  6.642  85.1  3.4211    5   
 84           85  0.05059   0.0   4.49     0  0.4490  6.389  48.0  4.7794    3   
 18           19  0.80271   0.0   8.14     0  0.5380  5.456  36.6  3.7965    4   
 97           98  0.12083   0.0   2.89     0  0.4450  8.069  76.0  3.4952    2   
 279         280  0.21038  20.0   3.33     0  0.4429  6.812  32.2  4.1007    5   
 
      tax  ptr

In [8]:
y_train.shape

(379,)

- test_size=0.2: 20% for testing
- random_state=42: Ensures reproducibility (same split every time)

## Tips
- For small datasets, consider using K-Fold Cross Validation instead.
- Always scale/normalize data after splitting (fit scaler only on training set).
