# Train/Test Split & Shuffling

In machine learning, we need to evaluate whether a model can **generalize** to unseen data.  
To do this, we prepare the dataset by:

1. **Splitting the data**  
   - **Training set** → used to learn model parameters.  
   - **Testing set** → kept unseen until final evaluation.  
   - Common ratios: **80/20** or **70/30** (sometimes 60/20/20 with a validation set).

2. **Shuffling the data**  
   - Prevents bias when data is ordered (e.g., sorted labels, time sequence).  
   - Ensures both train and test sets represent the overall dataset fairly.

3. **Outcome**  
   - Training set → guides learning.  
   - Test set → provides an unbiased estimate of model performance.  

> This step ensures that our model evaluation reflects **true generalization ability**, not just memorization of the training data.


### Train/Test Split with NumPy
Here we use NumPy to shuffle indices and split the dataset into 80% training and 20% testing sets.  
This ensures that both sets are randomly sampled and representative of the whole dataset.


In [1]:
import numpy as np

# Dummy dataset
X = np.arange(100).reshape(50, 2)
y = np.arange(50)

# Shuffle indices
indices = np.arange(len(X))
np.random.shuffle(indices)

# Train/test split (80/20)
split = int(0.8 * len(X))
train_idx, test_idx = indices[:split], indices[split:]

X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

print("Train shape:", X_train.shape, " Test shape:", X_test.shape)


Train shape: (40, 2)  Test shape: (10, 2)


### Train/Test Split with Scikit-learn
Scikit-learn provides a convenient `train_test_split` function that handles both shuffling and splitting in one step.  
We use `test_size=0.2` for an 80/20 split and fix a `random_state` for reproducibility.


In [2]:
from sklearn.model_selection import train_test_split

X = np.arange(100).reshape(50, 2)
y = np.arange(50)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=True, random_state=42
)

print("Train shape:", X_train.shape, " Test shape:", X_test.shape)


Train shape: (40, 2)  Test shape: (10, 2)


### Train/Test Split with PyTorch
In PyTorch, we wrap the data into a `TensorDataset` and then use `random_split` to divide it into training and test sets.  
Finally, we create `DataLoader`s to iterate through mini-batches during training and evaluation.


In [3]:
import torch
from torch.utils.data import TensorDataset, DataLoader, random_split

X = torch.arange(100).reshape(50, 2).float()
y = torch.arange(50)

# Combine X and y into a dataset
dataset = TensorDataset(X, y)

# Train/test split sizes
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size

train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Create DataLoaders (shuffle training set)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

print("Train batches:", len(train_loader), " Test batches:", len(test_loader))


Train batches: 5  Test batches: 2


### Real-World Example: Titanic Dataset (Train/Test Split)

In this section, we apply train/test splitting to a real dataset — the **Titanic survival dataset**.  
Steps:
1. Load the dataset from `seaborn`.
2. Select useful features (`age`, `fare`, `pclass`, `sex`) and the target (`survived`).
3. Encode categorical data (`sex` → 0 for male, 1 for female).
4. Perform an **80/20 split** using both:
   - **Scikit-learn** → with `train_test_split`.
   - **PyTorch** → with `TensorDataset` and `random_split`.
5. Prepare PyTorch `DataLoader`s for batching during training and evaluation.


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import TensorDataset, random_split, DataLoader

# 1) Load Titanic dataset from seaborn
import seaborn as sns
titanic = sns.load_dataset("titanic")

# 2) Keep only numeric + a few useful features (drop NaN for simplicity)
df = titanic[["age", "fare", "pclass", "sex", "survived"]].dropna()
df["sex"] = df["sex"].map({"male": 0, "female": 1})  # encode categorical

X = df.drop("survived", axis=1).values
y = df["survived"].values

print("Dataset shape:", X.shape, y.shape)

Dataset shape: (714, 4) (714,)


In [5]:
# Show first 5 rows of the cleaned dataset
print('--------------------------------------')
print("\nFirst 5 rows of the dataset:")
print(df.head())
print('--------------------------------------')

# Show info about columns (types + non-null counts)
print("\nDataset Info:")
print(df.info())
print('--------------------------------------')


# Check survival distribution
print("\nSurvival counts (target variable):")
print(df["survived"].value_counts())
print('--------------------------------------')


--------------------------------------

First 5 rows of the dataset:
    age     fare  pclass  sex  survived
0  22.0   7.2500       3    0         0
1  38.0  71.2833       1    1         1
2  26.0   7.9250       3    1         1
3  35.0  53.1000       1    1         1
4  35.0   8.0500       3    0         0
--------------------------------------

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       714 non-null    float64
 1   fare      714 non-null    float64
 2   pclass    714 non-null    int64  
 3   sex       714 non-null    int64  
 4   survived  714 non-null    int64  
dtypes: float64(2), int64(3)
memory usage: 33.5 KB
None
--------------------------------------

Survival counts (target variable):
survived
0    424
1    290
Name: count, dtype: int64
--------------------------------------


In [6]:
# NumPy/Scikit-learn version
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=True, random_state=42
)

print("Scikit-learn split:")
print("Train:", X_train.shape, " Test:", X_test.shape)

Scikit-learn split:
Train: (571, 4)  Test: (143, 4)


In [7]:
# PyTorch version

# Convert to tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

dataset = TensorDataset(X_tensor, y_tensor)

train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size

train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

print("PyTorch split:")
print("Train batches:", len(train_loader), " Test batches:", len(test_loader))


PyTorch split:
Train batches: 36  Test batches: 9
