# Train/Test Split & Shuffling

In machine learning, we need to evaluate whether a model can **generalize** to unseen data.  
To do this, we prepare the dataset by:

1. **Splitting the data**  
   - **Training set** → used to learn model parameters.  
   - **Testing set** → kept unseen until final evaluation.  
   - Common ratios: **80/20** or **70/30** (sometimes 60/20/20 with a validation set).

2. **Shuffling the data**  
   - Prevents bias when data is ordered (e.g., sorted labels, time sequence).  
   - Ensures both train and test sets represent the overall dataset fairly.

3. **Outcome**  
   - Training set → guides learning.  
   - Test set → provides an unbiased estimate of model performance.  

> This step ensures that our model evaluation reflects **true generalization ability**, not just memorization of the training data.


### Train/Test Split with NumPy
Here we use NumPy to shuffle indices and split the dataset into 80% training and 20% testing sets.  
This ensures that both sets are randomly sampled and representative of the whole dataset.


In [None]:
import numpy as np

# Dummy dataset
X = np.arange(100).reshape(50, 2)   # 50 samples, 2 features
y = np.arange(50)                   # labels

# Shuffle indices
indices = np.arange(len(X))
np.random.shuffle(indices)

# Train/test split (80/20)
split = int(0.8 * len(X))
train_idx, test_idx = indices[:split], indices[split:]

X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

print("Train shape:", X_train.shape, " Test shape:", X_test.shape)


Train shape: (40, 2)  Test shape: (10, 2)


### Train/Test Split with Scikit-learn
Scikit-learn provides a convenient `train_test_split` function that handles both shuffling and splitting in one step.  
We use `test_size=0.2` for an 80/20 split and fix a `random_state` for reproducibility.


In [None]:
from sklearn.model_selection import train_test_split

X = np.arange(100).reshape(50, 2)
y = np.arange(50)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=True, random_state=42
)

print("Train shape:", X_train.shape, " Test shape:", X_test.shape)


Train shape: (40, 2)  Test shape: (10, 2)


### Train/Test Split with PyTorch
In PyTorch, we wrap the data into a `TensorDataset` and then use `random_split` to divide it into training and test sets.  
Finally, we create `DataLoader`s to iterate through mini-batches during training and evaluation.


In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, random_split

X = torch.arange(100).reshape(50, 2).float()
y = torch.arange(50)

# Combine X and y into a dataset
dataset = TensorDataset(X, y)

# Train/test split sizes
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size

train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Create DataLoaders (shuffle training set)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

print("Train batches:", len(train_loader), " Test batches:", len(test_loader))


Train batches: 5  Test batches: 2
