# Cross-Validation (K-Fold) with Dummy Dataset

In this notebook, we explore **K-Fold Cross-Validation**, a robust technique for evaluating model performance by repeatedly training and testing on different partitions of the dataset. Instead of relying on a single train/test split, K-Fold validation divides the data into *K equal folds* and iteratively uses each fold as the test set while the remaining folds serve as the training set.

**Why it matters:**
- Reduces variance in evaluation results.
- Utilizes all data for both training and validation.
- Provides a better estimate of generalization performance.

We'll:
1. Generate a dummy dataset using `make_classification`.
2. Implement K-Fold cross-validation using `sklearn.model_selection.KFold`.
3. Train a simple model (e.g., Logistic Regression).
4. Compute and interpret average accuracy across folds.


In [1]:
# K-Fold Cross Validation with Dummy Dataset

from sklearn.datasets import make_classification
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

# Generate a dummy dataset
X, y = make_classification(
    n_samples=200, n_features=5, n_informative=3,
    n_redundant=0, n_classes=2, random_state=42
)

# Define the model
model = LogisticRegression(max_iter=1000)

# Define K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate model using cross-validation
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print("Fold Accuracies:", scores)
print(f"Mean Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")


Fold Accuracies: [0.925 0.975 0.875 0.875 0.925]
Mean Accuracy: 0.9150 ± 0.0374


# Pytorch Version: Manual K-Fold Cross-Validation

In this section, we manually implement **K-Fold Cross-Validation** using PyTorch to understand the full training and evaluation workflow.

**Goal:**  
Perform repeated training and validation across K folds and average the results to obtain a more reliable estimate of the model’s performance.

**Steps:**
1. Create a dummy dataset using `sklearn.datasets.make_classification`.
2. Convert it into PyTorch tensors.
3. Split the dataset using `KFold`.
4. Train a simple MLP model on each fold.
5. Evaluate and report mean accuracy across folds.


In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
import numpy as np

In [4]:
# Generate dummy dataset
X, y = make_classification(
    n_samples=200, n_features=5, n_informative=3,
    n_redundant=0, n_classes=2, random_state=42
)

# Normalize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Convert to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

In [5]:
# Define a simple MLP model
class MLP(nn.Module):
    def __init__(self, input_dim=5, hidden_dim=16, output_dim=2):
        super(MLP, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

In [6]:
# Define training function
def train_model(model, optimizer, criterion, X_train, y_train, epochs=50):
    model.train()
    for _ in range(epochs):
        optimizer.zero_grad()
        outputs = model(X_train)
        loss = criterion(outputs, y_train)
        loss.backward()
        optimizer.step()

# Define evaluation function
def evaluate_model(model, X_val, y_val):
    model.eval()
    with torch.no_grad():
        outputs = model(X_val)
        preds = torch.argmax(outputs, dim=1)
        acc = (preds == y_val).float().mean().item()
    return acc

In [7]:
# K-Fold setup
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
fold_accuracies = []

# Loop over folds
for fold, (train_idx, val_idx) in enumerate(kfold.split(X_tensor)):
    X_train, X_val = X_tensor[train_idx], X_tensor[val_idx]
    y_train, y_val = y_tensor[train_idx], y_tensor[val_idx]

    # New model for each fold
    model = MLP()
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    train_model(model, optimizer, criterion, X_train, y_train)
    acc = evaluate_model(model, X_val, y_val)
    fold_accuracies.append(acc)

    print(f"Fold {fold + 1} Accuracy: {acc:.4f}")

# Report overall performance
print(f"\nMean Accuracy: {np.mean(fold_accuracies):.4f} ± {np.std(fold_accuracies):.4f}")


Fold 1 Accuracy: 0.9250
Fold 2 Accuracy: 0.9750
Fold 3 Accuracy: 0.8500
Fold 4 Accuracy: 0.8750
Fold 5 Accuracy: 0.9250

Mean Accuracy: 0.9100 ± 0.0436
