# Cross Validation

Cross-validation is like testing your model on different slices of the dataset to make sure it doesn’t just memorize the training data — but actually learns patterns that generalize well.


### Imagine you're preparing for an exam. Instead of taking the whole syllabus at once, you:

    1. Split the syllabus into parts.

    2. Study some parts and test yourself on others.

    3. Repeat this process to make sure you're well-prepared from all angles.

    That’s cross-validation!

### Why is it Useful?

    1. Helps detect overfitting

    2. Makes better use of limited data

    3. Provides a more robust estimate of performance than a single train/test split



## Importing Libraries

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold,train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

## Preparing a Sample Dataset

In [2]:
np.random.seed(42)
n = 100

data = pd.DataFrame({
    'Age': np.random.randint(30, 70, n),
    'Cholesterol': np.random.randint(150, 300, n),
    'SystolicBP': np.random.randint(110, 180, n),
    'BMI': np.round(np.random.normal(25, 4, n), 1),
    'Smoker': np.random.choice([0, 1], n),
    'ExerciseFreq': np.random.choice([0, 1, 2], n),  # 0: Rarely, 1: Sometimes, 2: Regularly
})

# Create a simple target variable (1 = High Risk, 0 = Low Risk)
# We'll assume risk increases with age, cholesterol, BP, BMI, and smoking
data['Risk'] = (
    (data['Age'] > 50).astype(int) +
    (data['Cholesterol'] > 240).astype(int) +
    (data['SystolicBP'] > 140).astype(int) +
    (data['BMI'] > 28).astype(int) +
    data['Smoker']
)

# Label as "High Risk" (1) if sum of risk factors >= 3
data['Risk'] = (data['Risk'] >= 3).astype(int)


## Seperating Independent and Dependent Variables

In [3]:
X = data.drop('Risk', axis=1)
y = data['Risk']

### Preparing the Train and Test split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Feature Scaline

In [6]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Logistic Regression with K-Fold cross Validation


**How to Decide?**

***Use K-Fold Cross Validation when:***

    You have enough data.

    Your data is not time-based.

    You want a simple and effective estimate of model performance.

***Use Stratified K-Fold when:***

    You’re doing a classification task.

    Your classes are imbalanced (e.g., 90% class A, 10% class B).

    Ensures fair distribution of labels in each fold.

***Use LOOCV when:***

    You have very few samples (like <100).

    You want max training data per fold, but it will be slow.

***Use Group K-Fold when:***

    You have data where observations are not independent (e.g., multiple rows per patient or user).

    You want to prevent the same user from being in both train and test.

***Use TimeSeriesSplit when:***

    Your data is ordered in time.

    You’re predicting the future based on the past (e.g., stock prices, weather).

    You must not shuffle the data.

In [8]:
model = LogisticRegression()
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation on scaled training data
scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='accuracy')

In [10]:
print("Cross-Validation Accuracy Scores:", scores)
print("Mean CV Accuracy:", np.mean(scores).round(3))

Cross-Validation Accuracy Scores: [0.875 0.875 0.875 0.875 0.625]
Mean CV Accuracy: 0.825


## Model Training

In [11]:
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

## Evaluating the Model using Classification Report

In [14]:
report = classification_report(y_test, y_pred, target_names=["Low Risk", "High Risk"])

In [16]:
print(report)

              precision    recall  f1-score   support

    Low Risk       1.00      0.82      0.90        11
   High Risk       0.82      1.00      0.90         9

    accuracy                           0.90        20
   macro avg       0.91      0.91      0.90        20
weighted avg       0.92      0.90      0.90        20



### What the Metrics Mean:

*Precision:* Out of all predicted positives, how many were actually correct?

*Recall:* Out of all actual positives, how many did we correctly predict?

*F1-Score:* Balance between precision and recall.

*Support:* Number of actual samples in each class.