**Before you dive into the implementations, I highly recommend first learning the heart of each algorithm—its core idea and how it works. You can explore this through YouTube tutorials, books, or online courses. This repository is meant to complement that knowledge by showing how to translate concepts into working code.**
# Model Evaluation

This document provides an overview of essential techniques for evaluating machine learning models.

## I. Hold-out Method (Train-Test Split)

The hold-out method is the simplest approach to evaluate model performance. It involves dividing the dataset into two distinct subsets:

*   **Training Set:** Used to train the machine learning model.
*   **Test Set (Hold-out Set):** Held back and used to evaluate the trained model on unseen data.

**Process:**

1.  **Split:** Divide the dataset (e.g., 80% training, 20% testing).
2.  **Train:** Train the model on the training set.
3.  **Evaluate:** Use the trained model to make predictions on the test set.
4.  **Calculate Metrics:** Calculate evaluation metrics (e.g., accuracy, RMSE).

**Advantages:** Simple, computationally inexpensive, useful for large datasets.

**Disadvantages:** Sensitive to data split, may not be suitable for small datasets.

## II. Cross-Validation

Cross-validation is a more robust technique than the hold-out method. It involves partitioning the dataset into multiple subsets (folds), training the model on some folds, and evaluating it on the remaining folds. This process is repeated multiple times, and the results are averaged.

### Types of Cross-Validation

1.  **k-fold Cross-Validation**

    *   The dataset is divided into k equal-sized folds.
    *   The model is trained k times, each time using a different fold as the test set and the remaining k-1 folds as the training set.
    *   The performance is averaged across all k evaluations.
    *   Common values for k are 5 and 10.

    **Example (5-fold):**

    Data: `[Fold 1] [Fold 2] [Fold 3] [Fold 4] [Fold 5]`

    *   Round 1: Train on `[2, 3, 4, 5]`, Test on `[1]`
    *   Round 2: Train on `[1, 3, 4, 5]`, Test on `[2]`
    *   ...

2.  **Stratified k-fold Cross-Validation**

    *   Ensures that each fold has the same proportion of target classes as the original dataset.
    *   Essential for imbalanced datasets.

3.  **Leave-One-Out Cross-Validation (LOOCV)**

    *   A special case of k-fold where k equals the number of data points.
    *   Each data point is used as the test set once.
    *   Computationally expensive for large datasets.

4.  **Time Series Cross-Validation**

    *   Designed for time series data where the order of data points is important.
    *   Uses a rolling or expanding window approach to prevent data leakage.

5.  **Nested Cross-Validation**

    *   Used for both hyperparameter tuning and model evaluation.
    *   Involves an outer loop for evaluation and an inner loop for hyperparameter tuning.
    *   Provides a less biased estimate of model performance.

### Cross-Validation vs. Hold-out Method (Train-Test Split)

| Feature             | Hold-out Method                      | Cross-Validation                                        |
|----------------------|--------------------------------------|--------------------------------------------------------|
| Data Usage          | Single split into train and test sets | Multiple splits; each data point used for both training and testing |
| Performance Estimate | Single evaluation on the test set     | Average performance across multiple evaluations        |
| Reliability         | Can be sensitive to the specific split | More robust and reliable                               |
| Computational Cost  | Lower                                  | Higher                                                   |

### How Cross-Validation Helps

*   **Reliable Performance Estimate:** Provides a more robust estimate of how well a model will perform on unseen data.
*   **Effective Use of Data:** Especially useful for small datasets.
*   **Model Selection:** Helps compare different models or hyperparameter settings.
*   **Reduces Overfitting:** Gives a better indication of how well a model generalizes, preventing overfitting to a specific train/test split.

### Hyperparameter Tuning with Cross-Validation

*   **Define a Grid:** Specify a range of values for each hyperparameter.
*   **Cross-Validation for Each Combination:** Perform k-fold cross-validation for each hyperparameter combination.
*   **Select Best Hyperparameters:** Choose the combination with the best average performance.

### Nested Cross-Validation (Detailed)

*   **Outer Loop (Evaluation):** Split data into k folds.
*   **Inner Loop (Hyperparameter Tuning - for each outer fold):**
    *   Split the outer fold's training data into subsets.
    *   Test different hyperparameter values using cross-validation on these subsets.
    *   Select the best hyperparameters.
*   **Evaluation:** Train a model with the selected hyperparameters on the entire training set of the outer fold and evaluate it on the hold-out test fold.
*   **Final Performance:** Average results from each outer fold.

### Choosing the Number of Folds (k)

*   **Common values:** 5 or 10.
*   **Small datasets:** Consider LOOCV or lower k (e.g., 3).
*   **Large datasets:** Lower k to reduce computational cost.
*   Balance computational cost and reliability.

### Key Takeaway

Cross-validation is a powerful technique for evaluating and comparing machine learning models. It provides a more reliable estimate of performance than a simple train-test split, especially when dealing with limited data or when accurate performance estimation is crucial. Nested cross-validation further enhances this by providing an unbiased evaluation after hyperparameter tuning.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, LeaveOneOut, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, classification_report


In [5]:
# 1. Generate a synthetic dataset (for demonstration)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y
df

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,target
0,1.470848,-0.360450,-0.591602,-0.728228,0.941690,1.065964,0.017832,-0.596184,1.840712,-1.497093,...,-0.603968,2.899256,0.037567,-1.249523,0.257963,0.416628,1.408208,-1.838041,-0.833142,1
1,4.513369,-2.227103,-1.140747,2.018263,-2.238358,-0.497370,0.714550,0.938883,-2.395169,0.159837,...,1.461499,3.954171,0.309054,0.538184,-7.157865,-4.532216,-0.081800,-9.325362,0.574386,1
2,-2.355643,2.218601,-1.603269,0.873394,0.401483,0.717264,-0.859399,-1.042190,-2.175965,0.980231,...,0.544434,-2.466258,-0.470256,0.073018,-2.203531,-2.299263,-1.742761,-0.271579,-0.359285,0
3,-1.596198,-0.857427,1.772434,-0.639361,1.419409,-0.438525,0.281949,2.345145,1.006230,0.389135,...,-1.025051,-2.422975,1.579807,-0.300713,4.267120,2.893775,1.236697,6.034785,-0.045711,0
4,2.840049,-2.489600,-0.844902,-1.594362,-4.688517,0.459637,0.913607,-1.143505,1.263937,-2.040928,...,4.176424,1.341742,0.133565,1.743819,1.531188,2.269808,0.053489,-3.151109,1.603702,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,-0.952534,0.238036,0.331327,0.120452,3.539113,-0.556466,0.517210,2.324479,-0.123064,-2.143993,...,-2.233344,-1.424993,0.443893,-2.239507,1.650853,2.296884,0.438294,1.705011,-1.504149,0
996,-3.434088,-1.020016,-0.726931,-1.787934,-3.247447,1.439954,1.075627,-2.812310,2.527895,-6.569889,...,2.058341,-2.655348,0.566760,3.414485,5.391528,0.744789,-3.919345,-5.365949,0.725590,0
997,-0.015335,1.883507,3.221682,2.878762,-3.854459,-1.862864,-0.534772,-6.446245,0.976906,0.268957,...,0.306705,4.747956,-0.484956,7.601766,1.039739,-0.107586,-0.789566,0.282480,0.825853,0
998,1.285071,1.618508,-1.700678,1.051307,-2.025566,-0.375928,0.261185,1.514845,-2.452698,0.811538,...,-1.600277,3.054941,0.791745,0.982096,-2.942005,-4.929617,1.660500,-2.192116,0.906871,1


 * Uses make_classification to create a synthetic dataset, making the example self-contained.

In [18]:
# 2. Hold-out Method (default training)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [4]:
# Train and evaluate Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Hold-out - Logistic Regression:")
print(classification_report(y_test, y_pred_lr))

Hold-out - Logistic Regression:
              precision    recall  f1-score   support

           0       0.88      0.81      0.84       160
           1       0.80      0.87      0.83       140

    accuracy                           0.84       300
   macro avg       0.84      0.84      0.84       300
weighted avg       0.84      0.84      0.84       300



In [6]:
# Train and evaluate Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("\nHold-out - Decision Tree:")
print(classification_report(y_test, y_pred_dt))


Hold-out - Decision Tree:
              precision    recall  f1-score   support

           0       0.88      0.76      0.81       160
           1       0.76      0.89      0.82       140

    accuracy                           0.82       300
   macro avg       0.82      0.82      0.82       300
weighted avg       0.83      0.82      0.82       300



* The results are almost similar with normal training and testing let's use cross-validation.

In [7]:
# 3. K-Fold Cross-Validation (k=5)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

lr_scores_kf = cross_val_score(lr, X, y, cv=kf, scoring='accuracy')
dt_scores_kf = cross_val_score(dt, X, y, cv=kf, scoring='accuracy')

print("\nK-Fold Cross-Validation (k=5):")
print(f"Logistic Regression: Mean Accuracy = {lr_scores_kf.mean():.4f}, Std Dev = {lr_scores_kf.std():.4f}")
print(f"Decision Tree: Mean Accuracy = {dt_scores_kf.mean():.4f}, Std Dev = {dt_scores_kf.std():.4f}")


K-Fold Cross-Validation (k=5):
Logistic Regression: Mean Accuracy = 0.8360, Std Dev = 0.0380
Decision Tree: Mean Accuracy = 0.8450, Std Dev = 0.0239


In [9]:
# 5. K-Fold with different k (k=10)
kf_10 = KFold(n_splits=10, shuffle=True, random_state=42)

lr_scores_kf_10 = cross_val_score(lr, X, y, cv=kf_10, scoring='accuracy')
dt_scores_kf_10 = cross_val_score(dt, X, y, cv=kf_10, scoring='accuracy')

print("\nK-Fold Cross-Validation (k=10):")
print(f"Logistic Regression: Mean Accuracy = {lr_scores_kf_10.mean():.4f}, Std Dev = {lr_scores_kf_10.std():.4f}")
print(f"Decision Tree: Mean Accuracy = {dt_scores_kf_10.mean():.4f}, Std Dev = {dt_scores_kf_10.std():.4f}")


K-Fold Cross-Validation (k=10):
Logistic Regression: Mean Accuracy = 0.8370, Std Dev = 0.0390
Decision Tree: Mean Accuracy = 0.8370, Std Dev = 0.0361


* with k=5 folds decision tree get better mean accuracy and less standard deviation(lower is better) , but still not much
though .let's try  stratified k-fold.

In [8]:
# 4. Stratified K-Fold Cross-Validation (k=5)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

lr_scores_skf = cross_val_score(lr, X, y, cv=skf, scoring='accuracy')
dt_scores_skf = cross_val_score(dt, X, y, cv=skf, scoring='accuracy')

print("\nStratified K-Fold Cross-Validation (k=5):")
print(f"Logistic Regression: Mean Accuracy = {lr_scores_skf.mean():.4f}, Std Dev = {lr_scores_skf.std():.4f}")
print(f"Decision Tree: Mean Accuracy = {dt_scores_skf.mean():.4f}, Std Dev = {dt_scores_skf.std():.4f}")


Stratified K-Fold Cross-Validation (k=5):
Logistic Regression: Mean Accuracy = 0.8410, Std Dev = 0.0275
Decision Tree: Mean Accuracy = 0.8320, Std Dev = 0.0157


In [19]:
# 4. Stratified K-Fold Cross-Validation (k=5)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

lr_scores_skf = cross_val_score(lr, X, y, cv=skf, scoring='accuracy')
dt_scores_skf = cross_val_score(dt, X, y, cv=skf, scoring='accuracy')

print("\nStratified K-Fold Cross-Validation (k=10):")
print(f"Logistic Regression: Mean Accuracy = {lr_scores_skf.mean():.4f}, Std Dev = {lr_scores_skf.std():.4f}")
print(f"Decision Tree: Mean Accuracy = {dt_scores_skf.mean():.4f}, Std Dev = {dt_scores_skf.std():.4f}")


Stratified K-Fold Cross-Validation (k=10):
Logistic Regression: Mean Accuracy = 0.8390, Std Dev = 0.0359
Decision Tree: Mean Accuracy = 0.8240, Std Dev = 0.0585


* with stratified k-fold(k=10), logistic regression got a bit good results then decision tree. let's try LOOCV.

In [12]:
# 6. Leave-One-Out Cross-Validation (LOOCV)
loo = LeaveOneOut()

lr_scores_loo = cross_val_score(lr, X, y, cv=loo, scoring='accuracy')
dt_scores_loo = cross_val_score(dt, X, y, cv=loo, scoring='accuracy')

print("\nLeave-One-Out Cross-Validation:")
print(f"Logistic Regression: Mean Accuracy = {lr_scores_loo.mean():.4f}") #No std as each test set is only 1 sample
print(f"Decision Tree: Mean Accuracy = {dt_scores_loo.mean():.4f}")


Leave-One-Out Cross-Validation:
Logistic Regression: Mean Accuracy = 0.8380
Decision Tree: Mean Accuracy = 0.8300


* no change at all with LOOCV(LeaveOneOut cross-validation). let's do Nested cross-validation with hyperparameter tuning.

In [17]:
# 7. Nested Cross-Validation with Hyperparameter Tuning for Logistic Regression
param_grid_lr = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['elasticnet'], 'solver':['saga'], 'l1_ratio': [0.1, 0.5, 0.9]} # Example grid for Logistic Regression

nested_cv_lr = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid_lr, cv=5, scoring='accuracy',n_jobs=-1)  # Inner CV is 5-fold, set n_jobs for parallel processing
nested_cv_lr.fit(X, y)

print("\nNested Cross-Validation with Hyperparameter Tuning (Logistic Regression):")
print(f"Best Hyperparameters: {nested_cv_lr.best_params_}")
print(f"Best Score: {nested_cv_lr.best_score_:.4f}")


Nested Cross-Validation with Hyperparameter Tuning (Logistic Regression):
Best Hyperparameters: {'C': 0.1, 'l1_ratio': 0.1, 'penalty': 'elasticnet', 'solver': 'saga'}
Best Score: 0.8460


In [13]:
# 8. Nested Cross-Validation with Hyperparameter Tuning
param_grid_dt = {'max_depth': [None, 5, 10, 15], 'min_samples_split': [2, 5, 10]} # Example grid

nested_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5, scoring='accuracy') # Inner CV is 5-fold
nested_cv.fit(X, y)

print("\nNested Cross-Validation with Hyperparameter Tuning (Decision Tree):")
print(f"Best Hyperparameters: {nested_cv.best_params_}")
print(f"Best Score: {nested_cv.best_score_:.4f}")


Nested Cross-Validation with Hyperparameter Tuning (Decision Tree):
Best Hyperparameters: {'max_depth': None, 'min_samples_split': 5}
Best Score: 0.8360


* Here i use elasticnet (regularization) for LogisticRegression to find hyperparameters using GridSearchCV and for decision tree depth and min samples as a hyperparameters. the results are not much different as data is less and for basic implementation purspose i used simple synthetic data in realtime data we need to do feature engineering before training  then we can apply cross-validation to find best model for our data. of couse it need some experimentation with parameters and enough knowledge to tune the model. you  can find basic implemenations of evalution metrics for classfication , regression and clustering in my repo, feel free to check it out.