## **LECTURE** 17: Class Cross_Validation and Hypertuning

## **course**: Awfera Machine Learning

## **Instructor**: Dr. Shazia Saqib

## **Student**: Muhammad Shafiq

____________

## Introduction to Cross-Validation

### **Statistical Foundation**

**Cross-validation** is a statistical method used to evaluate how well a machine learning model generalizes to an independent dataset. The data is split into multiple parts, with each part being used as a test set while the remaining data is used for training.

### **Systematic Approach**

It systematically partitions data into training and validation subsets, providing robust performance estimates across different data samples.

### **Practical Importance**

Implementing cross-validation to crucial for detecting overfitting. ensuring model realiability, and building confidence in real-world predictive performance.

## Why Cross-Validation is Important
 - **Reliable Evaluation**

 provides more accurate assessment of model performane by using multiple data subsets.

 - **MOdel Selection**

 Enables objective comparison between different algorithms to identify the most effective model.

 - **Hyperparameter Tunning**

 Facilitates systematic optimization of model parameters without introducing bias.

 - **Preventing Overfitting**
 
 Detects and reducing mitigating by validating performance on unseen data.

## Basic Cross-Validation Concepts

- **Training Set**

     The portion of data used to fit the model parameters. Typically comprises 70-80% of the available dataset and teaches the model to recognize patterns.

- **Validation Set**

     Independent data used for hyperparameter tuning and performance evaluation during development . Helps prevent overfitting by providing unbiased feedback. 

- **Test Set**

     Completely untouched data reserved for final model assessment. Provides an accurate estimate of how the model will perform on new, real-world data.

## How K-Fold Enhances Reliablity

- **Multiple Evaluations**

     K-Fold provides comprehensive assessment by testing your model on different data subsets, ensuring performance isn't dependent on a single validation split.

- **Reduced Variance**

     By averging results across multiple folds, K-Fold significantly minimizes the statistical variance that occurs from arbitrary data partitioning.

- **Averaged Performance**

     Combining metrices from all iterations delivers more stable and trustworthy model assessment, revealing the true predictive 
     capabilities.

- **Increased Robustness**
 
     Models validated through K-Fold demonstrate superior generalization ability when deployed on entirely new, unseen data in real-world.




## Types of Cross-Validation:

- **K-Fold Cross-Validation**: 

    The data is split into k subsets (folds). The model is trained on k-1 folds and tested on the remaining fold. This process is repeated for each fold.

    **Example**: 5-Fold Cross-Validation divides the dataset into 5 parts and trains 5 models.

- **Stratified K-Fold Cross-Validation**: 

    This variant maintains the class distribution across folds, ensuring balanced representation of each class.
    
    **Benefit**: Ideal for imbalanced datasets.

- **Leave-One-Out Cross-Validation (LOOCV)**: 

    A special case of k-fold where k equals the number of data points, meaning each data point is used as a test case once.

- **Time Series Cross-Validation**: 

    This technique splits data based on time and ensures that the future data is not used to predict the past.
    Also maintains chronological order in data splits
    like stock prices, weather

## Implementing Cross-Validation

- **Choose Strategy**

    Select the appropriate cross-validation technique based on your dataset characteristics and specific machine learning problem 

- **Split Data**

    Partition your dataset into multiple folds according to your chosen cross validation method, ensuring proper distribution of classes.

- **Train and validate**

    Systematically train models on training folds and evaulate performance on validation folds, recording metrics for each iteration.

- **Aggregate Results**

    Calculate average performance across all folds to obtain reliable metrics that indicate your model's true generalization capability

## Challenges and Considerations for cross-validation

When implementing cross-validation, be aware of these potential hurdles.

- **Computation Time**

     Cross-validation significantly increase computational demands, especially with complex models or large datasets,  potentially extending training time from minutes to hours or days

- **Data Leakage**

    Improper implementation can allow information from validation
      sets to influence training, leading to ovely optimistic performance estimates and poor real-world generalization.

- **Fold Selection**

     Determining the optimal number of folds requires balancing between statistical reliability and computational efficiency, with too few risking high variance and to many increasing processing burden.

## What are Hyperparameters?

#### Definition

Configuaration variables that are set before the training process of a machine learning model begins.

#### Key Characteristic

Control the learning process itself, rather than being learned from the data. 

#### Importance 
 
Used to tune the performance of a model and can have a significant impact on the model's accuracy gneralization and other metrics.

## **Hyperparameter Tunning**

- **Definition** 

        Hyperparameter tuning is the process of selecting the optimal values for a machine learning model's hyperparameters.

- **Purpose**

    The goal of hyperparameter tuning is to find the values that lead to the best performance on a give task

- **Examples**

    Hyperparameters are setting that control the learning processs of the model, such as the learning rate, the number of neurons in a neutral network, or the kernel size in a support vector machine

## **Importance of Hyperparameter Tunning**

- Improved Accuracy 
- Enhanced Robustness
- Better Generalization 
- Outperforms defaults

## **Key Hyperparameter: Learning Rate**

- **Definition**: Setp size the model takes to update weights.
- **high Rate** : May miss the optimal solution.
- **Low Rate** : Slow down trianing process

## **Key Hyperparameter: Batch Size**

- **Definition**: Setp size the model takes to update weights.
- **Large Batches** : more stable but require more computational resourse
- **small batches** : lead to frequent updates but can be noisy

## **Key Hyperparameter : Early Stopping**

- **Monitor Performance** : Track validation error during trianing
- **Detect Overfitting** : Identity when validation error starts increasing.
- **Stop Training** : Halt Process to prevent overfitting. 

## **Hyperparameter Selection Methods**

- **Grid Search**

        Systematically searches through a predefined set of hyperparamter combinations. Tries every possible configuration within specified ranges.

       1. Define parameter space : Select range of values for each hyperparameter

       2. Create Combination : Generate all possible combinations of values

       3. Evaluate Each Combination : train and test model with each set. 

       4. Select Best Performing : Choose combination with highest performance

- **Random Search**

        Randomly samples hyperparametr combinations. More efficient than grid search, especially in high dimensional spaces.

## Comparing Grid Search and Random Search

##### **Grid Search**

 - Carefully checks all present parameter options
 - tests every possible parameter combination
 - can be slow for complex models

##### **Random Seach**

- Randomly samples different parameter settings
- Finds good solutions more quickly
- Works better with many different parameters 

## Hyperparamters in Support Vector Machine (SVM)

Key hyperparameters that affect SVM performance: 

 - **Regularization Parameter(C)**

    **c**: Controls trade-off between margin width and training errors Higher values prioritize error minimization: Lower values favor margins.

 - **Kernel Function**

    **Kernel**: Transfroms data into higher dimentions. Options include Linear(linearly separatable data), RBF(non-linear relationships), and Polynomial(curved boundaries)    

 - **Influence Radius**

    **Gamma**: Determine influence radius of trainig examples, HIgher values create complex boundaries with local focus; lower values produce smooother boundaries.      

## Practical Tips For Tunning
1. **Start with Defaults**

    Begin with standard setting and observe results.

2. **One at a Time**

    Adjust one hyperparameter at a time for clear impact.

3. **Multiple Search Methods**

    Combine grid search, random search, and bayesian optimization for comprehensive hyperparameter exploration

4. **Monitor Validation**

    Track validation performance to avoid overfitting


## Manual Tuning

- **Definition**

    Manually adjust hyperparameters based on experience

- **Process**

    Observing results and making informed changes.

- **Advantage**

    Allow for intuition and domain knowledge application


## **Implementing Code**

### Import labraries:

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np 




### **Load the dataset**

In [6]:
iris = load_iris()
X, y = iris.data, iris.target

### **Initialize model**


In [7]:
model = LogisticRegression(max_iter=200)

### **Set up K-Fold**


In [13]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

fold = 1
all_reports = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model 
    model.fit(X_train,y_train)

    # predict
    y_pred = model.predict(X_test)

    # Generate and print classification report
    report = classification_report(y_test, y_pred, target_names=iris.target_names)

    print(f"Classification Report for Fold {fold}:\n", classification_report(y_test, y_test))
    all_reports.append(report)
    fold +=1

Classification Report for Fold 1:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Classification Report for Fold 2:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      1.00      1.00        10
           2       1.00      1.00      1.00         7

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Classification Report for Fold 3:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00 

### **Grid Search on RandomClassifier**


In [14]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report



### **Load DAta**

In [None]:
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### **Model and Parameter**


In [16]:
model = RandomForestClassifier()
param_grid = {
    'n_estimators' : [10, 50, 100],
    'max_depth' : [None, 3, 5],
    'criterion' : ['gini', 'entropy']
}

### **Grid Search**

In [17]:
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)


0,1,2
,estimator,RandomForestClassifier()
,param_grid,"{'criterion': ['gini', 'entropy'], 'max_depth': [None, 3, ...], 'n_estimators': [10, 50, ...]}"
,scoring,
,n_jobs,
,refit,True
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,10
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


### **Evaluation**


In [None]:
print("Best parameters: ", grid_search.best_params_)
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

Best parameters:  {'criterion': 'gini', 'max_depth': None, 'n_estimators': 10}
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



: 