### **CV & Hyperparameter Tuning**

**Q : What does train score value indicate?**

* high value implies learning has happened
* low value implies that no learning has happened

**Q : For machine classification problem model, is there a cut-off that indicates desirable model?**

* No, such cut-off
* It totallyy depends on the problem statement and scenario

**Q : (Algo, hyperparameter) ordered pairs for a single problem can be a huge number. How to decide the best combination?**

* GridSearch CV
* Randomised CV 

These features are present in sklearn. CV stands for cross validation.

#### **WHAT ?**

**Q : What is Cross validation?**

* technique used to assess the performance and generalization ability of a machine learning algorithm
* done by dividing the data into multiple parts and training/testing on different subsets
* helps to avoid overfitting 
* ensures that the model performs well on unseen data



**Q : How is Cross validation done? Explain step-by-step.**

1. Divide the data into K folds. Fix an algorithm
2. Keep aside a fold for testing and compile rest of the K-1 folds into a training set and run the algo over it to get a model.
3. test that model over the test fold.
4. Repeat steps 2,3 for all K-folds.
5. The average value of K performance scores gives an idea about the algorithm's generalization ability. It is a more reliable indicator.

**Q : What are other types of CV?**

1. **Stratified K-Fold CV :**
 - Whole data is divided into k-folds
 - the class distribution is preserved across every folds
 - useful when there is class imbalance in the dataset

2. **Leave-One-Out-CV :**
 - Computationally very expensive
 - a datapoint is used as a test point and training is done on rest of the dataset
 - repeated for all datapoints in the dataset


**Q : Can cross validation be considered a thorough learning of the data?**

* Although it appears to be a thorough learning process, CV is not actually that
* During cross-validation, the algo is trained from scratch for each fold. It doesn’t retain knowledge from previous folds.
* It is not thorough learning, but instead it is a thorough evaluation process
* Its like thinking twice or thrice before fixing an algo for solving a problem

CV does not contribute to the model's actual learning but rather evaluates how well an ML algorithm generalizes to unseen data. Each fold in CV serves only as a temporary training-validation split, and once CV is complete, those models are discarded. The final model is then trained separately on the full dataset. Hence, CV is a thorough evaluation technique rather than a learning process.

**Q : Extend the 'student learning for exam' analogy of ML model building pipeline to include CV step.**

CV is more like screening tests or entrance exams, where:

 - Multiple test setups are designed to evaluate the student (or algorithm).
 - Each test is independent; there’s no learning transfer between tests.
 - Once the student (algorithm) clears the screening (CV), the real training begins (training on the entire dataset).

#### **WHY ?**

**Q : What is the prime purpose behind CV?**

* Validation/ Evaluation

**Q : What is that the CV is trying to evaluate?**

- CV evaluates an algorithm, not a specific model instance
- CV isn’t about improving or retaining knowledge in the model being trained
- CV is about understanding :
  * How well the algorithm (e.g., Random Forest, SVM) performs on the given data.
  * How **consistent** the performance is across multiple data splits.

**Q : Average of CV scores is a highly reliable indicator of what?**

- of the algorithm’s generalization ability on unseen data.

**Q : Why is it highly reliable?**

1. **Multiple Validation Splits** 
  - CV evaluates the algorithm on multiple train-validation splits rather than a single train-test split, reducing bias from any specific data partition.
2. **Reduced Variance** 
  - Averaging the scores across folds smooths out fluctuations caused by random data variations, leading to a more stable estimate of performance.
3. **Better Approximation of Real-World Performance**
 - Since CV tests the algorithm on diverse subsets of data, the average score reflects how well the algorithm would perform on truly unseen data.
4. **Prevents Overfitting to a Single Split**
  - Without CV, a single train-test split might give an overoptimistic or overly pessimistic estimate, while CV provides a balanced evaluation.

**Q : "CV mitigates problem of overfitting and underfitting of a ML model (not ML algorithm)". True or False. Justify.**

**False.**  

- CV helps evaluate and detect overfitting or underfitting
- but it does **not** directly mitigate these issues in the final **ML model** 
- Instead, CV provides insight into how well an **ML algorithm** generalizes by testing it on multiple train-validation splits. 
- If overfitting or underfitting is observed, actions like changing the **algorithm, data preprocessing, or regularization techniques** must be taken to address it. 
- CV itself does not alter the model’s parameters or learning process—it only assesses an algorithm's performance.

**Q : Doing CV is like thinking twice or thrice before fixing a ML algo for solving a problem. Justify.**

- CV allows us to **evaluate** an ML algorithm on multiple data splits before finalizing it for the problem. 
- Just as thinking twice or thrice helps in making a well-informed decision, CV helps in assessing whether an algorithm generalizes well to unseen data, avoiding **hasty or biased conclusions** based on a single train-test split
- provides a clearer picture of an algorithm’s **stability and consistency**, helping us decide whether it is the right choice for the given problem.

#### **WHERE ?**

CV is used in machine learning model evaluation, feature selection, and hyperparameter tuning to ensure robust performance.

It is applied when data is limited, when avoiding overfitting is crucial, or when comparing models or hyperparameter settings.

**Q : How is CV used for hyperparameter tuning? Explain step-by-step.**

1. **Initial Data Split**
  * Typically, the dataset is initially split into a training set (e.g., 80%) & a testing set
  * training set is used for CV ie. further divided into folds
  * test set is also called as **holdout set** is held out and never seen by the model during training or CV

2. **Folding the training Set**
  * training dataset is divided into k equal-sized subsets (folds).
  * One subset is used for validation, and the remaining k-1 subsets are used for training.

3. **Training**
  * (k-1) subsets form a single training set
  * we get a single model upon running the algorithm on this training set
  * this model is validated using the test subset

4. **Repeat**
  * This process is repeated k times, each time using a different fold for validation purpose.
  * hence, cross-validation

5. **Average Performance**
  * In the end, I have k different models and their performance scores
  * The final performance is computed as the average of the scores across all different training subsets
  * this value gives idea about how a single hyperparameter setting performs on an average
  * the same process is done for various hyperparameter settings

6. **Comparison**
  * after comparing average performances of various hyperparameter settings, the best hyperparameter setting is chosen

7. **Actual training**
  * the best hyperparameter setting is locked
  * the algo is run under this setting over the entire train set
  * a single final model is obtained which will be tested on the holdout set


**Q : For each hyperparameter setting, what is the need of training over k different subsets & then computing average performance? Why not on a single training set? What is the need of this double work? Or in other words why cross validate for a single hyperparameter setting?**

OR

**Q : What is the need of cross validation over single simple validation?**

**A :**

(i) *What is the problem with a Single Training-Validation Split?*
 
**Bias from Random Splitting :** 
  - single split may accidentally contain "easy" or "hard" examples in the training or test set
  - will lead to overly optimistic or pessimistic performance evaluation
  - model may perform well on this specific split but poorly on unseen data

**Overfitting to a Particular Split :** 
  - The model might learn patterns specific to the given training set
  - since you're testing on just one test set, you don't know if the model generalizes well across different subsets
  - A good test score might give a false sense of confidence.

**Data Imbalance Issues :**
  - Important patterns might be underrepresented in the test set due to class imbalance or sampling biases
  - will lead to misleading performance evaluation
  - CV ensures all patterns are tested across different subsets

(ii) *Why k-fold CV is done for each hyperparameter setting?*

**Reducing variance in Performance Estimation :**
  - training on k different subsets allows the model to be evaluated across diverse portions of the data
  - will lead to a more reliable estimate of performance
  - averaged score smooths out any fluctuations caused by randomness in a single split

**Ensuring Generalization Ability :**
  - By using different validation sets in each fold, we ensure the model performs well on all parts of the data, not just one specific subset
  - helps in selecting hyperparameters that generalize well to unseen data

**Efficient use of Data :**
  - With CV, each data point gets to be in the validation set exactly once and in the training set k-1 times, maximizing the use of limited data.
  - Training on the full dataset in different parts ensures no information is wasted.

**Q : Summarize the differences and similarities between 'CV with hyperparameter tuning' and 'CV without hyperparameter tuning' in a table.** 


| Aspect                              | CV Without Hyperparameter Tuning                          | CV With Hyperparameter Tuning                                |
|-------------------------------------|----------------------------------------------------------|------------------------------------------------------------|
| **Purpose**                         | Evaluate model performance and generalization ability.    | Find the best hyperparameter settings for the model.        |
| **Hyperparameters**                 | Fixed throughout the process.                            | Multiple hyperparameter combinations are tested.            |
| **Number of Models**                | k models (from k folds).                                 | k models per hyperparameter combination, leading to more models overall. |
| **Performance Score**               | Average performance score across k folds.                | Average performance score across k folds for each hyperparameter combination. |
| **Final Model**                     | Retrained on the full dataset with fixed hyperparameters. | Retrained on the full dataset using the best hyperparameters found during tuning. |
| **Computational Cost**              | Relatively low.                                           | Higher due to multiple hyperparameter combinations being evaluated. |
| **Focus**                           | Assess the model’s generalization ability.               | Optimize the model's performance through parameter adjustment. |
| **Use Cases**                       | Simple evaluation tasks or when hyperparameters are predetermined. | When tuning hyperparameters to maximize model performance.  |
| **Analogy (Student Example)**       | Multiple mock tests to revise concepts, final revision based on fixed study plan. | Mock tests + optimizing study plan for best performance before final revision. |
| **Practical Application**           | Less commonly used, mainly for baseline evaluation.       | Widely used in practice for building high-performing models. |
| **Risk of Data Leakage**            | Minimal, provided test data remains separate.             | Minimal, provided test data remains separate.               |


#### **HOW ?**

**Q : How to implement simple cross validation to check performance of RandomForest classifier algorithm on a dataset?**

```python
# import dependencies
from sklearn.model_selection import KFold, cross_val_score, train_test_split

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# set up the algorithm apparatus
model = RandomForestClassifier(random_state = 42)

# perform CV
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)
cv_scores = cross_val_score(model, X_train, y_train, cv = kf, scoring = 'accuracy')

# average performance of the algo
print("Cross-Validation Scores:", cv_scores)
print("Average CV Score:", np.mean(cv_scores))

**Q : Which module of sklearn has the classes GridSearchCV & RandomizedSearchCV?**

* model_selection module

```python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

**Q : What is the purpose of the verbose attribute of GridSearchCV object?**

* It indicates whether or not to display the metadata( what is happening internally while running a grid search CV)
* verbose = 0 means, no need to show metadata
* verbose = 1 means, show very little essential metadata
* verbose = 2 means, show the entire detailed metadata

**Q : How is GridSearchCV executed in Python using sklearn?**

```python
'''STEP 01 - Import dependencies'''
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

'''STEP 02 - Define model'''
model = SVC()

'''STEP 03 - Setup a parameter grid'''
# parameter grid is a dictionary basically
param_grid = {
    'C': [0.1, 1, 10],  
    'kernel': ['linear', 'rbf'],  
    'gamma': [0.01, 0.1, 1]
}

'''STEP 04 - Initialize a GridSearchCV object'''
# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')

'''STEP 05 - Run the object over whole dataset'''
grid_search.fit(X_train, y_train)

# Best parameters and best score
print(grid_search.best_params_)
print(grid_search.best_score_)
best_model = grid_search.best_estimator_

# Evaluate final model on test set
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
```


**Q : What about Randomised search CV?**

```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Define model 
model = SVC()

# Define parameter distribution
param_dist = {
    'C': uniform(0.1, 10),  
    'kernel': ['linear', 'rbf'],  
    'gamma': uniform(0.01, 1)
}

# Perform Randomized Search with 10 iterations
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

# Best parameters and best score
print(random_search.best_params_)
print(random_search.best_score_)


**Q : Is hyperparameter tuning the only method to rectify overfitting issues with a model? If not, what are the other remedies?**

No,

1. Get more data
2. Data Augmentation
3. Early stopping
4. Dropouts for DL
5. Feature engineering (Transformation, selection)
6. Regularization
7. Ensembles


To solve a problem, I have a list of algorithms and several hyperparameter settings for it. By doing several CVs, I get the optimal hyperparameter setting for each algorithm. Out of these best of each algos, how to choose one? How to finalize a model?