# **Cross Validation in Machine Learning**

- Cross validation is a technique used in machine learning to evaluate the performance of a model on unseen data.

- It involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds.

- This process is repeated multiple times, each time using a different fold as the validation set.

- Finally, the results from each validation step are averaged to produce a more robust estimate of the model’s performance.

- Cross validation is an important step in the machine learning process and helps to ensure that the model selected for deployment is robust and generalizes well to new data.

## What is cross-validation used for?

- The main purpose of cross validation is to prevent overfitting, which occurs when a model is trained too well on the training data and performs poorly on new, unseen data.

- By evaluating the model on multiple validation sets, cross validation provides a more realistic estimate of the model’s generalization performance, i.e., its ability to perform well on new, unseen data.

## Types of Cross-Validation

- There are several types of cross validation techniques
  - k-fold cross validation,
  - leave-one-out cross validation,
  - Holdout validation,
  - Stratified Cross-Validation

- The choice of technique depends on the size and nature of the data, as well as the specific requirements of the modeling problem

###  K-Fold Cross Validation

- In K-Fold Cross Validation, we split the dataset into k number of subsets (known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model.

- In this method, we iterate k times with a different subset reserved for testing purpose each time.

*Example of K Fold Cross Validation*

- The diagram below shows an example of the training subsets and evaluation subsets generated in k-fold cross-validation.

- Here, we have total 25 instances.
- In first iteration we use the first 20 percent of data for evaluation, and the remaining 80 percent for training ([1-5] testing and [5-25] training) while in the second iteration we use the second subset of 20 percent for evaluation, and the remaining three subsets of the data for training ([5-10] testing and [1-5 and 10-25] training), and so on.

![merge3cluster](https://media.geeksforgeeks.org/wp-content/uploads/crossValidation.jpg)

Source: GeeksforGeeks

```python
Total instances: 25
Value of k     : 5 
No. Iteration              Training set observations                     Testing set observations
 1      [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]   [0 1 2 3 4]
 2      [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]   [5 6 7 8 9]
 3      [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]   [10 11 12 13 14]
 4      [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]   [15 16 17 18 19]
 5      [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]   [20 21 22 23 24]
```

*k fold cross-validation*

In [4]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

svm_classifier = SVC(kernel='linear')

num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

cross_val_results = cross_val_score(svm_classifier, X, y, cv=kf)

print(f'Cross-Validation Results (Accuracy): {cross_val_results}')
print(f'Mean Accuracy: {cross_val_results.mean()}')


Cross-Validation Results (Accuracy): [1.         1.         0.96666667 0.93333333 0.96666667]
Mean Accuracy: 0.9733333333333334


## Advantages & Disadvantages of Cross Validation

| **Advantages**                                      | **Disadvantages**                                   |
| ----------------------------------------------------------------------- | --------------------------------------------------------------------- |
| **Overcoming Overfitting:** Prevents overfitting by providing a robust estimate of the model's performance on unseen data.              | - **Computationally Expensive:** Can be computationally expensive, especially with a large number of folds or a complex model.           |
| - **Model Selection:** Compares different models and selects the one performing the best on average.                                    | - **Time-Consuming:** Can be time-consuming, especially with many hyperparameters to tune or when comparing multiple models.             |
| - **Hyperparameter Tuning:** Optimizes hyperparameters, such as the regularization parameter, by selecting values resulting in the best performance on the validation set. | - **Bias-Variance Tradeoff:** The choice of the number of folds can impact the bias-variance tradeoff. Too few folds may result in high variance, while too many may result in high bias. |
| - **Data Efficient:** Utilizes all available data for both training and validation, making it more data-efficient compared to traditional validation techniques. |                                                                    |


# **Hyperparameter Tuning**

- A Machine Learning model is defined as a mathematical model with several parameters that need to be learned from the data.

- By training a model with existing data, we can fit the model parameters.

-  However, there is another kind of parameter, known as `Hyperparameters`, that cannot be directly learned from the regular training process.
  
-  They are usually fixed before the actual training process begins.
  
-  These parameters express important properties of the model such as its *complexity* or *how fast* it should learn.

## Hyperparameter Tuning


- Hyperparameter tuning is the <u>process of selecting the optimal values for a machine learning model’s hyperparameters.</u>

- Hyperparameters are settings that control the learning process of the model, such as the **learning rate**, the **number of neurons** in a neural network, or the **kernel size** in a support vector machine.

- The goal of hyperparameter tuning is to find the values that lead to the best performance on a given task.

## What are Hyperparameters?

- Hyperparameters in machine learning are configuration variables set before model training.
  
- They control the learning process, unlike model parameters learned from the data.
  
- Hyperparameters are crucial for tuning a model's performance and can impact accuracy, generalization, and other metrics.

## Different Ways of Hyperparameters Tuning

- Hyperparameters differ from model parameters (weights and biases) learned from the data.
  
- Various types of hyperparameters exist, each with specific roles.

### Hyperparameters in Neural Networks
| Hyperparameter           | Description                                                                                                                |
|--------------------------|----------------------------------------------------------------------------------------------------------------------------|
| Learning rate            | Controls the step size taken by the optimizer during each training iteration. Too small or large rates can lead to convergence issues.           |
| Epochs                   | Represents the number of times the entire training dataset passes through the model during training. Increased epochs may enhance performance but could lead to overfitting.    |
| Number of layers         | Determines the depth of the model, impacting complexity and learning ability.                                              |
| Number of nodes per layer| Influences the width of the model, affecting its capacity to represent complex relationships in the data.                   |
| Architecture             | Dictates the overall structure of the neural network, including the number of layers, neurons per layer, and connections. Optimal architecture depends on task complexity and dataset size. |
| Activation function      | Introduces non-linearity, enabling the model to learn complex decision boundaries. Common functions include sigmoid, tanh, and Rectified Linear Unit (ReLU).               |


### Hyperparameters in Support Vector Machine
| Hyperparameter   | Description                                                                                                                |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------- |
| C                | The regularization parameter that controls the trade-off between the margin and the number of training errors. A larger value of C penalizes training errors more heavily, resulting in a smaller margin but potentially better generalization performance. A smaller value of C allows for more training errors but may lead to overfitting. |
| Kernel           | The kernel function that defines the similarity between data points. Different kernels can capture different relationships between data points, and the choice of kernel can significantly impact the performance of the SVM. Common kernels include linear, polynomial, radial basis function (RBF), and sigmoid. |
| Gamma            | The parameter that controls the influence of support vectors on the decision boundary. A larger value of gamma indicates that nearby support vectors have a stronger influence, while a smaller value indicates that distant support vectors have a weaker influence. The choice of gamma is particularly important for RBF kernels. |


## Hyperparameter Tuning techniques

- Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem.
  
- The two strategies for Hyperparameter tuning are:
  1. GridSearchCV
  2. RandomizedSearchCV

### 1. GridSearchCV 

- Grid search is a brute force approach to hyperparameter optimization.
  
- It explores all possible combinations from a grid of hyperparameter values.
  
- Each set's model performance is logged, and the combination with the best results is chosen.
  
- GridSearchCV refers to this approach, searching for the best hyperparameter set from the grid.
  
- Despite being exhaustive and ideal for finding the best combination, grid search is slow.
  
- It requires significant processing power and time, which may not always be available.

**For example:**

If we want to set two hyperparameters `C` and `Alpha` of the Logistic Regression Classifier model, with different sets of values.

The grid search technique will construct many versions of the model with all possible combinations of hyperparameters and will return the best one.

As in the image:

- `C` values: [0.1, 0.2, 0.3, 0.4, 0.5]
- `Alpha` values: [0.1, 0.2, 0.3, 0.4]

For a combination of `C=0.3` and `Alpha=0.2`, the performance score comes out to be 0.726 (highest), therefore it is selected.


![merge3cluster](https://media.geeksforgeeks.org/wp-content/uploads/Hyp_tune.png)

Source: Geeksforgeeks

*GridSearchCV*

In [2]:
# Necessary imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.datasets import make_classification


X, y = make_classification(
	n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)

# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiating logistic regression classifier
logreg = LogisticRegression()

# Instantiating the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Assuming X and y are your feature matrix and target variable
# Fit the GridSearchCV object to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))


Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.853


**Drawback:**

GridSearchCV will go through all the intermediate combinations of hyperparameters which makes grid search computationally very expensive.

### 2. RandomizedSearchCV 


- **Random Search Method:**
  - Selects values randomly, contrasting with the predetermined set of numbers in grid search.

  - Attempts a different set of hyperparameters in each iteration and logs the model's performance.
  
  - Returns the combination with the best outcome after several iterations, reducing unnecessary computation.

<br/>

- **RandomizedSearchCV:**
  - Addresses drawbacks of GridSearchCV by exploring a fixed number of hyperparameter settings.

  - Moves within the grid in a random fashion to find the best set of hyperparameters.

  - Generally produces comparable results faster than a grid search in most cases.


*RandomizedSearchCV*

In [3]:
import numpy as np
from sklearn.datasets import make_classification

# Generate a synthetic dataset for illustration
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)

# Rest of your code (including the RandomizedSearchCV part)
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
	"max_depth": [3, None],
	"max_features": randint(1, 9),
	"min_samples_leaf": randint(1, 9),
	"criterion": ["gini", "entropy"]
}

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(X, y)

print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))


Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'min_samples_leaf': 1}
Best score is 0.825


**Drawback:**

It’s possible that the outcome could not be the ideal hyperparameter combination is a disadvantage.



## Challenges in Hyperparameter Tuning

- **Dealing with High-Dimensional Hyperparameter Spaces:** Efficient Exploration and Optimization
  
- **Handling Expensive Function Evaluations:** Balancing Computational Efficiency and Accuracy
  
- **Incorporating Domain Knowledge:** Utilizing Prior Information for Informed Tuning

- **Developing Adaptive Hyperparameter Tuning Methods:** Adjusting Parameters During Training

## Applications of Hyperparameter Tuning

- **Model Selection:** Choosing the Right Model Architecture for the Task

- **Regularization Parameter Tuning:** Controlling Model Complexity for Optimal Performance

- **Feature Preprocessing Optimization:** Enhancing Data Quality and Model Performance

- **Algorithmic Parameter Tuning:** Adjusting Algorithm-Specific Parameters for Optimal Results

## Advantages & Disadvantages of Hyperparameter

| **Advantages** | **Disadvantages** |
| -------------------------------------- | ---------------------------------------- |
| - Improved model performance           | - Computational cost                   |
| - Reduced overfitting and underfitting  | - Time-consuming process               |
| - Enhanced model generalizability       | - Risk of overfitting                   |
| - Optimized resource utilization        | - No guarantee of optimal performance   |
| - Improved model interpretability      | - Requires expertise                   |
