# **Hyperparameter Tuning**

- A Machine Learning model is defined as a mathematical model with several parameters that need to be learned from the data.

- By training a model with existing data, we can fit the model parameters.

-  However, there is another kind of parameter, known as `Hyperparameters`, that cannot be directly learned from the regular training process.
  
-  They are usually fixed before the actual training process begins.
  
-  These parameters express important properties of the model such as its *complexity* or *how fast* it should learn.

## Hyperparameter Tuning


- Hyperparameter tuning is the <u>process of selecting the optimal values for a machine learning model’s hyperparameters.</u>

- Hyperparameters are settings that control the learning process of the model, such as the **learning rate**, the **number of neurons** in a neural network, or the **kernel size** in a support vector machine.

- The goal of hyperparameter tuning is to find the values that lead to the best performance on a given task.

## What are Hyperparameters?

- Hyperparameters in machine learning are configuration variables set before model training.
  
- They control the learning process, unlike model parameters learned from the data.
  
- Hyperparameters are crucial for tuning a model's performance and can impact accuracy, generalization, and other metrics.

## Different Ways of Hyperparameters Tuning

- Hyperparameters differ from model parameters (weights and biases) learned from the data.
  
- Various types of hyperparameters exist, each with specific roles.

### Hyperparameters in Neural Networks
| Hyperparameter           | Description                                                                                                                |
|--------------------------|----------------------------------------------------------------------------------------------------------------------------|
| Learning rate            | Controls the step size taken by the optimizer during each training iteration. Too small or large rates can lead to convergence issues.           |
| Epochs                   | Represents the number of times the entire training dataset passes through the model during training. Increased epochs may enhance performance but could lead to overfitting.    |
| Number of layers         | Determines the depth of the model, impacting complexity and learning ability.                                              |
| Number of nodes per layer| Influences the width of the model, affecting its capacity to represent complex relationships in the data.                   |
| Architecture             | Dictates the overall structure of the neural network, including the number of layers, neurons per layer, and connections. Optimal architecture depends on task complexity and dataset size. |
| Activation function      | Introduces non-linearity, enabling the model to learn complex decision boundaries. Common functions include sigmoid, tanh, and Rectified Linear Unit (ReLU).               |


### Hyperparameters in Support Vector Machine
| Hyperparameter   | Description                                                                                                                |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------- |
| C                | The regularization parameter that controls the trade-off between the margin and the number of training errors. A larger value of C penalizes training errors more heavily, resulting in a smaller margin but potentially better generalization performance. A smaller value of C allows for more training errors but may lead to overfitting. |
| Kernel           | The kernel function that defines the similarity between data points. Different kernels can capture different relationships between data points, and the choice of kernel can significantly impact the performance of the SVM. Common kernels include linear, polynomial, radial basis function (RBF), and sigmoid. |
| Gamma            | The parameter that controls the influence of support vectors on the decision boundary. A larger value of gamma indicates that nearby support vectors have a stronger influence, while a smaller value indicates that distant support vectors have a weaker influence. The choice of gamma is particularly important for RBF kernels. |


## Hyperparameter Tuning techniques

- Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem.
  
- The two strategies for Hyperparameter tuning are:
  1. GridSearchCV
  2. RandomizedSearchCV

### 1. GridSearchCV 

- Grid search is a brute force approach to hyperparameter optimization.
  
- It explores all possible combinations from a grid of hyperparameter values.
  
- Each set's model performance is logged, and the combination with the best results is chosen.
  
- GridSearchCV refers to this approach, searching for the best hyperparameter set from the grid.
  
- Despite being exhaustive and ideal for finding the best combination, grid search is slow.
  
- It requires significant processing power and time, which may not always be available.

**For example:**

If we want to set two hyperparameters `C` and `Alpha` of the Logistic Regression Classifier model, with different sets of values.

The grid search technique will construct many versions of the model with all possible combinations of hyperparameters and will return the best one.

As in the image:

- `C` values: [0.1, 0.2, 0.3, 0.4, 0.5]
- `Alpha` values: [0.1, 0.2, 0.3, 0.4]

For a combination of `C=0.3` and `Alpha=0.2`, the performance score comes out to be 0.726 (highest), therefore it is selected.


![merge3cluster](https://media.geeksforgeeks.org/wp-content/uploads/Hyp_tune.png)

Source: Geeksforgeeks

*GridSearchCV*

In [2]:
# Necessary imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.datasets import make_classification


X, y = make_classification(
	n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)

# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiating logistic regression classifier
logreg = LogisticRegression()

# Instantiating the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Assuming X and y are your feature matrix and target variable
# Fit the GridSearchCV object to the data
logreg_cv.fit(X, y)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))


Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.853


**Drawback:**

GridSearchCV will go through all the intermediate combinations of hyperparameters which makes grid search computationally very expensive.

### 2. RandomizedSearchCV 


- **Random Search Method:**
  - Selects values randomly, contrasting with the predetermined set of numbers in grid search.

  - Attempts a different set of hyperparameters in each iteration and logs the model's performance.
  
  - Returns the combination with the best outcome after several iterations, reducing unnecessary computation.

<br/>

- **RandomizedSearchCV:**
  - Addresses drawbacks of GridSearchCV by exploring a fixed number of hyperparameter settings.

  - Moves within the grid in a random fashion to find the best set of hyperparameters.

  - Generally produces comparable results faster than a grid search in most cases.


*RandomizedSearchCV*

In [3]:
import numpy as np
from sklearn.datasets import make_classification

# Generate a synthetic dataset for illustration
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)

# Rest of your code (including the RandomizedSearchCV part)
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
	"max_depth": [3, None],
	"max_features": randint(1, 9),
	"min_samples_leaf": randint(1, 9),
	"criterion": ["gini", "entropy"]
}

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(X, y)

print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))


Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'min_samples_leaf': 1}
Best score is 0.825


**Drawback:**

It’s possible that the outcome could not be the ideal hyperparameter combination is a disadvantage.



## Challenges in Hyperparameter Tuning

- **Dealing with High-Dimensional Hyperparameter Spaces:** Efficient Exploration and Optimization
  
- **Handling Expensive Function Evaluations:** Balancing Computational Efficiency and Accuracy
  
- **Incorporating Domain Knowledge:** Utilizing Prior Information for Informed Tuning

- **Developing Adaptive Hyperparameter Tuning Methods:** Adjusting Parameters During Training

## Applications of Hyperparameter Tuning

- **Model Selection:** Choosing the Right Model Architecture for the Task

- **Regularization Parameter Tuning:** Controlling Model Complexity for Optimal Performance

- **Feature Preprocessing Optimization:** Enhancing Data Quality and Model Performance

- **Algorithmic Parameter Tuning:** Adjusting Algorithm-Specific Parameters for Optimal Results

## Advantages & Disadvantages of Hyperparameter

| **Advantages** | **Disadvantages** |
| -------------------------------------- | ---------------------------------------- |
| - Improved model performance           | - Computational cost                   |
| - Reduced overfitting and underfitting  | - Time-consuming process               |
| - Enhanced model generalizability       | - Risk of overfitting                   |
| - Optimized resource utilization        | - No guarantee of optimal performance   |
| - Improved model interpretability      | - Requires expertise                   |
