### Optimization:
Come up with a way of efficiently finding the parameters that minimize the loss function is called Optimizer.
1. Parameter Update
2. Hyperparameter Optimization

#### 1. Parameter Updates:
Parameter update, also known as weight update, is a crucial step in training a machine learning model. During the training process, the model's parameters (weights and biases) are adjusted or updated based on the information provided by the training data and the optimization algorithm being used. The goal is to find the optimal values of the parameters that minimize the loss or error of the model and enable it to make accurate predictions on new, unseen data.

1. First-order (SGD), momentum, Nesterov momentum
2. Annealing the learning rate
3. Second-order methods
4. Per-parameter adaptive learning rates (Adagrad, RMSProp)

#### 1.1 First Order Optimization

#### 1.2 Second Order Optimization
Second-order optimization methods, such as Newton's method and variants like the Gauss-Newton method or the Levenberg-Marquardt algorithm, use the second-order derivatives of the loss function with respect to the model parameters to update the parameters. These methods provide more information about the curvature of the loss function compared to first-order methods, which only use the gradient information.

    The Newton's method requires the inversion of the Hessian matrix, which can be computationally expensive and may not always be feasible for large-scale problems. 

1. Newton's method or Newton-Raphson method:
`θ_new = θ_old - H^(-1) * ∇L(θ_old)`
2.  the Gauss-Newton method:
3. Levenberg-Marquardt algorithm:


**In practice,** it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov’s) momentum are more standard because they are simpler and scale more easily.

#### 2. Hyperparameter optimization:
Parameter optimization, also known as hyperparameter optimization, is the process of selecting the best values for the hyperparameters of a machine learning algorithm. Hyperparameters are adjustable settings that are not learned from the data, but rather set before the learning process begins. They control the behavior and performance of the machine learning model.

    The goal of parameter optimization is to find the optimal combination of hyperparameter values that yields the best performance for a given learning algorithm and dataset. 

The most common hyperparameters in context of Neural Networks include:
1. **the initial learning rate:** If the learning rate (LR) is too small, overfitting can occur. Large learning rates help to regularize the training but if the learning rate is too large, the training will diverge. 
2. **learning rate decay schedule (such as the decay constant):** Test with short runs of momentum values 0.99, 0.97, 0.95, and 0.9 to get the best value for momentum.
3. **regularization strength (L2 penalty, dropout strength):** Weight decay is one form of regularization and it plays an important role in training so its value needs to be set properly [7]. Weight decay is defined as multiplying each weight in the gradient descent at each epoch by a factor $λ [0<λ<1]$.

The common algorithom uses to choose best hypoparameter are following:
1. Random Serach
2. Local search
3. Grid Search
4. Bayesian Optimization
5. Evolutionary Algorithms
6. Gradient Base Optimization


### 4.1 Random Search
Random Search is a simple optimization algorithm that involves randomly sampling points from the search space to find the best solution. It does not utilize any gradient information or prior knowledge about the objective function, making it a straightforward and easy-to-implement approach for optimization tasks.

1. Advantages:
    1. Random Search is eaasy to implement.
2. Limitations:
    1. potentially slow convergence.
    2. High number of iterations is required to get an optimal parameters.

Random Search can serve as a baseline or starting point for optimization tasks, especially in situations where the search space is not well understood or there is no prior information available.

Algorithm:

1.  Define a search space.
2. Define the iteration number.
3. Sample random parameters
4. Evalute the score: score
5. Update the best hyperparameters if the score is improved:
    `if(score > best_score) best_score=score best_params=params`

In [1]:
import numpy as np

# Define the search space
search_space = {
    'learning_rate': np.linspace(0.01, 0.1, 10),
    'batch_size': [16, 32, 64, 128],
    'num_hidden_units': [32, 64, 128, 256],
    'dropout_rate': np.linspace(0.0, 0.5, 6)
}

num_iterations = 100

def evaluate_model(params):
    # Here, we can train and test a model or perform any other evaluation task
    score = np.random.random()  # Random score for demonstration purposes
    return score

# Perform random search
best_params = None
best_score = float('-inf') ## Python assigns the highest possible float value

for _ in range(num_iterations):
    # Sample random hyperparameters
    params = {param: np.random.choice(values) for param, values in search_space.items()}

    # Evaluate the performance using the current hyperparameters
    score = evaluate_model(params)

    # Update the best hyperparameters if the score is improved
    if score > best_score:
        best_params = params
        best_score = score

# Print the best hyperparameters and the corresponding score
print("Best hyperparameters:", best_params)
print("Best score:", best_score)

Best hyperparameters: {'learning_rate': 0.09000000000000001, 'batch_size': 16, 'num_hidden_units': 128, 'dropout_rate': 0.2}
Best score: 0.9893745487102236


#### 4.2 Random Local Search
Random Local Search is an optimization technique that aims to find the optimal solution within a search space by iteratively exploring the neighborhood of the current solution.
It starts with an initial solution and iteratively explores the search space by randomly generating new solutions in the vicinity of the current solution. The new solution is accepted if it improves the objective function, and the process continues until a stopping criterion is met.

Advantages:
1. Efficient in situations where the search space has local optima or discontinuities.
2. Can quickly escape local optima and explore different regions of the search space.

Limitations:
1. Highly dependent on the initial solution.

In [2]:
import random
'''
1.Initialize the best solution and its value.
2. Generate a random neighbor
3. Update the best solution if the neighbor is better
'''
def objective_function(x):
    return -(x**2) + 4

def random_neighbor(solution, search_range):
    # Generate a random neighbor within the search range
    neighbor = solution + random.uniform(-search_range, search_range)
    return neighbor

def random_local_search(search_range, max_iterations):
    # step-1
    best_solution = random.uniform(-search_range, search_range)
    best_value = objective_function(best_solution)

    # Perform random local search
    iterations = 0
    while iterations < max_iterations:
        # step-2
        neighbor = random_neighbor(best_solution, search_range)
        neighbor_value = objective_function(neighbor)

        # step-3
        if neighbor_value > best_value:
            best_solution = neighbor
            best_value = neighbor_value

        iterations += 1

    return best_solution, best_value

# Set the search range and maximum number of iterations
search_range = 10
max_iterations = 100

# Run the random local search algorithm
best_solution, best_value = random_local_search(search_range, max_iterations)

# Print the result
print("Best Solution:", best_solution)
print("Best Value:", best_value)

Best Solution: -0.09954309904125758
Best Value: 3.9900911714332623


### 4.3 Grid Search Algorithm:
Grid search systematically explores all possible combinations of hyperparameter values within the specified ranges. It covers the entire search space, leaving no combination unchecked.

1. Grid search can be computationally expensive, especially when dealing with a large number of hyperparameters or wide ranges of values for each hyperparameter. The time complexity increases exponentially with the number of hyperparameters and the number of values for each hyperparameter.
2. Grid search is well-suited for cases where there is prior knowledge about the ranges and values of hyperparameters that are likely to perform well. It is useful when the impact of each hyperparameter value can be predicted.

**Note:** Performing random search rather than grid search allows you to much more precisely discover good values for the important ones. 

### 4.4 Bayesian Optimization:
The Bayesian optimization algorithm is an iterative process that intelligently explores and exploits the search space to find the optimal set of hyperparameters. The Bayesian optimization algorithm leverages the probabilistic model to guide the search toward promising regions in the hyperparameter space, allowing for efficient exploration and exploitation. It adapts the search based on the model's uncertainty and the observed objective function values. 

By intelligently selecting the next set of hyperparameters to evaluate, Bayesian optimization can converge to good solutions with fewer evaluations compared to random or grid search, making it an effective technique for hyperparameter optimization in machine learning.

### 4.5 Evolutionary Algorithms:
Evolutionary algorithms mimic natural selection to iteratively search for optimal hyperparameter values. These algorithms maintain a population of candidate solutions and evolve them over generations using genetic operators such as mutation, crossover, and selection.

### 4.6 Gradient-Based Optimization:
Some algorithms allow for gradient-based optimization of hyperparameters. This involves computing the gradients of a performance metric with respect to the hyperparameters and using optimization techniques such as gradient descent to update the hyperparameter values iteratively.