## Optimization
- Minimizing or maximizing any mathematical expression involves finding the values that result in the lowest or highest outcome. This makes the predictions as accurate and optimized as possible.

- the parameters of the model are adjusted and changed during the training phase to minimize the loss function. 

    - Example: Building bridges involves balancing weight capacity, cost, safety, and material use. This equilibrium is referred to as optimization. 

        - Aim for maximum load capacity given the design and usage
        - Minimize material usage without compromising integrity
        - Prioritize cost-effectiveness
        - Uphold stringent safety standards

## Optimization Algorithms
- To minimize the loss function, optimization algorithms iteratively change the model's parameters during training. In deep learning, optimization algorithms optimize the cost function J. 

## Importance of Optimization Algorithms
- Scalability
- Efficiency
- Problem solving
- perofrmance improvements
- Decision making
- Resource allocation

## Optimizer
- Its the algorithms or methods that reduce losses by changing neural network attributes such as weights and learning rates. 
- Optimizers establish a connection between the loss function and model parameters, modifying the model based on the loss function's output. 
- they manipulate the neural network weights, modling and refining the model to achieve the highest possible accuracy. 

    - Example of Optimizer
        - the loss unction serves as a roadmap for for the optimizer to indicate if it has taken the correct or wrong path. 
        - consider hikers attempting to descend a mountain with a blindfold. 
        - The hikers can't determine which way to travel but can tell if they are going down (making progress) or up (losing progress)
        - If the hikers continue downhill, they will eventually reach the bottom. 


## Optimizers and their Types
- Gradient Descent
- Stochastic Gradient Descent
- Momentum
- AdaDelta
- RMSprop
- Adam
- AdaGrad

## Gradient Descent
- Gradient descent is a typical optimization algorithm and the foundation for training deep learning models. Gradient descent iteratively adjusts the parameters in the direction that minimizes the loss function, gradually improving the model's fit to the data. 

- This technique involves calculating how much the model's error changes with its changes in parameters.
- It modifies those parameters in a way that will drastically decrease the error.
- The optimal weights for a model are not known before training but can be found through trial and error using the loss function. 

- Goal: The goal of gradient descent is to reduce the difference between what the model predicts and the actual outcomes. To achieve this, it uses learning rate and direction. 

- Gradient descent uses direction to gradually arrive at the local or global minimum (the point of convergence)

- the dataset is shuffled for each iteration to avoid biasing the descent and improve the randomess of the mini-batch selection process. 

## Stochastic Gradient Descent 
- Stochastic gradient descent (SGD) updates parameters by evaluating the loss and gradient on mini-batches of data. It enables efficient iterative optimization in deep learning. 

- SGD, a variant of gradient descent algorithm, accelerates model learning in deep learning. 

- The term stochastic alludes to the random nature of the algorithm, SGD randomly picks one sample from a dataset to approximate the loss and the gradient and updates the parameters
- optimal convergence is SGD can be achieved by systematically decreasing the learning rate over time, allowing the algorithm to settle closer to the global minimum. 

    - The global minimum is the point in the parameter space where the cost function attains its lowest possible value. 
    - At this point, the model parameters (weights and biases) are optimized so that the cost function is minimized globablly across the entire parameter space. 

    ## Key features of SGD
    - optimization uses mini-batches of data to find optimized weights
    - SGD shuffles the data points within each mini-batch for improved generalization.
    - SGD aims to iteratively update the weights to find the optimal solution, considering ffactors like mini batch randomess and noise in the data. 

  ## Stochastic Gradient Descent mini batch (SGD mini batch)
  - is a combination of vanilla GD and SGD, which distribute the training data in its entirety in mini batches
  - it divides the training data into small batches so that the network can easily be trained on the data. 
  - The mathematical formulation is the same as vanilla GD, but the training occurs batch wise

## Gradient Descent vs SGD mini batch
- GD is computationally expensive, but it converges to the global minimum smoothly. In contrast, SGD creates more noisy weight, in turn, takes more time to reach the global minimum 

## Momentum 
- it improves convergence and stability. It helps overcome local optima by introducting inertia and consistent direction during weight updates.
- during the model training, algorithm parameters are initialized randomly and updated iteratively, progressively getting closer to the optimal value of the function.
- The momentum algorithm uses an assigned learning rate to accelerate convergence and reduce divergence

    - Since the learning rate value cannot be large, the process slows down, if the algorithm encounters a plateau, it may be deceived into thinking it has reached the minimum. To resolve these issues and correct the learning rate effect, the concept of momentum is used.

- Although it can easily handle smaller datasets, momentum is typically used in the vast, noisy datasets of neural networks.
- the only disadvantage to using momentum is that it adds further complexity to the algorithm. 

## Nesterov Accelerated Gradient (NAG)
- Nag combines momentum based udpates with lookahead adjustments for faster convergence compared to traditional SGD with or without momentum

## AdaGrad
- Adaptive gradient is an optimization algorithm that adjests the learning rate for each parameter based on historical gradients, improving convergence and performance. 
- Accumulates gradients over time from the entire training dataset
- Scales the learning rate individually for reach parameter based on their historical gradients
- Is effective for handling sparse data with varying parameter importance
- Adaptive gradient optimization iteratively updates different learning rates for each parameter wthout manual tuning. 

## RMSProp
- The root mean square propagation (RMSProp) optimizer is a momentum based version of the gradient descent technique
- it limits the oscillations in the vertical plane
- It boosts the learning rate, allowing the algorithm to take greater horizontal steps to converge faster. 

## Adadelta
- Adadelta is an optimaization algorithm used to train neural networks. It builds on AdaGrad and RMSProp, altering the custom step size calculation, this approach removes the need for an initial learning rate hyperparameter

- To achieve optimal perofrmance, choose the algorithm that aligns with the behavior of the data variables in the specific application

    - Understand the data: 
        - Analyze the data variables behavior and characteristics
        - Identify challenges such as sparsity or noisy gradients
    
    - Choose the appropriate algorithm
        - Recognize the strengths and weaknesses of various algorithms
        - use adaptive learning rate methods without manual tuning

    - Align with application
        - assess the alignment of Adadelta's features with data behavior
        - Ensure the compatibility of the algorithm with the specific task

    - Optimize performance
        - Fine-tune hyperparameters, including the decay factor
        - Evaluate and adjest continuuously to align with the data variables. 

## Adam optimizer
- the adam optimizer is a popular algorithm in deep learning that optimizes stochastic objectives using adaptive estimates. it efficiently handles sparse gradients and noisy problem s with low resources. 
- It handles large - sclae problems with extensive data numerous parameters.
- it doesnt require excessive memory, but its use mainly depends on the model architecture
- adam combines momentum and RMSProp concepts for effective parameter space navigation and fast convergence. 


## Batch Normalization
- Data preprocessing, the data is generally normalized or standardized

- Normalization: A typical normalization involves scaling down a large range of data into a smaller range.

- Standardization: A typical standardization is to subtract the mean of all the data points from each data point and then divide the difference by the standard deviation

- Data points can either be hgih or low. This leads to cascading efects in the network, reason data preprocessing is needed. Cascading effect in neural networks refers to a phenomenon where errors or variations in input data propagate through the layers of the network, potentially amplifying and affecting the final output

- when there are multiple features, each with a different range of data points, the non-processed data creates instability and cascades through the neural network layers. Scaling the different ranges to a standard range leads to stability and better results

## Batch Normalization implementation techniques
- Normalization of data before feeding into the network is not enough: the outputs from the neurons shoukld also be normalized, the process involves
1. nomalize output x from the activation function
2. Multiply normalized output z by the arbitrary parameter g
3. Add an arbitrary b to the resulting product (z*g)

## Regularization
- Regularization is a technique that makes slight modifications to the learning algorithm so that the model generalizes unseen data mroe effectively. Regularization helps reduce errors by fitting a function appropriately to the training set and avoiding overfiting. 

- In machine learning, regularization is frequently employed as a solution to the overfitting problem, when the models becomes complex enough to simulate the noise in the training data. when the training data is small, and the model fails to develop a generalizable map

## Types of Regularization
- Modifying the loss function
- Modifying the data sampling
- Changing training approach

## Modifying the loss function:
 - There are two components to modifying the loss function in the context of regularization strategies: 
    - In regularization strategies, the loss function is adjested to directly consider the norm of the learned parameters or the output distribution to improve the model. 
    - Regularization itself invoves modifying the loss function to penalize large weight values. 

## Loss function Strategies:
- `L2 Regularization`: Increase the model's complexity by adding more weights, which raises the risk of overfitting 

- `L1 Regularization`: Promotes weight sparsity by setting more weights to zero, intead of reducing their average magnitude

- `Entropy`: Measures uncertainty in a probability distribution, where higher uncertainty corresponds to gereater entropy 

## Modifying data samples: 

- Data augmentation: create extra data from existing data by randomly cropping, dilating, rotating, and adding slight noise. 

- k-fold cross-validation: Separate the data into K groups, train with (k-1) groupts, and test with the remaining k group. Experiment with all the possible k combinations. 

## Change training approach
- change training approach in the context of reglarization refers to altering how the training process is conducted to improve the models performance and generalization ability. 

    Algorithm modification:
    - Adding regularization terms to the learning algorithm to prevent overfitting

    Data augmentation: 
    - Increasing dataset size and diversity through modifications of existing data to improve the model's generalization ability
    
    Injecting noise:
    - it improves generalization, prevents overfitting, and is widely used in the deep learning industry to enhance model performance on unseen data. 

    Dropout
    - It involves randomly setting a fraction of input units to zero at each update during training time which helps prevent overfitting. 

    Early Stopping
    - early stopping is a regularization technique for deep neural networks taht stops training when parameter updates no longer begin to yield improvements on a validation set. by limiting the optimization approach to a smaller amount of parameter space, it functions as a regualrized technique

## vanishing Gradient
- It referes to the derivative of loss with respect to weight
- it updates ueural network weights, Backpropagation calculates the gradient
- When the gradient becomes very small, subtracting it from the weight doesnt change the previous weight. the model stops learning, this problem is called the vanishing gradient. 

- The vanishing gradients, the gradients become smaller as the backpropagation method progresses backward from the output layer to the input layer. 
- the lowery layer weights remain unchanged, and the gradient descent never ereaches the optima. 

## Prevent Vanishing Gradient
- Residual networks (ResNets) use shortcut connections to effectively address the vanishing gradient problem and enable the training  of deep neural networks.

- GPUs play a crucial role in mitgiating the vanishing gradient issue through parallel processing, faster training, and more frequent weight updates during backpropagation. 

- Choosing the right activation function, like rectified linerar (ReLu), helps prevent the vanishing gradient problem by avoiding input compression into small ranges

- The swithc from CPUs to GPUs has significantly improved the feasibility of standard backpropagation, even for low-cost models, due to their faster compilation times. 

## Exploding Gradient
- The issue of exploding gradients arises when there is a significant accumulation of error gradients. It leads to large weight updates in a neural network during the training process. 
- When the gradients grow larger with the advancement of the backpropagation method, massive weight updates occur, causing the gradient descent to diverge.

## How to fix exploding gradients
- It can be fixed by redesigning the neural network with fewer layers and mini batch sizes.
- using long short term memory (LSTM) networks reduces exploding gradients. 
- Gradient clipping limits gradient size to effectively fix exploding gradients. 

## Hyperparameter Tuning
- `Parameters`: They are found during model training, these internal variables are adjusted to make predictions based on the input data.
    Example: in k means clustering, the positions of the centroids are learned during training. these positions are the model's parameters. 

- `Hyperparameters`: They are determineed before training. these are configurations or settings that govern the learning process but arent learned from the data.
    Example: the value of K in K means clustering is decided before creating the model. this value, representing the number of clusters, is a hyperparameter. 

## Hyperparameters of Deep Learning Models
- Learning rate: it is the most importand hyperparameter that helps the model get an optimized result.

- Number of hidden units: It is a classic hyperparameter that specifies the representational capacity of a model

- Convolutional kernel width: it determines the size of the filters in a convolutional neural network, which influences the receptive field and the capacity of the model. 

- mini batch size: it affects the training process, training speed, and number of iterations in a deep learning model. 

- Number of epochs: It is partly responsible for the weight optimaization in a neural network


## Automatic Hyperparameter tuning
- The automatic selection approach is preferred over the manual approach, as the latter is a very rigorous method. The automatic approach is the process of tuning the hyperparameters with the help of algorithms. tunning approches:

- `Grid search`: Iterating over given hyperparameters using cross validation
    - the process of grid search for hyperparameter tuning:
        1. parameter grid construction: arranges all potential hyperparameter combinations in a grid layout
        2. Matrix conversion: Represents each unique hyperparameter combination in a matrix for systematic processing
        3. Performance evaluation: trains and assesses models for each distinct set of parameters, typically using a validation set.
        4. best model identification: choose the model with the highest performance score (based on metrics like accuracy, precision, and various others) as the grid search's optimal outcome.

    - `gridSeachCV`: algorithm extends grid search by adding cross validation, evaluating all hyperparameter combinations with different data splits. 
    - constructs multiple versions of the machine learning model by systematically testing all possible combinations of the specified hyperparameter values

- `Random Search`: is commonly used as a hyperparameter tuning method for functions that are non-differentiable or discountinuous, including those with complex, nonlinear behavior. 
Produces a random value at each instance, covers every combination of instances, considers a random combination of parameters at every iteration, finds the optimized parameter through the performance of models. 
    - `RandomSearchCV`: is an advanced version of random search


- `Graident based tuning`: is used for algorithms where it is possible to compute the hyperparameter for the gradient, and optimiaztion of the hyperparameter is done by the gradient descent

- `Evolutionary optimization`: mimics natural evolution to optimize solutions. 

- `Bayesian optimization`: is an advanced method for hyperparameter tuning that uses a probabilistic model to predict the performance of different hyperparamter combinations and iteratively selects the most promising ones to evaluate

