# Problems of Gradient Descent:

# Local Minima:

A point on a graph (or its associated function) whose value is less than all other points near it.

# Global Minima:

A global minimum is a point where the function value is smaller than at all other feasible points.

![3.png](attachment:3.png)

# Saddle Point:

In mathematics, a saddle point or minimax point is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function. If the first and second derivative of a function is zero at a point then that point is called Saddle Point.For multivariate functions the most appropriate check if a point is a saddle point is to calculate a Hessian matrix. 

![1.jpg](attachment:1.jpg)

For convex problems, gradient descent can find the global minimum easily, while for non-convex problems, it is sometimes difficult to find the global minimum, where the machine learning models achieve the best results. Whenever the slope of the cost function is at zero or just close to zero, this model stops learning further. Apart from the global minimum, there occur some scenarios that can show this slop, which is saddle point and local minimum. Local minima generate the shape similar to the global minimum, where the slope of the cost function increases on both sides of the current points.

![2.png](attachment:2.png)

# Problems Involving Neural Networks:

Lets say we have a neural network with 2 hidden layers and we will use sigmoid as an activation function
![nn-2.PNG](attachment:nn-2.PNG)

When we satrt training it using forward propagtion we will find the value of each neuron with the sum of value(x1) and its corresponding weight of each input and then put the value in activation function. <br>
![derivation.PNG](attachment:derivation.PNG)


![sig.PNG](attachment:sig.PNG)
Now, to reduce loss we will have to reduce the weights so, we will have to do back propagation and to do so we will find gradient first.

![gd.PNG](attachment:gd.PNG)
The problem will arise while calcualting the derivative of sigmoid.

Some activation functions like the sigmoid function, for example, compresses a wide input space into a narrow input space between 0 and 1. As a result, the output of the sigmoid function will change little when the input changes significantly.<br>
<img src="attachment:sigmoid-2.PNG" width="500"/>

## Vanishing Gradient Problem

Reference : https://neptune.ai/blog/vanishing-and-exploding-gradients-debugging-monitoring-fixing

The vanishing gradient problem is an issue that sometimes arises when training machine learning algorithms through gradient descent. This most often occurs in deep neural networks.<br>
*`In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight.The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value.In the worst case, this may completely stop the neural network from further training.`*                                                                                                        ***`-Wikipedia`***
<br><br>How to identify VGP?<br>
<li>Large changes are observed in parameters of later layers, whereas parameters of earlier layers change slightly or stay unchanged
<li>In some cases, weights of earlier layers can become 0 as the training goes
<li>The model learns slowly and often times, training stops after a few iterations
<li>Model performance is poor

<li>Vanishing Gradient Problem occurs due to very low value of gradient which cause very slight or negligible improvement in weights.
<li>We never use sigmoid in hidden layer of neural network because they output very low gradient during back prpagation.

Batch normalisation layers can also help to fix the problem. As previously mentioned, the issue develops when a large input space is transferred to a smaller one, which makes the derivatives vanish.

## Exploding Gradient Problem

On the contrary, if the gradients get LARGER or even NaN as our backpropagation progresses, we would end up with exploding gradients having big weight updates, leading to the divergence of the gradient descent algorithm.


How to identify EGP?
<li>Contrary to the vanishing scenario, exploding gradients shows itself as unstable, large parameter changes from batch/iteration to batch/iteration
<li>Model weights can become NaN very quickly
<li>Model loss also goes to NaN

## Solutions:
### Multi-level hierarchy:
This technique pretrains one layer at a time, and then performs backpropagation for fine tuning.
### Re-Design the Network Model:
In deep neural networks, vanishing/exploding gradients may be addressed by redesigning the network to have fewer layers.<br>There may also be some benefit in using a smaller batch size while training the network.

### Use Weight Regularization:
Another approach, if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.<br>The regularization parameter gets bigger, the weights get smaller, effectively making them less useful, as a result making the model more linear. <br>
This is called weight regularization.

### RELUs:
The simplest solution is to use other activation functions, such as ReLU, which doesn’t cause a small derivative.Rectified Linear Units (ReLU) are activation functions that generate a positive linear output when they are applied to positive input values.ReLu activation functions keep linearity for regions where sigmoid and TanH are saturated, thus responding better to gradient vanishing / exploding. If the input is negative, the function will return zero.<br>
![relu.PNG](attachment:relu.PNG)
The conventional sigmoidal activation functions used for node output are substituted with a new function when rectified linear units are implemented: f(x) = max (0, x). Because this activation only saturates in one direction, it is more resistant to gradient vanishing.