# **Vanishing Gradient Problem & its Solution**

# **The problem:**

Vanishing Gradient Problem (VGP) is a phenomenon that occurs during the training of deep neural networks. In this phenomenon, gradient descent is used to update the weights from an output layer to the ealier layer in back-propagation. During this process, the gradients become too small that they are almost vanished and do not further train the data.

**Example:**

Consider a very simple example on the which the principle of VGP is based. If a number smaller than 1 is multiplied multiple times, the resultant number will become too small. 

*                             0.1 X 0.1 X 0.1 X 0.1 = 0.0001 
                 

*                            W<sub>n</sub> = W<sub>old</sub> - η∂L/∂W

The change in ∂L/∂W would be so small that it would not change the weights in training the network. Therefore, the loss function after certain epochs whould not reduce, which will further result in worst accuracy of our model. 

# **Causes of VGP**

There are mainly two causes of the Vanishing Gradient Problem. They are:

* Deep NeuraL Networks
* Activation Functions (Sigmoid & tanh)

# **How to recognize VGP?**

Below are the two main ways via which one can recognize Vanishing Gradient Problem (VGP). 
* **Focus on loss:** If there are no changes in loss after every epoch, then a problem of VGP exists. 
* **Plot weights:** Visualize weights against epochs. If the graph of weights is consistent and there is no change in weights, then it is a sign of VGP. 

# **How to handle VGP?**

* **Reduce model complexity:** It means use shallow neural network, which in turn would give larger derivative of the weights in the initial layer. Reason being is that minimum number of derivatives will be multiplied in shallow model in comparison to the complex model.

        This is not an efficient way because number of hidden layers is increased to find out complex patterns. It may work sometimes but will not be used mostly.
        

* **Use different activation function:** Use Relu activation function instead of sigmoid function. The vlaue of relu function is like this, max(0, z). If negative, the value will be 0. Otherwise, the value will be z. 

          The beauty of relu function is that it brings the inputs if they are in a range of (-1000, 1000) to (0, 1000). Whereas, the sigmoid function squeez the input to a range of (0, 1). 
          Another good thing is that the derivative will be either 0 or 1. If negative, it will be 0. Otherwise 1. When you multiply maximum numbers of 1, then then number does not reduced and VGP does not occur. 
        

  ****When does the Relu function fail to counter VGP?**** 
  
  The problem is **Dying Relu.** When the activation is 0, the derivatives will become 0, which inturn would not update weights. To solve dyink relu, the concept of leaky relu came into existence, which will be further explored in the upcoming notebooks.

* **Other ways:**
1. Proper Weight Initialization
2. Batch Normalization
3. Residual Network

**Note:** The other ways will be explored in the upcoming notebooks. 

# **Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense

# **Loading make_moons dataset from Scikit-Learn**

In [None]:
X, y = make_moons(n_samples=250, noise=0.05, random_state=42)

In [None]:
plt.scatter(X[:, 0], X[:,1], c=y)

# **Model Building**

In [None]:
model = Sequential()

model.add(Dense(10, activation="relu", input_dim=2))
model.add(Dense(10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

**Points to Ponder:**

* To depict the phenomenon of Vanishing Gradient Problem (VGP), 10 hidden layers were used with different epochs such as 1 and 100. 

* To reduce model complexity and solve Vansishing Gradient Problem (VGP), only two hidden layers are used. 

* To solve VGP via activation function, sigmoid was changed to relu activation function and the number of hidden layers were increased to 10 again.  

## **Model Compiling**

In [None]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

## **Old Weights**

In [None]:
old_weights = model.get_weights()[0]
old_weights

## **Data Splitting**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **Model Fitting**

In [None]:
history = model.fit(X_train, y_train, epochs=100)

## **New Weights**

In [None]:
new_weights = model.get_weights()[0]
new_weights

In [None]:
model.optimizer.get_config()["learning_rate"]

We know that 


*                            W<sub>n</sub> = W<sub>old</sub> - η * ∂L/∂W

The value of ∂L/∂W will be

*                            ∂L/∂W = (W<sub>old</sub> - W<sub>n</sub>) / η


# **Analysis of Weights and Gradients**

In [None]:
print(f"Old weights are: \n\n {old_weights} \n")
print(f"New weights are: \n\n {new_weights} \n")
gradient = (old_weights - new_weights) / 0.001
percent_chng = abs(100*(old_weights - new_weights) / old_weights)

print(f"Gradients are: \n\n {gradient} \n")
print(f"Percent changes are: \n\n {percent_chng} \n")

# **Conclusions**

Runing it for different values of epochs and hidden layers, following conclusions have been made. 

* When the value of epoch was 1, the model gave us small values of gradients and weights did not get update. 

* When the value of epoch was 100, the model stopped training further after almost 15 epochs. 

* When the number of hidden layers were reduced to 3, the loss function reduced too much. It stop fluctauting around a single point. 

* With changing model complexity to shallow neural network, the problem of Vanishing Gradient Problem solved. 

* With changing relu function as an activation function, the problem of Vanishing Problem solved. 


**Point to Ponder:**

1. Reducing hidden layers in a model is not an effiecient way as compared to changing activation function from sigmoid to relu.

2. The percent changes in old weights and new weights are comparatively higher than when it was run to show depiction of Vaninshing Gradient Problem. 