<a href="https://colab.research.google.com/github/Rishav-hub/Challenge1/blob/main/Documentation_2_June.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What do you mean by training of Neural Network ?
It is a mechanism of providing the network with the desired output.

### What do you mean by DNN ?
DNNs or Deep Neural networks are archietecture that are made of many layers. All the layers are connected. They are specifically used to perform some complex task.

### What is a Sequential model ?
A plain stack of layers where each layer has exactly one input tensor and one output tensor.


```
model = keras.models.Sequential()
```



### Why are callbacks implemented while fitting the model ?
Keras will call during training at the start and end of training, at the start and end of each epoch and even before and after processing each batch. 

Some functionalities are:
- Saving checkpoints
- Early Stopping
- log directory for tensorboard

### Why do we flatten the input of a image ?
The input of an image for example in MNIST dataset has (28, 28) shape but the input layer accept that in a single dimension. So Flatten converts the (28,28) to (None, 784). Now the input layer has 784 neurons. 


```
tf.keras.layers.Flatten(input_shape=[28, 28])
```



### How are parameters of the layers calculated ?
If its just made out of dense layers then it's 

$firsLayer \times secondLayer + bias$


### What do one epoch means ?
One epoch is one forward **propagation + back propagation**.

### What are the problems faced while trainign a NN ?
- Vanishing and exploding Gradients
- It requires lot of data to train (Transfer learning or data augmentation)
- Increase in size of NN (better optimizers)
- Risk of overfitting (Dropout or Regularization).

### What is Vanishing of Gradients ?
- This problem basically occured in Deep Neural Networks(DNNs).
- The activation function that caused such kind of issue is the "Sigmoid AF".
- Gradients gets smaller and smaller as algorithm progress down to the lower layers.
- So the lower layers are left untrained.
- Training becomes very slow and no optimal value is reached.

 Ex - if $\frac{\partial e}{\partial w_2}$ is the weight update and these are in ratios.

 $$\frac{\partial e}{\partial w_2} = \frac{\partial e}{\partial a_2} \times \frac{\partial a_2}{\partial z_2} \times\frac{\partial z_2}{\partial w_2}$$

- If these values are very small then the resultant value would be a very small value ex. $0.0006$.

- So there is almost no change in the weight.


### What is Exploding of Gradients ?
- Its opposite of Vanishing Gradients.
- Due to the large weights the gradients become heavier than the previous weights.
- Resultant would be negative or very close to previous weights.
- Many layers get insanely large weights updates. 
- It happens moatly in RNN's.

 Ex - Ratios become > 1. weight update like $6000$
- In these cases the solution will diverge.

### How do we handle Vanishing and Exploding gradient problem ?
These are the two methods which are used to handle Vanishing and Exploding gradient problem.
- Choise of Activation Function.
- Weight Initialization

### What are Activation Functions and use of it ?
- Helps to determine the output of the NN.
- Determines whether a neuron should get activated or not.

### What was the issue with Sigmoid AF ?
 ![image](https://miro.medium.com/max/3268/1*a04iKNbchayCAJ7-0QlesA.png)

- As we can observe in the diagram that this function just worked well with **smaller weight initialization** or weights that are **initialized closer to zero**.
- For larger weights the garadient of Sigmoid AF is **zero** and it goes into the **saturation state**. 
- In saturation state the Vanishing Gradient issue occurs as there is no weight update due to small gradient.
- Therefore, sigmoid AF had the disadvantage of the Vanishing Gradient.

### What should be the choise of AF rather than Sigmoid ?
- ReLU or Rectified Linear Unit is one of the most popular used AF that can be used in replacement of Sigmoid in the inner layers. 
![relu](https://miro.medium.com/max/754/1*3JUMOqugWKB2SDra6x6v0A.png)
- As we can observe that it has no such kind of issue of saturation.
- Hence it would produce output for larger value also.
- But, it has some issue like **dying relu**.

### What are the techniques of weight initialization ?
- We can observe that if the weights are initialized withing the **non-saturation** only then our network can prevent vansihing and exploding gradient. 
- This can be done using **Glorot weight initialization technique** which states that 

 The signal need to flow properly in both directions: in the forward direction
when making predictions, and in the reverse direction when backpropagating gradients.
- For the signal to flow properly, we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction.
- They introduced two terms $fan_in$ and $fan_out$ and $fan_in = fan_out$ 
- This will only happen if $\sigma_in^2 = \sigma_out^2$

 $fan_avg = fan_in + fan_out /2$


### Comparing different types of weight initialization


In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
minist = tf.keras.datasets.mnist

(X_train_full, y_train_full), (X_test, y_test) = minist.load_data()

In [None]:
X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train =  y_train_full[:5000], y_train_full[5000:]


### Use of Early Stopping and Checkpoints

These are all types of callback implemented during model fitting. 

Early Stopping - Its a techinique in which the training is stopped if there no improvement in the metric for certain epochs and it returns the best weights.

Mod elCheckpoint - It saves checkpoints of your model at regular intervals during training, by default at the end of each epoch. This would help if there is any problem while training and the training discontinues so we can continue from the last saved checkpoint

In [None]:
# Early Stopping
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)

In [None]:
# Model Checkpoint saving
CKPT_path = "model_ckpt_06_09.h5"
checkpointing_cb = tf.keras.callbacks.ModelCheckpoint(CKPT_path, save_best_only=True)

In [None]:
model_3 = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape = [28,28], name = 'input_layer'),
                                     tf.keras.layers.Dense(300,activation='relu',name= 'hidden_layer1'),
                                     tf.keras.layers.Dense(200, activation = 'relu', name = 'hidden_layer2'),
                                     tf.keras.layers.Dense(100, activation = 'relu', name = 'hidden_layer3'),
                                     tf.keras.layers.Dense(10, activation='softmax', name = 'output_layer')])

LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model_3.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 50
VALIDATION_SET = (X_valid, y_valid)

history = model_3.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32,
                    callbacks= [early_stopping_cb, checkpointing_cb])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50


### Observation
When implemented Early Stopping It interrupt training when it measures no progress on the validation set for a number of epochs.

### Comparing different types of weight initialization

#### Glorot Initialization
Glorot normal

variance = $\frac{1}{fan \tiny avg}$

In [None]:
model_1 = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape = [28,28], name = 'input_layer'),
                                     tf.keras.layers.Dense(300,activation='relu', kernel_initializer= 'glorot_normal'),
                                     tf.keras.layers.Dense(200, activation = 'relu', kernel_initializer= 'glorot_normal'),
                                     tf.keras.layers.Dense(100, activation = 'relu', kernel_initializer= 'glorot_normal'),
                                     tf.keras.layers.Dense(10, activation='softmax')])

In [None]:
model_1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_layer (Flatten)        (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 300)               235500    
_________________________________________________________________
dense_1 (Dense)              (None, 200)               60200     
_________________________________________________________________
dense_2 (Dense)              (None, 100)               20100     
_________________________________________________________________
dense_3 (Dense)              (None, 10)                1010      
Total params: 316,810
Trainable params: 316,810
Non-trainable params: 0
_________________________________________________________________


In [None]:
LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model_1.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)


In [None]:
EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

history = model_1.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Glorot Uniform

Uniform ditribution between r and -r and r = $\sqrt{\frac{3}{fan\tiny avg}}$

In [None]:
model_2 = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape = [28,28], name = 'input_layer'),
                                     tf.keras.layers.Dense(300,activation='relu', kernel_initializer= 'glorot_uniform'),
                                     tf.keras.layers.Dense(200, activation = 'relu', kernel_initializer= 'glorot_uniform'),
                                     tf.keras.layers.Dense(100, activation = 'relu', kernel_initializer= 'glorot_uniform'),
                                     tf.keras.layers.Dense(10, activation='softmax')])



In [None]:
LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model_2.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

history = model_2.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Evaluating both
**Observation** - Both perform in equal manner but may get affected for larger epochs

In [None]:
print("For Glorot normal is {}".format(model_1.evaluate(X_test, y_test)))
print("For Glorot uniform is {}".format(model_2.evaluate(X_test, y_test)))

For Glorot normal is [14.695219039916992, 0.9724000096321106]
For Glorot uniform is [14.899723052978516, 0.970300018787384]


### He initialization
He normal

variance = $\frac{2}{fan \tiny in}$

In [None]:
model_1 = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape = [28,28], name = 'input_layer'),
                                     tf.keras.layers.Dense(300,activation='relu', kernel_initializer= 'he_normal'),
                                     tf.keras.layers.Dense(200, activation = 'relu', kernel_initializer= 'he_normal'),
                                     tf.keras.layers.Dense(100, activation = 'relu', kernel_initializer= 'he_normal'),
                                     tf.keras.layers.Dense(10, activation='softmax')])

In [None]:
LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model_1.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

history = model_1.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


He uniform with elu activation function sometimes performs better

In [None]:
model_2 = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape = [28,28], name = 'input_layer'),
                                     tf.keras.layers.Dense(300,activation='elu', kernel_initializer= 'he_normal'),
                                     tf.keras.layers.Dense(200, activation = 'elu', kernel_initializer= 'he_normal'),
                                     tf.keras.layers.Dense(100, activation = 'elu', kernel_initializer= 'he_normal'),
                                     tf.keras.layers.Dense(10, activation='softmax')])

LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model_2.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

history = model_2.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Evaluating Both

In [None]:
print("For he normal is {}".format(model_1.evaluate(X_test, y_test)))
print("For he normal with elu AF is {}".format(model_2.evaluate(X_test, y_test)))

For he normal is [15.593754768371582, 0.9713000059127808]
For he normal with elu AF is [41.63624572753906, 0.8618999719619751]


#### Obsevation is He initialization performs better with ReLU and its variant.

#### Lecun Initialization with SELU

In [None]:
model_1 = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape = [28,28], name = 'input_layer'),
                                     tf.keras.layers.Dense(300,activation='selu', kernel_initializer= 'LecunNormal'),
                                     tf.keras.layers.Dense(200, activation = 'selu', kernel_initializer= 'LecunNormal'),
                                     tf.keras.layers.Dense(100, activation = 'selu', kernel_initializer= 'LecunNormal'),
                                     tf.keras.layers.Dense(10, activation='softmax')])

LOSS_FUNCTION = "sparse_categorical_crossentropy" # use => tf.losses.sparse_categorical_crossentropy
OPTIMIZER = "SGD" # or use with custom learning rate=> tf.keras.optimizers.SGD(0.02)
METRICS = ["accuracy"]

model_1.compile(loss=LOSS_FUNCTION,
              optimizer=OPTIMIZER,
              metrics=METRICS)

EPOCHS = 10
VALIDATION_SET = (X_valid, y_valid)

history = model_1.fit(X_train, y_train, epochs=EPOCHS,
                    validation_data=VALIDATION_SET,batch_size = 32)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Evaluate the model

In [None]:
print("For LeCun normal with elu AF is {}".format(model_1.evaluate(X_test, y_test)))

For LeCun normal with elu AF is [28.084856033325195, 0.7336000204086304]


#### Observation
From above we can see that with this network Glorot Normal and He Normal gave the best result in the test dataset.