<a href="https://colab.research.google.com/github/RiseAboveAll/PYTORCH_Learning/blob/master/Pytorch_FeedForward.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial Neural Network

- Type of neural network : Feed Forward Neural Network



### **Forward Propogation**

  -  Different neurons find different features from the input

  -  Stack layers of neurons over layer

  -  Same inputs can be attached to multiple different neurons, each calculating something different

  -  Neurons of one layer can act as input to another layer, it increases depth and width of layer and network 

  -  A neuron : sigmoid(w.T . X + b)

  -  Multiple Neurons : Consider when you have multiple neurons in a layer, let say from 1 to M, Zj=sigmoid(wj.T . X + bj) for j=1 to M . It can also be written more effeciently as in vector form : if Zj=sigmoid(Wj.T . X + bj) for j=1 to M then Z=sigmoid(W.T . X + b). Where :
      
      -  z is a matrices of size M (M x 1)

      -  X is the vector of size D (Dx1) 

      -  W is the matrix of size D x M

      -  b is a vector of size M ( Mx1 )

      -  sigmoid() is an element wise multiplication









### **Activation Functions**



- Sigmoid Activation:
  
  -  Makes neural network non linear

  -  Issues with sigmoid activation function :

      -  **Standardization** : We want our inputs to have mean of 0 and std at 1 and hence in same scale . Problem with sigmoid is that output is centered between 0 and 1, i.e at .5. The output of the sigmoid layer is input to the next layer . If the output is centered around 0.5 hence there is no uniformity , which is not good as next layer would also want to see the standardized input, hence we want both the input and the output of the layer both to be standardized.

-  Hyperbolic Tangent(tanh) :

    -  It is centered at 0 

    -  It lies between -1 to 1 

    -  tanh(a) = (exp(2a) - 1)/(exp(2a) + 1)

    -  Issues with both **Sigmoid** & **Tanh** :

        -  Vanishing Gradient : We use gradient descent to train the model. Hence we calculate the gradient of the cost with respect to parameters. The issue comes with deeper neural network, gradients have to propogate backwards throughout the networks starting from the end. Output is made of bunch of composite function, which are ove non-linear functions , hence when we take the gradient or derivative of composite function , we get chain rule, hence composite function becomes multiplication in the derivative. Derivative of sigmoid and tanh are near to 0 at flatter regions on both side of the axis. Only in the center it is non zero, maximum value of derivative is 0.25. When we multiply the smaller gradients consecutively , we get very small values, hence gradients become very small as we go further back in depth of neural network. Hence the gradients will vanish. 


-  ReLu : 

    -  Partially Differentiable

    -  Values greater than 0 never have 0 gradient which makes training neural network lot more efficient. 

    -  ReLu does not have vanishing gradient as values less than equal 0 has already vanished. It is basically **Dead Neuron** . Dead Neurons are neurons which outputs always 0 because the weighted sum of its inputs are always less or equal 0. 

-  Leaky-Relu :

    -  Has Slope < 1 for values less than 0 

    -  Derivatives will always be positive 

-  ELU : 

    -  Higher speed of convergence

    -  Higher accuracy

    -  Allows outputs to be negative which goes back to the idea that mean of the values are close to 0 

-  Softplus:

    -  Taking log of the exponent, looks linear when the input is reasonably large. 

    -  For softplus and Relu minimum value is around 0 and maximum value is infinity , hence they are not centered around 0. 



**Something to try in activation function : BRU ( Bionodal Root Unit ) 

Code for fully Connected layer block : Sequential :

nn.Sequential(

  nn.Linear(D,M),

  nn.ReLU(),

  nn.Linear(M,K),

  nn.Softmax()

)



**Do not Use the above code**

Because pytorch combines the softmax function with the cross entropy loss (nn.CrossEntropyLoss) 

Code for multiclass Logistic Regression:

model=nn.Linear(D,K)

criterion=nn.CrossEntropyLoss()

**pytorch does have standalone categorical loss called NLL Loss*



### How to Represent Images 

**How Images are stored in Computer?**

-  Images have two properties :

    -  Height

    -  Width

-  It can be stored in form of **Matrix**, in an image matrix you store the color in form of pixel values. Colors comes in different channels like greyscale , RGB , etc. For eg : RGB, it is a set of 3 numbers , hence the matrix will have third dimension apart from height & width , it will be number of channels , i.e in case of RGB image , it is 3 , Hence it is an tensor of size (H,W,C). 

- **Quantization** : Physically color is light , measured by light intensity, hence it is continuous and has infinite possible values. But computers do not have infinite precision, hence in computers we have given each value as of 8 bits , i.e 2^8 = 256 possible values (0,1,2,....,255). Hence for 3 channels we have 2^8 * 2^8 * 2^8 = 16.8 million possible colors. Using this we can check how much space will an image consume. For example you hve 500 x 500 image , number of bits required = <code>500*500*3*8 = 6 million bits</code>. 

For images which are greyscale images, image can be either black or white or in between, it is a 2-D array(h,w), 0 represents **Black** & 255 represents **White**.

**Image As Inputs**

-  We do not use 8-bit valued image, neural network do not like a large scale values. Hence we scale images to have values between 0 and 1. 0 represents Black and 1 represent White in greyscale.  These are not standardized values as they are not centered around 0.  For images these values are convenient as these values can be represented as probabilities. 

<code>**Exception** VGG images are centered around 0 but the range is still 256.</code>

**Images as inputs to neural network**

-  Neural network expects input X to be of shape NxD, where <code> N is the number of samples</code> & <code> D is the number of Features</code>. An image can not have D features, it is 3-D object. Hence to represent the data set of images we would need to have a 4-D object of shape (N x H x W x C), where N is number of samples , H is height , W is width, C is number of channels. But when we are dealing with Fully Connected Dense Networks for images , we flatten the images by multiplying Height & Width hence shape is <code> (N , H*W*C) </code> . To do so we use <code> .view() </code> or <code> .reshape() </code> function. 