# In this section
1. The Neuron
2. The Activation Function
3. How Neural Networks work
4. How Neural Networks learn
5. Gradient Descent
6. Stochastic Gradient Descent
7. Backpropagation

## 1. Neuron
* Dendrites: receivers of signal, Axon: Transmitter of signal, Synapse: Region where signal gets passed from one neuron to another
* ![image.png](attachment:image.png)
* It is necessary to standardise/normalise the input variables to make it easy for neural network to function properly
* Resource on standardisation: __Efficient Backprop by Yann LeCun et al (1998)__
* Output value can be: contunuous (price), Binary(yes/no), Categorical (several output values from dummy variables containing categories)
* The inputs are for one row and the output is for one row as well. Think of it like a simple/multivariate linear regression
* Input values are different values for the same row, and the output is for that same row as well
* All synapses (connections from input values to neurons) get assigned weights. 
    * Weights are crucial to ANN functioning. By adjusting weights, ANN decides which signal is more important than others. 
    * During the training phase of ANN, we are basically adjusting the weights of synapses across the whole neural network. 
    * That's where Gradient Descent and Backpropagation come into play
* Inside the neuron, weighted signals are added up. And then an activation function is applied
    * __Activation function__ is applied to the whole neuron layer. Depending on the function the neuron decides whether to pass a signal or not  

## 2. The Activation Function
* Predominantly, there are 4 different types of Activation Functions
    1. __Threshold Function__: A yes/no type of function ![image.png](attachment:image.png)
    2. __Sigmoid Function__: x is the value of the weighted sum inside Neuron ![image-2.png](attachment:image-2.png)
        * This is used in __Logistic Regression__. 
        * The positives are that this function has smooth gradual progression,and unlike the threshold function, it doesn't have kinks in curve.
        * Anything below zero drops off, and as we get beyond zero, the value approaches 1
        * __Sigmoid function is valuable in final layer when we are trying to predict probabilities__
    3. __Rectifier Function__: one of the most popular activation functions used in ANN, even though it has kinks ![image-3.png](attachment:image-3.png)
        * Can be used where there are multiple neurons in hidden layers![image-5.png](attachment:image-5.png)
        * Additional Source: To know why rectifier function is the most popular activation function 
            * Deep sparse rectifier neural networks - by Xavier Glorot et al. (2011)
    4. __Hyperbolic Tangent (tanh) Function__: ![image-4.png](attachment:image-4.png)
        * Very similar to the sigmoid function, but the hyperbolic function's values range from (-1, 1)

## 3. How Neural Networks (NN) work?
1. training the network is an important portion of NN's working
2. Suppose we are predicting house prices and we have 4 parameters in input layer: available Area(sq.ft), No. of bedrooms, distance to city(miles) and age of the property.
3. In the very basic form NN just has input layer and output layer. So these 4 variables would just be weighted up and output/house price would be calculated. ![image.png](attachment:image.png)
    * Pretty much any function could be used to accomplish this e.g. Logistic regression or any of the activation functions 
    * Most ML algorithms that exist can be represented in this form
4. The power and utility of NN comes in when hidden layer comes into action. Suppose we already have a neural network in place with 5 neurons. This is how it works: ![image-2.png](attachment:image-2.png)
    * All neurons in hidden layer have specific rules/criteria they look at, they arrive at it by training on data. 
    * The neuron may consider either one or all of the input variables to form a criteria.
        * e.g. one neuron may take only age of the property as a criteria, It then applies a rectifier activation function to predict price. a property above a certain age might be deemed as an historic property can demand a higher price than others.
        * another neuron might consider area and distance to city as important, as normally the further away you go from city centre, the cheaper the property becomes. So if a property comes up which is closer to city and has a larger area than its peers, then this neuron will activate and give an output.
        * we can not know how a neuron has picked up parameters, we can only speculate
    * This cumulative effect of outputs generated by all neurons in hidden layer then gives out the final predicted price

## 4. How Neural Networks (NN) learn? (Backpropagation)
1. Our goal is to create a network which learns on its own; avoid putting in pre-programmed rules.
2. We code the architecture, and then point the NN towards a pre-categorised dataset to learn from. When a new datapoint comes up, the NN should be able to identify it
3. __A Perceptron__ : A single layered feed-forward Neural Network ![image.png](attachment:image.png)
    * y-hat is the output value predicted by neural network 
    * y is the Actual value
    * Input values are supplied to a perceptron, activation function is applied and we have an output y-hat which is plotted
    * In order to learn we must compare this output value with the actual value, and a __Cost Function (C)__ is calculated
    * there are many ways to calculate a cost function. The method shown here is a common way to do so
    * Cost Function shows the error in our prediction. Our goal is to minimise the cost function
    * After comparison, the result is feeded back to Neural Network, and the weights get updated ![image-2.png](attachment:image-2.png)
    * The only attribute we can control in this NN is weights, and so to minimise the cost function we update weights
    * After multiple iterations of this process, we can find a particular set of value for weights which minimises C (usually we don't get the Cost function as 0)
    * All this was being done for elements on one row
    * __An Epoch__: when we go through the whole dataset and a perceptron has been trained on all rows
    
4. Suggested Reading: A list of cost functions used in neural networks, alongside applications (https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications)

## 5. Gradient Descent (or batch gradient descent)
1. An efficient way to solve the optimisation problem, where we are trying to minimise the Cost Function
2. A method used as an alternative to brute-forcing through zillions of combinations to find optimal weights, by minimising value of Cost Function. ![image.png](attachment:image.png)
    * a starting weight is chosen and the equation is differentiated at that point to determine whether the slope is positive or negative. If slope is negative, that means we are going down the curve. So we take a zig-zag approach.
    * The idea is to arrive at that value of y-hat where slope is 0 ![image-2.png](attachment:image-2.png) 
    * The above image was an example of gradient descent applied in a 1-d space i.e. a curve
    * This is an example of gradient descent applied on a 2-d surface ![image-3.png](attachment:image-3.png)
        * It can be seen that with every attempt we are reaching closer to the centre of mass
    * This is an example of gradient descent applied on a 3-d object ![image-4.png](attachment:image-4.png)
        * the zig-zagging is plotted on a 2-d surface for clarity on right
    

## 6. Stochastic Gradient Descent
1. The gradient descent method described above works for a convex cost function which has only one global minimum
2. If our cost function is not convex in shape, we might end up incorrectly finding a local minimum instead of a global minimum as shown below. This could result in suboptimal weight selection ![image.png](attachment:image.png)
3. The solution to this problem is to use stochastic gradient descent, which does not require the cost function to be convex
    * here we run the NN on one row, check our cost function and adjust the weight. And we adjust the weight after running neural network on every row, instead of running NN on whole batch and then adjusting weight (as in batch gradient descent)![image-2.png](attachment:image-2.png)
    * Stochastic gradient descent avoids problems where we might end up with a local minimum. The reason being that fluctuations are higher in stochastic gradient descent and is more likely to find global minimum
    * Stochastic gradient descent is also much faster than batch gradient descent
    * Batch gradient is a deterministic algorithm i.e. the results are same every time (if the starting weights are same), whereas stochastic gradient descent is a probabilistic algorithm i.e. the results are different everytime.
4. There is another gradient descent method called Mini Batch gradient descent where we combine the two, and run batches of rows
5. Suggested Reading: 
    * A neural network in 13 lines of Python (Part 2 - Gradient Descent) by Andrew Trask (2015) https://iamtrask.github.io/2015/07/27/python-network-part2/
    * Neural Networks and Deep Learning, Michael Nielsen (2015) http://neuralnetworksanddeeplearning.com/chap2


## 7. Backpropagation
1. Forward propagation: when information in entered in the input layer and is propagated forward to get y-hats (output values) which are then compared with actual values and error is calculated (Cost function)
2. Backpropagation: Errors are backpropagated then in the opposite direction through the network, which allows us to train the network by adjusting the weights.
    * All the weights are adjusted simultaneously
    * Neural Networks and Deep Learning, Michael Nielsen (2015) http://neuralnetworksanddeeplearning.com/chap2
    

# Steps to build and Train the Artificial Neural Network
![image.png](attachment:image.png)