## McCulloch and Pitts Neuron

In 1943, McCulloch and Pitts introduced a mathematical model of a neuron. It consisted of three components:

1. A set of **weights** $w_i$ corresponding to synapses (inputs)
2. An **adder** for summing input signals; analogous to cell membrane that collects charge
3. An **activation function** for determining when the neuron fires, based on accumulated input



![neuron](http://d.pr/i/9AMK+)


## Perceptron

A collection of McCullough and Pitts neurons, along with a set of input nodes connected to the inputs via weighted edges, is a perceptron, the simplest neural network. The number of inputs and outputs are determined by the data. Weights are stored as a `N x K` matrix, with N observations and K neurons, with $w_{ij}$ specifying the weight on the *i*th observation on the *j*th neuron.

![perceptron](http://d.pr/i/4IWA+)

## Multi-layer Perceptron

The solution to fitting more complex (*i.e.* non-linear) models with neural networks is to use a more complex network that consists of more than just a single perceptron. The take-home message from the perceptron is that all of the learning happens by adapting the synapse weights until prediction is satisfactory. Hence, a reasonable guess at how to make a perceptron more complex is to simply **add more weights**.


![multilayer](http://d.pr/i/14BS1+)



# Backpropagation

Backpropagation is a method for efficiently computing the gradient of the cost function of a neural network with respect to its parameters.  These partial derivatives can then be used to update the network's parameters using, e.g., gradient descent.  This may be the most common method for training neural networks.  Deriving backpropagation involves numerous clever applications of the chain rule for functions of vectors. 


![bp](https://theclevermachine.files.wordpress.com/2014/09/neural-net.png)

Gradient Descent
---
The simplest algorithm for iterative minimization of differentiable functions is known as just **gradient descent**.
Recall that the gradient of a function is defined as the vector of partial derivatives:

$$\nabla f(x) =  [{\partial{f}{x_1}, \partial{f}{x_2}, \ldots, \partial{f}{x_n}}]$$

and that the gradient of a function always points towards the direction of maximal increase at that point.

Equivalently, it points *away* from the direction of maximum decrease - thus, if we start at any point, and keep moving in the direction of the negative gradient, we will eventually reach a local minimum.

This simple insight leads to the Gradient Descent algorithm. Outlined algorithmically, it looks like this:

1. Pick a point $x_0$ as your initial guess.
2. Compute the gradient at your current guess:
$v_i = \nabla f(x_i)$
3. Move by $\alpha$ (your step size) in the direction of that gradient:
$x_{i+1} = x_i + \alpha v_i$
4. Repeat steps 1-3 until your function is close enough to zero (until $f(x_i) < \varepsilon$ for some small tolerance $\varepsilon$)

Note that the step size, $\alpha$, is simply a parameter of the algorithm and has to be fixed in advance. 

![gd](http://ludovicarnold.altervista.org/wp-content/uploads/2015/01/gradient-trajectory.png)

# Activation Functions:
* Sigmoid
* Tanh
* ReLu
* LEaky ReLU
* ELU
* Softmax
* PReLu
* Swish


# Loss Functions :
## ML algorithms :
* 01 Linear Regression - L2 Loss Function
* 02 Logistic Regression  -  Cross Entropy Loss function
* 03 Ridge Regression - Regualarized L2 Loss Function
* 04 Linear Models of  Regression - Regualarized L2 Loss Function
* 05 K Means - L2 Loss Function
* 06 Kohonnen Self Organizing Map (KSOM) - L2 Loss Function
* 07 SVM Clasiification - Hinge Loss Function
* 08 SVM Regression - epsilon Insensitive Loss Function


# Optimizers:

1.Gradient Descent:
> Batch GD , Mini Batch GD , Stochastic GD

2. Momentum GD
3. Adagrad
4. Ada Delta
5. RMS Prop
6. Adam