### Lecture 3: Introduction to Neural Networks

##### 1) Backpropagation:
* <u>**Chain rule:**</u> To ease gradient computations, we use computational graph, in which we can apply the chain rule: current_gradient = [upstream gradient] x [local gradient] $$ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \cdot \frac{\partial q}{\partial x}$$
* <u>**Sigmoid gate:**</u> $$\sigma (x) = \frac{1}{1+e^{-x}} \rightarrow \sigma '(x) = (1-\sigma (x))\cdot \sigma (x)$$
* <u>**Add gate:**</u> Plays role of gradient distributor. During backpropagation, passes its value to each of its entries (same value). 
* <u>**Max gate:**</u> Plays role of gradient router. During backpropagation, passes its value to one of its entries (the others get 0). 
* <u>**Multiplication gate:**</u> Plays role of gradient switcher. During backpropagation, take the upstream gradient and scale it according to the value of the other branch.
* <u>**Sum of multiple entries:**</u> $$ \frac{\partial f}{\partial x} = \sum_i \frac{\partial f}{\partial q_i} \cdot \frac{\partial q_i}{\partial x} $$
* <u>**Vectorized operations:**</u> $\frac{\partial f}{\partial x}$ is now a Jacobian matrix of size $N_{input} \times N_{input}$ (usually, we process a full minibatch at a time so $ N_{input} = sampleSize \cdot \#samples $) $$ \frac{\partial L}{\partial x} = \frac{\partial f}{\partial x} \cdot \frac{\partial L}{\partial f} $$
    - The gradient with respect to a variable should have the same shape as the variable

Modularized implementation of forward/backward propagation: (Caffe layers library contains a lot of computational nodes implementations)

In [None]:
class ComputationalGraph(object):
    def __init__(self):
        pass
    def forward(self, inputs):
        for gate in self.graph.nodes_topologically_sorted():
            gate.forward()
        return loss
    def backward(self):
        for gate in reversed(self.graph.nodes_topologically_sorted()):
            gate.backward()
        return input_gradients

class MultiplyGate(object):
    def __init__(self):
        pass
    def forward(self, x, y):
        z = x*y
        self.x = x
        self.y = y
        return z
    def backward(self, dz):
        dx = self.y * dz # dL/dx
        dy = self.x * dz # dL/dy
        return [dx,dy]

##### 2) Neural Networks:
* <u>**1 layer:**</u> Linear layer: $ f = Wx$
* <u>**2 layers:**</u> Add non-linearity to avoid collapsing into linear layer: $ f = W_2 \cdot max(0, W_1 \cdot x)$
* <u>**3 layers:**</u> Expansion of the idea: $ f = W_3 \cdot max(0, W_2 \cdot max(0, W_1 \cdot x))$
* <u>**Perceptron:**</u> General idea of neuron (action potential): $ output_{neuron} = f(\sum_i w_i x_i + b_i)$ (while $f$ is an activation function)
* <u>**Activation functions:**</u> Sigmoid, ReLU, tanh, Leaky ReLU, Maxout, ELU...