# Deep Learning Tutorials - Chapter 3 - Multi-layer perceptron and back-propagation.

# 3.1 Perceptron

- The first mathematical model for a neuron was the threshold logic unit, with boolean inputs and outputs (logic gates)
- **Perceptrons** are like these basic binary neuron models except they can operate on any values. Inspired by biology. With $w_{i}$ being synaptic weights and $x_{i}$ and $f$ firing rates. However, it is a very crude biological model

Perceptrons can be modeled mathematically as the following:

![per](./images/perceptron.png)

$$f(x) = \sigma (w*x+b)$$

The following mathematical model can be visualized like this:

![mm](./images/math-model.png)

This model is simplified when we consider $x, w, b$ to be tensors

![tm](./images/tensor-model.png)

here is a training algorithm for our perceptrons:

In [1]:
import torch

In [2]:
def train_perceptron(x,y, nb_epochs_max):
    w = torch.zeros(x.size(1)) 

    for e in range(nb_epochs_max):
        nb_changes = 0
        for i in range(x.size(0)):
            if x[i].dot(w) * y[i] <= 0:
                w = w + y[i] * x[i]
                nb_changes += 1
            if nb_changes == 0: break;
        return w

# Dense Layer of a Multilayer perceptron in tf

In [None]:
import tensorflow as tf
from tf import keras

In [None]:
class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self, input_dim, output_dim):
        super(MyDenseLayer,self).__init__()

        # Intiailze weights and biases
        self.W = self.add_weight([input_dim, output_dim])
        self.b = self.add_weight([1, output_dim])
    
    def call(self, inputs):
        # forward propogate our inputs
        z = tf.matmul(inputs,self.W) + self.b

        # feed through non-linearity
        output = tf.math.sigmoid(z)

        return output

A **dense** layer is called a dense layer because every input is connected to every output

-The perceptron stops as soon as it finds a separating boundary.
- Other algorithms maximize the distance of samples to the decision boundary, which improves the robustness to noise.
- Support Vector Machines (SVM) achieve this by minimizing 

$$L(w,b) = \lambda ||w||^{2} + \frac{1}{N} \sum_{n} max(0,1 - y_{n} (w*x_{n}+ b))$$

which is convex and has a global optimum



#### Hinge Loss

the term 

$$max(0,1 - \alpha)$$

is called the "hinge loss"

![hinge](./images/hinge.png)



# Bonus MIT Introduction to Deep Learning | 6.S191

**Artificial Intellegence**: Any automated task to teach a computer to mimic human behavior

**Machine Learning**: Teaching a machine to learn without explicitly telling it to. Teach a computer through experiences. (Typically have to manually select a feature space which is big challenge)

**Deep Learning**: Learning both the features and the learning from raw data

#### Activation function

- **Activation functions** are in the essence of every deep learning model. Their main purpose is to introduce non-linearity to fit non-linear representations to our data. This activation function is what makes deep learning so powerful. 

How to build a deep neural network model using tensorflow

![dnn](./images/dnn.png)

In [None]:
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(n,),
    tf.keras.layers.Dense(n,),
    tf.keras.layers.Dense(n,),
])

#### Loss

More definitions
- The **Empirical Loss** measures the total loss over the entire dataset

Goal is to minimize empirical loss

Use cases:
- Softmax cross entropy loss function = binary classification
- Mean squared error loss = regresiion models that outputs continuous real numbers

#### Gradient Descent

Algorithm to find optimal solutions iteratively

In [None]:
import tensorflow s tf

weights = tf.Variable([tf.random.normal()])

while True: # loops forever
    with tf.GradientTape() as g:
        loss = compute_loss(weights)
        gradient = g.gradient(loss, weights)
    
    weights = weights -lr * gradient

learning rates (`lr`) are important in gradient descent 

- small learning rates may converge at incorrect local minima
- large learnig rare may overshoot, become unstable, and diverge

examples of gradient descent algorithms 

- SGD `tf.keras.optimizer.SGD`
- Adam `tf.keras.optimizer.Adam`
- Adadelta `tf.keras.optimizer.Adadelta`
- Adagrad `tf.keras.optimizer.Adagrad`
- RMSProp `tf.keras.optimizer.RMSProp`

Putting whole lesson together

In [None]:
import tensorflow as tf

model = tf.keras.Sequential([...])

# pick your favorite optimizer
optimizer = tf.keras.optimizers.SGD()

while True:

    # forward pass through the network
    prediction = model(x)

    with tf.GradientTape() as tape:
        # compute loss
        loss = compute_loss(y, prediction)
    
    # update the weights using the gradient
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

How to combat overfitting since it is a big issue in deep learning: **Regularization**

**Regularization I**: Dropout
- During training, randomly set some activations to 0
- typically drop 50% of activations in layer
- forces network to not rely on any I node
- `tf.keras.layers.Dropout(p=0.5)`

**Regularization II**: Early Stopping
- stop training before we have a chance to overfit

# 3.2 Probabilistic view of a linear classifier

The Linear Discriminant Analysis (LDA) algorithm provides a nice bridge between these linear classifiers and probabilstic modeling

Consider the twoo following populations: 

![2pop](./images/2pop.png)

That is, they are Gaussian with the **same covariance matrix** $\Sigma$. This is the homoscedasticity assumption

Intutively, we can map data linearly to make all the covariance matrices identity, there the Bayesian separation is a plan, so it is also in the original space.

#### Loss Function

signmoid function: soft heavy side function, popular loss function

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

# 3.3 Linear Separability and feature design

The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable. Basically its difficult to model data with only linear lines. 

Training a model composed of manually engineered features and parametric model such as logistic regression is now referred to as "shallow learning"

The signal goes through a single processing trained from data



# 3.4 Multilayer Perceptron

Even though it has no practical value implementation-wise, we can represent such a model as a combination of units

![mlp101](./images/mlp101.png)

#### Activation functions 

An activation function introduces non-linearity to our model. Here are the two most common

![activation](./images/act.png)

#### ReLU

ReLU activation function is a piecewise linear combination that can approximate any function through a linear combinations of straight lines at diferent translated/scaled RELU functions. Below you can see this polynomial being fitted with these RELUs:

![relu](./images/Relu.png)

#### General notes on training/testing

- **False Misconception:** a better approximation requires a larger hiiden layer (larger `k`) and this theorem says nothing about the relation between the two
- so this results states that we can make the training error as low as we want by using a larger hidden layer. it states nothing about the test error
- deploying mlp in practice is often a balanceing act between under-fitting and over-fitting but we can combat this through regularization (dropout, early stopping)




# 3.5 Gradient Descent

Gradient descent is an iterative algorithm used to find local minima of a given space. Refer to MIT notes for more details

Python implementation in PyTorch:

In [None]:
def gradient(x,y,w,b):
    u = y * y( -y * (x @ w + b)).sigmoid()
    v = x * u.view(-1,1)
    return v.sum(0), -u.sum(0)

# Gradient descent algorithm
w, b = torch.empty(x.size(1)).normal_(), 0 # weights and biases
eta = 1e-1 # learning rate

for k in range(nb_iterations):
    print(k, loss(x,y,w,b))
    dw, db = gradient(x,y,w,b)
    w -= eta * dw
    b -= eta * db

# 3.6 Back Propogation

MLP & deep learning in general relies on back propogation to compute the values during gradient descent

![backprop](./images/backprop.png)

![help](./images/helpful-3.6.png)

- Backward pass is a simple algorithm: apply the chain rule again and again

- Forward pass, can be expressed in tensorial form. Heavy computation is concentrated in linear operations, and all the non-linearities go into component-wise operations

- Without tricks, we have to keep in memory, all the activations computed during the forward pass
