# Deep Learning
## Overview





# The Big Picture

<img src="images/bigpicture.png"/>

<img src="images/bigpicture1.png" style="width: 75%; height: 75%" />

- **Neuron**: the *atomic computational units* of deep learning networks

<img src="images/bigpicture2.png" style="width: 75%; height: 75%" />

- **Layers**: neurons are organized in stacked layers to achieve increasingly abstract data representations

<img src="images/bigpicture3.png" style="width: 75%; height: 75%" />

- **Forward Propagation**: the end-to-end computational process for generating predictions

<img src="images/bigpicture4.png" style="width: 75%; height: 75%" />

- **Loss and Cost Function**: method for quantifying the error or discrepancy between predictions and ground truth

<img src="images/bigpicture5.png" style="width: 75%; height: 75%" />

- **Backward Propagation**: the computational process for systematically reducing the error by adjusting the network's parameters

<img src="images/bigpicture.png"/>

# I. Neuron

<img src="images/function.png" style="width: 50%; height: 50%" />

- Artificial Neurons are derived from biological neurons but can be viewed as mathematical functions

- A neuron accepts a set of inputs, performs a computation on that input, and returns an output

# Neuron's Two Step Computation

<img src="images/neuron2steps.png" style="width: 50%; height: 50%" />

> Neurons perform their computation in two-steps. The function in Step 1 is called the *transfer* function. Its output is sent to the *activation* function in Step 2.

### Step 1: Neuron Computation

<img src="images/transferfunction.png" style="width: 75%; height: 75%">

- Each neuron accepts $n$ inputs.

- A "weight" is associated with each input: $x_{n} \mapsto w_{n}$

- Then a "bias" term $b$ is added to the weighted sum of inputs.

### Step 1:  Computation Example

<img src="images/transferexample.png" style="width: 75%; height: 75%" >

### Step 2: Activation Function

<img src="images/activationfunction1.png" style="width: 75%; height: 75%" >|

<img src="images/activationfunction2.png" style="width: 75%; height: 75%">|

<img src="images/activationfunction3.png" style="width: 75%; height: 75%">

### Step2: Computation Example

<img src="images/activationexample.png" style="width: 75%; height: 75%">

## Data Representation

- We represent the input $\mathbf{x}$ as a column vector. $\mathbf{x} =\begin{bmatrix}
    x_1\\
    x_2\\
    \vdots \\
    x_n
\end{bmatrix}$

- We represent the weights $\mathbf{w}$ as a column vector but take it's transform to obtain a row vector $\mathbf{w}^\top = \begin{bmatrix}
    w_1\
    w_2\
    \dots \
    w_n
    \end{bmatrix}$. We represent the bias $\mathbf{b}$ as a scalar $b$.

- The result of the transfer function computation: $z = \begin{bmatrix}
    w_1\
    w_2\
    \dots \
    w_n
    \end{bmatrix} \bullet \begin{bmatrix}
    x_1\\
    x_2\\
    \vdots \\
    x_n
\end{bmatrix} + b$

- The result of the transfer function computation:  $z = \mathbf{w}^\top \mathbf{x} + b$

- The result of the activation function computation: $a$ = $\phi(z)$

## Neuron Class 

In [1]:
import numpy as np

In [2]:
def relu(z):
    return np.maximum(0,z)

def sigmoid(z):
        return 1.0/(1.0+np.exp(-z))
    
def heaviside(z):
    if z<0:
        return 0
    else:
        return 1

In [3]:
class Neuron(object):
    
    # isize = size or number of inputs
    def __init__(self,size):
        self.weights = np.random.randn(size)
        self.bias = np.random.randn()
        
    # computes z = weighted sum + bias
    def y_hat(self,x,phi):
        z = np.dot(self.weights,x) + self.bias
        a = phi(z)
        print(a)
        return(a)
    
    # show parameters
    def show_params(self):
        print("weights: =", self.weights)
        print("biases: = ", self.bias)
    
    
    # updates weights and biases
    def update_params(self,weights,bias):
        self.weights = np.array([weights])
        self.bias = np.array([bias])

In [4]:
x = [2.3,4.5,1.3]
weights = [3.2,-1.9,2.5]
bias = -1

N1 = Neuron(3)
N1.show_params()
N1.y_hat(x,sigmoid)


N1.update_params(weights,bias)
N1.show_params()
N1.y_hat(x,sigmoid)

weights: = [-0.54017605  0.26314472  1.04401152]
biases: =  -0.9199287111683621
0.5936397556747696
weights: = [[ 3.2 -1.9  2.5]]
biases: =  [-1]
[0.74269055]


array([0.74269055])

In [5]:
N1 = Neuron(5)
N1.show_params()


weights: = [-0.87047531  0.71872396  1.47511814 -0.0582848   0.02974996]
biases: =  0.637140682984003


In [6]:
x = [2.3,4.5,1.3,2.1,-5]

In [7]:
N1.y_hat(x,heaviside)

1


1

In [8]:
weights = [3.2,-1.9,2.5,3.3,-2]
bias = -1

In [9]:
N1.y_hat(x,sigmoid)

0.9711343080237546


0.9711343080237546

# 2. Layers

<img src="images/layers1.png" width="50%" height="50%" />

- A neural network consists of multiple neurons organized in layers

<img src="images/dense.png" width="50%" height="50%" />

- A densely connected network means that each neuron in each layer is connected to all neurons in the previous and subsequent layer

<img src="images/layers2.png" width="50%" height="50%" />

- A deep learning network consists of three types of layers: an input layer, one or more hidden layers, and an output layer

- The *size* of the network is the number of hidden layers plus the output layer

## Parameters and Hyperparameters

- The *weights* and *biases* of the neural network are its parameters

- A deep learning model *learns* the correct weights and biases in the process of training

- Hyperparameters are structural characteristics of the network, including the number of layers, the number of neurons in each layer, the activation functions, the particular loss function, the optimization function, etc.

# 3. Forward Propagation

<img src="images/feedforward1.png" width="50%" height="50%" />

- Feedforward networks are the quintessential neural network. They are referred to as feedforward because information flows sequentially through each layer without feedback or loops.

## Forward Propagation Computation

<img src="images/feedforward2.png" width="50%" height="50%" />

- A feedforward neural network composes together a series of functions: $f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))$

- A deep learning model *learns* the correct weights and biases in the process of training

## Network Class

In [10]:
import numpy as np

In [11]:
def relu(z):
    return np.maximum(0,z)

def sigmoid(z):
        return 1.0/(1.0+np.exp(-z))
    
def heaviside(z):
    if z<0:
        return 0
    else:
        return 1

In [12]:
class Network(object):
    def __init__(self,sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y,1) for y in sizes[1:]]
        self.weights = [np.random.randn(y,x) 
                        for x,y in zip(sizes[:-1], sizes[1:])]
        
    def feedforward(self,a,phi):
        for b,w in zip(self.biases, self.weights):
            z = np.dot(w,a)+b
            a = phi(z)
        return a
        
    def set_biases(self,newbiases):
        self.biases = [np.array(newbiases)]
        
    def set_weights(self,newweights):
        self.weights = [np.array([newweights])]
        
    def show_parameters(self):
        print("biases: =", self.biases)
        print("weights: =", self.weights)

In [13]:
deep = Network([4,1,3])
deep.show_parameters()



biases: = [array([[1.61556786]]), array([[-0.21402557],
       [ 0.20582364],
       [-2.59502659]])]
weights: = [array([[-1.97269426, -1.17618158, -0.50006713, -1.13947644]]), array([[ 1.80230017],
       [-0.05597854],
       [ 0.11217766]])]


In [14]:
deep.feedforward([2,3,4,1],sigmoid)

array([[0.446752  ],
       [0.55127331],
       [0.06946008]])

## Data Representation

 $$ Z = \begin{bmatrix} { w }_{ 11 } & { w}_{ 12 } & {w}_{13}\\ { w }_{ 21 } & { w }_{ 22 } &{w}_{23} \\ \vdots & \vdots & \vdots \\ { w }_{ n1 } & { w }_{ n2 }& {w}_{n3} \end{bmatrix} \bullet \begin{bmatrix}
    x_1\\
    x_2\\
    \vdots \\
    x_n
\end{bmatrix} + \begin{bmatrix}
    b_1\\
    b_2\\
    \vdots \\
    b_n
\end{bmatrix}$$

$$ a = \phi (\mathbf{W}^\top \cdot \mathbf{X} + \mathbf{B}) $$

# IV. Loss and Cost Function

## Motivation: Why do we need Cost Function?

<img src="images/target.jpg" width="50%" height="50%" />

- Imagine we are shooting arrows during target practice or a competition

- We need a way to measure how far off target we are with each arrow ("loss") and with all arrows ("cost")

- We could use a simple function, namely distance from center to express the error

## And the Challenger is Katniss Everdeen!

<img src=images\lawrence.jpg style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">
<img src=images\contest.png style="float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;">

- The loss function computes the error for a single training example. The cost function is the average of the loss functions of the entire training set.

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Cross Entropy** measures the performance of a classification model whose output is a probability value between 0 and 1. $$\frac 1n \sum_{i=1}^ny_i\log \hat{y}_i$$

# V. Backpropagation

- Given an artificial neural network and an cost function, the method calculates the gradient of the cost function with respect to the neural network's weights.

## What?

<img src="images/what.jpg" width="50%" height="50%" />

<img src="images/costfunction.png" width="75%" height="75%" />

- Our goal is to reduce error ("minimize the cost")

- The error is a function of the parameters of our network. We are looking for the optimum set of weights (i.e. cost is 0 or is as small as possible.)

- Where I am is my current set of weights. Where I want to be is the set of weights where error or cost is 0.

## What does Calculus have to do with it?

- By calculating derivative we can find minimum and maximum of functions

- Minimums and maximums are found where the first derivative (slope) is zero. By taking a second derivative we can determine whether it's a minimum or maximum.

$$f(x) = x^2$$

$$f'(x) = 2x$$

$$ 2x = 0  $$ 

$$f''x =2$$

$$f''>0$$

# Not out of the Woods!

- In deep neural networks we have cost functions with hundreds of thousands of parameters!


- We can't get at minimums analytically

- We have to calculate gradients and move incrementally reach minima

## Stochastic Gradient

<img src="images/contour.jpg" width="50%" height="50%" />

- Assumes we can't find minimum analytically
- We calculate gradient at current point. Negative gradient tells us *direction* of steepest descent.
- We take incremental step towards steepest descent. 
- We recalculate error based on new coordinates (i.e. new weights or parameters)
- We iterate until we converge to minimum.

## Stochastic Gradient

- Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

# Summary

<img src="images/bigpicture.png"  />