Timothy Helton

2018-06-26

# Background

- This algorithm is ideal for **binary** classification problems.

## Nomenclature

- b: intercept parameter
- J: cost function
- L: loss function
- w: weight parameters
- X: matrix of all input vector instances
    - each column is an instance
- x: input vector for single instance
- $\hat{y}$: predicted output
- $\sigma$: sigmoid function or logistic function
    - $\sigma(z) = \frac{1}{1 + e^{-z}}$

## Process
1. A forward pass through the model is used to calculate the cost function.
1. A backwards pass through the model evaluating the gradient facilitates the training of parameters (w and b).
1. Once the parameters are trained the model is evaluated against the test dataset.
1. If satisfied with the results from the training set apply the model to the validation set.
1. Finally, the model is applied to unclassified data.

## Computational Graph

![Logistic Regression Computational Graph](images/logistic_regression_graph.png)

# Parameters

- b: intercept parameter
    - $\in\mathbb{R}$
- w: weight parameters
    - $\in\mathbb{R}^{n_x}$

# Activation Function

The ***sigmoid*** or ***logistic*** function used to be the standard.
    - On the edges of the function the gradient near zero results in slow learning.
    - Now the ReLU function outperforms the sigmoid.
    
**Sigmoid Function**
![Sigmoid Function](images/sigmoid_function.png)

**Rectified Linear Unit (ReLU)**
![ReLU Function](images/ReLU_function.png)

# Cost Function

- The average of all the loss functions.
- Used to train the parameters w and b.
    - Influence of b on L
        - $\frac{\partial L}{\partial b}$
    - Influence of w on L
        - $\frac{\partial L}{\partial w_i}$
        
$$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(a, y)$$

## Loss Function
$L(a, y) = -(y \log{a} + (1 - y) \log{(1 - a)})$

# Backpropagation

## Derivative of Loss with respect to Activation

$L(a, y) = -(y \log{a} + (1 - y) \log{(1 - a)})$

$\frac{dL}{da} = -y \log(a) \frac{d}{da} - (1 - y) \log(1 - a) \frac{d}{da}$

1. $y \log(a) \frac{d}{da} = \frac{y}{a}$

1. $(1 - y) \log(1 - a) \frac{d}{da}$
    - Requires the Chain Rule
        - $f(g(a)) = (1 - y)\log(1 - a)$
        - $f(g) = (1 - y)\log(g)$
        - $g(a) = 1 - a$
        - $f \frac{d}{dg} = \frac{1 - y}{g}$
        - $g \frac{d}{da} = -1$
        - $f \frac{d}{dg} * g \frac{d}{da} = - \frac{1 - y}{g} = - \frac{1 - y}{1 - a}$

$\frac{dL}{da} = -\frac{y}{a} - -\frac{1 - y}{1 - a}$

$\frac{dL}{da} = \frac{1 - y}{1 - a} - \frac{y}{a}$

## Derivative of Activation with respect to Model

$\frac{da}{dz} = \sigma \frac{d}{dz}$

$\frac{da}{dz} = (1 + e^{-z})^{-1} \frac{d}{dz}$

1. Requires Chain Rule
    - $f(g) = g^{-1}$
    - $g(z) = 1 + e^{-z}$
    - $f(g) \frac{d}{dg} = -g^{-2}$
    - $g(z) \frac{d}{dz} = -e^{-z}$
    - $f \frac{d}{dg} * g \frac{d}{dz} = \frac{e^{-z}}{(1 + e^{-z})^2}$

Trick: Add and subtract 1 to the numerator.

$\frac{e^{-z} + 1 - 1}{(1 + e^{-z})^2}$

$\frac{1 + e^{-z}}{(1 + e^{-z})^2} - \frac{1}{(1 + e^{-z})^2}$

$\frac{1}{(1 + e^{-z})} - \frac{1}{(1 + e^{-z})^2}$

$\frac{1}{(1 + e^{-z})} \Big( 1 - \frac{1}{(1 + e^{-z})} \Big)$

$a(1 - a)$

## Derivative of Loss with respect to Model

$\frac{dL}{dz} = \frac{dL}{da} \frac{da}{dz}$

$\frac{dL}{dz} = \Big( \frac{1 - y}{1 - a} - \frac{y}{a} \Big) a(1 - a)$

$\frac{dL}{dz} = \frac{a(1 - a)(1 - y)}{1 - a} - \frac{a(1 - a)y}{a}$

$\frac{dL}{dz} = a(1 - y) - y(1 - a)$

$\frac{dL}{dz} = a - ay - y + ay$

$\frac{dL}{dz} = a - y$

## Derivative of Model with respect to Parameters

$z = wx + b$

$\frac{\partial z}{\partial w} = x$

$\frac{\partial z}{\partial b} = 1$

## Derivative of Loss with respect to Parameters

**w**

$\frac{\partial L}{\partial w} = \frac{dL}{da} \frac{da}{dz} \frac{\partial z}{\partial w}$

$\frac{\partial L}{\partial w} = \frac{dL}{dz} \frac{\partial z}{\partial w}$

$\frac{\partial L}{\partial w} = (a - y) x$

$\frac{\partial L}{\partial w} = x(a - y)$


**b**

$\frac{\partial L}{\partial b} = \frac{dL}{da} \frac{da}{dz} \frac{\partial z}{\partial b}$

$\frac{\partial L}{\partial b} = \frac{dL}{dz} \frac{\partial z}{\partial b}$

$\frac{\partial L}{\partial b} = (a - y) 1$

$\frac{\partial L}{\partial b} = (a - y)$


# Model

- $\hat{y}$: predicted output
    - probability that y is equal to 1 given x
        - $\hat{y} = P(y=1 | x)$
    - apply the sigmoid function to return a value between 0 and 1
        - $\hat{y} = \sigma(w^{T}x + b)$