# Some preliminaries for Artificial Neural Network Understanding


## Using nbconvert to print the presentation
* Generate the slides and serve them using nbconvert:
<pre>jupyter nbconvert --to slides your_talk.ipynb --post serve<pre>

It opens up a webpage in the browser at http://127.0.0.1:8000/your_talk.slides.html#/

* Add ?print-pdf to the query string as http://127.0.0.1:8000/your_talk.slides.html?print-pdf

Note that you need to remove the # at the end. The page will render the slides vertically.

* Save to PDF in Chrome using the print option

* Open the in-browser print dialog (Cmd/Ctrl + P).
* Change the Destination setting to Save as PDF.
* Change the Layout to Landscape.
* Change the Margins to None.
* Enable the Background graphics option.
* Click Save.

## Neurons and what they do
![neurons](figs/neurons.png)
* Figure courtesy: http://cs231n.github.io/neural-networks-1/

## Neurons and what they do?

[![Neurons](https://img.youtube.com/vi/vyNkAuX29OU/0.jpg)](https://www.youtube.com/watch?v=vyNkAuX29OU)

# A biological neuron vs. the artificial neuron (a.k.a., the perceptron)

![neurons](figs/neuron+perceptron.png)

## Artificial Neural Network
* It is a network with the perceptrons spanning across several layers forming a *multi-partite graph*
![ann-01](figs/ann-01.png)

![ann-02](figs/ann-02.png)

## Function to compute a neuron's output
$$ 
\begin{eqnarray}
\text{raw output of a neuron} &=& \sum weight_i \cdot input_i\\
&=& w_1\cdot i_1 + w_2\cdot i_2 + \cdots
\end{eqnarray}
$$
Here, $i_i$ is the $i^{th}$ input signal to that neuron, and $w_i$ is the associated weight of that input signal.

* Then, we normalize this with the use of an activation function. For example, sigmoid(.). Therefore, the final output of the neuron is:
$$
H(neuron) = sigmoid(\sum weight_i \cdot input_i)
$$

# Another Artificial Neural Network (ANN)
<img src="figs/nn-diag-02.d.png" />

# Steps in a Feed forward network
## Step 1: $\sum $, weighted sum calculation

<img src="figs/01.png" />

## Step 2: $\phi$, applying the activation function on the weighted sum
<img src="figs/02.png" />

## Step 3: Pass the activation output to the next neuron it is connected to.

<img src="figs/03.png" />

<img src="figs/04.png" />

<img src="figs/05.png" />

<img src="figs/06.png" />

# Hyperbolic tangent function as activation function
![tanh](figs/tanhx.png)

## Reasons to use activation functions
* An activation function essentially squashes the input and transforms it into an output value that represents how much a node should contribute (i.e., how much a node should fire)
* Activation functions introduce non-linearity in the model to solve non-linear classification problems.
<img src="figs/lin+non-lin.png">
* Activation functions limit the output of a node to a certain range. In most cases, reduce number of weights to learn.

# Intuition of weights in an ANN
<img src="figs/08.png" />

<img src="figs/09.png" />

<img src="figs/10.png" />

<img src="figs/11.png" />

<img src="figs/12.png" />

<img src="figs/13.png" />

<img src="figs/14.png" />

<img src="figs/15.png" />

<img src="figs/16.png" />

<img src="figs/17.png" />

# Idea behind the backpropagation algorithm
<img src="figs/18.png" />

<img src="figs/19.png" />

<img src="figs/22.png" />

<img src="figs/24.png" />

<img src="figs/25.png" />

<img src="figs/26.png" />

<img src="figs/29.png" />

<img src="figs/30.png" />

<img src="figs/30a.png" />

<img src="figs/34.png" />

<img src="figs/35.png" />

<img src="figs/36.png" />

<img src="figs/37.png" />

<img src="figs/39.png" />

<img src="figs/41.png" />

<img src="figs/42.png" />

<img src="figs/45.png" />

<img src="figs/46.png" />

### Goal is to find optimum set of weights that gives you the best $C$ score.
<img src="figs/47.png" />

## Using brute-force algorithm to solve for optimum set of weights might take millions of years.
<img src="figs/56.png" />

# So, here is Gradient Descent algorithm to solve it.
<img src="figs/57.png" />

<img src="figs/58.png" />

<img src="figs/59.png" />

<img src="figs/60.png" />

<img src="figs/61.png" />

<img src="figs/62.png" />

<img src="figs/63.png" />

<img src="figs/64.png" />

<img src="figs/65.png" />

<img src="figs/66.png" />

# But, this will work perfectly only if the $C$ function is convex.
<img src="figs/70.png" />

# How about now? Can we obtain the best weight solution if we start from there?
### This is a plot of a non-convex function.
<img src="figs/71.png" />

<img src="figs/72.png" />

<img src="figs/73.png" />

# Solution: Stochastic Gradient Descent
<img src="figs/74.png" />

# (Batch) gradient descent:
* You read the entire training dataset, and then you compute $C$, and then adjust weights.
<img src="figs/76.png" />

# Stochastic gradient descent:
* You read the first sample, compute $C$, and then adjust weights.
* Then you read the second sample, compute $C$, and then adjust weights
* and so on.
<img src="figs/77.png" />

<img src="figs/78.png" />

<img src="figs/79.png" />

<img src="figs/80.png" />

# Creating the ANN structure
<img src="figs/ann-01.gvz.png"/>

# Creating the ANN structure
<img src="figs/ANN-01.png"/>

## Parameters of an ANN
* **Input layer, Input nodes** An input node contains the input of the network and this input is always numerical. If the input is not numerical by default, it is always converted.


* **Hidden layer, hidden nodes**: A hidden layer is a lyer of nodes between input and output layers. There can be either a single hidden layer, or multiple hidden layers in a network.  The more that exist, the deeper the learning that a network can perform. In fact, multiple hidden layers are what the term deep learning refers to.


* **Output layer, output nodes**: An output node is a node within an output layer. There can be a single or multiple output nodes depending on the objective of the network. For example, for a network to classify cat and dog from a pool of images, then there would be 2 output nodes.

* **Weight values**: A weight is a variable that sits on an edge between two nodes. It may reflect the contribution/importance of nodes to accomplish the objective.


* **Bias nodes, Bias values**: A bias is an extra node added to each hidden and output layer. It connects to every node within each respective layer. A bias is never connected to a previous layer, but is simply added to the input of a layer. It typically has a constant value of 1 or -1, and also has a weight on its connecting edges which also needs to be trained.
![bias](figs/bias.png)
And, suppose the inputs are: $x_1=x_2=0$, 
Without the bias node, only one output value is possible, which is 0. **It may lead to a very poor fit of the given dataset**

- Bias enables activation function to be shifted to the left or right; i.e., make the offset between multiple node outputs 0.


* **Learning rate**: It is a value that speeds up or slows down how quickly the gradient descent learns.

# Forward propagation: Summation Operator
## netinput of a node = the weighted bias + the weighted sum of the inputs
$$ \text{netinput} = 1*w_b + \sum_{i=1}^n{ x_iw_i} $$
![ann](figs/ann-01.gvz.png)

![ann](figs/ann-01.gvz.png)

$$ 
\begin{align*}
b1_{net} &=& bias*bw1 + a1w1 + a2w3\\
&=& 1*bw1 + a1w1 + a2w3\\
b1_{out} &=& \phi(b1_{net})
\end{align*}
$$

![ann](figs/ann-01.gvz.png)
$$ 
\begin{align*}
b2_{net} &=& 1*bw2 + a1w2 + a2w4\\
b2_{out} &=& \phi(b2_{net})
\end{align*}
$$

![ann](figs/ann-01.gvz.png)
$$ \begin{align*}
\nonumber c1_{net} &=& 1*bw3 + b1_{out}w5 + b2_{out}w6\\
\nonumber c1_{out} &=& \phi(c1_{net})
\end{align*}
$$

![ann](figs/ann-01.gvz.png)
$$ \begin{align*}
\nonumber c1_{net} &=& 1*bw3 + b1_{out}w5 + b2_{out}w6\\
\nonumber \hat{c1}_{out} &=& \phi(c1_{net}) \quad \text{in fact this is an estimated value}
\end{align*}$$

# Calculating the total error: the cost function
In order for a neural network to successfully train, it must minimize the difference between its actual output and the target output to find the global minimum (or a local minimum that is close enough to the global). This difference is the total error, which essentially tells us how wrong a network is.

A cost function gives us the sense of the total error between the target output and actual output.
## Types of cost functions
For $n$ rows (instances, or samples) in the dataset, and $z_i$ is the ground truth output value of the $i^\text{th}$ sample, and $\hat{z}_i$ is the predicted output value of that sample:
### Mean Squared Error (MSE)
$$ MSE = \dfrac{1}{n} \sum_{i=1}^n (\hat{z}_i-z_i)^2$$


### Root Mean Squared error (RMSE)
$$ RMSE = \sqrt{\dfrac{1}{n} \sum_{i=1}^n (\hat{z}_i-z_i)^2}$$


### Sum of Squared Error (SSE)
$$ SSE = \sum_{i=1}^n (\hat{z}_i-z_i)^2$$


# Calculating the gradients
* We need to discover how the total error is spread across every weight in the ANN so that we can adjust the weights to minimize the error.

* To do this, we are going to compute the error of every weight in the network.

* The error of a weight is technically its analytical gradient.
* Once calculated, the ANN tries to minimize the errors and thus minimize the total error.
* This algorithm is called *Backpropagation* algorithm.

# Partial Derivative
<img src="figs/derivative.png" width=600>
<ul>
    <li>A derivative provides the slope of a tangent line at a single point.</li>
    <li>A partial derivative is the derivative of a function which has 2 or more variables but with respect to a single variable. All the other variables are treated as constant.</li>
</ul>

# Calculating Partial Derivative of Output layer weights
![ann](figs/ann-01.gvz.png)
Let's discover how a change in weight $w5$ affects the total error $E$ -- while all the other weights remain constant. It is $$ \dfrac{\partial E}{\partial w5}$$

![ann](figs/ann-01.gvz.png)
Since, $$ E = f(Z)$$
$$Z = g(w5)$$
Therefore, utilizing the chain rule we get:
\begin{align*}
\dfrac{\partial E}{\partial w5} = \dfrac{\partial E}{\partial Z}\cdot \dfrac{\partial Z}{\partial w5}
\end{align*}

![ann](figs/ann-01.gvz.png)
Even, $Z$ is a function of $netc1=1\times bw3 + b1\times w5 + b2\times w6$. So,
$$Z = h(netc1)$$
$$netc1 = q(w5)$$
Therefore, $\dfrac{\partial Z}{\partial w5}$ can be written as:
$$\dfrac{\partial Z}{\partial w5} = \dfrac{\partial Z}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial w5}$$
Equation of $\dfrac{\partial E}{\partial w5}$ becomes:
\begin{align*}
\dfrac{\partial E}{\partial w5} = \dfrac{\partial E}{\partial Z}\cdot \dfrac{\partial Z}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial w5}
\end{align*}

# Now, let's compute the following:
\begin{eqnarray}
\dfrac{\partial E}{\partial w5} = \dfrac{\partial E}{\partial Z}\cdot \dfrac{\partial Z}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial w5}
\end{eqnarray}

Considering $ E = MSE = \frac{1}{n}\sum_{i=1}^n(t_i - z_i)^2$, 
where $t_i$ is the ground truth output, and $z_i = \hat{z}_i$ is the predicted output.

$$
\begin{align*}
\dfrac{\partial E}{\partial Z} &=& -(t-z)\\
&=& z-t
\end{align*}
$$

# Computing:
\begin{eqnarray}
\dfrac{\partial E}{\partial w5} = \dfrac{\partial E}{\partial Z}\cdot \dfrac{\partial Z}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial w5}
\end{eqnarray}
![ann](figs/ann-01.gvz.png)
Here, $$ Z = \phi(netc1) = sigmoid(netc1) = \dfrac{1}{1+e^{-netc1}}$$
Therefore, $$\dfrac{\partial Z}{\partial netc1} = z(1-z)$$   **You can prove it, right?**

# Computing:
\begin{eqnarray}
\dfrac{\partial E}{\partial w5} = \dfrac{\partial E}{\partial Z}\cdot \dfrac{\partial Z}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial w5}
\end{eqnarray}
![ann](figs/ann-01.gvz.png)
Since, $netc1 = 1\times bw3 + b1\times w5$
\begin{align*}
\dfrac{\partial netc1}{\partial w5} = b1
\end{align*}

# Now putting it all together to compute the following:
\begin{eqnarray}
\dfrac{\partial E}{\partial w5} = \dfrac{\partial E}{\partial Z}\cdot \dfrac{\partial Z}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial w5}
\end{eqnarray}
![ann](figs/ann-01.gvz.png)
\begin{align*}
\dfrac{\partial E}{\partial w5} = (z-t)z(1-z)b1
\end{align*}

# Introducing Node Delta, $Delta_z$, $\delta_z$:
\begin{eqnarray}
\dfrac{\partial E}{\partial w5} = \dfrac{\partial E}{\partial Z}\cdot \dfrac{\partial Z}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial w5}
\end{eqnarray}
![ann](figs/ann-01.gvz.png)
$$\delta_z = (z-t)z(1-z)$$
Therefore,
\begin{align*}
\dfrac{\partial E}{\partial w5} = \delta_z \cdot b1
\end{align*}

# now calculate partial derivative of output layer bias weights
![ann](figs/ann-01.gvz.png)
\begin{align*}
\dfrac{\partial E}{\partial bw3} = (z-t)z(1-z) = \delta_z
\end{align*}
* Question: Why is there no $b1$ in the equation like what we saw in the non-bias output weights?

Answer: A bias is not connected to a previous layer and therefore does not have an input.
But, be careful about different $\delta_z$'s:
![ann](figs/ann-02.gvz.png)
Here, bias weight $bw3$ is connected to output nodes $c1$, and bias weight $bw4$ is connected to output node $c2$.
* Therefore, $bw3$ should use the $\delta_z$ for node $c1$, and
* $bw4$ would use $\delta_z$ for node $c2$

# now calculate partial derivative of hidden layer weights
![ann](figs/ann-01.gvz.png)
How to compute $$\dfrac{\partial E}{\partial w1}$$

![ann](figs/ann-01.gvz.png)
\begin{align*}
\dfrac{\partial E}{\partial w1} = \dfrac{\partial E}{\partial netb1}\cdot \dfrac{\partial netb1}{\partial w1}
\end{align*}
Because, $E = f(netb1)$, and $netb1 = g(w1)$

![ann](figs/ann-01.gvz.png)
$$
\begin{align*}
\dfrac{\partial E}{\partial netb1} = \dfrac{\partial E}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial netb1}
\end{align*}
$$

Because, $E = f(netc1)$, and $netc1 = g(netb1)$

![ann](figs/ann-01.gvz.png)
$$
\begin{align*}
\dfrac{\partial netc1}{\partial netb1} = \dfrac{\partial netc1}{\partial b1}\cdot \dfrac{\partial b1}{\partial netb1}
\end{align*}
$$
Because, $netc1 = f(b1)$, and $b1 = g(netb1)$

# In a nutshell, to compute $\dfrac{\partial E}{\partial w1}$
![ann](figs/ann-01.gvz.png)
\begin{align*}
\dfrac{\partial E}{\partial w1} = \dfrac{\partial E}{\partial netc1}\cdot \dfrac{\partial netc1}{\partial b1}\cdot\dfrac{\partial b1}{\partial netb1}\cdot \dfrac{\partial netb1}{\partial w1}
\end{align*}

# Now compute the derivative, $\dfrac{\partial E}{\partial netc1}$
![ann](figs/ann-01.gvz.png)

$$\dfrac{\partial E}{\partial netc1} =\delta_z$$
where, 
$$\delta_z = (z-t)z(1-z)$$
If there are more than one output nodes, the formula would incorporate that:
$$\dfrac{\partial E}{\partial netc} =\sum_c\delta_z$$


# Now compute the derivative, $\dfrac{\partial netc1}{\partial b1}$
![ann](figs/ann-01.gvz.png)

$$\dfrac{\partial netc1}{\partial b1} =w5$$

If there are more than one output nodes, the formula would incorporate that:
$$\dfrac{\partial netc}{\partial b1} =\sum_c w5$$


# Now compute the derivative, $\dfrac{\partial b1}{\partial netb1}$
![ann](figs/ann-01.gvz.png)

$$\dfrac{\partial b1}{\partial netb1} =b1(1-b1)$$



# Now compute the derivative, $\dfrac{\partial netb1}{\partial w1}$
![ann](figs/ann-01.gvz.png)

$$\dfrac{\partial netb1}{\partial w1} =a1$$



# Finally, $\dfrac{\partial E}{\partial w1}$
![ann](figs/ann-01.gvz.png)

\begin{align*}
\dfrac{\partial E}{\partial w1} &=&\left(\sum_c\delta_z w_5\right)\cdot b1\cdot (1-b1)\cdot a1
&=& \delta_b a1
\end{align*}
where, $$\delta_b=\left(\sum_c\delta_z w_5\right)\cdot b1\cdot (1-b1)$$



# Computing the partial derivative of hidden layer bias weights
![ann](figs/ann-01.gvz.png)
$$\dfrac{\partial E}{\partial bw1} = \delta_b$$
Question: why is there no $a1$ here in the equation?
Answer: Because the bias is not connected to previous layer, and so does not have an input. Therefore, we are left with the node delta of a previous layer as the partial derivative of any weight in a hidden layer.

# General weight update equation
$$ w5_{new} = w5 - \alpha\cdot \dfrac{\partial E}{\partial w5}$$
Here, $\alpha$ is the learning rate.
![learning-rate](figs/learning-rate.png)

# Batch gradient descent update
\begin{eqnarray}
w5_{new} &=& w5 - \alpha\cdot\dfrac{\partial E}{\partial w5}
&=& w5 - \alpha\cdot\left( \frac{1}{n}\sum_{i=1}^n(z_i-t_i)E_{i,w}\right)
\end{eqnarray}
here, $E_{i,w}$ denotes the partial derivative of the cost function, $E$ with respect to $w$ for $i^{th}$ training sample.

# Stochastic gradient descent update
Here, the weights are updated after each sample is passed through the network.

![cats+dogs](figs/cat+dog.gif)


# Class 06 - Deep Feedforward Networks
#### Reading materials: Chapter 6 from the textbook for this class
![DNN](figs/ann-01.gvz.png)
* Also known as **Feedforward Neural Networks**, or **Multilayer Perceptrons (MLP)**, or **Deep Neural Networks**
* The goal of the network is to approximate some function $f^*$.
    * For example, a classifier $y=f^*(\mathbf{x})$ maps an input $\mathbf{x}$ to a class $y$.
* The network defines a mapping $y=f(\mathbf{x}; \mathbf{\theta})$ and learns the value of parameters $\mathbf{\theta}$ that result in the best function approximation.

![DNN](figs/ann-01.gvz.png)
* These models are called **feedforward** for a reason
    * Information flows through the function being evaluated from $x$, through the intermediate computations used to define $f$, and finally to the output $y$.
    * There are no feedback connections in which outputs of the model are fed back into itself.
    * Here, *PLEASE DO NOT confuse it with the backpropagation algorithm to train the parameters $\theta$ (i.e., the set of weights)*
* When feedforward neural networks are extended to include feedback connections, they are called **recurrent neural networks**
![RNN](figs/RNN.png)

# Learning the *so called* XOR problem
$x_1$ | $x_2$ | $y$
------ | ------| -----
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0


<img src="figs/xor-problem.png" width=400>

## Why can't a one perceptron Neural Network solve the XOR problem?
<img src="figs/xor-01.gvz.png">
* Because, one perceptron (i.e., neuron) represents a single function which can divide the input space into two parts, whereas the XOR problem demands dividing the input space into more than two parts, right?
$$ y = f(x_0w_0 + x_1w_1 + x_2w_2) = \text{sigmoid}(x_0w_0 + x_1w_1 + x_2w_2)$$


In [11]:
%matplotlib notebook
#Draw a sigmoid plot
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
    return 1/(1+np.exp(-x))
x = np.arange(start=-5, stop=5, step=0.1)
plt.plot(x,sigmoid(x))
plt.xlabel('x')
plt.ylabel('y=sigmoid(x)')
plt.show()

<IPython.core.display.Javascript object>

#### So, we need at least two hidden neurons to properly learn the XOR problem

In [1]:
import numpy as np

def sigmoid(x):
    return 1.0/(1.0 + np.exp(-x))

def sigmoid_prime(x):
    return sigmoid(x)*(1.0-sigmoid(x))

def tanh(x):
    return np.tanh(x)

def tanh_prime(x):
    return 1.0 - x**2


In [16]:
class NeuralNetwork:

    def __init__(self, layers, activation='tanh'):
        if activation == 'sigmoid':
            self.activation = sigmoid
            self.activation_prime = sigmoid_prime
        elif activation == 'tanh':
            self.activation = tanh
            self.activation_prime = tanh_prime

        # Set weights
        self.weights = []
        # layers = [2,2,1]
        # range of weight values (-1,1)
        # input and hidden layers - random((2+1, 2+1)) : 3 x 3
        for i in range(1, len(layers) - 1):
            r = 2*np.random.random((layers[i-1] + 1, layers[i] + 1)) -1
            self.weights.append(r)
        # output layer - random((2+1, 1)) : 3 x 1
        r = 2*np.random.random( (layers[i] + 1, layers[i+1])) - 1
        self.weights.append(r)

    def fit(self, X, y, learning_rate=0.2, epochs=100000):
        # Add column of ones to X
        # This is to add the bias unit to the input layer
        ones = np.atleast_2d(np.ones(X.shape[0]))
        X = np.concatenate((ones.T, X), axis=1)
         
        for k in range(epochs):
            if k % 10000 == 0: print('epochs: %d'% k)
            
            i = np.random.randint(X.shape[0])
            a = [X[i]]

            for l in range(len(self.weights)):
                    dot_value = np.dot(a[l], self.weights[l])
                    activation = self.activation(dot_value)
                    a.append(activation)
            # output layer
            #error = y[i] - a[-1]
            error = a[-1] - y[i]
            deltas = [error * self.activation_prime(a[-1])]

            # we need to begin at the second to last layer 
            # (a layer before the output layer)
            for l in range(len(a) - 2, 0, -1): 
                deltas.append(deltas[-1].dot(self.weights[l].T)*self.activation_prime(a[l]))

            # reverse
            # [level3(output)->level2(hidden)]  => [level2(hidden)->level3(output)]
            deltas.reverse()

            # backpropagation
            # 1. Multiply its output delta and input activation 
            #    to get the gradient of the weight.
            # 2. Subtract a ratio (percentage) of the gradient from the weight.
            for i in range(len(self.weights)):
                layer = np.atleast_2d(a[i])
                delta = np.atleast_2d(deltas[i])
                self.weights[i] -= learning_rate * layer.T.dot(delta)

    def predict(self, x): 
        #a = np.concatenate((np.ones(1).T, np.array(x)), axis=1)      
        a = np.concatenate((np.array([[1]]), np.array([x])), axis=1)
        for l in range(0, len(self.weights)):
            a = self.activation(np.dot(a, self.weights[l]))
        return a

In [17]:
nn = NeuralNetwork([2,2,1])

X = np.array([[0, 0],
                  [0, 1],
                  [1, 0],
                  [1, 1]])

y = np.array([0, 1, 1, 0])

nn.fit(X, y)

for e in X:
    print("[%d %d] => %f"%(e[0], e[1],nn.predict(e)))

epochs: 0
epochs: 10000
epochs: 20000
epochs: 30000
epochs: 40000
epochs: 50000
epochs: 60000
epochs: 70000
epochs: 80000
epochs: 90000
[0 0] => 0.025995
[0 1] => 0.990652
[1 0] => 0.987167
[1 1] => 0.000607


# Output units: Sigmoid vs Softmax
## Sigmoid output units are for Bernoulli output distributions
* Many tasks require predicting the value of a binary variable $y$.
    * $Pr(y=1 | \mathbf{x}) = \frac{1}{1+e^{-x}}$
    * For this number to be a valid probability, it must lie in the interval [0,1]
    * Sigmoid activation unit: $\hat{y} = \sigma(w^Tx +b)$ converts $w^Tx+b$ into a probability.

## Softmax output units are for Multinoulli output distribution
* Anytime we wish to represent a probability distribution over a discrete variable with $n$ possible values, we use the softmax function.
    * It is a generalization of the sigmoid function.
    * In case of binary variables, we wish to output a single number, $\hat{y} = Pr(y=1|\mathbf{x})$
    * Generalized sigmoid for the case of discrete variable with $n$ values, we have a vector $\mathbf{\hat{y}}$ rather than a single number, with $ \hat{y}_i = Pr(y=i|\mathbf{x})$
        * Also, we need $\sum_{i=1}^n Pr(y=i|\mathbf{x}) = 1$
$$
\begin{align*}
\text{softmax}(\mathbf{z})_i = \dfrac{exp(z_i)}{\sum_j exp(z_j)}
\end{align*}
$$

In [12]:
import numpy as np
z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
softmax_out = np.exp(z) / np.sum(np.exp(z))

In [13]:
softmax_out

array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054,
       0.06426166, 0.1746813 ])

In [14]:
sum(softmax_out)

0.9999999999999999