# Softmax
## Forward Propagation
Softmax layer: the activation function for the output layer with more than one output, the probability of several outputs

The types of output $C$, $(n^{[L]}, 1)$

$z^{[L]} = Wa^{[L-1]} + b^{[L]}$

$t = e^{z^{[L]}}$, to ensure the outcomes are possitive, $(n^{[L]}, 1)$

$a^{[L]} = \frac{t}{\sum_{i=1}^{n^{[L]}}t_i}$, to ensure the sum of probabilities is 1

Mathematically\
$$a_i = \frac{e^{z_i}}{\sum_{j=1}^ne^{z_j}}$$

## Back Propagation
Loss Function: 
$$L(a,y) = -\sum_{i=1}^{n}y_iln(a_i)$$
where $y_i$ is the labeled data\
The derivative is
$$\frac{\partial L}{\partial a} = -\sum_i\frac{y_i}{a_i}$$

### Derivative of Softmax Function
Calculation of the derivatives of each $a_i$ to one $z_j$.\
The derivatives are different when i = j and i $\neq$ j.\
When i = j:
$$\frac{\partial a_j}{\partial z_j}=\frac{e^{z_j}\sum_ke^{z_k}-(e^{z_j})^2}{(\sum_ke^{z_k})^2}=\frac{e^{z_j}}{\sum_ke^{z_k}}-(\frac{e^{z_j}}{\sum_ke^{z_k}})^2 = a_j(1-a_j)$$
When i $\neq$ j:
$$\frac{\partial a_i}{\partial z_j} = -\frac{e^{z_i}}{(\sum_ke^{z_k})^2}e^{z_j} = -\frac{e^{z_i}}{\sum_ke^{z_k}}\frac{e^{z_j}}{\sum_ke^{z_k}} = -a_ia_j$$

### Combination
Thus, from Loss function to $z_j$ is simple.
$$\frac{\partial L}{\partial z_j} = \frac{\partial L}{\partial a_j}\frac{\partial a_j}{\partial z_j} + \sum_{i \neq j}\frac{\partial L}{\partial a_i}\frac{\partial a_i}{\partial z_j}$$
$$\frac{\partial L}{\partial z_j} = (-\frac{y_j}{a_j})\ a_j(1-a_j) + \sum_{i \neq j}(-\frac{y_i}{a_i})(-a_ia_j)$$
$$\frac{\partial L}{\partial z_j} = y_ja_j - y_j + \sum_{i \neq j}y_ia_j$$
$$\frac{\partial L}{\partial z_j} = \sum_{i}y_ia_j - y_j$$
$$\frac{\partial L}{\partial z_j} = a_j\sum_{i}y_i - y_j$$
Since $\sum_{i}y_i = 1$,
$$\frac{\partial L}{\partial z_j} = a_j - y_j$$

Ps. $y_i = 1$ for the correct case and $0$ for the others.

# tanh
$$tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}$$
$$\frac{dtanh(x)}{dx} = \frac{(e^x+e^{-x})^2 - (e^x-e^{-x})^2}{(e^x+e^{-x})^2}$$
$$\frac{dtanh(x)}{dx} = 1 - (\frac{e^x-e^{-x}}{e^x+e^{-x}})^2$$
$$\frac{dtanh(x)}{dx} = 1 - (tanh(x))^2$$

# Framework
## Criteria of choosing framework
1. Ease of programming
2. Training speed
3. Truly open (open source with good governance)

## Tensorflow
Automatically compute back propagation
Example. Gradient descent to calculate $w$ for Cost function $J = w^2 - 10w + 25$

In [1]:
import numpy as np
import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
coefficients = np.array([1.], [-10.], [25.])

w = tf.Variable(0, dtype=tf.float23)
#cost = tf.add(tf.add(w**2, tf.multiply(-10., w)), 25)
#cost = w**2 - 10w + 25 # operator reloaded
x = tf.placeholder(tf.float32, [3, 1]) # use placeholder to be able to feed different data into the model
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost) # 0.01 is the learning rate

init = tf.global_variables_initializer()

# session = tf.Session()
# session.run(init)
# print(session.run(w))
with tf.Session() as session:
    session.run(init)
    print(session.run(w))

In [None]:
# One gradient descent
session.run(train, feed_dict={x:coefficients}) # feed x with coefficients
print(session.run(w))

In [None]:
# 1000 gradient descents
for i in range(1000):
    session.run(train, feed_dict={x:coefficients})
print(session.run(w))