# Training Introduction 
* We previously went over the architecture of a neural network, mainly being how to pass data from input to output, and end up with a probabilistic prediction
* The problem was, that all of our weights were random!
* And thus, our predictions were not accurate!

![main%20nn%20equation.png](attachment:main%20nn%20equation.png)

* in this section we are going to focus on how to change our weights so that they are accurate
* Recall, our training data will consist of input and target pairs
* the goal is to make the predictions (y) as close to the targets (t) as possible 
* To do this we will create a **cost function** such that:
    * the **cost is large** when the prediction is not close to the target
    * The **cost is small** when the prediction is close to the target
* we are trying to make our cost as small as possible!
* Recall: the method that we used to achieve this is called gradient descent 
* This is a bit harder with neural networks, because the equations are a bit more complex, but no new skills are required

# Gradient Descent in NNs: Backpropagation
* Gradient descent in neural networks has a special name: **backpropagation**
* backpropagation is recursive in nature: allows us to find the gradient in a systematic way
* This recursiveness allows us to do backpropagation for neural networks that are arbitrarily deep, without much added complexity
* E.g. a 1-hidden layer ANN is no harder than a 100 layer ANN

![gradient%20descent%20plot.png](attachment:gradient%20descent%20plot.png)

---
# What do all of these symbols and letter mean?
## Training Data
* Training inputs: X
* Training targets: Y
* Generally speaking, these are both matrices
* X is of shape N x D
    * N = number of samples
    * D = number of input features
* Y is of shape N x 1 
    * aka a column vector
    * a 2-d object 
* Alternatively, Y can just be a vector of only 1-D of length N
    * this is how we will represent it in Numpy
    
## Training Data and Predictions
* Inputs: X, Targets: Y, Predictions: p(Y|X)
* p(Y|X) represents a full probability distribution over all individual values in the matrix Y, given the matrix X
* p(Y|X) is therefore also a matrix, same size as Y
* p(y = k | x) is a probability value - a single number 
    * Representing the probability that y is of class k, given the input vector x 
* Note: 
    * Capital letters usually represent matrices
    * lowercase letters usually represent vectors

## p(Y | X) is inconvenient
* p(Y | X) is inconvenient to write 
* many characters
* variable names cannot contain spaces or parentheses
* so we resort to things like:
    * p_y_given_x
    * py_x
    * Py
* none are really ideal
* Old school alternative for predictions is to write:
## $$\hat{y}$$
* notice that this still cannot represent a variable in code...

## Another convention
* This is were things start to get confusing
* an alternative way to represent inputs, targets, and predictions is:
    * Inputs: X
    * Targets: T
    * Predictions: Y
* this is beneficial since we no longer need to write out P of Y given X anywhere
* however this Y now conflicts with our other Y- before Y meant the targets, now it means the predictions

## Using Context
* we will need to use context to determine what Y really is 
* if we see Y and T at the same time, it should be clear that Y is a prediction
* if you see Y being assigned the output a neural network, its a prediction
* but if you see Y and Yhat, Y and p_y_given_x at the same time, Y is a target 

## Weights

![weights%20diagram.png](attachment:weights%20diagram.png)

* we have some conventions for the sizes of things
* N is the number of samples we have collected in our experiment
* D is then number of features, which is the size of the input layer in the neural network
* M represents the size of the hidden layer
* K represents the size of the output layer 
* K is the number of output classes, and can be anything from 2 and larger 
* when we have a 1 hidden layer neural network, one way to name the weights is as follows:
    * W is the weight matrix from the input to hidden layer - (D x M)
    * b is the bias term at hidden layer - (M x 1)
    * V is the weight matrix from the hidden to output layer - (M x K)
    * c is the bias term at the output layer - (K x 1)
* you can imagine that if we start adding more hidden layers, we are going to run out of letters! So using V and c isn't really a great option.
* How about just numbering our W's and b's?
    * W1, b1, W2, b2, ... and so on

## Indices
* we may or may not put indices in different places if they represent different things
* Example: if we are looking for the target T for the nth sample and kth class
* In Numpy: T[n,k]
* T(n,k)
* $T^n_k$
* $T_{nk}$
* $t^n_k$
* $t_{nk}$

## Indexing
* i, j, and k are common letters we use for indices
* Ex. i=1...D (input layer)
* Ex. j=1...M (hidden layer)
* Ex. k=1...K (output layer)
* the problem with i, j, and k is what happens if we have more than 1 hidden layer, and eventually run out of letters
* we then pick and index outside of these 3 current letters:
    * q = 1...Q
    * r = 1...R
    * s = 1...S

## Learning Rate
* greek letters: alpha or eta

![learning%20rate.png](attachment:learning%20rate.png)

## Cost/Objective/Error Function
* Typical letters: E or J
* Cost or error: usually means something we want to minimize
* Objective: can be something to minimize or maximize 
* Probabilistic interpretation of cost: negative log-likelihood
* we are trying to maximize the log likelihood, or minimizing the negative log likelihood
* minimizing E is the same as maximizing -E
* So if you are minimizing the negative log-likelihood (gradient **descent**) is the same as maximizing the log-likelihood (and likelihood) (gradient **ascent**)

## Likelihood
* Typically we use the uppercase L for likelihood, lowercase l for log-likelihood, if they are both presented together
* If discussing log-likelihood or negative log-likelihood by itself, we might just use L since L is easy to see, and l can be confused for I

---
# What does it mean to train a Neural Network? 
* this is going to be very similar to logistic regression

## The Main Concepts
* we very intuitively define something called the "cost"
* we want to minimize the cost!  
* But how do we minimize the cost? This falls into the domain of calculus! Calculus provides the tools to find the min/max of a function!
* we specifically use a method called **gradient descent**

## How do we define cost? 
* recall that for binary classification, this is exactly how we would calculate the likelihood of a sequence of coin tosses
* So for example, say we flip 2 heads and 3 tails
* Because these are independent trials, the total likelihood is then:
### $$Likelihood = p(H)p(H)p(T)p(T)p(T)$$
* again, the reason we can multiply these probabilities is because each coin toss is independent of the others
* another way to write this is to call:
### $$p = p(H)$$
* and hence we can rewrite likelihood as:
### $$Likelihood = p^{number \; of\; heads}(1-p)^{number \;of \; tails}$$

## Minimize or Maximize?
* the likelihood, or in other words the probability of our model, aka the probability of our data, given our model/parameters, is something we want to maximize
* but recall that we are looking for a cost, in other words, something to minimize
* In order to get something that we can call the cost, we take the negative log of the likelihood and call it the "cost"
* Negative log likelihood = -{#H logp + #T log(1-p)}
* recall from logistic regression that this is called the cross entropy cost function

## Cross-Entropy
* we can phrase it in terms of the output probability of the logistic regression model, and the targets
* $y_n$ = output of logistic regression or neural network
* $t_n$ = actual target (0 or 1) in the binary case 

![cross%20entropy%20cost.png](attachment:cross%20entropy%20cost.png)

* notice that if we had a neural network doing binary classification, we would use this exact same cost function
* recall that in order to find the best weights to minimize this cost, we can use gradient descent 
* we can also maximize the negative of this, gradient ascent

## Cross Entropy for Multi-class Classification
* in this section we want to be able to handle any number of outputs 
* lets consider a die roll (6 faces, but lets call it K)
* the probability of rolling k = $y_k$
* $t_k$ = 1 if we roll k, 0 if we do not roll k
* we have N total die rolls, so $t_{n,k} =1$ if we rolled k on the nth die roll 
* therefore, only one of the $t_{n,k}$ can equal 1 for any particular n
    * $t_{n,k}$ is thus an indicator matrix or one hot encoded matrix of 1s and 0s
    
![multi%20class%20likelihood.png](attachment:multi%20class%20likelihood.png)

* notice that y and k now have two indexes each:
    * n corresponds to which sample we are looking at 
    * and k corresponds to which class we are looking at 
* notice that for any particular n, only 1 of the k targets can be 1, and the rest must be 0
* that is because if we roll a die and get a 6, then that same die roll can't be any other number 

## Cross Entropy for Multi-class Classification

![cross%20entropy%20for%20multiclass%20classification.png](attachment:cross%20entropy%20for%20multiclass%20classification.png)

* What is cross entropy: https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
* we again want to transform this into a cost, so we will take the negative log of the likelihood
* this is called the cross entropy cost function, but for multiclass classification
* next we will see how to perform gradient descent on the new cost function and how to write it in code! 

---
# Backpropagation Intro 
* Lets start by recalling how we trained the logistic regression model

## Logistic Regression Recap
* we start with an objective function, and for binary classification we use cross entropy 

![total%20cross%20entropy%20error.png](attachment:total%20cross%20entropy%20error.png)

* we know that this objective function has a global minimum, so it looks something like a parabola 
* now we usually will randomize our weights initially, and then we slowly move towards the minimum in small steps
* we find the direction for the minimum using the gradient, $\frac{dJ}{dw}$

![cost%20minimization.png](attachment:cost%20minimization.png)

* we end up with an update rule:
### $$w += w - \alpha\frac{dJ}{dw}$$
* where $\alpha$ is the learning rate 
* this process is called gradient descent
* You can also do gradient ascent, where the goal is to find a global maximum
### $$w += w + \alpha\frac{dJ}{dw}$$

## Neural Networks Gradient Descent
* we are going to do the exact same process with neural networks! However, because they are nonlinear we are going to find a local minima, not global minimum
* additionally, because the weight updates are dependent on the error at multiple outputs, we are going to need the concepts of total derivatives

### Total Derivatives
* so if you have a function of x and y, $f(x,y)$, where x is a function of t, $x(t)$, and y is a function of t, $y(t)$, hence it is a parameterized function
* The goal is to take the **total derivative**
### $$\frac{df}{dt}$$
* To do this we use the **chain rule**
### $$\frac{df}{dt}= \frac{df}{dx}\frac{dx}{dt} +\frac{df}{dy}\frac{dy}{dt}$$
* now if you had a vector x, which has k components, and they are all parameterized by t, you can imagine that you would do the same thing, using a summation
### $$\frac{df}{dt} = \sum_k\frac{df}{dx_k}\frac{dx_k}{dt}$$

## Objective function with Softmax
* basically this is the same as the likelihood of rolling a die
* so if you were to roll a die, you would get your likelihood to be, with n independent and identically distributed tosses:
### $$Likelihood = \prod_{n=1}^N\prod_{k=1}^6 (\theta_k^n)^{t_k^n}$$
* so with neural networks this is exactly the same!
### $$P(targets\;|\;inputs, weights) = P(targets\;|\;X,W,V) = \prod_{n=1}^N\prod_{k=1}^K (\theta_k^n)^{t_k^n}$$
* so we are going to work with the log likelihood, not the negative log liklelihood, and do gradient ascent instead of descent
* so lets take the log likelihood
### $$\sum_n\sum_kt_k^nlogy_k^n$$
* now that we have our objective function, what do we do with it?
* it is the same idea as with logistic regression! We want to find the derivative with respect to certain weights. 
* since we have 2 sets of weights for a 1 hidden layer NN (W and V), the dimensions of each node are D, M, and K, and they are indexed by d, m, k

![1%20hidden%20layer%20diagram.png](attachment:1%20hidden%20layer%20diagram.png)

* so we want to find these derivatives: 
### $$\frac{dJ}{dV_{mk}}$$
### $$\frac{dJ}{dW_{dm}}$$
* Note that in these derivatives, J can be thought of as our **error**
* we are trying to find how the error (cost, J) changes as we change our weights!
* because we are doing backpropagation, we are going to find $\frac{dJ}{dV_{mk}}$ first, because it is on the right, followed by backpropagating the error, and then we will find $\frac{dJ}{dW_{dm}}$
* this can be done using the chain rule 
### $$\frac{dJ}{dV_{mk}}= \sum_n\sum_{k'}t_{k'}^n\frac{1}{y_{k'}^n}\frac{dy_{k'}^n}{dV_{mk}}$$
* now the question is, how do we find: $\frac{dy_{k'}^n}{dV_{mk}}$?
* in other words...
### How do we find the derivative of softmax?
### $$y_k=\frac{e^{a_k}}{\sum_je^{a_j}}$$
* where the activation, $a_k$ is just the dot product of the input times the weights
### $$a_k= V_k^TZ$$
* so we want to find just the derivative of the softmax first
    * if k == k'
### $$\frac{dy_{k'}}{da_k} = y_{k'}(1-y_k)$$  
    * if k != k'
### $$\frac{dy_{k'}}{da_k} = -y_{k'}y_k$$  
* these can be combined using the kronecker delta
    * if i == j
### $$\delta_{ij} = 1$$    
    * if i != j
### $$\delta_{ij} = 0$$        
* so the derivative is...
### $$\frac{dy_{k'}}{da_k} = y_{k'}(\delta_{kk'}-y_k)$$  
* we also know from the dot product that the derivative of the activation is just zm
### $$\frac{da_k}{dV_{mk}}=z_m$$

## Combine
### $$\frac{dJ}{dV_{mk}}= \sum_n(t_k^n-y_k^n)z_m$$

## Derivative with respect to inputs to hidden weights

---
<br></br>
# 1. Backpropagation Walkthrough
At this point I highly recommend that you go through my backpropagation walkthrough in the appendix! It goes through the backpropagation process in great detail, with several measures taken to simplify the process and help learn the basics. If you have just come from that walk through, here are the main changes that we are about to encounter:
> * Instead of binary classification, we are now going to be doing multiclass classification, and making use of the softmax, instead of the sigmoid, at the output layer. Recall, softmax is defined as:
#### $$y_k=\frac{e^{a_k}}{\sum_je^{a_j}}$$
Where k represents the class $k$ in the output layer. In other words our $y$ output is going to be a  **(kx1)** vector for a single training example, and an **(Nxk)** matrix when computed for the entire training set. 
* We are now going to be using the **cross entropy error** as our **objective function**, instead of the least squares error. Recall, **cross entropy error** for **binary classification** is:
#### $$Cost = J = -\sum_{n=1}^Nt_nlog(y_n)+(1-t_n)log(1-y_n)$$
And for **multi-class classification**:
#### $$Cost = J = -\sum_{n=1}^N\sum_{k=1}^Kt_{n,k}logy_{n,k}$$

<br></br>
## 1.1 Problem Setup
Lets take a minute to clearly define the problem that we are going to be working with. We are going to be investigating a **neural network** with **1 hidden layer**, that will be using the **softmax** at the output. As an **error/cost function** we will be using **cross entropy**. The overall architecture of our neural network will look like:

![nn%20diagram.png](attachment:nn%20diagram.png)

<br></br>
### 1.1.1 Notation 
It is very important to have a clear understanding of the notation and matrice sizes that will be going along with this walk through. Let's define them now:
> * **N**: is the number of samples we have collected in our experiment
* **D**: the number of input features, which is the size of the input layer in the neural network. It is indexed by **d**.
* **M**: represents the size of the **hidden layer**. It is indexed by **m**.
* **K**: represents the size of the **output layer**. It is indexed by **k**.
* **K**: number of **output classes**, and can be anything from 2 and larger
* **t**: this is our target for a given training example
* **y**: this is the output probability from our network, aka our prediction
* **a**: the activation, the value that goes into a node (a linear combination). This can be used at any layer, but we will use it specifically for the output layer, right before the softmax. 
* Since we will be performing matrix operations and want to vectorize our implementation, we will look at all training examples at the same time. Hence, our layers have the following dimensions:
    * **X**, the **input layer**, is an **(N x D)** matrix
    ![x%20input%20matrix.png](attachment:x%20input%20matrix.png)
    * **Z**, the **hidden layer**, is an **(N x M)** matrix
    ![hidden%20layer%20matrix.png](attachment:hidden%20layer%20matrix.png)
    * **K**, the **output layer**, is an **(N x K)** matrix
    ![output%20layer%20matrix.png](attachment:output%20layer%20matrix.png)
* In our example (1 hidden layer neural network), we will name the **weights** as follows:
    * **W** is the weight matrix from the **input** to **hidden layer**, it is **(D x M)**
    ![input%20to%20hidden%20layer%20matrix.png](attachment:input%20to%20hidden%20layer%20matrix.png)
    * **b** is the bias term at hidden layer, it is **(M x 1)**
    * **V** is the weight matrix from the **hidden** to **output layer**, it is **(M x K)**
    ![hidden%20to%20output%20matrix.png](attachment:hidden%20to%20output%20matrix.png)
    * **c** is the bias term at the output layer, it is **(K x 1)**

<br></br>
## 1.2 Starting Point and Overall Goal
Okay great, we now have the problem fully described and drawn out, understanding how each component looks. Lets take a minute to go over where we are starting when we begin backpropagation, and then clearly define the goals that we are trying to achieve. 

Backpropagation is going to occur *after* a prediction, which is made using the **feed forward** method. So we are picking up with a set of predictions, and we are going to determine the amount of error in those predictions using the **cross entropy loss**. Our goal is to figure out how to change the weights so as to minimize the cross entropy loss. So, to sum up, when we begin, we already know:

The values of the nodes in the **hidden layer z**:
#### $$z = \sigma(W^Tx)$$
The values of the nodes in the **output layer**:
#### $$y = softmax(V^Tz)$$
Keep in mind that in the above equation, $y$ is a $k$ dimensional vector. Each value that $k$ holds represents the probability that that training example belongs to class $k$. 
Finally, we know the value of the **cross entropy loss**:
#### $$J = \sum_{k=1}^Kt_klogy_k$$
Remember, the easiest way to think about the cross entropy loss in the above equation is as follows. Lets say we have 5 classes we are working we, and we are utilizing one training example, that belongs to class 3. We could label our target vector as:
#### $$t = [class_1 \; class_2\;class_3\;class_4\;class_5]$$
#### $$t = [0\;0\;1\;0\;0]$$
And then lets say that our prediction, $y$, ended up being:
#### $$y = [0.05\;0.15\;0.7\;0.04\;0.06]$$
We would then iterate through the vectors $t$ and $y$ in order to determine the total cross entropy cost. 
#### $$J = 0*log(0.05)+0*log(0.15)+1*log(0.7)+0*log(0.04)+0*log(0.06)$$
#### $$J = 1*log(0.7)$$

Now in this example, because we are going to be utilizing a matrix operation and working with all training examples at once, our **cross entropy cost function** will be modified to look like:
### $$\sum_n\sum_kt_k^nlogy_k^n$$
Which only means that now we are still looping through all the $k$ outputs of softmax, but also looping through *all* **N** training examples. 

With that taken care of, lets mathematically define what we are trying to solve for during this backpropagation process. We are trying to find how the **cross entropy cost** changes as we change $W_{dm}$, and as we change $V_{mk}$. Utilzing derivatives that looks like:

#### $$\frac{\partial J}{\partial W_{dm}}$$
#### $$\frac{\partial J}{\partial V_{mk}}$$

We know that when we backpropagate, we must solve for the weights in the hidden to output layer first, so we can start there. 

<br></br>
## 1.3 Find weights $V_{mk}$
So in order to determine how a change in $V_{mk}$ changes the cross entropy cost, $J$, we are going to need to use the **chain rule**. 
#### $$\frac{\partial J}{\partial V_{mk}}$$
Meaning we need to find:
1. How the **cross entropy** changes as we change the **output** from the softmax, $y$:
#### $$\frac{\partial J}{\partial y}$$
2. How the **output** from the softmax $y$ changes as we change the **input** to the softmax, $a$:
#### $$\frac{\partial y}{\partial a}$$
3. How the **input** to the softmax $a$ changes as we change the **weights**:
#### $$\frac{\partial a}{\partial V_{mk}}$$
Combining these via the chain rule we have the equation:
#### $$\frac{\partial J}{\partial V_{mk}}=\frac{\partial J}{\partial y}\frac{\partial y}{\partial a}\frac{\partial a}{\partial V_{mk}}$$

<br></br>
### New Notation
We are going to be introducing some additional notation to the chain rule now. This is particularly new if you are just coming from the backpropagation walkthrough in the appendix. The addition is due to the fact that we are utilizing matrix operations now, in addition to softmax at the output, meaning that each output depends on the other outputs. 
Remember, we define the cross entropy as:
#### $$J = \sum_{k'=1}^Kt_{k'}log(y_{k'})$$
We can start by rewriting $\frac{\partial J}{\partial V_{mk}}$ as:
#### $$\frac{\partial J}{\partial V_{mk}} = \frac{\partial}{\partial V_{mk}}\sum_{k'=1}^Kt_{k'}log(y_{k'})$$ 

The question may arise, why are we using $k'$ now? Well, $k'$ is a variable we are using as an index. It will have a value changing from 1 to $K$ (the number of classes). We call it $k'$ instead of $k$, because the variable $k$ is already being utilized in $a_k$ and $V_{mk}$. In the case of $a_k$ and $V_{mk}$, $k$ is not necessarily being iterated over, it is just a mathematical convention that states "this equation will hold for all values of $k$ and $m$". To make this even more clear, consider the following equation:
#### $$a = V_k^TZ$$
When we index that as follows:
#### $$a_k = V_{mk}z_m$$
It is simply saying that $a_k$ will equal $V_{mk}z_m$, for *all* values of $k$. Say we had $k =3$:
#### $$a_3 = V_{m3}z_m$$
Based on our notation, that equation is perfectly valid and holds true. So, for the rest of this tutorial keep that in mind: $k$ is just a variable be are using to index our equation and state that it holds true for all values of $k$. However, $k'$ is a being used specifically as an index! We will explore this more as the walk through continues, so don't worry if it is still slightly unclear. As a note, $k'$ could just as easily have been called $j$, or any other dummy variable name. 

Now, we can split this up to match the original chain rule as follows:
#### $$\frac{\partial J}{\partial V_{mk}} = \sum_{k'=1}^K\frac{\partial (t_{k'}logy_{k'})}{\partial y_{k'}}\frac{\partial y_{k'}}{\partial a_k}\frac{\partial a_k}{\partial V_{mk}}$$ 

<br></br>
### 1.3.1 Derivative of Cross Entropy with Respect to $y_k'$
We will keep reiterating this over the entire walk through, but never forget we are trying to see how $J$ changes as we change our weights. So lets take a look at our equation for $J$, the cross entropy again:
#### $$J = \sum_{k'=1}^Kt_{k'}log(y_{k'})$$
If this is still slightly unclear as to exactly what is happening and what is really going on with $J$, a simple example may help to make things a little more concrete. Consider the situation below:

![output%20layer%20example%20k%20=%203.png](attachment:output%20layer%20example%20k%20=%203.png)

We have $K = 3$ output nodes, where $y_1$, $y_2$, and $y_3$ are our prediction probabilities for each class at those output nodes, and $t_1$, $t_2$, and $t_3$ are our true value **targets** (either 0 or 1). In this specific case our equation for J looks like:
#### $$J = \sum_{k'=1}^3t_{k'}log(y_{k'})$$
Which can be expanded to:
#### $$J = t_1logy_1 + t_2logy_2 + t_3logy_3$$
Clearly, in our specific case above with $K = 3$, $J$ is a function of $y_1$, $y_2$, and $y_3$:
#### $$J = J(y_1,y_2, y_3)$$
And therefore we are going to need to figure out how $J$ changes as we change each individual $y$. So in our case, the derivatives that we are looking for are 
#### $$\frac{\partial J}{\partial y_1}, \frac{\partial J}{\partial y_2}, \frac{\partial J}{\partial y_3}$$
But in the general case, where we just state that we have have $K$ classes, we are looking for:
#### $$\sum_{k'}^K\frac{\partial J}{\partial y_{k'}}$$
Note, we are able to take the derivative of a sum because of a basic rule of calculus: *The derivative of a sum is just the sum of the derivatives*. So now we want to take the derivative of $J$ with respect to $y_{k'}$. To do that lets pull out the sum over $K$ for a moment, and find $J$ changes with respect to just one $k'$:
#### $$\frac{\partial J}{\partial y_{k'}} = \sum_{k'}^K\frac{\partial (t_{k'}logy_{k'})}{\partial y_{k'}}$$
#### $$\frac{\partial (t_{k'}logy_{k'})}{\partial y_{k'}} = t_{k'}\frac{1}{y_{k'}}$$
#### $$\frac{\partial J}{\partial y_{k'}} = \sum_{k'}^K\frac{t_{k'}}{y_{k'}}$$
Great! We have found out how $J$ changes as we change a specific $y_{k'}$. In our simple example where $K = 3$ above, that means that our 3 derivatives would have evaluated to:
#### $$\frac{\partial J}{\partial y_1} = \frac{t_1}{y_1}, \frac{\partial J}{\partial y_2} = \frac{t_2}{y_2}, \frac{\partial J}{\partial y_3} = \frac{t_3}{y_3}$$

<br></br>
### 1.3.2 Derivative of Softmax: derivative of $y_{k'}$ with Respect to $a_k$
Now that we have found out how the cross entropy error, $J$, changes as we change output prediction probability $y_{k'}$, we can look at how the output prediction probability changes as we change the activation $a_k$. 

This derivative is probably the most challenging thing about backpropagation, so lets make sure everything is very clear before getting into it. We are going to reference our small example above where $K=3$ in order to make this more clear. Lets start by defining the softmax equation again in the context of our problem:
#### $$y_{k'}=\frac{e^{a_k'}}{\sum_{j=1}^Ke^{a_j}}$$
Where $y_{k'}$ is the output of the softmax at node $k'$. For instance, say we are looking at class 2, in our small example from above:
#### $$y_{2}=\frac{e^{a_2}}{\sum_{j=1}^3e^{a_j}} = \frac{e^{a_2}}{e^{a_1}+e^{a_2}+e^{a_3}}$$
Here, we can clearly see that $y_2$ is a function of not only $a_2$, but also $a_1$ and $a_3$. This should be intuitive because the softmax is dependent on the activation going into all output nodes, so any y output will change as we change any activation. Hence, in the case of $y_2$ we would need to find the following derivatives:
#### $$\frac{\partial y_2}{\partial a_1}, \frac{\partial y_2}{\partial a_2}, \frac{\partial y_2}{\partial a_3}$$
And the same thing would go for the other output nodes, $y_1$ and $y_3$! Now, think back to that index change we made earlier, stating that we needed to use $k'$ instead of $k$. It should be rather clear now why that was the case. If we had used $k$ as the index for both $y$ and $a$, then our derivatives would only be able to be:
#### $$\frac{\partial y_k}{\partial y_k}$$
#### $$\frac{\partial y_1}{\partial a_1}, \frac{\partial y_2}{\partial a_2}, \frac{\partial y_3}{\partial a_3}$$
Which is clearly on 3 out of the 9 total that we need to find! If we had $k$ on both the top and bottom of the derivative, then they must be the same! Clearly that is not what we want. 

A question that you may run into a this point however, is if we need to iterate over $k'$, in order to to make sure we see how $J$ changes as we change each individual $y_{k'}$, why is that not the case with $k$? There is no iteration for $k$ because it has already been assumed to represent $k=1...K$. So we do not even need to write the summation at that point. Think back to the backpropagation appendix example; we first were solving for $w_5$. There was no specific summation term, but it was implicitly assumed that our equation would hold not just for $w_5$, but any $w$ in the input to hidden layer. It is the same case here. Again, we explicitly change the index to $k'$ because even though $k'=1...K$, this allows to **not have the same value for $k$ on the top and bottom of the derivative**. We had already used $k$, and we could not use it as another index. 

If this is still unclear, think about from a programming perspective. If you had an inner and an outer for loop, you would not want to index them both wiht $i$, so most programmers will index the inner loop with $j$. It is a similar situation here.

<br></br>
### 1.3.2.1 Derivative of softmax: with respect to $a_{k'}$ or $a_k$
Before we start the actual derivation of softmax, there is a very key point we need to touch on: The derivative will depend on whether we are trying to see how $y_{k'}$ changes with respect to $a_{k==k'}$ or just $a_{k!=k'}$.

That is a mouthful, and sounds more confusing than it actually is- to solidify what it actually means lets look at the following situation: 
#### $$y_{k'}=\frac{e^{a_{k'}}}{\sum_{j=1}^Ke^{a_j}}$$
Above we have our equation for softmax. If you are wondering why the $a$ in the numerator is $a_{k'}$, it is because the definition of softmax forces the output node, $y_{k'}$, to be equal to its input activation, in this case $a_{k'}$, divided by the sum of **all** activations. 

Now lets say we are specifically looking at output node 2 again, meaning $k' = 2$. 
#### $$y_{2}= \frac{e^{a_2}}{e^{a_1}+e^{a_2}+e^{a_3}}$$
So in this case, clearly $y_2$ depends on $a_2$, but also $a_1$ and $a_3$. Mathematically that looks like: 
#### $$y_2 = y_2(a_1, a_2, a_3)$$
If we are seeing how $y_2$ changes with respect to $a_2$, we can see that $a_2$ appears in the numerator and denominator of the equation for $y_2$, meaning it will have to be derived a specific way (making use of the product rule). Said another way, when differentiating $y_2$ with respect to $a_2$, the differentiating variable appears on both the bottom and top. In this case, we are deriving with respect to $a_{k'}$, since $y_{k'=2}$ and we are working with $a_k =2$, hence $k' == k$. 

However, we also know that $y_2$ will change with $a_1$ and $a_3$. In each case, our k values $k = 1$ and $k = 3$, do not equal our $k'$ value of 2. $a_1$ only appears in the denominator, so we will use a different derivation in that case. The same concept would be applied to $a_3$. 

<br></br>
### 1.3.2.2 Derivative of softmax: deriving w.r.t. $a_{k'==k}$
We will start with the derivative of softmax when $k'==k$. So in this case we are seeing how the output $y_{k'}$ changes as we change $a_k$, where $a_{k==k'}$. For simplicity, we just write that we are taking the derivative of $y_{k'}$ with respect to $a_{k'}$. Remember, in our simple example where we are dealing with node 2, $k'==2$: 
#### $$y_{2}= \frac{e^{a_2}}{e^{a_1}+e^{a_2}+e^{a_3}}$$
And we are deriving with respect to $a_2$. Lets start writing this out in a more general form. Starting with our original equation for softmax:
#### $$y_{k'}=\frac{e^{a_{k'}}}{\sum_{j=1}^Ke^{a_j}}$$
We can rewrite the above by bringing the denominator up top, that way we can make use of the product rule.
#### $$y_{k'}=e^{a_{k'}}\Big[\sum_{j=1}^Ke^{a_j}\Big]^{-1}$$
We can then use the product rule to help us with this derivative. For a quick refresher, the product rule is defined as:
#### $$(f*g)' = f'*g + f*g'$$
In our case that looks like: 
#### $$\frac{\partial y_{k'}}{\partial a_{k'}} = \frac{\partial (e^{a_{k'}})} {\partial a_{k'}}\Big[\sum_{j=1}^Ke^{a_j}\Big]^{-1} + e^{a_{k'}}\frac{\partial \Big[\sum_{j=1}^Ke^{a_j}\Big]^{-1}}{\partial a_{k'}}$$
Now we can solve for the indivdual derivatives seen in the equation above:
#### $$\frac{\partial y_{k'}}{\partial a_{k'}}=\frac{e^{a_{k'}}}{\sum_{j=1}^Ke^{a_j}} - \Big[\sum_{j=1}^Ke^{a_j}\Big]^{-2}*e^{a_{k'}}*e^{a_{k'}}$$
Which we know is equal to:
#### $$\frac{\partial y_{k'}}{\partial a_{k'}}=y_{k'} - y_{k'}^2$$
And we can rewrite that as:
#### $$\frac{\partial y_{k'}}{\partial a_{k'}}=y_{k'}(1-y_{k'})$$

<br></br>
### 1.3.2.3 Derivative of softmax: deriving w.r.t. $a_{k'!=k}$
Now lets look at the derivative of softmax when $k' != k $. In our simple example, this would be the case when we are seeing how $y_2$ changes with respect to $a_1$ or $a_3$. At this point the term we are deriving with respect to will only occur in the denominator. 

We can start by again rewriting the equation for softmax:
#### $$y_{k'}=e^{a_{k'}}\Big[\sum_{j=1}^Ke^{a_j}\Big]^{-1}$$
And remember, now we are deriving with respect to $a_k$, so we reflect that in our mathematical representation of the derivative:
#### $$\frac{\partial y_{k'}}{\partial a_{k}}$$
So lets now derive our equation of for softmax, with respect to $a_k$, when $k != k'$.
#### $$\frac{\partial y_{k'}}{\partial a_{k}} = e^{a_{k'}}\frac{\partial\Big[\sum_{j=1}^Ke^{a_j}\Big]^{-1}}{\partial a_k}$$
We were able to pull out $e^{a_{k'}}$ above because it was not dependent on $a_k$. For clarification, **there is one time during that summation where k' == k. However, we already accounted for that specific example in 1.3.2.2**.

We can now solve for the term being derived:
#### $$\frac{\partial y_{k'}}{\partial a_{k}} = -e^{a_{k'}}\Big[\sum_{j=1}^Ke^{a_j}\Big]^{-2}e^{a_k}$$

And now lets separate the two terms and simplify: 
#### $$\frac{\partial y_{k'}}{\partial a_{k}} =\frac{-e^{a_{k'}}}{\sum_{j=1}^Ke^{a_j}}\frac{e^{a_k}}{\sum_{j=1}^Ke^{a_j}}$$
Which can be rewritten as:
#### $$\frac{\partial y_{k'}}{\partial a_{k}} = -y_{k'}y_k$$

<br></br>
### 1.3.2.4 Derivative of softmax: Combining results
At this point we have just found how $y_{k'}$ changes as we change the its own activation, $a_{k'}$, as well as how it changes when we change all of the other activations going into the output layer, $a_k$. 
#### $$\frac{\partial y_{k'}}{\partial a_{k'}} \; and \; \frac{\partial y_{k'}}{\partial a_{k}}$$
We need a way to now combine these two derivatives into one derivative, which is just how the output $y_{k'}$ changes as we change all of the $a_k$ (one of which is $a_{k'}$. A really clever way to do that is to use the: **Kronecker delta**. This is a function that takes in two arguments and results in 1 if they are both equal, and 0 if they are not. 
#### $$\delta(1,1) = 1$$
#### $$\delta(5,3) = 0$$
#### $$\delta(4,4) = 1$$
By utilizing this delta function we can combine our equations into the following form (clever!):
#### $$\frac{\partial y_{k'}}{\partial a_{k}} = y_{k'}(\delta(k,k') - y_k) = y_k(\delta(k,k') - y_{k'})$$
We see above that their are two ways we could potentially combine the equations, but which one is useful to us? Well, lets take a second to recall what our derivative of the cross entropy, $J$, looked like before we started taking the derivative of the softmax:

#### $$\frac{\partial J}{\partial V_{mk}} = \sum_{k'=1}^K\frac{\partial (t_{k'}logy_{k'})}{\partial y_{k'}}\frac{\partial y_{k'}}{\partial a_k}\frac{\partial a_k}{\partial V_{mk}}$$ 
And we had found that: 
#### $$\frac{\partial J}{\partial y_{k'}} = \sum_{k'}^K\frac{t_{k'}}{y_{k'}}$$
With this in mind, we can see that using the combination where $y_{k'}$ is on the outside allows us to cancel out the $y_{k'}$!
#### $$\frac{\partial J}{\partial a_k} = \sum_{k'}^K\frac{t_{k'}}{y_{k'}}*y_{k'}(\delta(k,k') - y_k)$$
#### $$\frac{\partial J}{\partial a_k} = \sum_{k'}^Kt_{k'}(\delta(k,k') - y_k)$$

<br></br>
### 1.3.2.5 Derivative of softmax: Split the summation
Now if we were to multiply $t_{k'}$ through, and then split the summation we would end up with:
#### $$\frac{\partial J}{\partial a_k} =\sum_{k'=1}^Kt_{k'}\delta(k,k') - \sum_{k'=1}^Kt_{k'}y_k$$
But we can perform another trick! The summation on the left will go away because $\delta(k,k')$ is only equal to 1 when $k==k'$, meaning the only time it will be 1 we will be looking at $t_k$. 
#### $$\frac{\partial J}{\partial a_k} = t_k - \sum_{k'=1}^Kt_{k'}y_k$$

<br></br>
### 1.3.2.6 Derivative of softmax: Pull out $y_k$
We also can perform another simplification. For the summation on the right hand side, $y_k$ does not depend on $k'$, the index of the summation, so we can pull it out! 
#### $$\frac{\partial J}{\partial a_k} = t_k - y_k\sum_{k'=1}^Kt_{k'}$$
And that lets us perform one last simplification! Remember, that for our target vector, $t$, that from $t_1...t_K$ only *one* value is going to be equal to 1 (the true value we are trying to predict), while the rest will be 0. That means that that summation term will sum to 1! And we can write our final equation as:
#### $$\frac{\partial J}{\partial a_k} = t_k - y_k$$
Lets take a moment to appreciate how rad that was! 

<br></br>
### 1.3.3 Derivative of $a_k$ with respect to $V_{mk}$
Lets take a second to zoom way back out and remember what we are trying to solve for:

#### $$\frac{\partial J}{\partial V_{mk}} = \sum_{k'=1}^K\frac{\partial (t_{k'}logy_{k'})}{\partial y_{k'}}\frac{\partial y_{k'}}{\partial a_k}\frac{\partial a_k}{\partial V_{mk}}$$ 
We are trying to solve for how the cross entropy error, J, changes as we change the weights, $V_{mk}$. And we have managed to determine up until:
#### $$\frac{\partial J}{\partial a_k} = t_k - y_k$$
Meaning that the last piece we need to solve for is:
#### $$\frac{\partial a_k}{\partial V_{mk}}$$ 
Recall that the activation at unit k can be defined as:
#### $$a_k = V_{k}^TZ$$
Where $V_{k}$ is a matrix of dimensions **(M x K)**, and the index $m,k$ represents the weight in the matrix that goes from node m in the hidden layer z, to node k in the output layer:

![v%20matrix%201.png](attachment:v%20matrix%201.png)

And $Z$ is an **(M x 1)** vector, holding an output value for each node in the **hidden layer**:

![z%20vector.png](attachment:z%20vector.png)

Because we need our inner dimensions to match when performing matrix multiplication, we will need to take the transpose of V, $V^T$:

![v%20matrix%20transpose.png](attachment:v%20matrix%20transpose.png)

Now when it comes to performing the dot product at this point, lets quickly visualize what that looks like:

![linear-algebra-diagram.png](attachment:linear-algebra-diagram.png)

We can think of the data in the upper right hand corner of the above image as our Z vector, and the operation matrix as our $V^T$ weight matrix. We can see that our z vector is "run through" each row of the $V^T$ matrix, the dot product is applied, at a single value is output (for each row).

![vz%20multiplication.png](attachment:vz%20multiplication.png)
<br></br>
<br></br>
<br></br>
 This leaves us with a **(K x 1)** matrix for the output layer. 
 
![k%20output%20vector.png](attachment:k%20output%20vector.png)

So with the above process in mind, we can see that $a_k$ is:
#### $$a_k = V_{1k}z_1 + V_{2k}z_2 + ...+ V_{Mk}z_M$$
Keep in mind that we were trying to find:
#### $$\frac{\partial a_k}{\partial V_{mk}}$$ 
So lets say for a second that $k=3$, our equation would look like:
#### $$a_3 = V_{13}z_1 + V_{23}z_2 + ...+ V_{M3}z_M$$
Now lets see how it changes as we change $V_{m3}$. Only 1 term is a function of $V_{m3}$:
#### $$\frac{\partial a_3}{\partial V_{m3}} = 0 + 0 +...+1*z_m+ ...+0+0 $$ 
#### $$\frac{\partial a_3}{\partial V_{m3}} = z_m$$
Hence, no matter what value of $k$ we are looking at, when we change $V_{mk}$, $a_k$ will change as follows:
#### $$\frac{\partial a_k}{\partial V_{mk}} = z_m$$
If that isn't completely clear, we can quickly expand upon it. If we wanted to define $a_k$ in scalar form we would see:
#### $$a_k = \sum_m^MV_{mk}z_m$$

We can expand that to see:

Now if we take the derivative of $a_k$ with respect to $V_{mk}$:
####  $$\frac{\partial (V_{1k}z_1 + V_{2k}z_2 + ...+ V_{Mk}z_M)}{\partial V_{mk}}$$
We see that all terms are not function of $V_{mk}$ besides:
#### $$V_{mk}z_m$$
And when we derive that with respect to $V_{mk}$, we get:
#### $$\frac{\partial a_k}{\partial V_{mk}} = z_m$$

<br></br>
### 1.3.4 Combine it all together via the chain rule
Okay at this point we have everything we need and can combine it all back together! Remember, we started with a **cross entropy cost**:
#### $$J = \sum_{k'=1}^Kt_{k'}log(y_{k'})$$
And then we split it up via the chain rule, with the end goal of seeing how the cost would change as we changed $V_{mk}$:
#### $$\frac{\partial J}{\partial V_{mk}} = \sum_{k'=1}^K\frac{\partial (t_{k'}logy_{k'})}{\partial y_{k'}}\frac{\partial y_{k'}}{\partial a_k}\frac{\partial a_k}{\partial V_{mk}}$$ 
The first part we found to be: 
#### $$\frac{\partial J}{\partial y_{k'}} = \sum_{k'}^K\frac{t_{k'}}{y_{k'}}$$
The second part we found to be: 
#### $$\frac{\partial y_{k'}}{\partial a_{k}} = y_{k'}(\delta(k,k') - y_k)$$
And the third part we found to be: 
#### $$\frac{\partial a_k}{\partial V_{mk}} = z_m$$
When we finally plug all of these values back in to our original equation we had split up, we end up with...
#### $$\frac{\partial J}{\partial V_{mk}} = (t_k-y_k)z_m$$
Finally we have found how the cross entropy cost changes as we change the hidden to output layer weights! 

<br></br>
## 1.4 Find weights $W_{dm}$