# Training Introduction 
* We previously went over the architecture of a neural network, mainly being how to pass data from input to output, and end up with a probabilistic prediction
* The problem was, that all of our weights were random!
* And thus, our predictions were not accurate!

![main%20nn%20equation.png](attachment:main%20nn%20equation.png)

* in this section we are going to focus on how to change our weights so that they are accurate
* Recall, our training data will consist of input and target pairs
* the goal is to make the predictions (y) as close to the targets (t) as possible 
* To do this we will create a **cost function** such that:
    * the **cost is large** when the prediction is not close to the target
    * The **cost is small** when the prediction is close to the target
* we are trying to make our cost as small as possible!
* Recall: the method that we used to achieve this is called gradient descent 
* This is a bit harder with neural networks, because the equations are a bit more complex, but no new skills are required

# Gradient Descent in NNs: Backpropagation
* Gradient descent in neural networks has a special name: **backpropagation**
* backpropagation is recursive in nature: allows us to find the gradient in a systematic way
* This recursiveness allows us to do backpropagation for neural networks that are arbitrarily deep, without much added complexity
* E.g. a 1-hidden layer ANN is no harder than a 100 layer ANN

![gradient%20descent%20plot.png](attachment:gradient%20descent%20plot.png)

---
# What do all of these symbols and letter mean?
## Training Data
* Training inputs: X
* Training targets: Y
* Generally speaking, these are both matrices
* X is of shape N x D
    * N = number of samples
    * D = number of input features
* Y is of shape N x 1 
    * aka a column vector
    * a 2-d object 
* Alternatively, Y can just be a vector of only 1-D of length N
    * this is how we will represent it in Numpy
    
## Training Data and Predictions
* Inputs: X, Targets: Y, Predictions: p(Y|X)
* p(Y|X) represents a full probability distribution over all individual values in the matrix Y, given the matrix X
* p(Y|X) is therefore also a matrix, same size as Y
* p(y = k | x) is a probability value - a single number 
    * Representing the probability that y is of class k, given the input vector x 
* Note: 
    * Capital letters usually represent matrices
    * lowercase letters usually represent vectors

## p(Y | X) is inconvenient
* p(Y | X) is inconvenient to write 
* many characters
* variable names cannot contain spaces or parentheses
* so we resort to things like:
    * p_y_given_x
    * py_x
    * Py
* none are really ideal
* Old school alternative for predictions is to write:
## $$\hat{y}$$
* notice that this still cannot represent a variable in code...

## Another convention
* This is were things start to get confusing
* an alternative way to represent inputs, targets, and predictions is:
    * Inputs: X
    * Targets: T
    * Predictions: Y
* this is beneficial since we no longer need to write out P of Y given X anywhere
* however this Y now conflicts with our other Y- before Y meant the targets, now it means the predictions

## Using Context
* we will need to use context to determine what Y really is 
* if we see Y and T at the same time, it should be clear that Y is a prediction
* if you see Y being assigned the output a neural network, its a prediction
* but if you see Y and Yhat, Y and p_y_given_x at the same time, Y is a target 

## Weights

![weights%20diagram.png](attachment:weights%20diagram.png)

* we have some conventions for the sizes of things
* N is the number of samples we have collected in our experiment
* D is then number of features, which is the size of the input layer in the neural network
* M represents the size of the hidden layer
* K represents the size of the output layer 
* K is the number of output classes, and can be anything from 2 and larger 
* when we have a 1 hidden layer neural network, one way to name the weights is as follows:
    * W is the weight matrix from the input to hidden layer - (D x M)
    * b is the bias term at hidden layer - (M x 1)
    * V is the weight matrix from the hidden to output layer - (M x K)
    * c is the bias term at the output layer - (K x 1)
* you can imagine that if we start adding more hidden layers, we are going to run out of letters! So using V and c isn't really a great option.
* How about just numbering our W's and b's?
    * W1, b1, W2, b2, ... and so on

## Indices
* we may or may not put indices in different places if they represent different things
* Example: if we are looking for the target T for the nth sample and kth class
* In Numpy: T[n,k]
* T(n,k)
* $T^n_k$
* $T_{nk}$
* $t^n_k$
* $t_{nk}$

## Indexing
* i, j, and k are common letters we use for indices
* Ex. i=1...D (input layer)
* Ex. j=1...M (hidden layer)
* Ex. k=1...K (output layer)
* the problem with i, j, and k is what happens if we have more than 1 hidden layer, and eventually run out of letters
* we then pick and index outside of these 3 current letters:
    * q = 1...Q
    * r = 1...R
    * s = 1...S

## Learning Rate
* greek letters: alpha or eta

![learning%20rate.png](attachment:learning%20rate.png)

## Cost/Objective/Error Function
* Typical letters: E or J
* Cost or error: usually means something we want to minimize
* Objective: can be something to minimize or maximize 
* Probabilistic interpretation of cost: negative log-likelihood
* we are trying to maximize the log likelihood, or minimizing the negative log likelihood
* minimizing E is the same as maximizing -E
* So if you are minimizing the negative log-likelihood (gradient **descent**) is the same as maximizing the log-likelihood (and likelihood) (gradient **ascent**)

## Likelihood
* Typically we use the uppercase L for likelihood, lowercase l for log-likelihood, if they are both presented together
* If discussing log-likelihood or negative log-likelihood by itself, we might just use L since L is easy to see, and l can be confused for I

---
# What does it mean to train a Neural Network? 
* this is going to be very similar to logistic regression

## The Main Concepts
* we very intuitively define something called the "cost"
* we want to minimize the cost!  
* But how do we minimize the cost? This falls into the domain of calculus! Calculus provides the tools to find the min/max of a function!
* we specifically use a method called **gradient descent**

## How do we define cost? 
* recall that for binary classification, this is exactly how we would calculate the likelihood of a sequence of coin tosses
* So for example, say we flip 2 heads and 3 tails
* Because these are independent trials, the total likelihood is then:
### $$Likelihood = p(H)p(H)p(T)p(T)p(T)$$
* again, the reason we can multiply these probabilities is because each coin toss is independent of the others
* another way to write this is to call:
### $$p = p(H)$$
* and hence we can rewrite likelihood as:
### $$Likelihood = p^{number \; of\; heads}(1-p)^{number \;of \; tails}$$

## Minimize or Maximize?
* the likelihood, or in other words the probability of our model, aka the probability of our data, given our model/parameters, is something we want to maximize
* but recall that we are looking for a cost, in other words, something to minimize
* In order to get something that we can call the cost, we take the negative log of the likelihood and call it the "cost"
* Negative log likelihood = -{#H logp + #T log(1-p)}
* recall from logistic regression that this is called the cross entropy cost function

## Cross-Entropy
* we can phrase it in terms of the output probability of the logistic regression model, and the targets
* $y_n$ = output of logistic regression or neural network
* $t_n$ = actual target (0 or 1) in the binary case 

![cross%20entropy%20cost.png](attachment:cross%20entropy%20cost.png)

* notice that if we had a neural network doing binary classification, we would use this exact same cost function
* recall that in order to find the best weights to minimize this cost, we can use gradient descent 
* we can also maximize the negative of this, gradient ascent

## Cross Entropy for Multi-class Classification
* in this section we want to be able to handle any number of outputs 
* lets consider a die roll (6 faces, but lets call it K)
* the probability of rolling k = $y_k$
* $t_k$ = 1 if we roll k, 0 if we do not roll k
* we have N total die rolls, so $t_{n,k} =1$ if we rolled k on the nth die roll 
* therefore, only one of the $t_{n,k}$ can equal 1 for any particular n
    * $t_{n,k}$ is thus an indicator matrix or one hot encoded matrix of 1s and 0s
    
![multi%20class%20likelihood.png](attachment:multi%20class%20likelihood.png)

* notice that y and k now have two indexes each:
    * n corresponds to which sample we are looking at 
    * and k corresponds to which class we are looking at 
* notice that for any particular n, only 1 of the k targets can be 1, and the rest must be 0
* that is because if we roll a die and get a 6, then that same die roll can't be any other number 

## Cross Entropy for Multi-class Classification

![cross%20entropy%20for%20multiclass%20classification.png](attachment:cross%20entropy%20for%20multiclass%20classification.png)

* What is cross entropy: https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
* we again want to transform this into a cost, so we will take the negative log of the likelihood
* this is called the cross entropy cost function, but for multiclass classification
* next we will see how to perform gradient descent on the new cost function and how to write it in code! 

---
# Backpropagation Intro 
* Lets start by recalling how we trained the logistic regression model

## Logistic Regression Recap
* we start with an objective function, and for binary classification we use cross entropy 

![total%20cross%20entropy%20error.png](attachment:total%20cross%20entropy%20error.png)

* we know that this objective function has a global minimum, so it looks something like a parabola 
* now we usually will randomize our weights initially, and then we slowly move towards the minimum in small steps
* we find the direction for the minimum using the gradient, $\frac{dJ}{dw}$

![cost%20minimization.png](attachment:cost%20minimization.png)

* we end up with an update rule:
### $$w += w - \alpha\frac{dJ}{dw}$$
* where $\alpha$ is the learning rate 
* this process is called gradient descent
* You can also do gradient ascent, where the goal is to find a global maximum
### $$w += w + \alpha\frac{dJ}{dw}$$

## Neural Networks Gradient Descent
* we are going to do the exact same process with neural networks! However, because they are nonlinear we are going to find a local minima, not global minimum
* additionally, because the weight updates are dependent on the error at multiple outputs, we are going to need the concepts of total derivatives

### Total Derivatives
* so if you have a function of x and y, $f(x,y)$, where x is a function of t, $x(t)$, and y is a function of t, $y(t)$, hence it is a parameterized function
* The goal is to take the **total derivative**
### $$\frac{df}{dt}$$
* To do this we use the **chain rule**
### $$\frac{df}{dt}= \frac{df}{dx}\frac{dx}{dt} +\frac{df}{dy}\frac{dy}{dt}$$
* now if you had a vector x, which has k components, and they are all parameterized by t, you can imagine that you would do the same thing, using a summation
### $$\frac{df}{dt} = \sum_k\frac{df}{dx_k}\frac{dx_k}{dt}$$

## Objective function with Softmax
* basically this is the same as the likelihood of rolling a die
* so if you were to roll a die, you would get your likelihood to be, with n independent and identically distributed tosses:
### $$Likelihood = \prod_{n=1}^N\prod_{k=1}^6 (\theta_k^n)^{t_k^n}$$
* so with neural networks this is exactly the same!
### $$P(targets\;|\;inputs, weights) = P(targets\;|\;X,W,V) = \prod_{n=1}^N\prod_{k=1}^K (\theta_k^n)^{t_k^n}$$
* so we are going to work with the log likelihood, not the negative log liklelihood, and do gradient ascent instead of descent
* so lets take the log likelihood
### $$\sum_n\sum_kt_k^nlogy_k^n$$
* now that we have our objective function, what do we do with it?
* it is the same idea as with logistic regression! We want to find the derivative with respect to certain weights. 
* since we have 2 sets of weights for a 1 hidden layer NN (W and V), the dimensions of each node are D, M, and K, and they are indexed by d, m, k

![1%20hidden%20layer%20diagram.png](attachment:1%20hidden%20layer%20diagram.png)

* so we want to find these derivatives: 
### $$\frac{dJ}{dV_{mk}}$$
### $$\frac{dJ}{dW_{dm}}$$
* Note that in these derivatives, J can be thought of as our **error**
* we are trying to find how the error (cost, J) changes as we change our weights!
* because we are doing backpropagation, we are going to find $\frac{dJ}{dV_{mk}}$ first, because it is on the right, followed by backpropagating the error, and then we will find $\frac{dJ}{dW_{dm}}$
* this can be done using the chain rule 
### $$\frac{dJ}{dV_{mk}}= \sum_n\sum_{k'}t_{k'}^n\frac{1}{y_{k'}^n}\frac{dy_{k'}^n}{dV_{mk'}}$$
* now the question is, how do we find: $\frac{dy_{k'}^n}{dV_{mk'}}$?
* in other words...
### How do we find the derivative of softmax?
### $$y_k=\frac{e^{a_k}}{\sum_je^{a_j}}$$
* where the activation, $a_k$ is just the dot product of the input times the weights
### $$a_k= V_k^TZ$$
* so we want to find just the derivative of the softmax first
    * if k == k'
### $$\frac{dy_{k'}}{da_k} = y_{k'}(1-y_k)$$  
    * if k != k'
### $$\frac{dy_{k'}}{da_k} = -y_{k'}y_k$$  
* these can be combined using the kronecker delta
    * if i == j
### $$\delta_{ij} = 1$$    
    * if i != j
### $$\delta_{ij} = 0$$        
* so the derivative is...
### $$\frac{dy_{k'}}{da_k} = y_{k'}(\delta_{kk'}-y_k)$$  
* we also know from the dot product that the derivative of the activation is just zm
### $$\frac{da_k}{dV_{mk}}=z_m$$

## Combine
### $$\frac{dJ}{dV_{mk}}= \sum_n(t_k^n-y_k^n)z_m$$

## Derivative with respect to inputs to hidden weights
* 