# Introduction
* So far, we have covered 2 essential parts of Logistic Regression:
## 1. Predictions
* This just dealt with: how do we get from the input to the output of the model
* we called this making predictions
## 2. Training
* The second part was a little harder, we looked at how to make our model **learn**
* We did this by constructing an **objective function**, and minimized it using **gradient descent**
* We call this **learning**, or **training**, or **fitting**
* The whole goal of this is to find out **weights**, (w)

## These are the two main functions for any supervised machine learning model!

---

# Practical issues: 
## 1. Regularization
* Overfitting can happen when the model performs too well on its training data, but its not really training data that we really care about
* Where machine learning models become very powerful is when they predict things for us in the future! It doesn't matter how well our model predicts stock returns over the past year, we want to be able to predict stock returns tomorrow
* Overfitting can happen for any number of reasons, **one of which is having irrelevant inputs in your model**
* We are going to look at how to bring the weights for irrelevant inputs down to 0

## 2. Problem with Cost Function
* It is defined such that infinite weights are the ideal solution!
* We will see exactly how this happens, and how we can fix it!

## 3. A Straight Line is too Limiting
* There are some data sets that we can visually see are clearly separable, but not by a line/plane

![nonlinearly%20seperable.png](attachment:nonlinearly%20seperable.png)

* Logistic regression can't solve this because it is constrained to be straight, and a line can't separate the 2 classes
* We will look at one way around this problem in this section (think back to andrew NG course)

---
# Interpreting the Weights 
## Linear Regression 
* Like linear regression, the logistic regression weights are very interpretable and very intuitive
* Recall, for linear regression: 
    > * $w_i$ is the amount y will increase if $x_i$ is increased by 1, and all other $x$'s remain constant

## Logistic Regression 
* With logistic regression, the idea is similar!
* Lets first focus on just the binary scenario
* The output prediction from a logistic regression model is going to be 1 or 0
* So the weight is either going to bring the output closer to 1, or closer to 0
* Intuitively, we know that a bigger weight means a bigger effect
* If $w_i$ is large and positive, then a small increase in $x_i$ (assuming $x_i$ is positive), will push the output closer to 1
* If $w_i$ is big and negative, then a small increase in $x_i$ (assuming $x_i$ is positive), will push the output closer to 0

![interpreting%20the%20weights.png](attachment:interpreting%20the%20weights.png)

* In other words, larger magnitude means larger impact on the output

## Can we intepret weights similarly to linear regression? 
* Yes! 
* First, lets recall the interpration of the logistic model:

![logistic%20interpretation%201.png](attachment:logistic%20interpretation%201.png)

* It is the probability that y=1 (class is 1) given x (input)
* Hence, if we subtract it from 1 (remember, binary!), we get the probability the y=0, given x. 

![logistic%20interpretation%202.png](attachment:logistic%20interpretation%202.png)

### Next: Calculate the Odds
* most people have only heard of "odds" in a casual context
* odds are a concept in probability that is pretty old school, modern course won't teach you about it
* usually it is referrenced in relation to gambling, or other non scientific fields
* However, the odds are just the ratio of 2 probabilities:
### $$odds = \frac{p(event\;does\;happen)}{p(event\;doesn't\;happen)}$$
* so in our case that is:
### $$odds = \frac{p(y=1|x)}{p(y=0|x)}$$
* If we write down this expression in terms of $w$ and $x$:

![logistic%20regression%20odds.png](attachment:logistic%20regression%20odds.png)

* If we then take the log of both sides, notice that the right side is exactly like linear regression!!!
* The left sides, the log of the odds, is appropriately called the "log odds"

![log%20odds.png](attachment:log%20odds.png)

* We can think about it like we are doing linear regression on the log odds

### What does this mean in terms of intepreting the weights?
* Now we can use our "linear regression way" of interpreting the weights
* $w_i$ is the amount the log odds will increase if $x_i$ increases by 1, and all other x's remain constant
* This interpretation is something that may show up in statistics, but we don't really talk about it in deep learning
* In deep learning, logistic regression is thought of as a neuron
* When you have millions of neurons connected together, this "nice and simple" interpretation disappears

---
# L2 Regularization - Theory
* We are going to look at a few sides of the problem, so we fully understand regularization, and the implications

## Generalization and Overfitting
* Recall our "Gaussian Cloud" problem from the last section
* We have 2 clouds: 
    * one is centered at (-2,-2)
    * The other is centered at (2, 2)
    
![gaussian%20cloud.png](attachment:gaussian%20cloud.png)

* We found an exact Bayesian Solution!
* It was: w = [0, 4, 4]
* Which we were able to represent using y = mx + b (this is high school math!)
* Note that this y represents the y coordinate of the x-y plane, NOT the y output of logistic regression!
* And then we get: 0 + 4x + 4y = 0
* And finally: y = -x
* This means that we have a slope of -1, and y-intercept of 0
* This should case you to scratch your head a bit though! Why is the Bayesian solution (4,4)? Why not (1,1)? Or (10,10)?
* These all represent the same line!
* This is the first hint as to why we may need regularization!

### Objective function
* We need to consider the objective function
* Now, this y is the output of logistic regression:

![total%20cross%20entropy%20error.png](attachment:total%20cross%20entropy%20error.png)

* Take a test point $(x_1, x_2)$ = (1,1) and existing weights (0, 4, 4)
* This should be classified as y = 1
* And now if we plug this into our current logistic regression model: 
### $$\sigma(0 + 4*1 + 4*1) = \sigma(8) = 0.99966$$
* What would be better than that? Exactly 1!
* So, with:
### $$y = \sigma(8)$$
* We get an objective J:
### $$J = -0.00033354063728$$
* Now, what if our weights had been (0, 1, 1)? Well, then our objective is:
### $$J = -0.1269....$$
* This is not as good! 
* And what if our weights were (0, 10, 10)? Then our objective is:
### $$J = -2.06e-9....$$
* Hence, we can see that under this model the best weights are actually:
### $$w = (0, \infty, \infty)$$
* In a computer, that is an error!

## Regularization 
* People generally explain regularization in terms of overfitting and **regression** 

### Regularization: Regression

![overfitting.png](attachment:overfitting.png)

* However, we are not doing regression right now, so that scenario isn't really applicable
* In fact, if your data fills up the space of all the possible inputs, you shouldn't over fit even if your model is very complex! 
* That is why we like having lots of data
* Your model overfits when it has to "guess" what the output has to be in a space it has never seen
* But if your training data is well spread out, and covers all of the possibilities, you can train to do well on that data
* AKA, if test data looks just like training data, and you do well on your training data, then you will do well on your test data 

### Regularization: Logistic Regression/classification
* Now the scenario is different - we could have a perfectly split up dataset that covers the entire possible input space, and logistic regression would still tryo to go to w = (0, infinity, infinity)
* The solution is of course, regularization!
* Regularization penalizes very large weights
* So we have our existing cost function, J, which is the cross entropy: 

![cost%20cross%20entropy.png](attachment:cost%20cross%20entropy.png)

* And we now add a penalty for big weights:
### $$J_{reg} = J + \Big(\frac{\lambda}{2}\Big)*(w)^2 = J + \Big(\frac{\lambda}{2}\Big)*w^Tw $$
* This term makes our cost grow larger if any of the weights grow bigger
* This encourages the weights to be close to 0
* So now we wouldn't get a weight like (0, 10, 10), because that would make your cost very large!
* $\lambda$ is what we call a smoothing parameter (usually 0.1-1), but it depends on your data
* The larger lambda is, the closer the weights will go to zero
* The smaller lambda is, the more that the weights will just try to minimize the cross entropy
* No universal way to choose the best lambda

## Solving for weights
* remember that to do gradient descent, all we need to do is take the gradient of the objective function and move in that direction
* Now we want:
### $$\frac{dJ_{reg}}{dw}$$ 
* instead of:
### $$\frac{dJ}{dw}$$
* Since addition doesn't effect the gradient calculation, you just take all of the gradients separately
* So, we just need to focus on the gradient of the regularization penalty (Since we already have the gradient of the cost from before)
* One way to do this is to look at each scalar parameter 
* since the penalty (regularization cost) is equal to:
### $$ regularization\;cost=\Big(\frac{\lambda}{2}\Big)(w_0^2+w_1^2+w_2^2...)$$
* then the derivative with respect to any particular $w_i$ is just:
### $$\frac{d(regularization\;cost)}{dw_i} = \lambda*w_i$$
* and if we then wanted to vectorize this: 
### $$\frac{d(regularization\;cost)}{dw_i} = \lambda*w$$
* and finally we can add this to our original gradient of the cross entropy cost:
### $$\frac{dJ_{reg}}{dw} = X^T(Y-T)+\lambda*w$$ 

## Probabilistic Perspective
* Lets now look at another interpretation of regularization
* remember that we like to interpret our models in terms of probabilities 
* We know that with the cross entropy, what we are really doing is maximizing the likelihood, since J = -log(likelihood)

