# Week 1 Reference Material

5.5 & 5.7.1 in https://www.deeplearningbook.org/

# 5.5 - Maximum Likelihood Estimation

Q: How can we derive functions that will be good estimators?

A: Use maximum likelihood principle!

### Basic Idea
Estimate the parameters of a probability distribution so that your model predicts that the most likely data is the data you have (observed)

### Max likelihood estimate 
Max likelihood estimate is the point in parameter space that maximizes likelihood function

### How to find it...
If likelihood function is differentiable, can use derivative test to find maxima

For linear regression model, can use ordinary least squares estimator 

### Reference text
<br />

<center><img style="display: inline" src="mle_1.png" alt="MLE" width="800"> </center>

^^ Find the probability of obtaining each datapoint in your sample (right of pi symbol), and multiply all of those probabilities together (pi symbol) to get probability of obtaining your observed sample. 

Maximize this probability (argmax) so your model assesses your observed data as the most likely data.

### Log likelihood

Multiplying a bunch of probabilities can be inconvenient 
<br /><br /> (e.g. when stuff like this crops up https://en.wikipedia.org/wiki/Arithmetic_underflow; KZ: also issues of concavity / convexity of likelihood equation for gradient descent?)

### Luckily, argmax is impervious to some changes:

It does not change if we take the log of the likelihood, which lets us sum instead of multiplying, since log(a * b) = log(a) + log(b) 

<center><img style="display: inline" src="mle_2.png" alt="MLE" width="800"> </center>

It does not change if we scale the cost function, so we can divide the whole thing by m.

<center><img style="display: inline" src="mle_33.png" alt="MLE" width="800"> </center>


### KZ's understanding: <br />
<b> Divide by m --> instead of getting the sum of a bunch of log probabilities, you get their (weighted?) average..</b>

higher probability (p) --> log(p) ~= negative values approaching zero 
<br />lower probability (p) --> log(p) ~= negative values approaching negative infinity 

<b> KZ: So, to maximize probabilities, you want the "expected" (E) average log probability to be closest to zero? (in this case, argmax, bc all values are negative or zero?) </b>



# Ways to find Max Likelihood Estimate 

Maximize the log likelihood / expectation equation (above), or minimize KL divergence (below)

### Reference text: <br />

<center><img style="display: inline" src="mle_4.png" alt="MLE" width="800"> </center>

### "Big picture" of KL divergence:

Consider two probability distributions: your data (P) and your model (Q).

<b>Kullback–Leibler divergence is the average difference in the number of bits required for encoding samples of P using a code optimized for Q versus a code optimized for P. </b>

KL divergence is essentially the same as "cross entropy"; minimizing one minimizes the other.

### Why minimize KL divergence..
...instead of minimizing "Negative Log Likelihood" (NLL) or maximizing log likelihood?

### Because...

<br/><b> KL divergence has a known minimum of 0</b>.<br/> Negative Log Likelihood (NLL) can be negative.

KZ: how / when is NLL negative?


# 5.5.1 - Conditional Log Likelihood & Mean Squared Error

You can also use Max Likelihood Estimation to do linear regression (using "conditional log likelihood")!

<center><img style="display: inline" src="mle_5.png" alt="MLE" width="800"> </center>

You can generalize maximum likelihood to make predictions about y given x. <br/>Just optmize your paramaters to maximize the likelihood of y given the model with input x.

### High level

Instead of producing a single prediction (y_hat) we can now think of linear regression as producing a conditional distribution p(y|x).

With an infinitely large training set, we might get cases where the same inputted x yields different values of y. The prob distribution tries to fit all of these as best as possible.

### Typical linear regression approach

If you play with the conditional log likelihood equations, you can convert this approach to more familiar linear regression and mean squared error equations (it's the same thing)

<center><img style="display: inline" src="mle_6.png" alt="MLE" width="800"> </center>


# 5.5.2 Properties of Maximum Likelihood

So, why use Maximum Likelihood Estimation for linear regression if you can use simpler equations? Because MLE has some advantageous properties..

### Key takeaways:
MLE offers <b> consistency</b> and <b> efficiency</b>, making it a commonly preferred estimator for machine learning.  

When the number of datapoints is small, risking overfitting, regularization strategies (like "weight decay") can provide a biased version of maximum likelihood -- one with has less variance when training data is limited

### Note: I only did a very cursory skim of this section!

For details on MLE properties, and requirements for MLE consistency, see pages 132-133.


# 5.7.1 Probabilistic Supervised Learning


<center><img style="display: inline" src="mle_7.png" alt="MLE" width="800"> </center>

Easy to generalize linear regression (conditional log likelihood) to do supervised classification by calculating the probability of a particular class (just need probability for one class, if only two-classes exist).

### Sigmoid

The normal distribution we typically use for linear regression is paramaterized in terms of a mean. A distribution over a binary variable is tricker because it will always be between 0 and 1.

To convert the output of linear function to be on the interval between 0 and 1, use the logistic sigmoid function:


p(y = 1 | x; θ) = σ(θ^T x)

### ^This is what we call logistic regression