### Maximum Likelihood Estimation of OLS

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Conditional probability: $P(A|B)$

In the context of models and data: Probability = $P(data|model)$

and Likelihood = $L(model|data)$

If we have two competing models, we want to choose the model that has the highest likelihood to output the given data

<img src = 'gaussian.png' width = 500>

If x comes from a normal distribution, then $$P(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma}}e^-{\frac{(x-u)^2}{2\sigma^2}}$$ now the **likelihood** is just the conditional probability of the distribution parameters, $\mu$ and $\sigma$, given the observations we have, $x$.  The likelihood is just the probabilities multiplied together: $$\mathcal{L}(\mu, \sigma)|(x_1,...,x_n) = \prod\limits_{i=1}^{N} P(x_i)\ldots P(x_N)$$ 

We can take the **log-likelihood** which allows us to sum these probabilities instead of multiplying them.  Maximizing the log-likelihood is the same as maximizing the likelihood because logarithmic functions are monotonically increasing.  We can do it like this: $$ln(\mathcal{L}(\mu, \sigma)|(x_1,...,x_n)) = ln(\prod\limits_{i=1}^{N} P(x_i)$$ $$ln(\mathcal{L}) = \sum\limits_{i=1}^{N} lnP(x_i)$$  

or $$ln(\mathcal{L}) = \sum\limits_{i=1}^{N}ln\frac{1}{\sqrt{2\pi\sigma}} - ln{\frac{(x-u)^2}{2\sigma^2}}$$

Notice in the right side of this equation the expression: $(x-u)^2$ which is the cost function for linear regression.  We see that minimizing this will maximize the liklihood, therefore **$$\mathcal{L} = -J$$

One of the most important algorithms in machine learning is **Gradient descent** which we use to minimize the cost function.  We use it in place of something like the normal equation because it does not require taking the inverse of a matrix, so it is much more efficient when we have many columns, or features.

$$w_i = w_i - \alpha\frac{\partial{J}}{\partial{w_i}}$$

### Comparison between different types of statistical tests

#### T-test

A t-test tests the **Null hypothesis** that there is **no difference** between the levels of a two-level categorical variable(for example Labour and Tory party) and the **mean** of a quantitative variable(like age).  So in this example a t-test would test if there is a statistically significance in age between members of the Labour and Tory parties.  

#### Chi-Squared test
In order to test if two categorical variables are related, we can use the **Chi-Squared test**

$$\chi^2 = \frac{1}{d}\sum\limits_{k=1}^{N}\frac{(O_k - E_k)^2}{E_k}$$

Where $O_k$ is is the frequency **observed values** and $E_k$ is the frequency of the **expected values** and d is the **degress of freedom**- or the number of values that were used to calculate the statistic - 1

#### F-test

We can use the F-test to test the performance of regression models relative to each other.  Using this test will tell us if one model is significantly better than the other

The F-test formula: $$F = \frac{\frac{RSS_1 - RSS_2}{p_2 - p_1}}{\frac{RSS_2}{n-p_2}}$$

where $n$ is the number of data points, and $p_2$ and $p_1$ are the number of features in model 2 and model 1, respectively

If the F value is above a certain threshold, then we can reject the null-hypothesis that the models perform approximately the same, and conclude that model 2 is better