In [1]:
import numpy as np

# Maximum Likelihood Estimate (MLE)
One way to make predictions is to use the most likely model parameters which would generate the data rather than integrating out all parameters.
$$P(X_\text{new}|X_\text{old})=P(X_\text{new}|\theta)$$
where $$\theta = \text{argmax}_\theta P(X_\text{old}|\theta)$$
We use $D$ to refer to old data in some cases. <br>
Often it is assumed that the data is independent and identically distributed, which means:
$$P(D|\theta)=\prod_i P(D_i|\theta)$$
Another common practice is to use the log likelihood of the data, as $$\text{argmax} (x) = \text{argmax} (\log(x))$$
This turns the above product into a sum:
$$P(D|\theta)\propto \sum_i \log(P(D_i|\theta))$$
This is much more numerically stable as a product of many numbers less than 1 gets very very small. <br>
**examples of maximum likelihood estimates:**

### Categorical
The binomial distribution describes the distribution with only two possible outcomes defined by a single value parameter $\theta$. Samples are akin to flipping a coin with a certain bentness. The predicted value $k$ is the number of times one outcome occurs in a given number of samples $n$. The number of times the other result happens is just $n-k$. The probability is defined for $\theta$: <br>
$$P(k|n,\theta)=\frac{\theta^k (1-\theta)^{n-k}n!}{k!(n-k)!}$$
We want a distribution over $\theta$ and are using the maximum likelihood to do this so need to take the $\text{argmax}(\theta)$ of the above. Constants and normalizations not depending on $\theta$ can thus be removed: 
$$P(k|n,\theta)\propto \theta^k (1-\theta)^{n-k}$$
Maximizing this is the same as maximizing the likelihood, so:
$$l(\theta)\propto k\ln(\theta)+(n-k)\ln(1-\theta)$$
As the function is convex we can set the gradient to 0 to get the maximum:
$$
\begin{aligned}
  \nabla l(\theta)&=\frac{k}{\theta}-\frac{n-k}{1-\theta} \\
  0&=\frac{k}{\theta}-\frac{n-k}{1-\theta} \\
  \frac{n-k}{1-\theta}&=\frac{k}{\theta} \\
  \frac{(n-k)\theta}{(1-\theta)\theta}&=\frac{k(1-\theta)}{(1-\theta)\theta} \\
  (n-k)\theta&=k(1-\theta) \\
  n\theta-k\theta&=k-k\theta \\
  n\theta&=k \\
  \theta&=\frac{k}{n}
\end{aligned}$$

So the best estimate for the "bentness" is just the mean. With enough data this approaches the truth. E.g For a coin:

In [8]:
total_heads = 0
total_flips = 0
theta = 0.5
for sample in range(10000):
    if(np.random.rand()<theta):
        total_heads+=1
    total_flips+=1
prob_heads_estimates = total_heads/total_flips
print("theta estimate",prob_heads_estimates)

theta estimate 0.4934


The same formula also applies for $j$ discrete variables. The maximum likelihood is:
$$\theta_j=\frac{k_j}{\sum k}$$
Example of a fair dice:

In [23]:
totals = np.zeros(6)
dice_true_probs = np.ones(6)*(1/6)
for sample in range(10000):
    roll = np.random.choice(np.arange(1,7),p=dice_probs)
    totals[roll-1]+=1
estimated_probs = totals/np.sum(totals)
print("truth   ",dice_true_probs.round(4))
print("estimate",estimated_probs.round(4))

truth    [0.1667 0.1667 0.1667 0.1667 0.1667 0.1667]
estimate [0.1686 0.1613 0.1676 0.1717 0.162  0.1688]


### Gaussian
The gaussian pdf is:
$$ p(x|\mu,\sigma^2)=\mbox{$\frac{1}{\sigma\sqrt{2\pi}}$}e^{-\mbox{$\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}$}}$$
The log is:

$$ \begin{aligned}
    p(x|\mu,\sigma^2)&=\ln(\mbox{$\frac{1}{\sigma\sqrt{2\pi}}$})-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2} \\
    &=-\ln(\sigma\sqrt{2\pi})-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2} \\
    &=-\ln(\sigma) -\ln(\sqrt{2\pi})-\frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}
\end{aligned}
$$
So for $n$ iid (independent identically distributed) data points this becomes the log likelihood:
$$-n\ln(\sigma) -n\ln(\sqrt{2\pi})-\sum_{i=1}^n \frac{1}{2} \frac{(x_i-\mu)^2}{\sigma^2}$$
Which is: 
$$-n\ln(\sigma) -n\ln(\sqrt{2\pi})- \frac{1}{2} \frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^2}$$
The constant $-n\ln(\sqrt{2\pi})$ can be dropped when doing MLE <br>
This function is also convex, so we can set the gradient to 0 for each variable.
$$ \begin{aligned}
    \nabla\mu&=\frac{\sum_{i=1}^n(x_i-\mu)}{\sigma^2} \\
    0&=\frac{\sum_{i=1}^n(x_i-\mu)}{\sigma^2} \\
    0&=\sum_{i=1}^n(x_i-\mu) \\
    \sum_{i=1}^n \mu&=\sum_{i=1}^n x_i \\
    n\mu&=\sum_{i=1}^n x_i \\
    \mu&=\frac{1}{n}\sum_{i=1}^n x_i \\
\end{aligned}
$$
So $\mu$ is just the mean of the samples
$$ \begin{aligned}
    \nabla\sigma^2&=-\frac{n}{\sigma}+\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^3} \\
    0&=-\frac{n}{\sigma}+\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^3} \\
    \frac{n}{\sigma}&=\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^3} \\
    n&=\frac{\sum_{i=1}^n(x_i-\mu)^2}{\sigma^2} \\
    \sigma^2&=\frac{\sum_{i=1}^n(x_i-\mu)^2}{n}
\end{aligned}
$$
So, $\sigma^2$ is just the variance.

In [34]:
true_mean = 3.4053
true_sigma = 1.4233
gaussian_samples = np.random.normal(true_mean,true_sigma,10000)
estimated_mean = np.mean(gaussian_samples)
estimated_sigma = np.sqrt(np.mean((gaussian_samples-estimated_mean)**2))
print("true mean     ",true_mean,"true sigma     ",true_sigma)
print("estimated mean",estimated_mean.round(4),"estimated sigma",estimated_sigma.round(4))

true mean      3.4053 true sigma      1.4233
estimated mean 3.4038 estimated sigma 1.4283
