# **Lecture 5: Maximum Likelihood Learning**
## **Applied Machine Learning**

## **Why Does Supervised Learning Work?**
Previously, we saw one way of explaining why supervised learning works.

## **Part 1: Probabilistic Modeling**
In this lecture, we are going to look at why supervised learning works from a new, probabilistic perspective. \
First, we are going to start by defining the probabilistic approach to machine learning and set up new notation.

### **Review: Machine Learning Model**
A machine learning model is a function
$$f : \mathcal{X} \to \mathcal{Y}$$
that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.
Often, models have parameters $\theta \in \Theta$. We will then write the model as:
$$f_{\theta}: \mathcal{X} \to \mathcal{Y}$$
to denote that it's parametrized by $\theta$.

### **Review: Data Distribution**
We will assume that the dataset is governed by a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as:
$$x,y \sim \mathbb{P}.$$
The training set $\mathcal{D} = \{(x^{(i)},y^{(i)}) \mid i = 1,2,\dots,n\}$ consists of independent and identically distributed (IID) samples from $\mathbb{P}$.


### **Probabilistic Models**
A Probabilistc model is a probability distribution
$$P(x,y): \mathcal{X} \times \mathcal{Y} \to [0,1].$$
This model can approximate the data distribution $P(x,y)$.
Probabilistic models also have parameters $\theta \in \Theta$, which we denote as:
$$P_{\theta}(x,y): \mathcal{X} \times \mathcal{Y} \to [0,1]. $$
If we know $P_{\theta}(x,y)$, we can use the conditional $P_{\theta}(y \mid x)$ for prediction.

### **Probabilistic Models: Example**
Consider a simple version of our example with predicting diabetes from BMI.

*   For the target $\mathcal{Y} = {0, 1}$, we decretize the diasbetes risk score into low risk $(y = 0)$ and high risk $(y = 1)$.
*   For the input $\mathcal{X} = {0,1,2}$, we also decretize the BMI into low $(x=0)$, medium $(x=1)$ and high $(x=2)$.

Then the following is a simple probabilistic model:



In [None]:
import pandas as pd

df_model = pd.DataFrame.from_records([
    ['low', 'low', 0.2], ['medium', 'low', 0.1], ['high', 'low', 0.2],
    ['low', 'high', 0.05], ['medium', 'high', 0.1], ['high', 'high', 0.35]
], columns= ['BMI $x$', 'Risk $y$', 'P'])
df_model

Under this model we can compute $P(y \mid x) = P(x,y)/P(x)$ as follows:

In [None]:
df_px = df_model.groupby('BMI $x$').sum().rename(columns = {'P':'Px'})
df_conditional_model = df_model.merge(df_px,  on='BMI $x$')
df_conditional_model['$P(y|x)$'] = df_conditional_model['P']/df_conditional_model['Px']
df_conditional_model.iloc[:,[0,1,4]]

### **Why Use Probabilistic Model?**
The probabilistic approach to machine leaarning is powerful

*   We can fit models that capture predictive uncertainty.
*   We can construct models in a more principled way by explicitly modeling the data distribution.
*   It offers a new perspective on why supervised learning works.






## **Part 2: Monte Carlo Estimation**
Next, we are going to define Monte Carlo estimation, a mathematical tool that will be important in machine learning.

### **Notation: Random Variables**
Suppose that we have a variable $x \in \mathcal{X}$ that is governed by a distribution $\mathbb{P}$.
$$x \sim \mathbb{P}(x).$$
This $x$ can be sampled from data distribution, or any other random variable.

### **Notation: Expected Value**
Recall that the expected value of a function $g: \mathcal{X} \to \mathbb{R}$ when the input $x$ to $g$ is sampled from $\mathbb{P}$ is given by
$$\mathbb{E}_{x \sim \mathbb{P}}[g(x)] = \sum_{x}g(x)P(x),$$
where we assumed for simplicity that $x$ is dicrete.

In practice computing expected value is not always easy:

*   $x$ can take on a very large number of values and summing over all of them is not possible.
*   When $x$ is continuous, the expected value can be an integral with no closed form solution.

In practice, we often use *approximate* methods to compute expected values.



### **Monte Carlo Estimation**
Monte Carlo Estimation is a way to approximately compute expected values.
$$\mathbb{E}_{x \sim \mathbb{P}}[g(x)] = \sum_{x}g(x)P(x),$$


1.   We first generate $T$ IID samples $x_1, x_2, \dots, x_T$ from $\mathbb{P}$
2.   Then we estimate the expected value as:
$$\hat{g}(x_1, x_2, \dots, x_T) ≜ \frac{1}{T}\sum_{t=1}^Tg(x_t)$$

We call $\hat{g}$ is the Monte Carlo Estimation of the expected value.



### **Monte Carlo Estimation: Example**
Let's say that we throw five dice. What is the expected number of twos?


*   Let's $x = (x_1, x_2, \dots,x_5)$ be a dice roll where $x_j \in \{1,2,\dots,6\}$ is the outcome of the $j$-th dice.
*   Let $g(x)$ denote the number of twos in the roll of dice $x$.

The expected value $\mathbb{E}_{x \sim \mathbb{P}}[g(x)] = \sum_{x}g(x)P(x)$ is the expected number of twos. We can calculate it as follows 


In [None]:
import numpy as np

# sample 10,000 rolls of five dice
dice_rolls = np.random.randint(0,6,size=(5,10000))

# Count the number of twos in each throw
TWO_VAL = 1 # twos are denoted by 1 because of zero-based indexing
num_twos = (dice_rolls == TWO_VAL).sum(axis=0).mean()

print('Monte Carlo Estimate: %.4f'% num_twos)

This makes scene, because the correct answer is $5/6 \approx 0.83$

### **Properties of Monte Carlo Estimaation**
The Monte Carlo Estimation $\hat{g}$ has the following properties:

*   It is an unbiased estimate of the true expectation:
$$\mathbb{E}_P[\hat{g}] = \mathbb{E}_P[g(x)]$$
*   It converges to the true expectation as we average the additional samples
$$\hat{g} = \frac{1}{T}\sum_{t=1}^Tg(x_t) \rightarrow \mathbb{E}_P[g(x)] \text{ for } T → \infty $$
*   Ita variance decrese to zero as we collect more samples:
$$\text{var}_P(\hat{g}) = \text{var}_P\left[\frac{1}{T}\sum_{t=1}^Tg(x_t)\right] = \frac{\text{var}_P[g(x)]}{T}$$






### **Monte Carlo: Summary**

*   A lot of problems in Machine Learning require computing intractable expected values.
*   Monte Carlo Estimation is a simple approximate method that compute expected values approximately.



## **Part 3: Maximum Likelihood**
Maximum likelihood learning is a general way of training machine learning model. Many algorithms we have seen so far implicitly use this principle.

### **Review: Data Distribution**
We will assume that the dataset is governed by a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as:
$$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i=1,2,\dots,n\}$ consists of independent and identically distributed (IID) samples from $\mathbb{P}$.


### **Review: Probabilistic Model**
A probabilistic model is a probability distribution
$$P(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]$$
This model can appoximate the data distribution $\mathbb{P}(x,y)$
Probabilistic model can also have parameters $\theta \in \Theta$, which we denote as
$$P_{\theta}(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]$$

If we know $P(x,y)$, we can use the conditional $P(y\mid x)$ for prediction.

### **Learning Probabilistc Models**
We now have a probabilistic model and a data distribution. Thus, it is natural to try to learn a good probability distribution $P_{\theta}(x,y)$ that approximate the data distribution $\mathbb{P}(x,y)$.
What are characteristics of a good model $P_{\theta}(x,y)$?


*   **Predictive accuracy**: corretly predicting targets $y$ from $x$.
 *   Dose this patient have diabetes or not? 
*   **Understanding the relationships between $x$ and $y$**.
 *   What physicological features of the patient influence their diabetes risk?
*   **Density estimation**: appoximating $\mathbb{P}(x,y)$ so that we can answer any query later.






### **Kullback-Leibler Divergence**
In order to appoximate $\mathbb{P}$ with $P_{\theta}$, we need a measure of distance between distributions.

A standard measure of similarity between distribtions is Kullback-Leibler (KL) divergence between two distributions $p$ and $q$, defined as
$$D(p \| q) = \sum_xp(x) \log \frac{p(x)}{q(x)}.$$

Observations:


*   $D(p \| q) \geq 0$ for all $p,q$, with equality if and only if $p = q$. Proof:
$$\begin{align*}
D(p \| q) = \mathbb{E}_{x \sim p} - \log \frac{q(x)}{p(x)} & \geq -\log\left(\mathbb{E}_{x \sim p} \frac{q(x)}{p(x)}\right)\\
&= - \log\left(\sum_xp(x)\frac{q(x)}{p(x)}\right) \\
&= 0
\end{align*}$$
*   The KL-divergence is asymmetric, i.e., $D(p\|q) \neq D(q \| p)$
*   It has root in information theory.






### **Learning models using KL-Divergence**
We may now learn a probabilistic model $P_{\theta}(x,y)$ that approximates the data distribution $\mathbb{P}(x,y)$ via the KL-divergence:
$$\begin{align*}
D(\mathbb{P} \| P_{\theta}) &= \mathbb{E}_{x,y \sim \mathbb{P}}\log \frac{\mathbb{P}(x,y)}{P_{\theta}(x,y)}\\
&= \sum_{x,y}\mathbb{P}(x,y) \log \frac{\mathbb{P}(x,y)}{P_{\theta}(x,y)}
\end{align*}$$
Note that $D(\mathbb{P} \| P_{\theta}) = 0$ if the two distributions are the same.

### **From KL Divergence to Log Likelihood**
We can simplify the KL divergence objective somewhat:

$$\begin{align*}
D(\mathbb{P} \| P_{\theta}) &= \mathbb{E}_{x,y \sim \mathbb{P}}\log \frac{\mathbb{P}(x,y)}{P_{\theta}(x,y)}\\
&= \mathbb{E}_{x,y \sim \mathbb{P}}\log\mathbb{P}(x,y) - \mathbb{E}_{x,y \sim \mathbb{P}}\log P_{\theta}(x,y)
\end{align*}$$

The first term does not depend on $P_{\theta}$: minimize the Kl divergence is equal to maximize the expected log-likelihood.
$$\begin{align*}
\arg \min_{P_{\theta}} D(\mathbb{P} \| P_{\theta}) &= \arg\min_{P_{\theta}} - \mathbb{E}_{x,y \sim \mathbb{P}} \log P_{\theta}(x,y)\\
&= \arg\max_{P_{\theta}} \mathbb{E}_{x,y \sim \mathbb{P}} \log P_{\theta}(x,y).
\end{align*}$$

We have now defined a learning objective equivalent to optimize the KL divergence
$$\arg\max_{P_{\theta}} \mathbb{E}_{x,y \sim \mathbb{P}} \log P_{\theta}(x,y)$$


*   This asks that $P_{\theta}$ assign high probability to instances sampled from $\mathbb{P}$, so as to reflect the true distribution.
*   Because of $\log$, samples $x, y$ where $P_{\theta}(x,y) ≈ 0$ weigh heavily oin the objective 

Problem: In practice, we don't know the data distribution $\mathbb{P}$, hence expected value is intractable.

### **Maximum Likelihood Estimation**
Applying, Monte Carlo estimation, we may approximate the expected log-likelihood
$$\mathbb{E}_{x,y \sim \mathbb{P}}\log P_{\theta}(x,y) $$

with the *empirical log-likelihoood*:
$$\mathbb{E}_{\mathcal{D} \sim P_{\theta}(x,y)} = \frac{1}{|\mathcal{D}|}\sum_{x,y \in \mathcal{D}}\log P_{\theta}(x,y)$$

Maximum Likelihood Learning is then:
$$\max_{P_{\theta}} \frac{1}{|\mathcal{D}|}\sum_{x,y \in \mathcal{D}}\log P_{\theta}(x,y)$$

### **Example: Flipping a Random Coin**
Consider a simple example in which we repeatedly toss a biased coin and record the outcomes.


*   There are two possible outcomes: head ($H$) and tail ($T$). A training dataset consists of tosses of the biased coin, e.g., $\mathcal{D} = \{H, H, T, H, T \}$.
*   Assumption: true probability distribution is $\mathbb{P}(x), x \in \{H,T\}$
*   Our task is to model the probability of heads/tails. Our class of models $\mathcal{M}$ are Bernoulli distribution over $x \in \{H,T\}$.




### **Example: Flipping a Random Coin (2)**
How should we choose $P_{\theta}$ from $\mathcal{M}$ if $3$ out of $5$ tosses are heads in $\mathcal{D}$? Let's apply maximum likelihood learning.

*   Our model is $P_{\theta}(x = H) = \theta$ and $P_{\theta}(x = T) = 1 - \theta$.
*   Our dataset is $\mathcal{D} = \{H, H, T, H, T \}$.
*   The likelihood of the data is $\prod_iP_{\theta}(x_i) = \theta \cdot \theta \cdot (1 - \theta) \cdot \theta \cdot (1 - \theta)$.




We optimizes for $\theta$ which makes $\mathcal{D}$ most likely. what is the solution in this case? 

In [None]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt

# Our dataset is {H,H,T,H,T}; if theta = P(x = H), we get:
coin_likelihood = lambda theta: theta*theta*(1-theta)*theta*(1-theta)

theta_vals = np.linspace(0,1)
plt.plot(theta_vals, coin_likelihood(theta_vals))

### **Example: Flipping a Random Coin (3)**
Our Log-Likelihood function is
$$\begin{align*}
L(\theta) &= \theta^{\#\,\text{heads}}(1 - \theta)^{\#\,\text{tails}}\\
\log L(\theta) &= \log (\theta^{\#\,\text{heads}}(1 - \theta)^{\#\,\text{tails}})\\
&= \#\,\text{heads}\cdot \log (\theta) + \#\,\text{tails}\cdot \log (1 - \theta)
\end{align*}$$

the MLE estimate is the $\theta^* \in [0,1]$ such that $\log L(\theta^*)$ maximum.

Differentiating the log-likelihood function with respect to $\theta$ and set derivative to zero, we obtain.
$$\theta^* = \frac{\#\, \text{heads}}{\#\, \text{heads} + \#\, \text{tails}}$$

When exact solutions are not available, we can optimize the log-likelihood numerically, e.g., using gradient descent.

We will see examples of this later.


### **Conditional Maximum Likelihood**
Sometimes, we may be interested in only fitting a *conditional* model $P(y \mid x)$. For example, we may be only interested in predicting $y$ from $x$ rather than learning the joint structure of $x, y$.

We can extend the principle of maximum likelihood learning to this setting as well. In this case, we are interested in minimizing:
$$\min_{\theta}\mathbb{E}_{x \sim \mathbb{P}}[D(\mathbb{P}(y \mid x) \| P_{\theta}(y \mid x))],$$
the expected KL divergence between $\mathbb{P}(y \mid x)$ and $P_{\theta}(y \mid x)$ over all the inputs $x$.

With a bit of math, we can show that the maximum likelihood objective becomes
$$\max_{\theta}\mathbb{E}_{x,y \sim \mathbb{P}}\log P_{\theta}(y \mid x)$$
This is the principle of *conditional maximum likelihood*.

## **Part 4: Extensions of Maximum Likelihood** 
Maximum Likelihood Learning is one approach for training probabilistic machine learning models.\
An even more general approach comes from Bayesian statistics. We briefly overview the Bayesian approach in this lesson.

### **Review: Maximum Likelihood Learning**
Recall that in maximum likelihood learning, we are optimizing the following objective:
$$\theta_{MLE} = \arg\max_{\theta}\mathbb{E}_{x, y \sim \mathbb{P}}\log P(y,x; \theta).$$

### **The Frequentist Approach** 
So far, we viewed the parameters $\theta_{MLE}$ as a fixed but unknown quantity that we want to determine.
$$\theta_{MLE} = \arg\max_{\theta}\mathbb{E}_{x, y \sim \mathbb{P}}\log P(y,x; \theta).$$
This view is an example of the *frequentist* approach in statistics, there exists some true values of $\theta_{MLE}$ and our job is to devise statistical procedure to estimate this value. 

### **The Bayesian Approach**
In Bayesian statistics, $\theta$ is a random variable whose value happens to be unknown.

We formulate two models:


*   A likelihood model $P(x,y \mid \theta)$ that defines the probability of $x, y$ for any fixed value of $\theta$.
*   A prior $P(\theta)$ that specifies us existing belief about the distribution of the random variable $\theta$

Together, these two models define the *joint distribution* 
$$P(x, y, \theta) = P(x,y \mid \theta)P(\theta)$$

in which both the $x, y$ and the parameters $\theta$ are random variables



### **Bayesian Inference and Learning**
How do we estimate the parameter $\theta$ that is consistent with a given dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(n)}, y^{(n)})\}?$

Since the variable $\theta$ is an unknown random value, in Bayesian approach we are interested in the *posterior probability* $P(\theta \mid \mathcal{D})$ of $\theta$ given the dataset $\mathcal{D}$.

How do we obtain $P(\theta \mid \mathcal{D})$? This value is computed using Bayes' rule:
$$\begin{align*}
P(\theta \mid \mathcal{D}) &= \frac{P(\mathcal{D} \mid \theta)P(\theta)}{P(\mathcal{D})} \\
&= \frac{P(\mathcal{D} \mid \theta)P(\theta)}{\int_{\theta}P(\mathcal{D},\theta)P(\theta)d\theta},
\end{align*}$$

where $P(\mathcal{D} \mid \theta) = \prod_{i=1}^nP(x^{(i)}, y^{(i)} \mid \theta)$.

### **Bayesian Predictions**
Suppose we now want the predict the value $y$ from $x$. Unlike in the frequentist setting, we no longer have a single estimate $\theta$ of the model params, instead we have a distribution.

The Bayesian approach to predicting $y$ given an input $x$ and training set $\mathcal{D}$ consists of taking the prediction of all the possible models
$$P(y \mid x, \mathcal{D}) = \int_{\theta}P(y \mid x, \theta)P(\theta \mid \mathcal{D})d\theta.$$

This is called the *posterior predictive* distribution. Note how each $P(y \mid x, \theta)$ is weighted by the probability of $\theta$ given $\mathcal{D}$

### **The Pros and Cons of the Bayesian Approach**
The Bayesian is very powerful. Some of its advantages include:

*   Principled estimates of uncertainty, both in the predictions and the parameters of the model
*  Ability to incorporate prior knowledge via the prior
*   Providing a general framework for reasoning about probabilistic model.

The disadvantages is by far the computational complexity. Averaging over all possible weights is typically intractable. There exists an entire field of machine learning that learn how to appoximate it






### **Maximum A Posterior Learning**
Instead of trying to use the *posterior distribution* of $P(\theta \mid \mathcal{D})$, a common approach is to approximate this distribution by its most likely value:
$$\begin{align*}
\theta_{MAP} &= \arg \max_{\theta} \log P(\theta \mid \mathcal{D}) \\
&= \arg \max_{\theta}(\log P(\mathcal{D} \mid \theta) + \log P(\theta) - \log P(\mathcal{D})) \\
&= \arg \max_{\theta} (\log \prod_{i=1}^nP(x^{(i)},y^{(i)} \mid \theta) + \log P(\theta)),
\end{align*}$$

where in the second line we use the Bayesian theorem and in the third line we used the fact that $P(\mathcal{D})$ does not depend on $\theta$.

Thus, we have the following objective:
$$\arg \max_{\theta} (\log \prod_{i=1}^nP(x^{(i)},y^{(i)} \mid \theta) + \log P(\theta))$$

The $\theta_{MAP}$ is known as the *maximum a posterior* estimate. Note that we use the same formula as we used for maximum likelihood, except that we have added the prior term $\log P(\theta)$.

### **Example: Flipping a Random Coin**
How should we choose $P(x \mid \theta)$ from $\mathcal{M}$ if 3 out of 5 tosses are heads in $\mathcal{D}$? Let's apply maximum likelihood learning.


1.   Our Model is: $P(x = H \mid \theta) = \theta$ and $P(x = T \mid \theta) = 1 -\theta$
2.   Our data is: $\mathcal{D} = \{H, H, T, H, T\}$
3. The likelihood of the data is $\prod_iP(x_i | \theta) = \theta \cdot \theta \cdot (1 - \theta) \cdot \theta \cdot (1 - \theta).$

Let's now make this a MAP problem. Let's assume the prior follows the [Beta]() distribution
$$P(\theta) = \frac{1}{B(\alpha + 1,\beta + 1)}\theta^{\alpha}(1 - \theta)^{𝛽},$$

where $\alpha, \beta >0$ are parameters and $B$ is the [Beta] distribution.

The joint probability on $\mathcal{D} = \{H, H, T, H, T\}$ is then 
$$\prod_iP(x_i | \theta)P(\theta) = \theta \cdot \theta \cdot (1 - \theta) \cdot \theta \cdot (1 - \theta)\frac{\theta^{\alpha}(1 - \theta)^{𝛽}}{B(\alpha + 1,\beta + 1)}$$

Let's derive an analytic solution. Our objective function is:
$$\begin{align*}
L(\theta) &\propto \theta^{\#\,\text{heads}} \cdot (1 - \theta)^{\#\, \text{tails}} \cdot \theta^{\alpha} \cdot (1 - \theta)^{𝛽} \\
\log L(\theta) &= \log (\theta^{\#\,\text{heads}} \cdot (1 - \theta)^{\#\, \text{tails}} \cdot \theta^{\alpha} \cdot (1 - \theta)^{𝛽}) + \text{const.} \\
&= (\#\, \text{heads} + \alpha) \log(\theta) + (\#\, \text{tails} + \beta) \log(1 - \theta).
\end{align*}$$

Differentiating the log-likelihood with respect to $\theta$ and setting the derivative to zero, we obtain
$$\theta^* = \frac{\#\, \text{heads} + \alpha}{\#\, \text{heads} + \#\, \text{tails} + \alpha + \beta}$$

Thus we see that adding a Beta prior with parameters $\alpha, \beta$ allows to encode having seen $\alpha$ (virtual heads) and $\beta$ (virtual tails).

For example, if our initial dataset is 
$$\mathcal{D} = \{H, H, T, H, T\}$$
and we set $\alpha =1, \beta = 1$, then the optimal $\theta^*$ will be as if we had the following dataset
$$\mathcal{D}_{\text{virtual}} = \{H, H, T, H, T, H, T\}$$
with an extra heads and tails.


In [None]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt

# Our dataset is D={H,H,T,H,T}; if theta = P(x=H), we get
alpha, beta = 1, 1
# Our effective dataset is D={H,H,T,H,T,H,T}
coin_likelihood = lambda theta: theta*theta*(1-theta)*theta*(1-theta)*(theta**alpha)*(1-theta)**beta

theta_vals = np.linspace(0,1)
plt.plot(theta_vals, coin_likelihood(theta_vals))