# **Lecture 5b: Probabilistic Perspectives on ML Algorithms**
## **Applied Machine Learning**

## **Part 1: Probabilistic Linear Regression**
Previously, we derived maximum likelihood learning as a general way of learning machine model.\
we will now see how the algorithms we have seen so far are special case of this principle.


### **Review: Probabilistic Models**
A probabilistic model is a probability distribution
$$P(x,y): \mathcal{X} \times \mathcal{Y} \to [0,1].$$

This model can approximate the data distribution $\mathbb{P}$.

If we know $P(x,y)$, we can use the conditional $P(y \mid x)$ for prediction.

Probabilistic models may also have parameters $\theta \in \Theta$, which we will denote as:
$$P_{\theta}(x,y): \mathcal{X} \times \mathcal{Y} \to [0,1]$$ 

### **Review: Conditional Maximum Likelihood**
A general approach pf optimizing the conditional models of the form $P(y \mid x)$ by minimizing the expected KL Divergence based on the data distribution
$$\min_{\theta}\mathbb{E}_{x \sim \mathbb{P}}[D(\mathbb{P}(y \mid x))\|P_{\theta}(y \mid x))].$$ 

With a bit of math, we can see that the maximum likelihood becomes:
$$\max_{\theta}\mathbb{E}_{x,y \sim \mathbb{P}}\log P_{\theta}(y \mid x).$$

This is the principle of *conditional maximum likelihood.*

### **Review: Least Squares**
Recall that the linear regression algorithms fit a linear model of the form:
$$f(x) = \sum_{j=1}^n \theta_j x_j = \theta^\top x. $$

It minimizes the mean squared error (MSE)
$$J(\theta) = \frac{1}{2n}\sum_{i=1}^n(y^{(i)} - \theta^\top x^{(i)})^2 $$

on a dataset $\{(x^{(i)}, y^{(i)}) \mid i = 1,2, \cdots, n\}$

Is there a specific reason for us to be optimizing the mean squared error to fit our linear model?

The answer to this can found by looking at the algorithm from a probabilistic perspective

### **Probabilistic Least Squares**
Let's derive a probabilistic algorithm by defining a class of probabilistic models and use maximum likelihood as the objective.

1.   Let's choose our model family $\mathcal{M}$ to be the set of Gaussian distributions of the form
$$p(y \mid x; \theta) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(- \frac{(y - \theta^\top x)^2}{2\sigma^2}\right).$$

Each model $\mathcal{N}(y; \mu(x), \sigma)$ is a Gaussian with a standard deviation $\sigma$ of one and a mean of $\mu(x) = \theta^\top x$ that is parametrized by $\theta$.
2.  We optimize the model using maximum likelihood. The log-likelihood function at a point $(x,y)$ equals
$$\begin{align*}
\log L(\theta) &= \log p(y \mid x; \theta) = \log \left(\frac{1}{\sqrt{2\pi}\sigma}\exp\left(- \frac{(y - \theta^\top x)^2}{2\sigma^2}\right)\right) \\
&= - \frac{(y - \theta^\top x)^2}{2\sigma^2} + \text{const.}
\end{align*}$$

Note how this is the a mean squared error (MSE) objective!

Thus, minimizing MSE is equivalent to maximizing the log-likelihood of a Normal distribution $\mathcal{N}(y; \mu(x), \sigma)$.



### **Algorithm: Gaussian Ordinary Least Sqares**



*   **Type**: Supervised Learning (Regression)

*   **Model class**: Linear models
*   **Objective function**: Mean squared error
*   **Optimizer**: Normal equations
*   **Probabilistic Interpretation**: Conditional-Gaussian fit using max-likelihood





### **Extensions of Gaussian Least Squares**

This is an example of how we can interpret a machine learning algorithm in a probabilistic framework.

We can see many algorithms that have these kinds of interpretations. Here are some simple extensions:

We can use Gaussian model and also parametrize the standard deviation:

*   This is called heteroscedatic regression, and allow us to obtain condidence intervals for our predictions.

We can also parametrize other distribution, not only Gaussian


*   Exponential or Gamma distribution for continuous variables,
*   Bernoulli distribution for dicrete variables.

This yields many new machine learning algorithms.





## **Part 2: Bayesian Algorithms**
We can also use the Bayesian ML we have learned to interpret several algorithms we've seen as special case of the Bayesian framework.

### **Review: The Bayesian Approach**
In Bayesian algorithms, parameter $\theta$ is a random variable whose value happens to be.

We formulate the two models:


*   A likelihood model $P(x,y \mid \theta)$ that defines the probability of $x,y$ by any fixed value of $\theta$.
*   A prior $P(\theta)$ that specifies us existing belief about the distribution of the random variable $\theta$.

Together, the two models define the joint distribution
$$P(x, y, \theta) = P(x,y \mid \theta)P(\theta)$$

in which both $x, y$ and the parameters $\theta$ are random variables.


### **Review: A Posteriori Learning**
Recall that in maximum a posteriori (MAP) learning, we optimize the following objective
$$\theta_{MAP} = \arg \max_{\theta} \left(\log \prod_{i=1}^n P(x^{(i)}, y^{(i)} \mid \theta) + \log P(\theta)\right),$$

Note that we used the same formula as we used for maximum log likelihood, except that we have added the prior term $P(\theta)$

### **Review: Ridge Regression**
Recall that a ridge regression algorithm fits a linear model
$$f(x) = \sum_{j=0}^d \theta_j x_j = \theta^\top x,$$

We minimize the L2-regularized mean squared error (MSE)

$$J(\theta) = \frac{1}{2n}\sum_{i=1}^n(y^{(i)} - \theta^\top x^{(i)})^2 + \frac{\lambda}{2}\sum_{j=1}^d \theta_j^2,$$

on a dataset $\{(x^{(i)}, y^{(i)}) \mid i = 1,2,\cdots, n\}$. The term $\frac{1}{2}\sum_{j=1}^d\theta_j^2 = \frac{1}{2} \|\theta\|_2^2$ is called the regulizer.

### **Probabilistic Ridge Regression**
We can intepret ridge regression as maximum a prosteriori (MAP) estimation as follow:



1.   First, we select our model family $\mathcal{M}$ is Gaussian distribution of the form (let's assume $x \in \mathbb{R}$ for simplitcity) 
$$p(y \mid x; \theta) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(- \frac{(y - \theta^\top x)^2}{2\sigma^2}\right).$$
2.   We use a Gaussian prior with mean zero and variance $\tau$ on the parameters $\theta$
$$p(\theta) = \prod_{j-1}^d\frac{1}{\sqrt{2\pi}\tau}\left(-\frac{\theta_j^2}{2\tau^2}\right).$$
3. We optimize the model using MAP approach. The objective at a point $(x,y)$ equals:
$$\begin{align*}
\log L(\theta) &= \log p(y \mid x,\theta) + \log p(\theta)\\
&= \log \frac{1}{\sqrt{2\pi}\sigma}\exp\left(- \frac{(y - \theta^\top x)^2}{2\sigma^2}\right) + \log \prod_{j-1}^d\frac{1}{\sqrt{2\pi}\tau}\left(-\frac{\theta_j^2}{2\tau^2}\right) \\
&= - \frac{(y - \theta^\top x)^2}{2\sigma^2} - \frac{1}{\sqrt{2\pi}\tau}\sum_{j=1}^d\theta_j^2 + \text{const.}
\end{align*}$$

Thus, we saw that ridge regression actually amounts to perform MAP estimation with a Gaussian prior. The strength of the regulizar $\lambda$ equals to $\frac{1}{\tau^@}$

### **Algorithms: Probabilistic Ridge Regression**


*  Type: Supervised Learning (Regression)
*  Model class: Linear models
*  Objective: Mean squared error
*  Optimizer: Normal equations
* Probabilistic Inteprretation: Conditional Gaussian likelihood and Gaussian prior fit using MAP.



### **Bayesian View on Machine Learning Algorithms**
Very often, we can intergret classical ML algorithms as applications of the probabilistic or Bayesian approaches (although we can derive them in other ways as well)

*   Regularization can often be seen as applying a prior on the weights,
*   L1 regularization can be seen as applying a *Laplace* prior,
*   Many other algorithms will have similar interpretations. 




## **Part 3: Bayesian Ridge Regression**
Let's now look at an example of fully Bayesian machine learning algorithms.


### **Review: The Bayesian Approach**

In Bayesian statistics, $\theta$ is an *random variable* whose value happens to be unknown.

We formulate the two models:
*   A likelihood model $P(x, y \mid \theta)$ that defines the probability of $x, y$ for any fixed value of $\theta$.
*   A prior $P(\theta)$ that specifies us existing belief about the distribution of the random variable $\theta$.

Together these two models define the joint probability:
$$P(x,y,\theta) = P(x,y\mid \theta)P(\theta)$$

in which both the $x,y$ and the parameters $\theta$ are random variables.



### **Review: Ridge regression**
Recall that a ridge regression algorithm fits a linear model:
$$f(x) = \sum_{j=0}^d \theta_j \cdot x_j = \theta^\top x.$$

We minimize the L2-regularized mean squared error (MSE)
$$J(\theta) = \frac{1}{2n}\sum_{i=1}^n(y^{(i)} - \theta^\top x^{(i)})^2 + \frac{1}{2}\sum_{j=1}^d \theta_j^2$$

on a dataset $\{(x^{(i)}, y^{(i)}) \mid i= 1,2,\dots,n\}$. The term $\frac{1}{2}\sum_{j=1}^d\theta^2 = \frac{1}{2}\|\theta\|_2^2$ is called the regularizer.

### **Probabilistic Ridge Regression**
We can interpret ridge regression as maximum a posteriori (MAP) as follows:
....

### **Bayesian Predictions**
Suppose we now want to predict the value of $y$ and $x$. Unlike in the frequentist setting, we no longer have a single estimate of $\theta$ of the model params, but instead we have a distribution.

The Bayesian approach to predicting $y$ given an input $x$ and a training dataset $\mathcal{D}$ consists of taking the prediction of all the possible models
$$P(y \mid x, \mathcal{D}) = \int_{\theta}P(y \mid x, \theta)P(\theta \mid \mathcal{D})d\theta.$$
This is called the *pasteriori predictive* distribution. Note how each $P(x \mid x, \theta)$ is weighted by the probability of $\theta$ given $\mathcal{D}$