# Maximum Likelihood / Maximum A Posteriori

Here, we review the concepts of ML and MAP. For that, we will consider a running example, logistic regression, which we studied on Day 1. We will find the ML/MAP solution of the logistic regression problem using gradient ascent, which we studied on Day 2.

(*Note:* In general, we do *not* recommend to fit logistic regression using vanilla gradient ascent. We will do that in this lab session for didactic purposes only. There are more advanced methods that would work better here, but these are variants of gradient ascent.) 

In this lab session, we will use the classic breast cancer data from Wisconsin. It contains 569 observations of dimensionality 30 each. Features are computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image. The goal is to detect whether the image corresponds to a malignant or benign tumor. This dataset is available in [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html). For more details, see the [UCI repository](https://archive.ics.uci.edu/ml/datasets).

Let us assume that the labels $y_n$ can take value either $+1$ (when the tumor is benign) or $-1$ (when the tumor is malignant). Logistic regression posits that the probability of each observation $y_n$, given the covariates $x_n$ (the vector of 30 features, in this case) and the parameters $\beta$ and $\beta_0$, is
$$
p(y_n=1 \;|\; x_n, \beta, \beta_0) = \sigma(\beta_0+\beta^\top x_n),
$$
where $\sigma(\cdot)$ denotes the sigmoid function, which returns a probability (a number between 0 and 1). Therefore, the probability of the negative labels is
$$
p(y_n=-1 \;|\; x_n, \beta, \beta_0) = 1-\sigma(\beta_0+\beta^\top x_n) = \sigma(-\beta_0-\beta^\top x_n),
$$
due to symmetry properties of the sigmoid function.

*Note: We use $+1$ and $-1$ instead of $+1$ and $0$ for mathematical convenience only. Specifically, this allows us to write the likelihood in a more compact form.*

When we have $N$ observations, we can write down the *conditional probability*, given the covariates and the model parameters, in a compact form,
$$
p\left(\{y_n\} \;\big|\; \{x_n\}, \beta, \beta_0\right) = \prod_{n=1}^N p(y_n \;|\; x_n, \beta, \beta_0) = \prod_{n=1}^N \sigma\left(y_n\cdot (\beta_0+\beta^\top x_n)\right),
$$
This equation specifies a conditional probability, because it has the model parameters in the conditioning side. If we knew $\beta$ and $\beta_0$, we would be able to compute the probability of our dataset.

However, in logistic regression we want to find $\beta$ and $\beta_0$, i.e., we do not know their values. One approach to find their values is maximum likelihood.

## Load and Preprocess the Data

In the cell below, we import the python packages and write a function to load the breast cancer data. We will use `scikit-learn` for that.








In [None]:
import os, struct
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from scipy.special import expit
%matplotlib notebook

def load_breast_data():
    # Load the data from sklearn
    datadict = load_breast_cancer()
    # Convert X to a numpy matrix of type "float32"
    X = np.asarray(datadict['data']).astype(np.float32)
    # Convert Y (the labels) to a numpy matrix of size 569x1
    Y = np.asarray(datadict['target']).astype(np.float32)
    # Convert the entries of Y from {+1,0} to {+1,-1}
    Y = 2.0*Y[:,None]-1.0
    # Return
    return X, Y

**Load and standardize.** We now load the data and standardize the features in mean and variance.

In [None]:
# Load the data
X, Y = load_breast_data()

# Standardize in mean and variance
X = X - X.mean(axis=0)
X = X / X.std(axis=0)

**Visualize the data.** It is *always* good practice to visualize the data, as we did on Monday. Here, due to time constraints, we will only check that the dimensions of $X$ and $Y$ are sensible. But, feel free to plot any features and/or compute any summary of statistics.

In [None]:
print("Size of the dataset: " + str(X.shape[0]) + " observations of dimensionality " + str(X.shape[1]))
print("Shape of X: " + str(X.shape))
print("Shape of Y: " + str(Y.shape))

**Expand with a column of ones.** As we did on other lab sessions, we will expand $X$ with a column containing ones. This will allow us to treat $\beta_0$ as if it were part of $\beta$.

In [None]:
# Add a column of ones
X = np.hstack( (np.ones((X.shape[0],1)), X) )

print("Shape of X: " + str(X.shape))


## Maximum Likelihood

In ML, we find the values of $\beta$ and $\beta_0$ that maximize the conditional probability above. It is typically easier to maximize the logarithm of the conditional probability instead of the conditional probability itself. (Maximizing a function or the logarithm of that function are equivalent tasks, but the latter is numerically more stable.) Thus, we form the objective function as
$$
f_{\textrm{ML}}(\beta) = \sum_{n=1}^N \log p(y_n \;|\; x_n, \beta) = \sum_{n=1}^N \log \sigma\left(y_n\cdot \beta^\top x_n\right)
$$
Here, we have ignored $\beta_0$ because we assume that $x_n$ has been extended with a "one", and thus the vector $\beta$ contains the bias $\beta_0$ in its first element.

On Day 2, we maximized this function using a python implementation of gradient ascent. Here, we will implement gradient ascent ourselves.

**[Task 1]** In the cell below, implement gradient ascent to obtain the ML estimator of $\beta$ (and implicitly $\beta_0$). Plot the resulting $\beta$.

For that, we will need the gradient with respect to the model parameters ($\beta$), which is given by
$$
\nabla_{\beta} f_{\textrm{ML}} = \sum_{n=1}^N \sigma\left(-y_n\cdot \beta^\top x_n\right) y_n x_n
$$

Hints:

1. You may reuse the code from Tuesday. However, keep in mind that we are now *maximizing* instead of minimizing (this tells you about the sign of the gradient ascent updates).

2. For the initial guess, set $\beta$ to a `numpy` array of size $31\times 1$. Initialize $\beta$ randomly following a normal distribution (`np.random.normal`) with standard deviation $0.1$.

3. Set the stepsize as $\rho = 0.05$.

4. Compute the gradient magnitude at each iteration as
```python
grad_sum = np.sqrt(np.sum(grad**2))
```
and use the stopping criterion
```python
while(it==0 or grad_sum>1.0e-3 and it<=100000):
```
where `it` is the iteration number (initialized at zero).

5. Recall that the sigmoid function is given in python by `expit()`.

6. To plot the results, you may use this piece of code:
```python
# Plot the resulting value of the coefficients beta
plt.figure()
plt.bar(range(X.shape[1]),beta)
plt.xlabel('dimension')
plt.ylabel('value')
plt.show()
```

In [None]:
def grad_ML(X,Y,beta):
    # TODO: Write your code here

In [None]:
def f_ML(X,Y,beta):
    # TODO: Write your code here

In [None]:
# TODO: Write you implementation of gradient ascent here

**[Task 2]** You may have noticed that in task 1, the algorithm stopped after reaching the maximum number of iterations. Change the maximum number of iterations to 200000 and re-run the code. Plot the resulting $\beta$. What happens? What would happen if we allowed the algorithm run for a larger number of iterations?

## Maximum A Posteriori

The ML criterion maximizes the likelihood, i.e., the conditional probability. In contrast, the MAP criterion maximizes the *posterior probability* of the parameters, given the data. It assumes a Bayesian setting, in which we have *prior information* about the parameters. A common approach is to place a Gaussian (normal) prior distribution over $\beta$. We place a normal distribution with mean $0$ and variance $\nu$ over these parameters,
$$
p(\beta) = \mathcal{N}(0,\nu I)
$$

The posterior distribution is the probability of the parameters conditioned on the observed data, and we know that it is *proportional* to the prior multiplied by the likelihood, i.e.,
$$
\underbrace{p\left(\beta \;\big|\; \{y_n,x_n\} \right) }_{\textrm{posterior}} \propto \underbrace{p(\beta) }_{\textrm{prior}} \underbrace{ p\left(\{y_n\} \;\big|\; \{x_n\}, \beta, \right)}_{\textrm{likelihood}} = p(\beta)  \prod_{n=1}^N \sigma\left(y_n\cdot \beta^\top x_n\right),
$$
Maximizing the (logarithm of the) posterior is very similar to maximizing the likelihood. The only difference is the presence of the prior term, which acts as a *regularizer*. Thus, the MAP objective function is
$$
f_{\textrm{MAP}}(\beta) = - \frac{1}{2\nu}\beta^\top \beta + \sum_{n=1}^N \log \sigma\left(y_n\cdot \beta^\top x_n\right)
$$

The gradient is now
$$
\nabla_{\beta} f_{\textrm{MAP}} = -\frac{1}{\nu}\beta+\sum_{n=1}^N \sigma\left(-y_n\cdot \beta^\top x_n\right) y_n x_n
$$

For simplicity, you may use $\nu=1$. Rigurously, we should find the best value of $\nu$ using a method such as cross-validation, but this is outside the scope of this lab session.

**[Task]** In the cell below, obtain the MAP solution using gradient ascent. Plot the resulting $\beta$.

Hints:

1. For the initial guess, set $\beta$ to a `numpy` array of size $31\times 1$, randomly initialized (same as we did before).

2. Set a small stepsize: $\rho = 10^{-4}$.

3. Compute the gradient magnitude at each iteration as
```python
grad_sum = np.sqrt(np.sum(grad**2))
```
and use the stopping criterion
```python
while(it==0 or grad_sum>1.0e-3 and it<=100000):
```

In [None]:
def f_MAP(X,Y,beta):
    # TODO: Write your code here

In [None]:
def grad_MAP(X,Y,beta):
    # TODO: Write your code here

In [None]:
# TODO: Write you implementation of gradient ascent here

**[Questions]** In this case, the algorithm has converged.

1. Is the objective function better or worse compared to the ML estimator?

2. Which $\beta$ do you think is better: the ML solution or the MAP solution? Which result should you trust?