### Lecture 15: Generative Models (MIT Notes)

**Gaussian Generative Models**

Multinomial Generative models in this lecture were used in the context of sequence classification. When it comes to continous data, one way to model a classification problem in a generative fashion is to use a Guassian Distribution. Some examples in these notes are based on this [document](https://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note09-2up.pdf)

- Univariate Gaussian 
A random variable $x$, st $x \in R$ is said to follow a gaussian distribution with mean $\mu$ and variance $\sigma^2$ if its *pdf* can be written as:

$$\frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})$$

The MLE estimate for $\mu$ is the sample mean $\bar{x}$ and the estimate for variance $\sigma^2$ is sample variance $\bar{\sigma^2}$


- Univariate Gaussian Classification Example

Imagine we have data on blood sugar levels and diabetes. We want to build a Gaussian Classifier to predict from blood sugar levels if someone will be a diabetic or not.


| Blood Sugar 	| Diabetes 	|
|---	|---	|
| 10 	| 0 	|
| 8 	| 0 	|
| 10 	| 0 	|
| 10 	| 0 	|
| 11 	| 0 	|
| 11 	| 0 	|
| 12 	| 1 	|
| 9 	| 1 	|
| 15 	| 1 	|
| 10 	| 1 	|
| 13 	| 1 	|
| 13 	| 1 	|


Now the MLE for 
- $\mu_0 = 10$, $\sigma^2_{0}=1$
- $\mu_1=12$, $\sigma^2_{1}=4$

Suppose we now get blood sugar measurement as given below:
- 10
- 11
- 6

Now using the univariate gaussian model we can estimate the $P(x|0)$ and $P(x|1)$

We would ideally want to compute the $P(c|x)$, we can use the baye's rule to compute this probability as a posterior estimate.

$P(c=0|x) = \frac{P(x|c=0)*P(c=0)}{P(x)}$

and 

$P(c=1|x) = \frac{P(x|c=1)*P(c=1)}{P(x)}$

Now to simplify things we can instead take the ratio of proior probabilities

$$\frac{P(c=0|x)}{P(c=1|x)} = \frac{\frac{P(x|c=0)*P(c=0)}{P(x)}}{\frac{P(x|c=1)*P(c=1)}{P(x)}} = \frac{P(x|c=0)*P(c=0)}{P(x|c=1)*P(c=1)}$$

Now in the current case  $P(c=0) = P(c=1)$ (see the table above, we have same number of diabetic and non diabetic cases and there is no additional information given to assume otherwise) so the expression above reduces to:

$$\frac{P(c=0|x)}{P(c=1|x)} = \frac{P(x|c=0)}{P(x|c=1)} $$

In [1]:
import math
def norm_dist(x,mu,sigma):
    sigma_square = sigma**2
    const = 1/math.sqrt(2*math.pi*sigma_square)
    norm = -1*((x-mu)**2)/(2*sigma_square)
    pdf = const*math.exp(norm)
    return pdf

In [2]:
norm_dist(x = 10,mu=10, sigma=1)

0.3989422804014327

Now using the python function defined above we can compute the ratio of posterior probabilities:

**$$\frac{P(c=0|x=10)}{P(c=1|x=10)} = \frac{P(x=10|c=0)}{P(x=10|c=1)} $$**

In [3]:
norm_dist(x = 10,mu=10, sigma=1)/norm_dist(x = 10,mu=12, sigma=4)

4.532593812267305

**$$\frac{P(c=0|x=11)}{P(c=1|x=11)} = \frac{P(x=11|c=0)}{P(x=11|c=1)} $$**

In [4]:
norm_dist(x = 11,mu=10, sigma=1)/norm_dist(x = 11,mu=12, sigma=4)

2.5031360384183645

**$$\frac{P(c=0|x=11)}{P(c=1|x=11)} = \frac{P(x=11|c=0)}{P(x=11|c=1)} $$**

In [5]:
norm_dist(x = 6,mu=10, sigma=1)/norm_dist(x = 6,mu=12, sigma=4)

0.004133190554590548

- Multivariate Gaussian:

The extension of univariate case is that of multivariate random vector. A random variable $x$ st $x \in R^d$ is said to follow a multivariate gaussian distribution if the pdf can be written as below:

$$\frac{1}{\sqrt{(2\pi)^ddet\Sigma}}exp(\frac{-1(x-\mu)^T\Sigma^-1(x-\mu)}{2})$$

The mean vector $\mu$ is computed from the sample data so is covariance matrix $\Sigma$ (both are MLE estimates)

The process of model estimation and prediction is similar to the one followed in the univariate case. 

We will use a dataset to estimate the model and then do the prediction.

In [6]:
import pandas as pd
import numpy as np
from scipy import stats
data = pd.read_csv("./data/abalone.csv",header=None)
data.columns=['Sex','Length','Diameter','Height','While Weight','Shucked Weight','Visceral Weight',"Sell Weight","Rings"]
data.head()

Unnamed: 0,Sex,Length,Diameter,Height,While Weight,Shucked Weight,Visceral Weight,Sell Weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [7]:
train = data.query("Sex!='I'").sample(frac=0.95,random_state=42)
test = data.query("Sex!='I'").drop(train.index)

In [8]:
train.head(2)

Unnamed: 0,Sex,Length,Diameter,Height,While Weight,Shucked Weight,Visceral Weight,Sell Weight,Rings
3078,F,0.695,0.535,0.2,1.5855,0.667,0.334,0.471,11
3773,F,0.575,0.46,0.15,0.927,0.333,0.207,0.2985,9


In [9]:
test.head(2)

Unnamed: 0,Sex,Length,Diameter,Height,While Weight,Shucked Weight,Visceral Weight,Sell Weight,Rings
25,F,0.56,0.44,0.14,0.9285,0.3825,0.188,0.3,11
38,F,0.575,0.445,0.135,0.883,0.381,0.2035,0.26,11


In [10]:
train_c1 = train.query("Sex=='M'").drop("Sex",axis=1).values
train_c0 = train.query("Sex=='F'").drop("Sex",axis=1).values

In [11]:
mu0 = train_c0.mean(axis=0)
mu1 = train_c1.mean(axis=0)
cov0 = np.cov(train_c0,rowvar=False)
cov1 = np.cov(train_c1,rowvar=False)

We can use the pdf defined in scipy as below:

```python
stats.multivariate_normal.pdf(test.drop("ca_cervix",axis=1).iloc[0].values,mean=mu0,cov=cov0)
```

In [12]:
train['Sex'].value_counts(normalize=True)

M    0.540661
F    0.459339
Name: Sex, dtype: float64

In [13]:
prior_0 = 0.459339
prior_1 = 0.540661

For the first row in the test data lets use our model to see what predictions we get. We will use the following result:

$$\frac{P(c=0|x)}{P(c=1|x)} = \frac{\frac{P(x|c=0)*P(c=0)}{P(x)}}{\frac{P(x|c=1)*P(c=1)}{P(x)}} = \frac{P(x|c=0)*P(c=0)}{P(x|c=1)*P(c=1)}$$

In [14]:
row = test.drop("Sex",axis=1).loc[76].values
(stats.multivariate_normal.pdf(row,mean=mu0,cov=cov0)*prior_0)/(stats.multivariate_normal.pdf(row,mean=mu1,cov=cov1)*prior_1)

0.7899157985689419

In [15]:
test.head()

Unnamed: 0,Sex,Length,Diameter,Height,While Weight,Shucked Weight,Visceral Weight,Sell Weight,Rings
25,F,0.56,0.44,0.14,0.9285,0.3825,0.188,0.3,11
38,F,0.575,0.445,0.135,0.883,0.381,0.2035,0.26,11
76,M,0.595,0.475,0.14,0.944,0.3625,0.189,0.315,9
108,F,0.51,0.39,0.135,0.6335,0.231,0.179,0.2,9
154,F,0.565,0.45,0.135,0.9885,0.387,0.1495,0.31,12


**3 Class Formulation**

In [16]:
data = pd.read_csv("./data/abalone.csv",header=None)
data.columns=['Sex','Length','Diameter','Height','While Weight','Shucked Weight','Visceral Weight',"Sell Weight","Rings"]
train = data.sample(frac=0.95,random_state=42)
test = data.drop(train.index)

In [17]:
train_m = train.query("Sex=='M'").drop("Sex",axis=1).values
train_f = train.query("Sex=='F'").drop("Sex",axis=1).values
train_i = train.query("Sex=='I'").drop("Sex",axis=1).values

mu_m = train_m.mean(axis=0)
mu_f = train_f.mean(axis=0)
nu_i = train_i.mean(axis=0)


cov_m = np.cov(train_m,rowvar=False)
cov_f = np.cov(train_f,rowvar=False)
cov_i = np.cov(train_i,rowvar=False)

In [18]:
train['Sex'].value_counts(normalize=True)

M    0.365171
I    0.323841
F    0.310988
Name: Sex, dtype: float64

In [19]:
prior_m = train['Sex'].value_counts(normalize=True)['M']
prior_f = train['Sex'].value_counts(normalize=True)['F']
prior_i = train['Sex'].value_counts(normalize=True)['I']

In [32]:
row = test.drop("Sex",axis=1).iloc[3].values
l_m = stats.multivariate_normal.pdf(row,mean=mu_m,cov=cov_m)
l_f = stats.multivariate_normal.pdf(row,mean=mu_f,cov=cov_f)
l_i = stats.multivariate_normal.pdf(row,mean=nu_i,cov=cov_i)
post_m = (l_m*prior_m)/(l_m+l_f+l_i)
post_f = (l_f*prior_f)/(l_m+l_f+l_i)
post_i = (l_i*prior_i)/(l_m+l_f+l_i)

In [33]:
print(f"post_m: {post_m}, post_f: {post_f}, post_i: {post_i}")

post_m: 0.17131801367707852, post_f: 0.16507670137143418, post_i: 1.358442762714473e-05


In [25]:
test.head()

Unnamed: 0,Sex,Length,Diameter,Height,While Weight,Shucked Weight,Visceral Weight,Sell Weight,Rings
34,F,0.705,0.55,0.2,1.7095,0.633,0.4115,0.49,13
64,M,0.52,0.4,0.12,0.58,0.234,0.1315,0.185,8
95,M,0.665,0.535,0.195,1.606,0.5755,0.388,0.48,14
130,M,0.595,0.48,0.165,1.262,0.4835,0.283,0.41,17
161,F,0.605,0.485,0.16,1.222,0.53,0.2575,0.28,13
