# Generative Gaussian Models

We'll use again the *iris* dataset, and solve the iris classification prolem using Gaussian classifiers.

In [61]:
import numpy as np
import matplotlib.pyplot as plt
from load_dataset import loadDataSet                                #for loading the dataset
from train_validation_split import splitTrainingValidation          #for splitting the dataset into training and validation sets
from mean_covariance import vcol, vrow, compute_mu_C                #for computing the empirical mean and the empirical covariance of the dataset 
from sklearn.metrics import classification_report                   #for generating the classification report of the models
from scipy.special import logsumexp                                 #for scipy.special.logsumexp

In [40]:
numFeatures = 4

#load the iris dataset
D, L = loadDataSet('iris.csv', numFeatures)
print("Data shape: ", D.shape)
print("Labels shape: ", L.shape)

Data shape:  (4, 150)
Labels shape:  (150,)


In [41]:
#split the dataset into training and validation sets
#DTR and LTR are training data and labels, DTE and LTE are evaluation (or more precisely validation) data and labels
(DTR, LTR), (DVAL, LVAL) = splitTrainingValidation(2/3, D, L)
print("Training data shape: ", DTR.shape)
print("Training labels shape: ", LTR.shape)
print("Evaluation data shape: ", DVAL.shape)
print("Evaluation labels shape: ", LVAL.shape)

Training data shape:  (4, 100)
Training labels shape:  (100,)
Evaluation data shape:  (4, 50)
Evaluation labels shape:  (50,)


We use 100 samples for training and 50 samples for evaluation.

## Multivariate Gaussian Classifier
The optimal Bayes decision is to select for each test point the class with highest **posterior probability**: having class $c$ and $x_{t}$ as test point, we can thus write:
$$
c_{t}^{*} = argmax_{c} P (C_{t} = c \mid \mathbf{X}_t = \mathbf{x}_t) \rightarrow We \space assign \space x_{t} \space to \space the \space class \space having \space the \space highest \space Posterior \space probability
$$ 
We will assume that the samples are independent and identically distributed (*i.i.d.*) according to $(\mathbf{X}_t, C_{t}) ∼ (\mathbf{X}, C)$. <br>
Let $f_{X,C}$ be the joint density of $X, C$: we can
compute the joint likelihood for the hypothesized class $c$ for the observed test
sample $x_{t}$ as $f_{X,C}(x_{t}, c)$ and then use **Bayes rule** to compute the class posterior
probability:
$$
P(C_t = c \mid \mathbf{X}_t = \mathbf{x}_t) = \frac{f_{\mathbf{X},C}(\mathbf{x}_t, c)}{\sum_{c' \in C} f_{\mathbf{X},C}(\mathbf{x}_t, c')}
$$
We can factorize the joint density as:
$$
f_{\mathbf{X}_t, C_t}(\mathbf{x}_t, c) = f_{\mathbf{X} \mid C}(\mathbf{x}_t \mid c) P(c)
$$
Where:
- $f_{\mathbf{X} \mid C}(\mathbf{x}_t \mid c)$ is the class conditional distribution 
- $P(c)$ is called *Prior probabilty*: it's application-dependent and describes the probability of the class being $c$ **before** we observe $x_{t}$ 

In this specific case, we assume that our data, given the class, can be described by a **Gaussian distribution**:
$$
(\mathbf{X}_t \mid C_{t} = c) ∼ (\mathbf{X} \mid C = c) ∼ \mathcal{N}(\mathbf{µ_{c}}, \mathbf{Σ_{c}})
$$
If we knew $\mathbf{µ_{c}}$, $\mathbf{Σ_{c}}$ then we could compute the conditional this way;
$$
f_{\mathbf{X} \mid C}(\mathbf{x}_t \mid c) = \mathcal{N}(\mathbf{µ_{c}}, \mathbf{Σ_{c}})
$$
The problem is that we don't have **these parameters** $\theta = [(\mathbf{µ_{1}}, \mathbf{Σ_{1}}), . . . ,(\mathbf{µ_{k}}, \mathbf{Σ_{k}})] $, where $k$ is the number of different classes. <br>
However, since we have at our disposal a *labeled Dataset*, we can assume:
- Gaussian distribution for $\mathbf{X} \mid C$
- That, given the model parameters $\theta$, all the samples observations are *i.i.d* 

After (and only after) making these assumptions, we can plug in the **Maximum Likelihood Estimators** (*MLE*), which, for a **MVG** distribution, are the empirical mean and covariance matrix of each class. So, for each class $c$ we can compute the two estimators:
$$
\mu^{MLE}_{c} = \frac{1}{N_c} \sum_{i} x_{c,i}, \quad 
\Sigma^{MLE}_{c} = \frac{1}{N_c} \sum_{i} (x_{c,i} - \mu^*_c)(x_{c,i} - \mu^*_c)^T
$$
Where $x_{c,i}$ is the $i$-th sample of class $c$.


In [42]:
#Compute the MLE estimators of a MVG distribtion, which are the empirical mean and covariance of the training data
mu_0, C_0, = compute_mu_C(DTR[:, LTR == 0])
mu_1, C_1, = compute_mu_C(DTR[:, LTR == 1])
mu_2, C_2, = compute_mu_C(DTR[:, LTR == 2])

print(f"mu_0:\n{mu_0}\nShape: {mu_0.shape}")
print(f"mu_1:\n{mu_1}\nShape: {mu_1.shape}")
print(f"mu_2:\n{mu_2}\nShape: {mu_2.shape}")
print(f"C_0:\n{C_0}\nShape: {C_0.shape}")
print(f"C_1:\n{C_1}\nShape: {C_1.shape}")
print(f"C_2:\n{C_2}\nShape: {C_2.shape}")

mu_0:
[[4.96129032]
 [3.42903226]
 [1.46451613]
 [0.2483871 ]]
Shape: (4, 1)
mu_1:
[[5.91212121]
 [2.78484848]
 [4.27272727]
 [1.33939394]]
Shape: (4, 1)
mu_2:
[[6.45555556]
 [2.92777778]
 [5.41944444]
 [1.98888889]]
Shape: (4, 1)
C_0:
[[0.13140479 0.11370447 0.02862643 0.01187305]
 [0.11370447 0.16270552 0.01844953 0.01117586]
 [0.02862643 0.01844953 0.03583767 0.00526535]
 [0.01187305 0.01117586 0.00526535 0.0108845 ]]
Shape: (4, 4)
C_1:
[[0.26470156 0.09169881 0.18366391 0.05134068]
 [0.09169881 0.10613407 0.08898072 0.04211203]
 [0.18366391 0.08898072 0.21955923 0.06289256]
 [0.05134068 0.04211203 0.06289256 0.03208448]]
Shape: (4, 4)
C_2:
[[0.30080247 0.08262346 0.18614198 0.04311728]
 [0.08262346 0.08533951 0.06279321 0.05114198]
 [0.18614198 0.06279321 0.18434414 0.04188272]
 [0.04311728 0.05114198 0.04188272 0.0804321 ]]
Shape: (4, 4)


Given the estimated model, we now turn our attention towards inference for a test sample $x$. As we
have seen, the final goal is to compute class posterior probabilities $P(c \mid \mathbf{x})$. We split the process in three
stages:

*Stage 1*: For each sample we compute the likelihoods, so the class conditional probabilities as:
$$
f_{X|C} (x_t | c) = \mathcal{N} (x_t | \mu^{MLE}_c, \Sigma^{MLE}_c)
$$

**Beware**: model params were estimated using the *training samples*, whereas densities are computed using *estimation samples*!

In [43]:
from logpdf_loglikelihood_GAU import logpdf_GAU_ND

#For each class Compute the log-pdf of the training data given the MLE parameters of the MVG distribution
#It's better to compute the log-pdf and not the pdf, because the pdf can be very small and can cause numerical problems (underflow)
#Then the logpdf gets exponentiated and the numerical problems are avoided
logpdf_0 = logpdf_GAU_ND(DVAL, mu_0, C_0)
logpdf_1 = logpdf_GAU_ND(DVAL, mu_1, C_1)
logpdf_2 = logpdf_GAU_ND(DVAL, mu_2, C_2)

print(f"logpdf_0 Shape: {logpdf_0.shape}")
print(f"logpdf_1 Shape: {logpdf_1.shape}")
print(f"logpdf_2 Shape: {logpdf_2.shape}")


logpdf_0 Shape: (50,)
logpdf_1 Shape: (50,)
logpdf_2 Shape: (50,)


In [44]:
#Now in order to compute the pds I exponentiate the log-likelihoods
pds_0 = np.exp(logpdf_0)
pds_1 = np.exp(logpdf_1)
pds_2 = np.exp(logpdf_2)

I can automate the process and compute a *Score Matrix* having for each row i the conditional of class i and so $S[i, j]$ is the pdf of the j-th sample given the i-th class:

In [45]:
def scoreMatrix_Pdf_GAU(D, params):
    """
    Compute the Pdf of the data given the parameters of a Gaussian distribution
    and populate the score matrix S with the log-pdf of each class
    #The score matrix is filled with the pdfs of the training data given the MLE parameters of the MVG distribution
    #S[i, j] is the pdf of the j-th sample given the i-th class

    Parameters:
    - D: the data matrix of shape (numFeatures, numSamples)
    - params: the model parameters, so  list of tuples (mu, C) where mu is the mean vector fo class c and C is the covariance matrix of class c

    Returned Values:
    - S: the score matrix of shape (numClasses, numSamples) where each row is the score of the class given the sample

    """
    numClasses = len(params) #number of classes, since for each class we have a tuple (mu, C)

    
    S = np.zeros((numClasses, D.shape[1]))
    for label in range(numClasses):
        S[label, :] = np.exp(logpdf_GAU_ND(D, params[label][0], params[label][1]))

    return S

In [46]:
#Compute score matrix S of log likelihoods for each sample and class
S_Likelihoods = scoreMatrix_Pdf_GAU(DVAL, [(mu_0, C_0), (mu_1, C_1), (mu_2, C_2)])
print(f"Score matrix shape: {S_Likelihoods.shape}")

Score matrix shape: (3, 50)


*Stage 2*: We multiply the class conditional probabilities, computed before, with the class *Prior* probabilities. In
the following we assume that the three classes have the same Prior probability $P(c) = 1/3$. We can thus
compute the joint distribution for samples and classes as:
$$
f_{X,C}(x_t, c) = f_{X|C}(x_t | c) P_C(c)
$$


In [None]:
def computeSJoint(S, Priors):
    """
    Compute the joint densities by multiplying the score matrix S with the Priors
    #The joint densities are the product of the score matrix S with the Priors

    Parameters:
    - S: the score matrix of shape (numClasses, numSamples) where each row is the score of the class given the sample
    - Priors: the priors of the classes, so a list of length numClasses

    Returned Values:
    - SJoint: the joint densities of shape (numClasses, numSamples) where each row is the joint density of the class given the sample
    """


    """
    #Old implementation of computeSJoint:


    numClasses = len(Priors) #number of classes, since we have 1 prior for each class
    newS = np.zeros((numClasses, S.shape[1])) #initialize newS with zeros

    for classIndex in range(numClasses):
        #multiply each row of S (where 1 row corresponds to a class) with the prior of the class
        newS[classIndex, :] = S[classIndex, :] * Priors[classIndex]


    return newS
    """

    #S has shape: (numClasses, numSamples)
    #Priors has shape: (numClasses, ) -> it's a row vector
    #To correctly perform the multiplication, we need to transpose Priors to make it a column vector
    return S * vcol(Priors) #multiply each row of S (where 1 row corresponds to a class) with the prior of the class

In [48]:
SJoint_MVG = computeSJoint(S_Likelihoods, np.ones((3, )) / 3) #compute the joint densities by multiplying the score matrix S with the Priors
print(f"Joint densities shape: {SJoint_MVG.shape}")

SJoint_MVG_Sol = np.load("./solutions/SJoint_MVG.npy")




Joint densities shape: (3, 50)


In [49]:
#Check if the joint densities are equal to the solution
#Beware: the joint densities are not equal to the solution, but they are very close to the solution due to numerical problems
np.allclose(SJoint_MVG, SJoint_MVG_Sol)

True

The problem stemming from this technique is that these calculations originate many numeric problems! That's why the expressions like: 
```python
SJoint_MVG==SJoint_MVG_Sol
```
return False whereas expressions like:
```python
np.allclose(SJoint_MVG, SJoint_MVG_Sol)
```
return True

*Stage 3*: Finally, we can compute the class Posteriors probabilities as:
$$
P(C_t = c \mid \mathbf{X}_t = \mathbf{x}_t) = \frac{f_{\mathbf{X},C}(\mathbf{x}_t, c)}{\sum_{c' \in C} f_{\mathbf{X},C}(\mathbf{x}_t, c')}
$$
At the denominator we sum the joint probability over all classes to compute the marginal densities for each sample wich are $f_{\mathbf{X}}(\mathbf{x}_t)$ and have shape ```(1, DVAL.shape[1])```. The *axis_0* has shape equal to 1 since we sum over all the rows, corresponding to the joints for all the classes.

In [50]:
vrow(SJoint_MVG_Sol.sum(0)).shape #check if the first column of the joint densities are equal to the solution

(1, 50)

In [51]:
def computePosteriors(SJoint):
    """
    Compute the posteriors by normalizing the joint densities
    The posteriors are the joint densities divided by the sum of the joint densities which are the marginals

    Parameters:
    - SJoint: the joint densities of shape (numClasses, numSamples) where each row is the joint density of the class 

    Returned Values:
    - SPost: the posteriors of shape (numClasses, numSamples) where each row is the posterior of the class given the sample
    """
    #1. Compute marginals
    SMarginal = vrow(SJoint.sum(0)) #sum over the rows (axis=0) to get the marginal of each sample

    #2. Compute posteriors by dividing the joint densities by the marginals
    SPost = SJoint / SMarginal #element wise division

    return SPost
   

In [52]:
SPost_MVG = computePosteriors(SJoint_MVG) #compute the posteriors by normalizing the joint densities
print(f"Posteriors shape: {SPost_MVG.shape}")

Posteriors shape: (3, 50)


**Classification Rule**: As said before, the optimal Bayes decision is to select for each test sample the class with highest **posterior probability**: 
$$
c_{t}^{*} = argmax_{c} P (C_{t} = c \mid \mathbf{X}_t = \mathbf{x}_t)
$$ 

In [53]:
#Select for each sample the class with the highest posterior probability
PVAL_MVG = np.argmax(SPost_MVG, axis=0) #select the class with the highest posterior probability for each sample, set axis=0 to select the class with the highest posterior probability for each sample
print(f"Predictions shape: {PVAL_MVG.shape}")
print(f"Predictions: {PVAL_MVG}")

Predictions shape: (50,)
Predictions: [0 0 1 2 2 0 0 0 1 1 0 0 1 0 2 1 2 1 0 2 0 2 0 0 2 0 2 1 1 1 2 2 2 1 0 1 2
 2 0 1 1 2 1 0 0 0 2 1 2 0]


Error calculation for the MVG ggm model:

In [54]:
error_count_MVG = np.count_nonzero(PVAL_MVG != LVAL)
print(f"Number of wrong predictions: {error_count_MVG}")
error_rate_pca = np.mean(PVAL_MVG != LVAL)
print(f"Error Rate: {error_rate_pca:.2%}")

Number of wrong predictions: 2
Error Rate: 4.00%


Accuracy, Precision, Recall, F-1 Score for the MVG gmm model: <br>
With 3 classes, accuracy is compute as:
$$
acc = \frac{Correct \space predictions}{Tot \space samples} = \frac{T0+T1+T2}{T0+T1+T2+F0+F1+F2} = 1 - Error \space Rate
$$

In [55]:
print(classification_report(LVAL, PVAL_MVG, digits=3))

              precision    recall  f1-score   support

           0      1.000     1.000     1.000        19
           1      1.000     0.882     0.938        17
           2      0.875     1.000     0.933        14

    accuracy                          0.960        50
   macro avg      0.958     0.961     0.957        50
weighted avg      0.965     0.960     0.960        50



As we have already discussed, working directly with densities is often problematic, due to numerical
issues. It’s useful to implement the whole procedure directly in terms of log-densities (if we need, we can
recover posterior probabilities at the end). <br>
Working in the *log-domain*, the three stages for computing the class posterior probabilities $P(c \mid \mathbf{x})$ are: <br>
*Stage 1*: For each sample we compute the log-likelihoods, so the class conditional log-probabilities as:
$$
\log f_{X|C} (x_t | c) = \log \mathcal{N} (x_t | \mu^{MLE}_c, \Sigma^{MLE}_c)
$$

**Beware**: model params were estimated using the *training samples*, whereas densities are computed using *estimation samples*! <br>
We can thus rewrite and extend the function `scoreMatrix_Pdf_GAU` written before:

In [56]:
def scoreMatrix_Pdf_GAU(D, params, useLog=False):
    """
    Compute the (log?)-Pdf of the data given the parameters of a Gaussian distribution
    and populate the score matrix S with the (log?)-pdf of each class
    #The score matrix is filled with the pdfs of the training data given the MLE parameters of the MVG distribution
    #S[i, j] is the pdf of the j-th sample given the i-th class

    Parameters:
    - D: the data matrix of shape (numFeatures, numSamples)
    - params: the model parameters, so  list of tuples (mu, C) where mu is the mean vector fo class c and C is the covariance matrix of class c
    - useLog: if True, compute the log-pdf, else compute the pdf

    Returned Values:
    - S: the score matrix of shape (numClasses, numSamples) where each row is the score of the class given the sample

    """
    numClasses = len(params) #number of classes, since for each class we have a tuple (mu, C)
    S = np.zeros((numClasses, D.shape[1]))
    for label in range(numClasses):
        if useLog:
            #if useLog is True, then compute the log-pdf
            S[label, :] = logpdf_GAU_ND(D, params[label][0], params[label][1])
        else:
            #if useLog is False, then compute the pdf
            S[label, :] = np.exp(logpdf_GAU_ND(D, params[label][0], params[label][1]))

    return S

In [57]:
#Compute score matrix S of log likelihoods for each sample and class
S_logLikelihoods = scoreMatrix_Pdf_GAU(DVAL, [(mu_0, C_0), (mu_1, C_1), (mu_2, C_2)], useLog=True)
print(f"log Score matrix shape: {S_logLikelihoods.shape}")

log Score matrix shape: (3, 50)


*Stage 2*: We add the log class conditional probabilities, computed before, to the log of the class *Prior* probabilities. In
the following we assume that the three classes have the same Prior probability $P(c) = 1/3$. We can thus
compute the joint distribution for samples and classes in the *log-domain* as:
$$
l_{c} = \log f_{X,C}(x_t, c) = \log \left( f_{X|C}(x_t | c) P_C(c) \right) = \log f_{X|C}(x_t | c) + \log P_C(c)
$$

In [None]:
def computeSJoint(S, Priors, useLog=False):
    """
    Compute the joint densities by multiplying the score matrix S with the Priors
    
    Parameters:
    - S: the score matrix of shape (numClasses, numSamples) where each row is the score of the class given the sample
    - Priors: the priors of the classes, so a list of length numClasses
    - useLog: if True, compute the log-joint densities, else compute the joint densities

    Returned Values:
    - SJoint: the (log?)joint densities of shape (numClasses, numSamples) where each row is the joint density of the class given the sample
    """
    

    if (useLog):
        #S needs to be already in log scale, so we just need to add the log of the priors
        return S + vcol(np.log(Priors)) #multiply each row of S (where 1 row corresponds to a class) with the prior of the class
    else:
        return S * vcol(Priors)

In [59]:
SJoint_log_MVG = computeSJoint(S_logLikelihoods, np.ones((3, )) / 3, useLog=True) #compute the joint densities by multiplying the score matrix S with the Priors

*Stage 3*: Finally, we can compute the log class Posteriors probabilities as:
$$
\log P(C_t = c \mid \mathbf{X}_t = \mathbf{x}_t) = \log \left( \frac{f_{\mathbf{X},C}(\mathbf{x}_t, c)}{\sum_{c' \in C} f_{\mathbf{X},C}(\mathbf{x}_t, c')} \right) = 
\log \left( \frac{f_{\mathbf{X},C}(\mathbf{x}_t, c)}{f_{\mathbf{X}}(\mathbf{x}_t)} \right) = \log f_{\mathbf{X},C}(\mathbf{x}_t, c) - \log  f_{\mathbf{X}}(\mathbf{x}_t) = \log f_{\mathbf{X},C}(\mathbf{x}_t, c) - \log \sum_{c} e^{l_{c}}
$$ 
where $l_{c}$ are all the log-joints. <br>
However, we need to take care that computing the exponential terms may result again in numerical
errors. A robust method to comute $\log \sum_{c} e^{l_{c}}$ is to rewrite it as:
$$
\log \sum_{c} e^{l_{c}} = l + \log \sum_{c} e^{l_{c} - l}
$$
where $l$ is the highest of the log-joints: $l = max_{c} {l_{c}}$
This is known as the *log-sum-exp* trick, and is already implemented in *scipy* as `scipy.special.logsumexp`. We can thus use `scipy.special.logsumexp(s)`,
where `s` is the array that contains the joint log-probabilities for a given sample, to compute the log-marginals $\log f_X(x_{t})$. <br>
`scipy.special.logsumexp` also allows specifying an axis, thus we can directly compute the array of
marginals for all samples directly from the matrix of joint log-densities as we did before.



In [64]:
def computePosteriors(SJoint, useLog=False):
    """
    Compute the posteriors by normalizing the joint densities
    The posteriors are the joint densities divided by the sum of the joint densities which are the marginals

    Parameters:
    - SJoint: the joint densities of shape (numClasses, numSamples) where each row is the joint density of the class 

    Returned Values:
    - SPost: the posteriors of shape (numClasses, numSamples) where each row is the posterior of the class given the sample
    """
    if useLog:
        #1. Compute marginals usign the logsumexp trick to minimize numerical problems
        #logsumexp is a function that computes the log of the sum of exponentials of input elements
        #It is more numerically stable than computing the sum of exponentials directly
        #It computes log(exp(a) + exp(b)) in a numerically stable way

        #sum over the rows (axis=0) to get the marginal of each sample
        SMarginal = logsumexp(SJoint, axis=0)
        #SMarginal has now shape = (numSamples, ) -> it's a row vector
        #I need to make it of shape (1, numSamples) 
        SPost = SJoint - vrow(SMarginal) #element wise division in log scale, so I just need to subtract the marginals from the joint densities
        

    else:
        
        #1. Compute marginals
        SMarginal = vrow(SJoint.sum(0)) #sum over the rows (axis=0) to get the marginal of each sample

        #2. Compute posteriors by dividing the joint densities by the marginals
        SPost = SJoint / SMarginal #element wise division

    return SPost
   

In [66]:
#calculate log S post
SPost_log_MVG = computePosteriors(SJoint_log_MVG, useLog=True) #compute the posteriors by normalizing the joint densities
print(f"log Posteriors shape: {SPost_log_MVG.shape}")

log Posteriors shape: (3, 50)
