## Probabilistic Generative Models
### Linear Discriminant Analysis
As we saw in the LR, we were able to explicitly model the conditional distribution of the target class given the input features. Another approaching for modeling this distribution is by implicitly modeling it by using Bayes' theorem. This approach is known as generative learning algorithm, in which model the conditional distribution of the input features of each class. Then the posterior (conditional distribution of the target class w.r.t input features) is estimated by Bayes' theorem. And this is illustrated by the following formula:-
$$
\begin{align*}
\begin{split}
&p(t=C_1|x) = \frac{p(t=C_1,t)}{p(x)} = \frac{p(x|t=C_1)p(C_1)}{\sum_{i}p(x,t=C_i)},\\
&=\frac{p(x|t=C_1)p(C_1)}{\sum_{i}p(x|t=C_i)p(C_i)}
\end{split}
\end{align*}
$$
One of the Generative models is the Gaussian Discriminant model(GDM) in which we try to fit to $p(x|t=C_i)$ a multi-variate distribution, and this is indicative that the features space should be made from continuous random variables. And the Gaussian distribution is fully described by its mean and covariance matrix. Therefore, GDM model can further divided into two models one is called linear discriminant analysis (LDA) model and the other one is quadratic discriminant analysis (QDA) model, the meaning of LDA will be explained in this jupyter notebook, while QDA will be explained in another jupyter notebook.

### Explaining the linear part in LDA
As was shown above, the conditional distribution of input feature space given the target class can be explained by $p(t=C_1|x)=\frac{p(x|t=C_1)p(C_1)}{\sum_{i}p(x|t=C_i)p(C_i)}$. By inserting a natural logarithm and an exponential before each $p(x, t=C_i)$, one can easily express $p(t=C_i|x)$ as follows:- $p(t=C_k|x) = \frac{exp(ln(p(x|C_k)\ p(C_k)))}{\sum_{i}exp(ln(p(x|t=C_i)p(C_i)))} = \frac{exp(a_k)}{\sum_{i}exp(a_i)}$. Also, as was mentioned before $p(x|t=C_k)$ is assumed to be a multi-variate Gaussian that is expresses by the following formula:-
$$
\begin{align*}
\begin{split}
p(x|t=C_k) = \frac{1}{(det(2\pi \Sigma)^{D/2}} exp(\frac{-1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k))
\end{split}
\end{align*}
$$
Also, LDA will assume that the distribution of the features space in each class is sharing the same exact covariance matrix. The $\underline{a_k}$ vector can expresses as follows:-
$$
\begin{align*}
\begin{split}
&a_k = ln(p(x|C_k)p(C_k)) = ln(\frac{1}{det(2\pi \Sigma)}) -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + ln(p(C_k)), \ where\  is\ \mu\  and\  x\  are\  vectors\\
&a_k = ln(\frac{1}{det(2\pi \Sigma)}) -\frac{1}{2}\Big( x^T\Sigma^{-1}x - x^T\Sigma^{-1}\mu_k-\mu_k^{T}\Sigma^{-1}x+\mu_k^T\Sigma^{-1}\mu_k\Big) + ln(p(C_k));\ let\ t=\Sigma^{-1}\mu_k\ ,\ \Sigma^T=\Sigma and\ utilize\ a^Tb=b^Ta\\
&a_k= ln(\frac{1}{det(2\pi \Sigma)}) -\frac{1}{2}x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}x  -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln(p(C_k))
\end{split}
\end{align*}
$$
And now we are equipped with the necessary tool to be able to see the linearity with respect to $\underline{x}$, this can be derives as follows:-

$$
\begin{align*}
\begin{split}
&p(C_k|x) = \frac{exp\Big( exp(ln(\frac{1}{det(2\pi \Sigma)}) -\frac{1}{2}x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}x  -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln(p(C_k))) \Big)}{\sum_{i}\Big( exp\Big( ln(\frac{1}{det(2\pi \Sigma)}) -\frac{1}{2}x^T\Sigma^{-1}x + \mu_i^T\Sigma^{-1}x  -\frac{1}{2}\mu_i^T\Sigma^{-1}\mu_i + ln(p(C_i)) \Big) \Big)};\\

& we\ take\ ln(\frac{1}{det(2\pi \Sigma)}) -\frac{1}{2}x^T\Sigma^{-1}x\ as\ a common\ term\ from\ the\ denominator\\

&and\ we\ need\ to\ utilize\ exp(a+b)=exp(a)exp(b)\\

&p(C_k|x) = \frac{exp\Big( \mu_k^T\Sigma^{-1}x  -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln(p(C_k) \Big)}{\sum_{i}exp\Big( \mu_k^T\Sigma^{-1}x  -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln(p(C_k) \Big)}\\

&a_k(x) =\underline{w}^T\underline{x} + w_{k0} = (\Sigma^{-1}\mu_k)^Tx +\frac{-1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln(p(C_k)) 
\end{split}
\end{align*}
$$
As can be seen from the last equation, we can express $a_k$ as a linear combination of the input feature, and from here the term linear was coined. Also, this indicate that the decision boundary would be linear similar to the case of logistic regression and softmax regression. And the relationship between LDA and logistic regression will be discussed later.

### Maximum Likelihood Solution
This method need to estimate three parameters and these parameters are:- $p(C_k)=\phi_k$ which is proportion of observation belonging to class k, $\mu_k$ which is mean vector of the input feature for the k-th class, and $\Sigma$ which is share covariance matrix among all classes. Because we have k-classes we need to utilize the indicator function in which $\textit{I}(c =k)$ in which is equal to 1 if c was equal to k, otherwise it would be zero. Therefore, the log-Likelihood can be expressed as follows:-
$$
\begin{align*}
\begin{split}
&\ell(\mu_1, ... \mu_k, \Sigma, \phi_1, ... \phi_k) = log\Big( p(\underline{t}|\underline{x}, \phi^s, \mu^s, \Sigma) \Big)\\

&=log\Big( \prod_{n=1}{N}\big( \prod_{k=1}^{K}p(x|C_k;.)p(C_k) \big) \Big)\\

&=log\Big( \prod_{n=1}^{N}\big( \prod_{k=1}^{K}(\phi_k \mathcal{N}(x_n;\mu_k, \Sigma))^{\textit{I}(C_k = t_n)} \big) \Big)
\end{split}
\end{align*}
$$
So, we need to maximize the log-likelihood with respect to $\phi^s$, $\mu^s$ and $\Sigma$ and the resulting values for those parameters that maximize the log-likelihood are:-
$$
\begin{align*}
\begin{split}
&\pi_k = \frac{N_k}{N_1 + ... + N_K}\\
&\mu_k = \frac{1}{N_k}\sum_{n=1}^{N}\textit(C_k =k) \underline{x_n}\\
&\Sigma = \frac{N_1}{N}\sigma_1+ ... +\frac{N_K}{N}\sigma_k\\
&\Sigma_k =\frac{1}{N_1}\sum_{n \in C_k}(x_n - \mu_k)(x_n - \mu_k)^T  
\end{split}
\end{align*}
$$
### Make prediction
After learning the parameters for the LDA model, we can make prediction by utilizing the following formula:-
$$
\begin{align*}
p(C_k|x) = \frac{exp\Big( \mu_k^T\Sigma^{-1}x  -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln(\phi_k) \Big)}{\sum_{i}exp\Big( \mu_k^T\Sigma^{-1}x  -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln(\phi_k) \Big)}
\end{align*}
$$
In which we choose the class that achieved the highest posterior value.

### Relationship between LDA and Logistic regression
We will restrict LDA to deal with a 2 class dataset in order to be compared to logistic regression. The derivation of the relationship can be seen as follows:-

As can be seen from above, LDA or GDA in general is equivalent to the LR when the assumption that p(X|C) is normally distributed. And if this assumption holds LDA will be faster than LR due to simple manner of estimating its parameters. But if the assumption doesn't holds and we have small data set ("Central limit theorem), the LR would achieve better LDA.


In [2]:
%matplotlib inline
import numpy as np 
import sklearn.preprocessing
import sklearn.datasets
import pandas as pd
import sklearn.model_selection
import numpy.random
import math
import sklearn.metrics

In [117]:
#X, y = sklearn.datasets.load_iris(return_X_y=True)
X, y = sklearn.datasets.load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=42)
standard = sklearn.preprocessing.StandardScaler()
X_train = standard.fit_transform(X_train)
training_data = np.c_[X_train, y_train]#All of the features are continuous, so, no need to use one-hot encoder and we can directly standard normalize the features of the data set

X_test = standard.transform(X_test)
test_data = np.c_[X_test, y_test]
print(training_data.shape)
print(test_data.shape)

(133, 14)
(45, 14)


In [118]:
k = 3
X_train, y_train = (training_data[:, 0:-1], training_data[:, -1])
X_test, y_test = (test_data[:, 0:-1], test_data[:, -1])

In [119]:
class LDA_Model(object):


    def __init__(self, X_train, y_train, k):
        self.X_train = X_train
        self.y_train = y_train
        self.Mu = [mu for mu in [np.zeros((X_train.shape[1], 1))]*k]
        self.Sigma = [sigma for sigma in [np.zeros((X_train.shape[1], X_train.shape[1]))]*k]
        self.Sigma_total = np.zeros((self.X_train.shape[1], self.X_train.shape[1]))
        self.phis = [phi for phi in np.zeros((k, 1))]
        self.m = (self.X_train).shape[0]
        self.n = (self.X_train).shape[1]
        self.K = k

    def fit(self):
        data = pd.DataFrame(np.c_[self.X_train, self.y_train])
        indexs = data.columns
        class_observations = []
        N = []
        for k in range(0, self.K):
            class_observations.append(data[data[indexs[-1]] == k])
            N.append(class_observations[k].shape[0])


        for k in range(0, self.K):
            temp = (class_observations[k]).to_numpy()
            mean_temp = (np.mean(temp[:, 0:-1], axis=0)).reshape(-1, 1)
            assert(self.Mu[k].shape == mean_temp.shape)
            self.Mu[k] = mean_temp.copy()
            self.Sigma[k] = np.cov((temp[:, 0:-1]).T)

        for k in range(0, self.K):
            self.phis[k] = N[k]/self.n
            self.Sigma_total = self.Sigma_total + (N[k]/self.n) * self.Sigma[k]
        
        return self.phis, self.Mu, self.Sigma_total

    def predict_observation(self, x):
        prediction = np.zeros((1, self.K))
        
        denominator = 0
        s_inv = np.linalg.inv(self.Sigma_total)

        for k in range(0, self.K):
            t1 = np.dot(self.Mu[k].T, np.dot(s_inv, x))
            t2 = (-1/2)* np.dot(self.Mu[k].T, np.dot(s_inv, self.Mu[k]))
            t3 = np.log(self.phis[k])
            temp = np.exp( t1 + t2 + t3)
            #print(temp.shape)
            #assert(denominator.shape == temp.shape)
            denominator = denominator + temp

        #print(denominator)
        #print(prediction.shape)
        for k in range(0, self.K):
            t1 = np.dot(self.Mu[k].T, np.dot(s_inv, x))
            t2 = (-1/2)* np.dot(self.Mu[k].T, np.dot(s_inv, self.Mu[k]))
            t3 = np.log(self.phis[k])
            temp = np.exp(t1 + t2 + t3)
            #print(temp.shape)
            #assert(temp.shape == denominator.shape)
            prediction[:, k] = (np.divide(temp, denominator))
        
        return np.argmax(prediction)
    
    def predict_dataset(self, X, y):
        prediction = np.zeros((X.shape[0], 1))
        for i in range(0, X.shape[0]):
            prediction[i, 0] = self.predict_observation(X[i, :])
        
        return prediction


In [120]:
LDA = LDA_Model(X_train, y_train, k=3)
phis, Mu, Sigma_total = LDA.fit()
prediction = LDA.predict_dataset(X_train, y_train)

print("Performance on the training set")
print(sklearn.metrics.confusion_matrix(y_train, prediction))
#print(f"precision:{sklearn.metrics.precision_score(y_train, prediction):0.3f}, recall:{sklearn.metrics.recall_score(y_train, prediction):0.3f}")

Performance on the training set
[[43  1  0]
 [ 0 53  0]
 [ 0  1 35]]


In [121]:
prediction = LDA.predict_dataset(X_test, y_test)
print("Performance on the test set")
print(sklearn.metrics.confusion_matrix(y_test, prediction))

Performance on the test set
[[15  0  0]
 [ 0 18  0]
 [ 0  0 12]]


### References 
* Chapter 1, chapter 2 and Chapter 4 from Bishop, C. (2006). Pattern Recognition and Machine Learning. Cambridge: Springer.
* Andrew Ng, Lec 4: (https://www.youtube.com/watch?v=nLKOQfKLUks)
* Andrew Ng, Lec 5: (https://www.youtube.com/watch?v=qRJ3GKMOFrE)