## Probabilistic Generative Models
### Quadratic Discriminant Analysis (QDA)
QDA takes the same paradigm as the LDA but with the difference that the decision boundary will be quadratic instead of linear as the case in LDA. This result from taking away the assumption that the covariance matrix is shared among the input space in each class. I willn't derive QDA from scratch because I already discussed its building blocks in LDA section, but I will discuss how to arrive to the quadratic decision boundary. And I will discuss the process for prediction and what are the essential formulas that will be discussed in this project.

As was mentioned in the LDA part, we can express the posterior of the conditional distribution of the class, as follows:
$$
\begin{align*}
\begin{split}
&p(t=C_k|x) = \frac{p(t=C_k,t)}{p(x)} = \frac{p(x|t=C_k)p(C_k)}{\sum_{i}p(x,t=C_i)},\\
&=\frac{p(x|t=C_k)p(C_k)}{\sum_{i}p(x|t=C_i)p(C_i)};\ by \ inserting\ exp\ log\ before\ p(x|t=C_i)p(C_i)\\
&p(t=C_k|x) = \frac{exp\Big( ln(p(x|t=C_k)p(C_k))\Big)}{\sum_{i}exp\Big( ln(p(x|t=C_i)p(C_i))\Big)} = \frac{exp(a_k)}{\sum_{i}exp(a_i)}\\
&a_k=ln(p(x|t=C_k)p(C_k))=-ln((2\pi)^{D/2}) - \frac{1}{2}ln(det(\Sigma_k)) -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) + ln(p(C_k));\\
&-ln((2\pi)^{D/2})\ will\ be\ removed\ because\ it\ is\ a\ common\ factor\ with\ the\ denominator\\
&a_k=-\frac{1}{2}ln(det(\Sigma_k)) + ln(p(C_k)) -\frac{1}{2}x^T\Sigma_k^{-1}x +\frac{1}{2}x^T\Sigma_k^{-1}\mu_k+\frac{1}{2}\mu_k^T\Sigma_k^{-1}x-\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k;\ let\ e=\Sigma_k^{-1}\mu_k\ and\ we\ use\ a^Tb=b^Ta\\
&a_k =\frac{-1}{2}\Big( x^T\Sigma_k^{-1}x -2x^T\Sigma_k^{-1}\mu_k+\mu_k^T\Sigma_k^{-1}\mu_k+ln(det(\Sigma_k) -2ln(p(C_k))) \Big)
\end{split}
\end{align*}
$$ 
It can be seen directly from the last question the occurrence of quadratic decision boundary. The following equations are the same results that were derived in LDA, with the update for the prediction method.
$$
\begin{align*}
\begin{split}
&\pi_k = \frac{N_k}{N_1 + ... + N_K}\\
&\mu_k = \frac{1}{N_k}\sum_{n=1}^{N}\textit(C_k =k) \underline{x_n}\\
&\Sigma_k =\frac{1}{N_1}\sum_{n \in C_k}(x_n - \mu_k)(x_n - \mu_k)^T\\
&p(C_k|x) = \frac{exp\Big( \frac{-1}{2}\Big( x^T\Sigma_k^{-1}x -2x^T\Sigma_k^{-1}\mu_k+\mu_k^T\Sigma_k^{-1}\mu_k+ln(det(\Sigma_k) -2ln(p(C_k))) \Big) \Big)}{\sum_{i}exp\Big( \frac{-1}{2}\Big( x^T\Sigma_i^{-1}x -2x^T\Sigma_i^{-1}\mu_k+\mu_i^T\Sigma_i^{-1}\mu_k+ln(det(\Sigma_i) -2ln(p(C_k))) \Big) \Big)} 
\end{split}
\end{align*}
$$

QDA tends to perform better on training set than LDA due to the complex decision boundary that it use, but as is known the more complex the model becomes the more likely it would over-fit the training set. Therefore, this is a direct disadvantage of QDA but this disadvantage would disappear with the more data that it use to learn its parameters.


In [2]:
%matplotlib inline
import numpy as np 
import sklearn.preprocessing
import sklearn.datasets
import pandas as pd
import sklearn.model_selection
import numpy.random
import math
import sklearn.metrics

In [12]:
#X, y = sklearn.datasets.load_iris(return_X_y=True)
X, y = sklearn.datasets.load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=42)
standard = sklearn.preprocessing.StandardScaler()
X_train = standard.fit_transform(X_train)
training_data = np.c_[X_train, y_train]#All of the features are continuous, so, no need to use one-hot encoder and we can directly standard normalize the features of the data set

X_test = standard.transform(X_test)
test_data = np.c_[X_test, y_test]
print(training_data.shape)
print(test_data.shape)

(133, 14)
(45, 14)


In [13]:
k = 3
X_train, y_train = (training_data[:, 0:-1], training_data[:, -1])
X_test, y_test = (test_data[:, 0:-1], test_data[:, -1])

In [14]:
class QDA_Model(object):


    def __init__(self, X_train, y_train, k):
        self.X_train = X_train
        self.y_train = y_train
        self.Mu = [mu for mu in [np.zeros((X_train.shape[1], 1))]*k]
        self.Sigma = [sigma for sigma in [np.zeros((X_train.shape[1], X_train.shape[1]))]*k]
        self.phis = [phi for phi in np.zeros((k, 1))]
        self.m = (self.X_train).shape[0]
        self.n = (self.X_train).shape[1]
        self.K = k

    def fit(self):
        data = pd.DataFrame(np.c_[self.X_train, self.y_train])
        indexs = data.columns
        class_observations = []
        N = []
        for k in range(0, self.K):
            class_observations.append(data[data[indexs[-1]] == k])
            N.append(class_observations[k].shape[0])


        for k in range(0, self.K):
            temp = (class_observations[k]).to_numpy()
            mean_temp = (np.mean(temp[:, 0:-1], axis=0)).reshape(-1, 1)
            assert(self.Mu[k].shape == mean_temp.shape)
            self.Mu[k] = mean_temp.copy()
            self.Sigma[k] = np.cov((temp[:, 0:-1]).T)

        for k in range(0, self.K):
            self.phis[k] = N[k]/self.n
        
        return self.phis, self.Mu, self.Sigma

    def predict_observation(self, x):
        prediction = np.zeros((1, self.K))
        
        denominator = 0

        for k in range(0, self.K):
            s_inv = np.linalg.inv(self.Sigma[k])
            t1 = -2*np.dot(x.T, np.dot(s_inv, self.Mu[k]))
            t2 = np.dot(x.T, np.dot(s_inv, x))
            t3 = np.dot(self.Mu[k].T, np.dot(s_inv, self.Mu[k]))
            t4 = np.log(np.linalg.det(self.Sigma[k]))
            t5 = -2 * np.log(self.phis[k])
            temp = np.exp( (-1/2) * ( t1 + t2 + t3 + t4 + t5))
            denominator = denominator + temp

        for k in range(0, self.K):
            s_inv = np.linalg.inv(self.Sigma[k])
            t1 = -2*np.dot(x.T, np.dot(s_inv, self.Mu[k]))
            t2 = np.dot(x.T, np.dot(s_inv, x))
            t3 = np.dot(self.Mu[k].T, np.dot(s_inv, self.Mu[k]))
            t4 = np.log(np.linalg.det(self.Sigma[k]))
            t5 = -2 * np.log(self.phis[k])
            temp = np.exp( (-1/2) * ( t1 + t2 + t3 + t4 + t5))

            prediction[:, k] = (np.divide(temp, denominator))
        
        return np.argmax(prediction)
    
    def predict_dataset(self, X, y):
        prediction = np.zeros((X.shape[0], 1))
        for i in range(0, X.shape[0]):
            prediction[i, 0] = self.predict_observation(X[i, :])
        
        return prediction


In [15]:
QDA = QDA_Model(X_train, y_train, k=3)
phis, Mu, Sigma = QDA.fit()
prediction = QDA.predict_dataset(X_train, y_train)

print("Performance on the training set")
print(sklearn.metrics.confusion_matrix(y_train, prediction))
#print(f"precision:{sklearn.metrics.precision_score(y_train, prediction):0.3f}, recall:{sklearn.metrics.recall_score(y_train, prediction):0.3f}")

Performance on the training set
[[44  0  0]
 [ 1 52  0]
 [ 0  0 36]]


In [16]:
prediction = QDA.predict_dataset(X_test, y_test)
print("Performance on the test set")
print(sklearn.metrics.confusion_matrix(y_test, prediction))

Performance on the test set
[[15  0  0]
 [ 0 18  0]
 [ 0  1 11]]


### References 
* Chapter 1, chapter 2 and Chapter 4 from Bishop, C. (2006). Pattern Recognition and Machine Learning. Cambridge: Springer.
* Andrew Ng, Lec 4: (https://www.youtube.com/watch?v=nLKOQfKLUks)
* Andrew Ng, Lec 5: (https://www.youtube.com/watch?v=qRJ3GKMOFrE)