#### Linear Discriminant Analysis(LDA)
Logistic regression models $Pr(Y=k|X=x)$ by using the logistc function. LDA is another method to model this conditional probability by using the __Bayers' theorem__.

The Bayers' theorem states that:

$Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)}$, where $\pi_k=Pr(Y=k)$ and $f_k(x)=Pr(X=x|Y=k)$. 

$\pi_k$ is easy to estimate by computing the fraction of the training observations that belong to the $k$th class. But the estimation of $f_k(x)$ is more challenging unless we assume some simple forms for these densities.

_First_, we assume that each class density $f_k(x)$ is multivariate Gaussian.

LDA holds the assumption that all classes have a common covariance matrix. Then the _log-odds_ with class $k$ and class $l$ are defined:

$log(\frac{Pr(Y=k|X=x)}{Pr(Y=l|X=x)})=log(\frac{\pi_k}{\pi_l})+log(\frac{f_k(x)}{f_l(x)})=log(\frac{\pi_k}{\pi_l}) = \frac{1}{2}(\mu_k+\mu_l)^T\sum^{-1}(\mu_k-\mu_l)+x^T\sum^{-1}(\mu_k-\mu_l)$.

This _log-odds_ is also the decision boundary between class $k$ and class $l$ if it is equal to zero.

The equivalent description of the above formula can be written as $\delta_k(x)=x^T\sum^{-1}\mu_k - \frac{1}{2}\mu_k\sum^{-1}\mu_k+log\pi_k$. We call this function as _linear discriminant functions_ and it is linear in x. It can be seen that $\pi_k$, $\mu_k$ and $\sum$ are unknown. But they can be estimated by training data.

1. $\hat\pi_k=n_k/n$, where $n_k$ is the number of class-k observations
2. $\hat\mu_k$ is the sample mean
3. $\hat\sum$ is the sample covariance matrix

But what if the covariance matrices are different for different classes?

For this case, we have _Quadratic discriminant analysis_ (QDA). And $\delta_k(x)$ becomes $\delta_k(x)=-\frac{1}{2}log|\sum_k|-\frac{1}{2}(x-\mu_k)^T\sum_k^{-1}(x-\mu_k)+log\pi_k$.


In [8]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn import metrics

In [12]:
def LDA(x,y):
        dataMat_train, dataMat_test, label_train, label_test = train_test_split(x, y, test_size=0.25, random_state=42)
        
        lda=LinearDiscriminantAnalysis()
        lda.fit(dataMat_train,label_train)
        
        predictions=lda.predict(dataMat_test)
        
        score=lda.score(dataMat_test,label_test)
        cm=metrics.confusion_matrix(label_test, predictions)
        return score,cm

In [13]:
def QDA(x,y):
        dataMat_train, dataMat_test, label_train, label_test = train_test_split(x, y, test_size=0.25, random_state=42)
        
        lda=QuadraticDiscriminantAnalysis()
        lda.fit(dataMat_train,label_train)
        
        predictions=lda.predict(dataMat_test)
        
        score=lda.score(dataMat_test,label_test)
        cm=metrics.confusion_matrix(label_test, predictions)
        return score,cm

For the following dataset, the accuarcy is not good if we use logistic regression. So, how about LDA or QDA?

In [14]:
df=pd.read_csv('D:/math/book/data mining/ML/machinelearninginaction/Ch06/testSetRBF2.txt',sep='\t',header=None)
x=df.iloc[:,0:2]
y=df.iloc[:,2]

In [15]:
score1,cm1 = LDA(x,y)
print(score1)
print(cm1)

0.68
[[8 4]
 [4 9]]
