# Naive Bayes, QDA, and LDA

## Naive Bayes Classifier
- In general, we can use Bayes' rule (and the law of total probability) to infer discrete classes $C_k$ for a given $\boldsymbol{x}$ set of features
    - $\displaystyle P(C_k \lvert\,\boldsymbol{x}) = \frac{\pi(C_k)\,{\cal{}L}_{\!\boldsymbol{x}}(C_k)}{Z} $ 
- Naively assuming the features are independent: $\displaystyle {\cal{}L}_{\!\boldsymbol{x}}(C_k) = \prod_{\alpha}^d p(x_{\alpha} \lvert C_k)$ 

### NB: Learning
- Say for Gaussian likelihoods, we simply estimate the sample mean and variance of all features for each class $k$
    - $\displaystyle p(x_{\alpha} \lvert C_k) = G(x_{\alpha};\mu_{k,\alpha}, \sigma^2_{k,\alpha})$
- We have to also pick some prior for the classes using uniform or based on frequency of points in the training set

### NB: Estimation
- - Look for maximum of the posterior: $\displaystyle \hat{k} =  \mathrm{arg}\max_k \left[ \pi_k \prod_{\alpha}^d G(x_{\alpha};\mu_{k,\alpha}, \sigma^2_{k,\alpha})\right]$ 

In [3]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris = datasets.load_iris()

# calculate feature means and variances for each class
param = dict()  # save parameters here
data = iris.data.copy()
label = iris.target

# Define the classes
classes = np.unique(label)

for k in classes:
    members = (label == k)        # boolean mask
    num = members.sum()           # count
    prior = num / label.size      # class prior
    
    X = data[members, :].copy()   # slice members safely
    mu = X.mean(axis=0)           # class mean
    
    X -= mu                       # mean-center
    var = np.square(X).sum(axis=0) / (X.shape[0] - 1)  # unbiased variance
    
    param[k] = (prior, mu, var)   # save results
    print(k, mu, var)

0 [5.006 3.428 1.462 0.246] [0.12424898 0.1436898  0.03015918 0.01110612]
1 [5.936 2.77  4.26  1.326] [0.26643265 0.09846939 0.22081633 0.03910612]
2 [6.588 2.974 5.552 2.026] [0.40434286 0.10400408 0.30458776 0.07543265]


In [5]:
# init predicted values
k_pred = -1 * np.ones(iris.target.size)

# evaluate posterior for each point and find maximum
for i in range(iris.target.size):
    pmax, kmax = -1, None   # initialize to nonsense values
    for k in param:
        prior, mu, var = param[k]
        diff = iris.data[i,:] - mu
        d2 = np.square(diff) / (2*var) 
        p = prior * np.exp(-d2.sum()) / np.sqrt(np.prod(2 * np.pi * var))
        if p > pmax:
            pmax = p
            kmax = k
    k_pred[i] = kmax

print("Number of mislabeled points out of a total %d points : %d"
      % (iris.target.size, (iris.target!=k_pred).sum()))

Number of mislabeled points out of a total 150 points : 6


In [6]:
# init predicted values
k_pred = -1 * np.ones(iris.target.size)

# evaluate posterior for each point and find maximum
for i in range(iris.target.size):
    pmax, kmax = -1, None   # initialize to nonsense values
    parr = np.zeros_like(classes, dtype=np.float64) # class probabilities
    for k in classes:
        prior, mu, var = param[k]
        diff = iris.data[i,:] - mu
        d2 = np.square(diff) / (2*var) 
        p = prior * np.exp(-d2.sum()) / np.sqrt(np.prod(2 * np.pi * var))
        parr[k] = p # save
        if p > pmax:
            pmax = p
            kmax = k
    print (i, parr / parr.sum()) # normalize
    k_pred[i] = kmax

print("Number of mislabeled points out of a total %d points : %d"
      % (iris.target.size, (iris.target!=k_pred).sum()))

0 [1.00000000e+00 2.98130936e-18 2.15237312e-25]
1 [1.00000000e+00 3.16931184e-17 6.93802994e-25]
2 [1.00000000e+00 2.36711261e-18 7.24095643e-26]
3 [1.00000000e+00 3.06960607e-17 8.69063581e-25]
4 [1.00000000e+00 1.01733735e-18 8.88579362e-26]
5 [1.00000000e+00 2.71773169e-14 4.34428540e-21]
6 [1.00000000e+00 2.32163910e-17 7.98827129e-25]
7 [1.00000000e+00 1.39075122e-17 8.16699477e-25]
8 [1.00000000e+00 1.99015585e-17 3.60646902e-25]
9 [1.00000000e+00 7.37893147e-18 3.61549223e-25]
10 [1.00000000e+00 9.39608901e-18 1.47462333e-24]
11 [1.00000000e+00 3.46196432e-17 2.09362749e-24]
12 [1.00000000e+00 2.80452047e-18 1.01019202e-25]
13 [1.00000000e+00 1.79903266e-19 6.06057778e-27]
14 [1.00000000e+00 5.53387950e-19 2.48503292e-25]
15 [1.00000000e+00 6.27386346e-17 4.50986372e-23]
16 [1.00000000e+00 1.10665843e-16 1.28241922e-23]
17 [1.00000000e+00 4.84177304e-17 2.35001131e-24]
18 [1.00000000e+00 1.12617475e-14 2.56717986e-21]
19 [1.00000000e+00 1.80851332e-17 1.96392412e-24]
20 [1.0000

In [7]:
# run sklearn's version - read up on differences if interested
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points out of a total %d points : %d"
      % (iris.target.size, (iris.target!=y_pred).sum()))

Number of mislabeled points out of a total 150 points : 6


In [13]:
# class probabilities
gnb.predict_proba(iris.data)

array([[1.00000000e+000, 1.35784265e-018, 7.11283512e-026],
       [1.00000000e+000, 1.51480769e-017, 2.34820051e-025],
       [1.00000000e+000, 1.07304179e-018, 2.34026774e-026],
       [1.00000000e+000, 1.46619543e-017, 2.95492722e-025],
       [1.00000000e+000, 4.53291917e-019, 2.88389975e-026],
       [1.00000000e+000, 1.49094245e-014, 1.75752068e-021],
       [1.00000000e+000, 1.10262691e-017, 2.71144689e-025],
       [1.00000000e+000, 6.53644612e-018, 2.77336308e-025],
       [1.00000000e+000, 9.42227052e-018, 1.20443161e-025],
       [1.00000000e+000, 3.42348334e-018, 1.20750647e-025],
       [1.00000000e+000, 4.38090482e-018, 5.06830427e-025],
       [1.00000000e+000, 1.65766943e-017, 7.24748728e-025],
       [1.00000000e+000, 1.27573119e-018, 3.28718898e-026],
       [1.00000000e+000, 7.73742183e-020, 1.86207920e-027],
       [1.00000000e+000, 2.43526387e-019, 8.23627924e-026],
       [1.00000000e+000, 3.04074398e-017, 1.66211400e-023],
       [1.00000000e+000, 5.42610885e-017

### Pros
- Features are automatically treated correctly relative to each other
    - For example, measuring similar things in different units? 1m vs 1mm
    - The estimated mean and variance puts them on a meaningful scale

### Cons
- Independence is a strong assumption (helps with computational cost)

## Discriminant Analysis

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
from scipy.stats import multivariate_normal

dataset = pd.read_csv("heart_processed_log.csv", index_col=0)
display(dataset)

X = dataset.drop("DEATH_EVENT", axis=1).values
y = dataset["DEATH_EVENT"].values

# split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# print the shapes of the training and testing sets
print('train shapes:')
print('\t X_train ->', X_train.shape)
print('\t y_train ->', y_train.shape)

print('test shapes:')
print('\t X_test ->', X_test.shape)
print('\t y_test ->', y_test.shape)

# Success Rates
def print_success_rates(y_true,y_pred):
    n_success = np.sum(y_true == y_pred)
    n_total   = len(y_true)
    print("Number of correctly labeled points: %d of %d.  Accuracy: %.2f" 
        % (n_success, n_total, n_success/n_total))

Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,DEATH_EVENT
0,4.317488,6.366470,2.995732,12.487485,0.641854,4.867534,1
1,4.007333,8.969669,3.637586,12.481270,0.095310,4.912655,1
2,4.174387,4.983607,2.995732,11.995352,0.262364,4.859812,1
3,3.912023,4.709530,2.995732,12.254863,0.641854,4.919981,1
4,4.174387,5.075174,2.995732,12.697715,0.993252,4.753590,1
...,...,...,...,...,...,...,...
294,4.127134,4.110874,3.637586,11.951180,0.095310,4.962845,0
295,4.007333,7.506592,3.637586,12.506177,0.182322,4.934474,0
296,3.806662,7.630461,4.094345,13.517105,-0.223144,4.927254,0
297,3.806662,7.788626,3.637586,11.849398,0.336472,4.941642,0


train shapes:
	 X_train -> (209, 6)
	 y_train -> (209,)
test shapes:
	 X_test -> (90, 6)
	 y_test -> (90,)


### QDA: Full Covariance Matrix
- Estimate the full covariance matrix for the classes: $\displaystyle {\cal{}L}_{\!\boldsymbol{x}}(C_k) =  G(\boldsymbol{x};\mu_k, \Sigma_k)$
- Handles correlated features well
- Consider binary problem with 2 classes
    - Taking the negative logarithm of the likelihoods we compare
    - $\displaystyle (\boldsymbol{x}\!-\!\boldsymbol{\mu}_1)^T\,\Sigma_1^{-1}(\boldsymbol{x}\!-\!\boldsymbol{\mu}_1) + \ln\,\lvert\Sigma_1\lvert$ vs.
    - $\displaystyle (\boldsymbol{x}\!-\!\boldsymbol{\mu}_2)^T\,\Sigma_2^{-1}(\boldsymbol{x}\!-\!\boldsymbol{\mu}_2) + \ln\,\lvert\Sigma_2\lvert$
    - f the difference is lower than a threshold, we classify it accordingly

In [13]:
def qda_predict(X_train, y_train, X_test):
    prior_class_0 = 0.5
    prior_class_1 = 0.5

    X_class_0 = X_train[y_train == 0] 
    X_class_1 = X_train[y_train == 1] 
    
    mu_class_0 = np.mean(X_class_0, axis=0)
    mu_class_1 = np.mean(X_class_1, axis=0)

    sigma_class_0 = np.cov(X_class_0, rowvar=False)
    sigma_class_1 = np.cov(X_class_1, rowvar=False)

    likelihood_class_0 = multivariate_normal.pdf(X_test, mean=mu_class_0, cov=sigma_class_0)
    likelihood_class_1 = multivariate_normal.pdf(X_test, mean=mu_class_1, cov=sigma_class_1)

    posterior_class_0 = likelihood_class_0 * prior_class_0 
    posterior_class_1 = likelihood_class_1 * prior_class_1

    y_pred = np.where(posterior_class_1 > posterior_class_0, 1, 0)

    return y_pred

y_pred_qda = qda_predict(X_train, y_train, X_test)
print_success_rates(y_pred_qda, y_test)

Number of correctly labeled points: 68 of 90.  Accuracy: 0.76


### LDA: Same Covariance Matrix
- When $\Sigma_1=\Sigma_2=\Sigma$, the quadratic terms cancel from the difference
    - $\displaystyle (x\!-\!\mu_1)^T\,\Sigma^{-1}(x\!-\!\mu_1) $ 
    - $\displaystyle -\ (x\!-\!\mu_2)^T\,\Sigma^{-1}(x\!-\!\mu_2) $
- Fewer parameters to estimate during the learning process (good if we don't have enough data)
- Think if the data exhibits linear or nonlinear fit

In [14]:
def lda_predict(X_train, y_train, X_test):
    
    prior_class_0 = 0.5
    prior_class_1 = 0.5

    X_class_0 = X_train[y_train == 0]
    X_class_1 = X_train[y_train == 1]
    
    mu_class_0 = np.mean(X_class_0, axis=0)
    mu_class_1 = np.mean(X_class_1, axis=0)

    sigma_class_0 = np.cov(X_class_0, rowvar=False)
    sigma_class_1 = np.cov(X_class_1, rowvar=False)
    sigma_pooled = ((X_class_0.shape[0] - 1) * sigma_class_0 + (X_class_1.shape[0] - 1) * sigma_class_1) / (X_class_0.shape[0] + X_class_1.shape[0] - 2)


    likelihood_class_0 = multivariate_normal.pdf(X_test, mean=mu_class_0, cov=sigma_pooled)
    likelihood_class_1 = multivariate_normal.pdf(X_test, mean=mu_class_1, cov=sigma_pooled)

    posterior_class_0 = likelihood_class_0 * prior_class_0
    posterior_class_1 = likelihood_class_1 * prior_class_1

    y_pred = np.where(posterior_class_1 > posterior_class_0, 1, 0)

    return y_pred

y_pred_lda = lda_predict(X_train, y_train, X_test)
print_success_rates(y_pred_lda, y_test)

Number of correctly labeled points: 63 of 90.  Accuracy: 0.70


# Sci Kit Implementation

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load iris dataset
iris = datasets.load_iris()

X = iris.data
y = iris.target

# Standardize features (important for LDA/QDA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

## NBC

In [2]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

nb = GaussianNB()
nb.fit(X_train, y_train)

y_pred_nb = nb.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))

Naive Bayes Accuracy: 0.9111111111111111


## QDA

In [3]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

y_pred_qda = qda.predict(X_test)
print("QDA Accuracy:", accuracy_score(y_test, y_pred_qda))

QDA Accuracy: 0.9777777777777777


## LDA

In [4]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

y_pred_lda = lda.predict(X_test)
print("LDA Accuracy:", accuracy_score(y_test, y_pred_lda))

LDA Accuracy: 0.9777777777777777
