# Naive Bayes
* grounded in probability, which can be powerful
* we will purposefully make this model not powerful ("naive")

## Example
* We want to determine if an email is spam
* can look at words like "free", "pills", "money", etc
* we want to find: 
$$p("money"|\;spam)$$
$$p("money"|\;not\;spam)$$

# How do we find the above probabilities?
* **discrete probabilities are just counts** (we will look at a slightly more complex situation after)
* hence: 
$$p("money"|\;spam) = \frac{count \; or \; number \;of \;spam \; messages\;containing\;"money"}{total \;number\;or \;count \;of\;spam\;messages}$$
* A similar thing will be done for 
$$p("money"|\;not\;spam)$$

# What makes this naive?
* consider the following:
$$p("cash"|\;spam)$$
* Is it correlated with $p("money"|\;spam)$?
* It most likely is very correlated! 
* however, naive bayes makes the assumption that none of the input features are correlated, and that they are all independent
$$p("money","cash"|\;spam)= p("money"|\;spam)p("cash"|\;spam)$$
which generalizes to:
$$p(all \;words | \; spam) = p(word1 \;|\; spam)p(word2\;|\;spam)...$$

# What makes Naive Bayes, Bayesian?
* think about what we are interested in doing here
* we don't just want the model probabilities p(X|spam), we want to make a prediction! We want: p(spam|X)
* In other words, we want to know the probability that a document is spam, given the words in a document!
* This is just the reverse of what we have been modeling so far, which is just p(words|spam)
* That is fine, because we can just use...**Bayes Rule**
$$P(spam|X) = \frac{P(X|spam)p(spam)}{P(X)}$$
* where in the above equation, X is the list of words in the document
* We classify based on which of the following are bigger:
    * p(spam|X)
    * p(not spam|X)
* if p(spam|X) > p(not spam|X), we classify as spam
* if p(not spam|X) > p(spam|X), we classify as not spam
* In other words we are taking the argmax of these two probabilties 
$$Y = argmax_C \Big(p(C|X)\Big) = argmax_C\Big(p(X|C)p(C)\Big)$$
* notice that p(X) = p(words) does not depend on spam or not spam
* in this case, C is equal to spam or not spam
### P(C) = class prior
* again, this can be found just by counting! Because it is a discrete probability 
* example: if you have 10 spam emails and 20 non spam emails in your data set, then p(spam) = p(c) = 1/3, and p(not spam) = p(c) = 2/3
### P(X | C) = likelihood
### P(C | X) = posterior


# How do we model P(X | C), the likelihood? 
* remember, naive bayes means that all of the words, given the class, are independent
$$p(document \;containing "free", "money", "cash",... | \; C) = p("free" \;|\; C)p("money"\;|\;C)...$$
* when you have independent probabilties, that means you can just multiply each of the individual probabilities together to get the joint probabilty 
* so since:
$$P(X\;|\;C)=P(words\;|\;C)$$
* and because we just mentioned how all words are independent, then we can write:

![bayes%202.png](attachment:bayes%202.png)

* so, as seen above, the probabilities of seeing a certain list of words, given the class either spam or not spam, is just the product of all the individual words given the class
* note that you also need to account for any word that doesn't appear as well
* that is just 1 - the probability of it appearing

![bayes1.png](attachment:bayes1.png)

* why do we need to account for the probability that the words does not appear? Well think back maximizing the likelihood of a binomial experiment
* The general form equation for the likelihood of a binomial experiment is:
$$L(X|p) = p^k*(1-p)^{N-k}$$

# Naive Bayes vs. KNN
* conceptually speaking, this is almost the opposite of what we did for KNN
* with KNN we are essentially trying to approximate a function that takes in some input, and produces a target label.
    * f(words in document) -> spam/not spam
* With Naive bayes, we are assuming that all of the data we observe is produced from the target label so that all we need to do is model the distribution of the data, given all of the different target labels
    * spam -> a spammy document -> model p(document | spam)
    * not spam -> a non-spammy document -> model p (document | not spam)
* we will discuss this idea in more detail when we look into **generative** vs **discriminative** classifiers


# Naive Bayes Implementation 
* we are again going to use the MNIST dataset
* Data: discrete distribution, each pixel has 256 different values
* in physical reality: we consider light to be more or less continuous 
* so, we are probably better off treating it as a continuous distribution 
* but which continuous distribution should we choose? 
* We could use a distribution that is defined from 0-1 (since we already divided by 255)
* but instead we are going to use a gaussian distribution!

![gaussian.png](attachment:gaussian.png)

* recall that there is also the multivariate gaussian distribution where x and $\mu$ are vectors, and instead of the variance we use covariance

![multi%20var%20gauss.png](attachment:multi%20var%20gauss.png)

* x = vector input
* $\mu$ = vector mean
* $\Sigma$ = covariance matrix
* Scipy library can calculate this directly
* This is much better than doing it manually, since scipy is otpimized to do it fast

## Further Optimizations
### 1) We won't use full covariance matrix
* This is because Naive bayes treats all dimensions as independent, even though they are not
* In terms of the multivariate gaussian, that means that all of the off diagonals in the covariance matrix are 0
$$Cov(i,j) = E[(x_i-\mu_i)(x_j-\mu_j)]$$
* the above equation equals 0 if $x_i$ is independent of $x_j$
$$Cov(i, i) = \Sigma_{ii} = var(x_i) = \sigma_i^2$$
* this means that instead of needing a DxD covariance matrix, we only need D different single dimension variances (so we can just store a D sized vector) 
* this is called an axis aligned elliptical covariance
* This speeds up the calculation a lot, because we don't need to invert the covariance matrix
* recall that in order to invert a diagonal matrix, all we need to do is take 1 over all of the individual values
$$(\Sigma^{-1})_{ii}= \frac{1}{\sigma_i^2}$$
* luckily, scipys library function allows us to pass in either a DxD covariance matrix, or a D sized vector in place of it!
* effectively that means we are still doing:
$$p(X|C) = p(pixel_1\;|\;C)p(pixel_2\;|\;C)...p(pixel_{784}\;|\;C)$$
* which equals..
$$N(x_1; \mu_1, \sigma_1^2)N(x_2; \mu_2, \sigma_2^2)...N(x_{784}; \mu_{784}, \sigma_{784}^2)$$
### NOTE
* in this case, because Naive bayes assumes that all dimensions are independent (i.e. all 784 pixels are independent of eachother), the covariance matrix with have zeros everywhere except the diagonals
* in that case it can be represented as 784 normal univariate gaussians for each class 
* However, those 784 univariate gaussians can be COMBINED into a multivariate gaussian! 
$$p(x_1, x_2,...,x_{784}; \mu_1, \mu_2,...,\mu_{784}, \sigma_1^2, \sigma_2^2,...,\sigma_{784}^2)$$ 
* this multivariate gaussian, unlike those we have worked with in the past, has 784 input features! (unlike the usual 2, x and y)

### 2) Log Probabilities
* we do not need to use direct probabilities
* calculating the exponential function in code is time consuming, as is multiplication
* by calculating the log probabilities instead, we can bring everything inside of the exponential down, and turn all of the multiplications into additions
$$Prediction = argmax_C \Big(P(X|C)P(C)\Big) = argmax_C \Big(logP(X|C)+ logP(C)\Big) $$
* scipy has a function for the log pdf of a gaussian as well!

### 3) Smoothing
* what happens when you try to invert the covariance matrix sometimes, is that you can come across a problem called the singular covariance problem 
* That is them matrix equivalent of dividing by 0
* To prevent this we add the identity matrix multiplied by a small number, like 10^-3, to the raw covariance matrix
* So it would go from, maximum likelihood estimate:
$$\Sigma = \frac{(X-\mu)^T(X-\mu)}{N-1}$$
* to the smoothed estimate:
$$\Sigma = \frac{(X-\mu)^T(X-\mu)}{N-1}+ \lambda*I$$
* where lambda is a small number like 10^-3
* and I is the indentity matrix

# Pseudocode
## Fit
![nb%20pseudocode.png](attachment:nb%20pseudocode.png)

* in our fit function above, what we want to do is calculate the mean and covariance for each class 
* the mean and the covariance are all that we need to fully represent the gaussian distribution
    * They are what we call **sufficient statistics**
* we also need to calculate the priors here

## Predict

![nb%20pseudocode%20predict.png](attachment:nb%20pseudocode%20predict.png)

* in the predict function, we want to loop through every sample that we want to predict on, and then loop through each class
* for each class, we get the mean, variance, and prior
* we use the mean and variance to calculate the log pdf for this sample being in this class
* we add that to the log of the prior to get the posterior
* if this is better than the maximum posterior we have found so far, we set this to our best current prediction

# Naive Bayes Handwritten Example
* lets again use the spam classifier examples
* in order to keep this manageable by hand, lets only look at 3 words: "money", "free", and "pills"
* so, we have a set of data points, which tells use whether or not each of these data points appears, and whether or not that point is spam 

![hand%20data.png](attachment:hand%20data.png)

* remember of our goal is to calculate P(C|X), in other words, P(spam | words) and P(not spam | words)
* we do not need P(X)

## Required probabilities
* in this case, the symbol '~' means **not**

![required%20probabilities%20.png](attachment:required%20probabilities%20.png)

* and here are there values:

![required%20prob%20values.png](attachment:required%20prob%20values.png)

## Example Calculation
* Now lets do an example calculation for classifying a spam message
* Input: suppose an email contains the word "money", but not "free" or "pills"
* is it spam?
* we need to calculate the probability: p(spam | money, ~free, ~pills) and p(~spam | money, ~free, ~pills)
* so both of these probabilities we can separate into distinct terms since they are all independent

### P(spam| money, ~free, ~pills)

![p%20spam.png](attachment:p%20spam.png)

### P(~spam| money, ~free, ~pills)

![p%20not%20spam.png](attachment:p%20not%20spam.png)

## Final Answer - NOT spam
* we can see that the not spame posterior is higher, so we classify this as not spam
* also, these are not *true* probabilties since we did not divide by p(X)

# 0 probabilities 
* note that probabilities that are exactly equal to 0 can be problematic
* because everything gets multiplied, all it takes is 1 zero to make everything zero
* can end up with a tie of all zeros
* solution: sometimes we use add-one smoothing/ laplace smoothing
* instead of the pure count, we add 1 to the numerator and V to the denominator
* where V = vocabulary size to ensure valid probability (adds up to 1)

![laplace%20smoothing.png](attachment:laplace%20smoothing.png)

---
# Naive Bayes in Code
* Lets code Naive Bayes and use it on the MNIST problem

In [53]:
import numpy as np
import pandas as pd
from datetime import datetime
from scipy.stats import norm # used to get single dimension gaussians
from scipy.stats import multivariate_normal as mvn # also can use multivariate normal

def get_data(limit=None):
    print("Reading in and transforming data...")
    df = pd.read_csv('data/train.csv')
    data = df.as_matrix()
    np.random.shuffle(data)
    X = data[:, 1:] / 255.0 # data is from 0..255
    Y = data[:, 0]
    if limit is not None:
        X, Y = X[:limit], Y[:limit]
    return X, Y

## Create Naive Bayes Class
* needs fit and train functions

In [54]:
class NaiveBayes(object):
    
    def fit(self, X, Y, smoothing=10e-3):    # takes in X, Y, and smoothing parameter
        self.gaussians = dict()    # create an empty dictionary of the gaussian parameters 
        self.priors = dict()       # and an empty dictionary of the priors
        labels = set(Y)            # grabs all unique values in Y
        
        for c in labels:           # loop through each of the labels (1,2,3,...,9)
            current_x = X[Y == c]  # set current_x to be that for which X has a label of c, 
                                   # ex. grabbing ALL training examples where y is 5
            
            self.gaussians[c] = {                          # at this point, current_x contains all training examples 
                'mean': current_x.mean(axis=0),            # associated with specific class label. Now, we find the 
                'var': current_x.var(axis=0) + smoothing   # mean and variance of each pixel (784 total). Aka we are  
            }                                              # are finding this so that we can determine the likelihood
                                                           # of observing certain values for the pixels, given a class label
            
            self.priors[c] = float(len(Y[Y == c])) / len(Y)  # calculate prior for each class 
                                                             # # of images for class / total # of images
        
    def score(self, X, Y):
        P = self.predict(X)
        return np.mean(P == Y)
    
    def predict(self, X): 
        N, D = X.shape
        K = len(self.gaussians)        # setting K = number of classes
        P = np.zeros((N,K))            # each prediction is a vector. Each class gets a probability, and the 
                                       # highest probability is chosen
        
        for c, g in self.gaussians.items():     # loop through all of the gaussians
            mean, var = g['mean'], g['var']         # get mean and variance for each gaussian
            
            P[:, c] = mvn.logpdf(X, mean=mean, cov=var) + np.log(self.priors[c]) # calculate N different log 
                                                                                 # pdfs at same time
        
        return np.argmax(P, axis=1)
        
if __name__ == '__main__':
    X, Y = get_data(10000)
    Ntrain = int(len(Y) / 2)
    Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
    Xtest, Ytest = X[Ntrain:], Y[Ntrain:]
    
    nb = NaiveBayes()
    nb.fit(Xtrain, Ytrain)
    
    
    # set timer to see how long it takes model to fit the training data
    t0 = datetime.now()
    nb.fit(Xtrain, Ytrain)
    print ("Training Time: ", (datetime.now() - t0))

    # now get training accuracy and time this as well
    t0 = datetime.now()
    print ("Train accuracy:", nb.score(Xtrain, Ytrain))
    print("Time to compute train accuracy:", (datetime.now() - t0), "Train size:", len(Ytrain))

    # now print test accuracy
    t0 = datetime.now()
    print ("Test accuracy:", nb.score(Xtest, Ytest))
    print("Time to compute test accuracy:", (datetime.now() - t0), "Test size:", len(Ytest))
    print('-----------------------------------------------------------------------')
        

Reading in and transforming data...
Training Time:  0:00:00.030958
Train accuracy: 0.8094
Time to compute train accuracy: 0:00:01.446992 Train size: 5000
Test accuracy: 0.7878
Time to compute test accuracy: 0:00:01.396955 Test size: 5000
-----------------------------------------------------------------------


----
# Non-Naive Bayes
* remember that in Naive Bayes, the "naive" just means that all of the input features are independent. 
    * aka in the bivariate case, this means that x1 and x2 are not correlated at all
    * it would means that the covariance matrix has 0s on the diagonals 
    * in the case of the MNIST data set, it means that a pixel value is not at all dependent on the pixel value next to it (in reality it **IS**)
    * Formula wise that looks like:
    $$p(X|C) = \prod_i p(X_i|C)$$
    * remember, these are able to be just multiplied together because they are assumed to be independent
    * each p(x_i|C) is a univariate gaussian distribution (a type of random variable) 
    * the assumption of independence means that we assume covariance between features to be 0 (0 correlation)
    * that means that this is just a special case of a multivariate gaussian, where instead of using the diagonal covariance, we use the full covariance matrix
* with non naive bayes, it just means that the features are not independent
* The question is then, how do we model P(X | C) in the non-naive bayes case? 
    * this is in fact a very open ended question - you could model it however you want
    * Example: you could use the full covariance matrix, instead of the diagonal covariance matrix as we did with naive bayes, and we would end up with non-naive bayes
    * Could also use a complex model like a hidden markov model
    * could get even more complex and use a custom bayes net
    * Take-away: can be arbitrarily complex
    
# Multivariate Gaussian
* lets talk more about the multivariate gaussian with full covariance, and why that is not naive
* in general, if 2 random variables are **independent**, then there covariance is 0
* this can be proven using the definition of covariance
$$cov(X_i, X_j) = E\Big[(X_i-\mu_i)(X_j-\mu_j)\Big]$$
$$cov(X_i, X_j) = E\Big[X_iX_j\Big]-E\Big[X_i\mu_j\Big]-E\Big[\mu_iX_j\Big]+E\Big[\mu_i\mu_j\Big]$$
$$cov(X_i, X_j) = E\Big[X_i\Big]E\Big[X_j\Big]-E\Big[X_i\Big]\mu_j-\mu_iE\Big[X_j\Big]+\mu_i\mu_j$$
$$cov(X_i, X_j) = \mu_i\mu_j-\mu_i\mu_j-\mu_i\mu_j+\mu_i\mu_j$$
* NOTE: the expected value is just the **mean**! It is the expected value if you sample a large number of times!
* the opposite is **not** true. I.e. if two variables have 0 covariance, they cannot be assumed to be independent 
* their is one exception to this, and that is the gaussian distribution 

## In the Gaussian Case
* if two variables, X_i and X_j are gaussian distributed and have a covariance of 0, cov(X_i, X_j) = 0, then X_i and X_j are **independent**
* remember the covariance matrix is just the pairwise covariance of each X_i and each X_j, where i = 1 to D, and j = 1 to D
* therefore, each element along the diagonal is just the variance for that dimension, and every off diagonal at the position (i,j) is the covariance between the two features X_i, and X_j. This can be seen below:

![Covariance-Matrix.png](attachment:Covariance-Matrix.png)

* Hence, if all the off diagonals are 0, then we have the naive bayes case (equivalent to all of the independent univariate distributions being multiplied together). We can break off each individual dimension into its own gaussian, and multiply them all together 
* if the off diagonals are **not** zero, then we need to use the most general form, which requires use to calculate the inverse and the covariance of the determinant, using a more general algorithm (which is slower)
* scipy can still do this for us!

## Scipy logpdf
* luckily, because we are using scipys logpdf function, there isn't much that we need to change
* the function either takes in a single scalar variance (circular gaussian), a diagonal covariance (axis aligned elliptical gaussian), or a full covariance

# Bayes Classifier
* we are just calling it "non-naive" to distinguish it from "naive"
* usually it is just called a bayes classifier
* more generally, we can have a "Bayes model"
* can do either classification or regression

---
# Non-Naive Bayes Classifier - Code
* The first difference we will see is that in the fit function, we will calculate and store the covariance instead of just the variance
    * numpy uses unbiased version of covariance, and divides by n-1 instead of n
* the next difference is the predict function now has the covariance passed in, instead of the variance
* We can see that this model performed much better!

In [55]:
class Bayes(object):
    
    def fit(self, X, Y, smoothing=10e-3):    # takes in X, Y, and smoothing parameter
        N, D = X.shape
        self.gaussians = dict()    # create an empty dictionary of the gaussian parameters 
        self.priors = dict()       # and an empty dictionary of the priors
        labels = set(Y)            # grabs all unique values in Y
        
        for c in labels:           # loop through each of the labels (1,2,3,...,9)
            current_x = X[Y == c]  # set current_x to be that for which X has a label of c, 
                                   # ex. grabbing ALL training examples where y is 5
            
            self.gaussians[c] = {                                  # at this point, current_x contains all training examples 
                'mean': current_x.mean(axis=0),                    # associated with specific class label. Now, we find the 
                'cov': np.cov(current_x.T) + np.eye(D)*smoothing   # mean and variance of each pixel (784 total). Aka we are  
            }                                                      # are finding this so that we can determine the likelihood
                                                                   # of observing certain values for the pixels, given a class label
            
            self.priors[c] = float(len(Y[Y == c])) / len(Y)  # calculate prior for each class 
                                                             # # of images for class / total # of images
        
    def score(self, X, Y):
        P = self.predict(X)
        return np.mean(P == Y)
    
    def predict(self, X): 
        N, D = X.shape
        K = len(self.gaussians)        # setting K = number of classes
        P = np.zeros((N,K))            # each prediction is a vector. Each class gets a probability, and the 
                                       # highest probability is chosen
        
        for c, g in self.gaussians.items():     # loop through all of the gaussians
            mean, cov = g['mean'], g['cov']         # get mean and variance for each gaussian
            
            P[:, c] = mvn.logpdf(X, mean=mean, cov=cov) + np.log(self.priors[c]) # calculate N different log 
                                                                                 # pdfs at same time
        
        return np.argmax(P, axis=1)
        
if __name__ == '__main__':
    X, Y = get_data(10000)
    Ntrain = int(len(Y) / 2)
    Xtrain, Ytrain = X[:Ntrain], Y[:Ntrain]
    Xtest, Ytest = X[Ntrain:], Y[Ntrain:]
    
    bayesModel = Bayes()    
    
    # set timer to see how long it takes model to fit the training data
    t0 = datetime.now()
    bayesModel.fit(Xtrain, Ytrain)
    print ("Training Time: ", (datetime.now() - t0))

    # now get training accuracy and time this as well
    t0 = datetime.now()
    print ("Train accuracy:", bayesModel.score(Xtrain, Ytrain))
    print("Time to compute train accuracy:", (datetime.now() - t0), "Train size:", len(Ytrain))

    # now print test accuracy
    t0 = datetime.now()
    print ("Test accuracy:", bayesModel.score(Xtest, Ytest))
    print("Time to compute test accuracy:", (datetime.now() - t0), "Test size:", len(Ytest))
    print('-----------------------------------------------------------------------')
        

Reading in and transforming data...
Training Time:  0:00:00.119507
Train accuracy: 0.999
Time to compute train accuracy: 0:00:01.860900 Train size: 5000
Test accuracy: 0.937
Time to compute test accuracy: 0:00:01.921949 Test size: 5000
-----------------------------------------------------------------------


---
# Linear Discriminant Analysis and Quadratic Discriminant Analysis
* lets discuss a version of the bayes classifier that leads to an even simpler implementation 
* we have seen this before when looking at logistic regression, but now we are going to formulate just based off of bayes rule 
* There are two assumptions:
    1. Our data for each class is gaussian distributed, ie:
    $$p(X|C) \approx N(X; \mu, \Sigma)$$
    2. We will only consider binary classification
    
![lda.png](attachment:lda.png)

# Simplified Bayes Classifiers 
* Lets call our two classes "0" and "1"
* remember our decision rule for predicting class 1 is: p(C=1|X) > p(C=0|X), then predict C=1
* remember also that we can replace these two expressions using bayes rule
$$\frac{P(X|C=1)P(C=1)}{P(X)} > \frac{P(X|C=0)P(C=0)}{P(X)}$$ 
* P(X) cancels out, which leaves our new decision rule as:
$$P(X|C=1)P(C=1) > P(X|C=0)P(C=0)$$ 
* where if the above expression is true, we predict class 1, else we predict 0
* now, we can go ahead and plug in our expressions for each probability distribution (each likelihood distribution)

![simplified%20bayes%201.png](attachment:simplified%20bayes%201.png)

* We can then replace the priors with alpha and 1 - alpha

![simplified%20bayes%202.png](attachment:simplified%20bayes%202.png)

### Log Both sides
* the next step is to take the log of both sides
* why does the greater sign remain valid? 
* because the log function is monotonically increasing!

### After logging both sides
* taking the log of both sides is great because it means that we can get rid of the exponential

![simplified%20bayes%203.png](attachment:simplified%20bayes%203.png)

* where above the coefficient of the gaussian is now replaced with K1 and K0

### Move everything to left side
* we can move everything to the left side so that the right side is just 0

![simplified%20bayes%204.png](attachment:simplified%20bayes%204.png)

### Expand the multiplications
* next we can expand the multiplications

![simplified%20bayes%205.png](attachment:simplified%20bayes%205.png)

### Combine Like Terms
* one trick we can do here is combine all of the parts below
* remember, sigma is a symmetric matrix, so its inverse is also symmetric 
* note: matrix rules and intuition: https://betterexplained.com/articles/matrix-multiplication/
* https://betterexplained.com/articles/linear-algebra-guide/

![simplified%20bayes%206.png](attachment:simplified%20bayes%206.png)

### Simplify Further
* After doing this, we can simplify further by using the above rule, an removing the parentheses

![simplified%20bayes%207.png](attachment:simplified%20bayes%207.png)

### Combine by degree in X
* finally we can combine these and we see that there are 3 parts
* the first part is where x shows up twice, that is what we call the quadratic term 
* then there is the part where x shows up once, that is the linear term
* then there is the part that does not depend on X, that is called the constant term 

![simplified%20bayes%208.png](attachment:simplified%20bayes%208.png)

* so what we have ended up with, is a multi dimensional quadratic equation 
* this is our prediction rule
* We plug in x into the equation above, and if it is greater than 0, predict 1, else predict 0!
* because this is a quadratic equation, we call this quadratic discriminant analysis

# Why is this better?
* the best part about this is the implementation 
* in our regular bayes classifier, we had to calculate the pdf using a library function, which required taking the matrix determinant, inverse, and so on 
* We don't need to use any log_pdf() calls
* in this form we can just solve for the parameters in terms of mu0, mu1, sigma0, sigma1, and alpha, then plug them into the equation and never use them again 

![quadratic%20da.png](attachment:quadratic%20da.png)

# Save A, w, b
* of course, we can calculate A, w, and b as follows:

![coeff%20bayes%20classifier.png](attachment:coeff%20bayes%20classifier.png)

* once we do this, we never have to invert any matrices for later predictions 
* notice what happens if covariance 0 and covariance 1 are the same? 
* the squared term goes away! That means each class shares the same covariance
* We are left with **Linear Discriminant Analysis**, LDA

# Linear Discriminant Analysis
* Studying linear classifiers is interesting, because they all find a weight vector and a bias term 
* the difference is in how they do it
* there are multiple of them, they all have different properties and make different assumptions 
* each has its own strengths and weakness's
* they all have **w** and **b**, but the way we find them is different
* we saw that with the bayes version, we need to make assumptions about the distribtution of the data 
* It turns out, if these assumptions are true, then the bayes classifier is the optimal classifier!
* On another note, we have worked with linear regression, where w and b are found using gradient descent
* Later we will talk about a Perceptron, a historically interesting linear classifier

---
# Generative vs Discriminative Models
---
# Discriminative Classifiers
* note that with all classifiers that output a probability, the probability that we make our prediction from is the posterior, p(C|X)
* we call classifiers that model that distribution directly, like logistic regression, **discriminative classifiers**
* This is because given the data, they try to learn to discriminate between each class (we start with X, get Y)

![discriminative.png](attachment:discriminative.png)

* recall from logistic regression, we use a hyperplane to discriminate between classes
* classifiers like KNN are also discriminative classifiers, because they learn to directly discriminate between classes given the input

# Generative Classifiers
* Generative classifiers are the opposite 
* they can still allow you to discriminate between classes using bayes rule, but...
* the main calculation is P(X|C)
* we start with y (The class) and model X
* the assumption here is that each class has its own structure, and therefore its own distribution of X
* so in other words, each class is kind of a "data-making machine", and each machine makes data in a different way 
* the data making machine generates data, hence the term generative 

![generative.png](attachment:generative.png)

# Generative vs Discriminative
* generative classifiers are theoretically satisfying, because they are grounded in the rules of probability
* each variable is modeled directly, and you can change your model P(X|C) if it is not correct, or if the result is poor 
* One advantage of this is you know how each variable influences the result
* this is very helpful if you are working with clients and they ask you to explain your model in terms of the input variables 
* One disadvantage of generative models is that discriminative models have historically worked better (Ex. deep learning)
* However, neural networks are hard to explain! 
* the linear combination of 2 variables could be physically meaningless, even if it helps your model predict