## Naive Bayes and Discriminant Analysis

Naive Bayes algorithms are a family of powerful and easy-to-train classifiers that determine the probability of an outcome given a set of conditions using Bayes' theorem.

#### We are going to discuss the following:

1.  Bayes' theorem and its applications
2. Naive Bayes classifiers (Bernoulli, Multinomial, and Gaussian)
3.  Discriminant analysis (both linear and quadratic)

#### Naive Bayes Classification 

1. Supervised Learning Algorithm used for classfication
2. It is based on Bayes Theorem


## What is Bayes Theorum ?

$ P(A \enspace/ \enspace B) =  \large \frac{ P (B \enspace/\enspace A) \, P (A) } { P(B) } $

$ P(A) $ :  **Priori Probabality** [Probablity of event before event B]

$ P(A / B) $ :  **Posterior Probabality** [Probablity of event after event B is true]

In the general discrete case, the formula can be re-expressed considering all possible outcomes for the random variable A:
- $ P(A / B) =  \large \frac{ P (B \enspace/ \enspace A) \, P (A) }  {\sum_{i} P(B \enspace / \enspace A_i) P( A_i ) }  $

- As the denominator is a normalization factor, the formula is often expressed as a proportionality relationship:
    - $ P(A \enspace / \enspace B) \propto { P (B \enspace / \enspace A) \, P (A) } $ 

Imagine we want to implement a very simple spam filter and we've collected 100 emails. We know that 30 are spam and that 70 are regular. So, we can say that P(Spam) = 0.3.

However, we'd like to evaluate using some criteria (for simplicity, let's consider a single one)—for example, email text is shorter than 50 characters. Therefore, our query becomes the following:

$ P( Spam \enspace/ \enspace Text < 50 char) =  \large \frac{ P (Text < 50 char \enspace/  \enspace Spam)  \, P (Spam) } { P(Text < 50 char) } $

Let's suppose that 35 emails have text shorter than 50 characters, so P(Text < 50 chars) = 0.35. Looking only into our spam folder, we discover that only 25 spam emails have short text, so that P(Text < 50 chars|Spam) = 25/30 = 0.83. The result is this:

$ P(Spam \enspace / \enspace Text < 50 chars) =  \large \frac{ 0.83 \enspace* \enspace 0.3 }  {0.35 }  $

So, after receiving a very short email, there is a 71% probability that it's spam. Now, we can understand the role of P(Text < 50 chars|Spam); as we have actual data, we can measure how probable is our hypothesis given the query.

In other words, we have defined a likelihood (compare this concept with the logistic regression), which is a weight between the Apriori probability and the A Posteriori one:

 $ \large {P(A \enspace / \enspace B) \propto {Likelihood \, P (A) }} $ 
 
 $ \large {P_{APosterior} \propto {Likelihood * P_{APriori} }} $ 


| Text                          | category   |
|-------------------------------|------------|
| A great game                  | Sports     |
| the election was over         | Non Sports |
| very clean match              | Sports     |
| a clean but forgettable game  | Sports     |
| it was a close election       | Non Sports |

- Using the Naive Bayes we will try to classify :  " A very close game " should  be in sports or non sports category.
- So in terms of probablity we have to find :
    - P(Sports / "A very close game " ) and P(Non sports / "A very close game " ) 
    - To find the probablity we have to find probality of ever word:
        - P(A) * P("very") * P(close) * P(game)
    - Get probablity of each when Sports is True 
        - P("A very close game"/Sports) = P(A/Sports) * P(very/Sports) * P(close/Sports) * P(game/Sports)
    - Get probablity of each when Non Sports is True 
        - P("A very close game"/Non Sports) = P(A/Non Sports) * P(very/Non Sports) * P(close/Non Sports) * P(game/Non Sports)

### P(A/sports) = 2/11 : 
    1.How many time A is there in catergory sports ? 2
    2.How many words are there is category sports ? 11

### P(very/Sports) = 1/11 : 
    Ask the above questions 

### P(close/Sports) : 0/11
    1. This is important because the word close is not there in the Sports category
    2. It cannot be 0/11 else it will make the entire probality to be zero .
    3. Here we use **Laplace smoothing** : its a difficult formula but verry easy to understand
    
$\large{\hat\theta_i= \frac{x_i + \alpha}{N + \alpha d} } \qquad (i=1,\ldots,d)$
    
$\large x_i =$  word count , How many time A is there in catergory sports ?
    
$\large N= $ total number of words in that catergory.
    
$\large d= $ total number of unique words in **all the categories**.
  
$\large \alpha  = 1  $ 
    
    - The default value for α is 1.0 (in this case, it's called Laplace factor) and it prevents the model from setting null probabilities when the frequency is zero
    
    - When α < 1.0, it's usually called the **Lidstone factor**. Clearly, if α → 0, the effect becomes more and more negligible, returning to a scenario very similar to the Bernoulli Naive Bayes. In our example, we're going to consider the default value of 1.0.
    
    - So this eqaution can be represted as :
    
$P(words) = \large {\frac {word \enspace count \enspace + \enspace 1} {total \enspace no \enspace of \enspace words \enspace + \enspace no \enspace of \enspace unique \enspace  words}}$



Using this lets find the probablity of every word :

P(a/sports) = (  2 + 1 )/ (11 + 14) 

P(close/sports) = ( 0 + 1 / (11 + 14)

| Text           | P(word/sports)  | P(word / non sports )  |
|----------------|-----------------|------------------------|
| A              | 2+1/11+14       | 1+1/9+14               |
| Very           | 1+1/11+14       | 0+1/9+14               |
| close          | 0+1/11+14       | 1+1/9+14               |
| game           | 2+1/11+14       | 0+1/9+14               |


- Get probablity of each when Sports is True 
    - P("A very close game"/Sports) = P(A/Sports) * P(very/Sports) * P(close/Sports) * P(game/Sports)
    - P("A very close game"/Sports) = 4.61 * $ 10^{-5} $  
        
- Get probablity of each when Non Sports is True 
    - P("A very close game"/Non Sports) = P(A/Non Sports) * P(very/Non Sports) * P(close/Non Sports) * P(game/Non Sports)
    - P("A very close game"/Non Sports) = 1.43 * $ 10^{-5} $
        
### Conclusion
- P("A very close game"/Sports)  has a higher probablity so this sentence belongs to the sports category

### Naive Bayes in scikit-learn

- Scikit-learn implements three Naive Bayes variants based on the same number of different probabilistic distributions: 
    1. Bernoulli : Is a binary distribution, and is useful when a feature can be present or absent.
    2. Multinomial : Is a discrete distribution and is used whenever a feature must be represented by a whole number (for example, in NLP, it can be the frequency of a term).
    3. Gaussian : Is a continuous distribution characterized by its mean and variance.

### Bernoulli Naive Bayes : 

https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html

- Where it is used ? 
    - If you have a dataset where the features value is Binary 
        - 1 or 0 
        - True or Flase 
        - Positive or negative
        - Yes or NO
        - Success or Failure 
- Bernoulli Naive bayes uses Bernoulli distribution :
    - P(success) = p
    - P(failure) = q = 1-p
    - X is a random variable , any column or datatype in your dataset which is binary.
    - X = 1 [success]
    - X = 0 [failure]
    - We can now say X has a Bernoulli distribution .
    - Bernoulli distribution explained by maths :
        - (X=x) = $ p^x(1-p)^{1-x} $ 
            - X is a random variable
            - x is 1 or 0
            - Subsituting 1 or 0 in the above we will achieve P(success) or P(failure)
            - The above can be represented as 
                - P(X) = p : if X=1 , 
                - P(X) = q : if X=0
- Multivariate Bernoulli model or Bernoulli model :  It generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence.
- When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences. As a result, the Bernoulli model typically makes many mistakes when classifying long documents.

### Multinomial Naive Bayes

- A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency. If the feature vectors have n elements and each of them can assume k different values with probability pk.

- The conditional probabilities P(x(i)|yj) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider also a correction parameter α (called Laplace or Lidstone smoothing factor), to avoid null probabilities:

- Where it can be used ?
    - To find the number of occurance/frequency/count of a word in a document.
    - The discrete count is given
- Multinomial Distribution : 
    - P($x_1,x_2,x_3,x_4.....x_n$)= $\large\frac{n!} {x_1!x_2!x_3!....x_n!}$ $P_{1}^{x_1}$...$P_{k}^{x_k}$  ....Lets break the maths into understandable peices
    - n : Size of you random sample
    - $x_{1}$....$x_{n}$ : Represent the number of occurance of $x_{1}$...$x_{n}$ in the random sample.
    - $P_{1}^{x_1}$...$P_{k}^{x_k}$ : Represent the Probablity of $x_{1}$...$x_{n}$ in the random sample.
    - The Last two points are little confusing : Lets understand with a dataset 
    - The below table is the probablity of blood group in the entire data set .
    
| BG | O   | A   | B   | AB   |
|----|-----|-----|-----|------|
| P  | .44 | .42 | .10 | .004 |
                 
What is the probablity in a given random sample of 6 people have count 1:O , 2:A, 2:B , 1:AB.

- $x_{1}$....$x_{n}$ : $x_{1}$=1 , $x_{2}$=2 , $x_{3}$=2 , $x_{4}$=1 : it is the count of people given the problem
- $P_{1}^{x_1}$...$P_{k}^{x_k}$ :     $P_{1}^{x_1 = 1:O}=.44^{1}$ ,   $P_{1}^{x_2=2:A}=.42^{2}$  ,  $P_{1}^{x_3=2:B}= .10^{2}$ ,  $P_{1}^{x_4=1:AB}=.004^{1}$
- n : The number of random sample
- P($x_1,x_2,x_3,x_4.....x_n$) = $\large {\frac {6!} {1!  2!  2!  1!}}  0.44^{1} .42^{2} .10^{2} .004^{1}$

### Gaussian Naive Bayes

- Gaussian Naive Bayes is useful when working with continuous values whose probabilities can be modeled using Gaussian distributions .
- Gussian or Normal distribution :
    - $p(x=v \mid C_k)=\large \frac{1}{\large \sqrt{2\pi\sigma^2_k}}\,e^{ -\large\frac{(v-\mu_k)^2}{2 \large \sigma^2_k} }$ 
    - Probablity of v in $C_{k}$
    - ${\sigma}$ : Standard deviation
    - ${\sigma^{2}}$ : Variance
    - ${\mu_k}$ : Mean