# Naive Bayes Classifier 
## Objectives 
- Describe how Bayes's Theorem can be used to make predictions of a target
- Identify the appropriate variant of Naive Bayes models for a particular business problem

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.naive_bayes import MultinomialNB, GaussianNB
    # There is also a BernoulliNB for a dataset with binary predictors
import seaborn as sns
from matplotlib import pyplot as plt
plt.style.use('fivethirtyeight')
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

> Let's take a second to go through an example to get a feel for how Bayes' Theorem can help us with classification. Specifically about document classification

![Many cans of spam](img/wall_of_spam.jpeg)

> This is the classic example: detecting email spam!

**The Problem Setup**

> We get emails that can be either emails we care about (***ham*** üê∑) or emails we don't care about (***spam*** ü•´). 
>
> We can probably look at the words in the email and get an idea of whether they are spam or not just by observing if they contain red-flag words üö©
> 
> We won't always be right, but if we see an email that uses word(s) that are more often associated with spam, then we can feel more confident as labeling that email as spam!

**Naive Bayes set-up:**
1. Look at spam and not spam (ham) emails
2. Identify words that suggest classification
3. Determine probability that words occur in each classification
4. Profit (classify new emails as "spam" or "ham")

## The Naive Assumption 
Probabilities associated with different predictors are independent of each other. 

$P(A,B) = P(A\cap B) = P(A)\ P(B)$ only if independent 

In practice, makes sense & is usually pretty good assumption

## The Formula

Let's say the word that occurs is "cash":

$$ P(ü•´ | "cash") = \frac{P("cash" | ü•´)P(ü•´)}{P("cash")}$$

- $P("cash")$
    * That's just the probability of finding the word "cash"! Frequency of the word!
- $P(ü•´)$
    * Well, we start with some data (_prior knowledge_). So frequency of the spam occurring!
- $P("cash" | ü•´)$
    * How frequently "cash" is used in known spam emails. Count the frequency across all spam emails

## Calculating That Our Email Is Spam

In [None]:
# Let's just say 2% of all emails have the word "cash" in them
p_cash = 0.02

# We normally would measure this from our data, but we'll take 
# it that 10% of all emails we collected were spam
p_spam = 0.10

# 12% of all spam emails have the word "cash"
p_cash_given_its_spam = 0.12

In [None]:
p_spam_given_cash = p_cash_given_its_spam * p_spam / p_cash
print(f'If the email has the word "cash" in it, there is a \
{p_spam_given_cash*100}% chance the email is spam')

## Extending It With Multiple Words
> With more words, the more certain we can be if it is/isn't spam

Spam:

$$ P(ü•´\ |"buy",\ "cash") \propto P("buy",\ "cash"|\ ü•´)\ P(ü•´)$$

Assumption of independence comes in. Our initial assumption is proportionate to the above likelihood. 

But because of independence: 
    
$$ P("buy",\ "cash"|\ ü•´) = P("buy"|\ ü•´)\ P("cash"|\ ü•´)$$ (product of relevant likelihoods)

Normalize by dividing!

$$
P(ü•´\ |"buy",\ "cash")  =
    \frac
        {P("buy"|\ ü•´)P("cash"|\ ü•´)\ P(ü•´)}
        {P("buy"|\ ü•´)P("cash"|\ ü•´)\ P(ü•´) + P("buy"|\ üê∑)P("cash"|\ üê∑)\ P(üê∑)}
$$


# Naive Bayes Modeling Example
## Using Bayes's Theorem for Classification

Let's recall Bayes's Theorem:

$\large P(h|e) = \frac{P(h)P(e|h)}{P(e)}$

### Does this look like a classification problem?

- Suppose we have three competing hypotheses $\{h_1, h_2, h_3\}$ that would explain our evidence $e$.
    - Then we could use Bayes's Theorem to calculate the posterior probabilities for each of these three:
        - $P(h_1|e) = \frac{P(h_1)P(e|h_1)}{P(e)}$
        - $P(h_2|e) = \frac{P(h_2)P(e|h_2)}{P(e)}$
        - $P(h_3|e) = \frac{P(h_3)P(e|h_3)}{P(e)}$
        
- Suppose the evidence is a collection of elephant heights.
- Suppose each of the three hypotheses claims that the elephant whose measurements we have belongs to one of the three extant elephant species (*L. africana*, *L. cyclotis*, and *E. maximus*).

In that case the left-hand sides of these equations represent the probability that the elephant in question belongs to a given species.

If we think of the species as our target, then **this is just an ordinary classification problem**.

What about the right-hand sides of the equations? **These other probabilities we can calculate from our dataset.**

- The priors can simply be taken to be the percentages of the different classes in the dataset.
- What about the likelihoods?
    - If the relevant features are **categorical**, we can simply count the numbers of each category in the dataset. For example, if the features are whether the elephant has tusks or not, then, to calculate the likelihoods, we'll just count the tusked and non-tuksed elephants per species.
    - If the relevant features are **numerical**, we'll have to do something else. A good way of proceeding is to rely on (presumed) underlying distributions of the data. [Here](https://medium.com/analytics-vidhya/use-naive-bayes-algorithm-for-categorical-and-numerical-data-classification-935d90ab273f) is an example of using the normal distribution to calculate likelihoods. We'll follow this idea below for our elephant data.

## Elephant Example 

In [None]:
elephs = pd.read_csv('data/elephants.csv', usecols=['height (cm)',
                                                   'species'])

In [None]:
elephs.head()

In [None]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()

sns.kdeplot(data=elephs[elephs['species'] == 'maximus']['height (cm)'],
            ax=ax, label='maximus')
sns.kdeplot(data=elephs[elephs['species'] == 'africana']['height (cm)'],
            ax=ax, label='africana')
sns.kdeplot(data=elephs[elephs['species'] == 'cyclotis']['height (cm)'],
            ax=ax, label='cyclotis')

plt.legend();

## Naive Bayes by Hand

Suppose we want to make prediction of species for some new elephant whose weight we've just recorded. We'll suppose the new elephant has:

In [None]:
new_ht = 263

What we want to calculate is the mean and standard deviation for height for each elephant species. We'll use these to calculate the relevant likelihoods.

So:

In [None]:
max_stats = elephs[elephs['species'] == 'maximus'].describe().loc[['mean', 'std'], :]
max_stats

In [None]:
cyc_stats = elephs[elephs['species'] == 'cyclotis'].describe().loc[['mean', 'std'], :]
cyc_stats

In [None]:
afr_stats = elephs[elephs['species'] == 'africana'].describe().loc[['mean', 'std'], :]
afr_stats

In [None]:
#prior probability - before looking at any data - 
#what is the probability I'll get an elephant of any species? 
elephs['species'].value_counts()

### Calculation of Likelihoods

We'll use the PDFs of the normal distributions with the discovered means and standard deviations to calculate likelihoods. Build normal distributions with mean and scale for each elephant species. 

In [None]:
stats.norm(loc=max_stats['height (cm)'][0],
           scale=max_stats['height (cm)'][1]).pdf(263)

In [None]:
stats.norm(loc=cyc_stats['height (cm)'][0],
          scale=cyc_stats['height (cm)'][1]).pdf(263)

In [None]:
stats.norm(loc=afr_stats['height (cm)'][0],
          scale=afr_stats['height (cm)'][1]).pdf(263)

### Posteriors

What we have just calculated are (approximations of) the likelihoods, i.e.:

- $P(height=263 | species=maximus) = 2.04\%$
- $P(height=263 | species=cyclotis) = 1.50\%$
- $P(height=263 | species=africana) = 0.90\%$

(Notice that they do NOT sum to 1!) But what we'd really like to know are the posteriors. I.e. what are:

- $P(species=maximus | height=263)$?
- $P(species=cyclotis | height=263)$?
- $P(species=africana | height=263)$?

Since we have equal numbers of each species, every prior is equal to $\frac{1}{3}$. Thus we can calculate the probability of the evidence:

$P(height=263) = \frac{1}{3}(0.0204 + 0.0150 + 0.0090) = 0.0148$ (denominator)

And therefore calculate the posteriors using Bayes's Theorem:

- $P(species=maximus | height=263) = \frac{1}{3}\frac{0.0204}{0.0148} = 45.9\%$;
- $P(species=cyclotis | height=263) = \frac{1}{3}\frac{0.0150}{0.0148} = 33.8\%$;
- $P(species=africana | height=263) = \frac{1}{3}\frac{0.0090}{0.0148} = 20.3\%$.

Bayes's Theorem shows us that the largest posterior belongs to the *maximus* species. (Note also that, since the priors are all the same, the largest posterior will necessarily belong to the species with the largest likelihood!)

Therefore, the *maximus* species will be our prediction for an elephant of this height.

### More Dimensions

In fact, we also have elephant weight data available in addition to their heights. To accommodate multiple features we can make use of multivariate normal distributions. A normal distribution generalized to multiple dimensions.
![multivariate-normal](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/MultivariateNormal.png/440px-MultivariateNormal.png)

For multiple predictors, we make the simplifying assumption that **our predictors are probablistically independent**. This will often be unrealistic, but it simplifies our calculations a great deal.

In [None]:
elephants = pd.read_csv('data/elephants.csv',
                       usecols=['height (cm)', 'weight (lbs)', 'species'])

In [None]:
elephants.head()

In [None]:
maximus = elephants[elephants['species'] == 'maximus']
cyclotis = elephants[elephants['species'] == 'cyclotis']
africana = elephants[elephants['species'] == 'africana']

Suppose our new elephant with a height of 263 cm also has a weight of 7009 lbs.

In [None]:
likeli_max = stats.multivariate_normal(mean=maximus.mean(),
                          cov=maximus.cov()).pdf([263, 7009])
likeli_max

In [None]:
likeli_cyc = stats.multivariate_normal(mean=cyclotis.mean(),
                         cov=cyclotis.cov()).pdf([263, 7009])
likeli_cyc

In [None]:
likeli_afr = stats.multivariate_normal(mean=africana.mean(),
                         cov=africana.cov()).pdf([263, 7009])
likeli_afr

#### Posteriors

In [None]:
post_max = likeli_max / sum([likeli_max, likeli_cyc, likeli_afr])
post_cyc = likeli_cyc / sum([likeli_max, likeli_cyc, likeli_afr])
post_afr = likeli_afr / sum([likeli_max, likeli_cyc, likeli_afr])

print(post_max)
print(post_cyc)
print(post_afr)

### [`GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [None]:
gnb = GaussianNB(priors=[1/3, 1/3, 1/3])

In [None]:
X = elephants.drop('species', axis=1)
y = elephants['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
gnb.fit(X_train, y_train)

In [None]:
gnb.predict_proba(np.array([263, 7009]).reshape(1, -1))

In [None]:
#accuracy 
gnb.score(X_test, y_test)

In [None]:
plot_confusion_matrix(gnb, X_test, y_test);

## Pros and Cons 
**Pros:** 
- It is not only a simple approach but also a fast and accurate method for prediction.
- Naive Bayes has very low computation cost.
- It can efficiently work on a large dataset.
- It performs well in case of discrete response variable compared to the continuous variable.
- It can be used with multiple class prediction problems.
- It also performs well in the case of text analytics problems.
- When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression.
- No hyperparameters! 

**Cons:**
- Require to remove correlated features because they are voted twice in the model and it can lead to over inflating importance.

- If a categorical variable has a category in test data set which was not observed in training data set, then the model will assign a zero probability. It will not be able to make a prediction. This is often known as ‚ÄúZero Frequency‚Äù. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace smoothing. Sklearn applies Laplace smoothing by default when you train a Naive Bayes classifier. For more info on smoothing [go here](https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf)