In [1]:
import pandas as pd

# Naive Bayes Classifier explained by simple example (Spam/Ham Mails)

## The Problem:

Imagine we want to classify an email as "Spam" or "Ham" (not spam) based on certain words it contains. We will use a simple Naive Bayes classifier to do this, which assumes that the presence of each word is independent of the others.

We have the following dataset of emails, where each email is labeled as either "Spam" or "Ham" (not spam). Each email is represented by three features: whether it contains the words "free", "offer", and "meeting".

| Email   | Contains "free" | Contains "offer" | Contains "meeting" | Class      |
|---------|-----------------|------------------|---------------------|------------|
| Email 1 | Yes             | Yes              | No                  | Spam       |
| Email 2 | Yes             | No               | Yes                 | Spam       |
| Email 5 | Yes             | No               | No                  | Spam       |
| Email 3 | No              | Yes              | No                  | Ham        |
| Email 4 | No              | No               | Yes                 | Ham        |

Our goal is to classify a new email that contains the words **"free"** and **"offer"**, but **not "meeting"** as "Spam" or "Ham".

## Solution by hand

Based on the dataset, we'll calculate the probabilities for the new email being "Spam" or "Ham" and choose the class with the highest probability. We will use the Naive Bayes formula, which takes into account the prior probabilities of each class and the likelihood of the features given each class.

$$ \hat y = \arg \mathop {\max }\limits_y \prod\nolimits_{i = 1}^n {{\rm{P}}\left( {{x_i}|y} \right) {\rm{P}}\left( y \right)} $$

### Step 1: Calculate Prior Probabilities: $P(y)$

The prior probability is the likelihood of a class occurring in the dataset. We calculate the prior probabilities for both **Spam** and **Ham**:

- $P(\text{Spam}) = \frac{\text{Number of Spam emails}}{\text{Total emails}} = \frac{3}{5} = 0.6$
- $P(\text{Ham}) = \frac{\text{Number of Ham emails}}{\text{Total emails}} = \frac{2}{5} = 0.4$

### Step 2: Calculate Likelihoods: $P(x_i | y)$

The likelihoods are the conditional probabilities of the features (words) given the class.

| Feature    | `P(Feature\|Spam)`   | `P(Feature\|Ham)`    |
|------------|---------------------|---------------------|
| "free"     | 3/3 = 1.0           | 0/2 = 0.0           |
| "offer"    | 1/3 = 0.33          | 1/2 = 0.5           |
| "meeting"  | 1/3 = 0.33          | 1/2 = 0.5           |

### Step 3: Apply Naive Bayes Formula

Now, we classify a new email that contains the words **"free"** and **"offer"**.

**Note**: We will take into account the absence of the word **"meeting"** (not_meeting), which is common in classifiers like Naive Bayes (including scikit-learn). However, there is another possible approach where the absence of a word could be ignored, focusing only on the features that are present. This approach might slightly affect the final probability, but the prediction outcome usually remains the same.

For **Spam**:
$$
P(Spam | free, offer, not\_meeting) \propto P(free | Spam) * P(offer | Spam) * P(not\_meeting | Spam) * P(Spam)
$$

Substitute the values:
$$
P(Spam | free, offer, not\_meeting) \propto 1.0 * 0.33 * (1-0.33) * 0.6 = 0.13266
$$

For **Ham**:
$$
P(Ham | free, offer, not\_meeting) \propto P(free | Ham) * P(offer | Ham) * P(not\_meeting | Ham) * P(Ham)
$$
Substitute the values:
$$
P(Ham | free, offer, not\_meeting) \propto 0.0 * 0.5 * (1-0.5) * 0.4 = 0
$$

Current Result:
- $P(Spam) = 0.13266$
- $P(Ham) = 0$


### The Zero-probability problem

We see that if we have any feature that has a probability of 0 (such as the word "free" in a "Ham" email), this will cause the entire class probability to become 0 when multiplied. I.e any mail containing 'free' will be classified as Spam.

Common solution is the Laplace Smoothing, which avoid the problem by adding a constant (usually 1) to each feature's count. This ensures that no zero probabilities


### Step 2 Revised: Calculate Likelihoods: $P(x_i | y)$ with Laplace Smoothing

With Laplace Smoothing, we add 1 to the count of each word and increase the denominator by the number of possible outcomes (2, for presence/absence of words). This ensures no probability is zero.

| Feature    | `P(Feature \| Spam)`                |   `P(Feature \| Ham)`               |
|------------|-------------------------------------|-------------------------------------|
| "free"     | $\frac{3+1}{3+2} = \frac{4}{5} = 0.8$   | $\frac{0+1}{2+2} = \frac{1}{4} = 0.25$ |
| "offer"    | $\frac{1+1}{3+2} = \frac{3}{5} = 0.4$   | $\frac{1+1}{2+2} = \frac{2}{4} = 0.5$  |
| "meeting"  | $\frac{1+1}{3+2} = \frac{2}{5} = 0.4$   | $\frac{1+1}{2+2} = \frac{2}{4} = 0.5$  |


### Step 3 Revised: Apply Naive Bayes Formula

For **Spam**:
$$
P(Spam | free, offer, not meeting) \propto P(free | Spam) * P(offer | Spam) * P(not meeting | Spam) * P(Spam)
$$
Substitute the values with smoothing:
$$
P(Spam | free, offer, not meeting) \propto 0.8 * 0.4 * (1 - 0.4) * 0.6 = 0.1152
$$

For **Ham**:
$$
P(Ham | free, offer, not meeting) \propto P(free | Ham) * P(offer | Ham) * P(not meeting | Ham) * P(Ham)
$$
Substitute the values with smoothing:
$$
P(Ham | free, offer, not meeting) \propto 0.25 * 0.5 * (1 - 0.5) * 0.4 = 0.025
$$

Current Result:
- $P(Spam) = 0.1152$
- $P(Ham) = 0.025$

### Step 4: Normalize Probabilities

 We need to normalize probabilities results to ensure that the sum of the posterior probabilities for all classes equals 1, which allows for direct comparison between the classes.

 This is done by dividing each class's probability by the sum of the unnormalized probabilities.

 For **Spam**:

 $$
 P(\text{Spam}) = \frac{0.1152}{0.1152 + 0.025} = 0,82168331
 $$

 For **Ham**:

 $$
 P(\text{Ham}) = \frac{0.025}{0.1152 + 0.025} = 0,17831669
 $$

### Step 5: Conclusion

$P(Spam) \approx 82.17\%$, 

$P(Ham) \approx 17.83\%$


The classifier predicts the new email as **Spam**.

## Solution with [scikit-learn Bernoulli Naive Bayes classifier](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
import numpy as np

# Define the dataset (emails and their corresponding class: Spam or Ham)
emails = [
    'free offer',       # spam
    'free meeting',     # spam
    'free',             # spam
    'offer',            # ham
    'meeting',          # ham
]

# Corresponding labels: 1 for spam, 0 for ham
labels = [1, 1, 1, 0, 0]

# Vectorize the data to create a bag-of-words model (binary: presence/absence of words)
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(emails)

# Initialize the Bernoulli Naive Bayes classifier
bnb = BernoulliNB(alpha=1)

# Train the classifier
bnb.fit(X, labels)

# Test it on a new email containing 'free' and 'offer', but not 'meeting'
new_email = ['free offer']
X_test = vectorizer.transform(new_email)

# Predict the class for the new email
predicted_class = bnb.predict(X_test)

# Print predicted class and probabilities for each class
print(f'predicted class: {predicted_class}')
print(f'probabilities for each class: ${bnb.predict_proba(X_test)}')
print('~'*80)

# Print the likelihood probabilities for each class (Spam, Ham)
log_likelihoods = bnb.feature_log_prob_
likelihoods = np.exp(log_likelihoods) # Convert log probabilities to actual probabilities
print(f'Likelihoods (Spam first, then Ham):\n{likelihoods}')
print('~'*80)

# Print feature counts for each class (Spam and Ham)
feature_counts = bnb.feature_count_
print(f'Feature counts (Spam first, then Ham):\n{feature_counts}')
print('~'*80)

# Print CountVectorizer details:
print(f'Vocabulary: {vectorizer.vocabulary_}')
print(f'Vectorised emails:\n{X.toarray()}')


predicted class: [1]
probabilities for each class: $[[0.17831669 0.82168331]]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Likelihoods (Spam first, then Ham):
[[0.25 0.5  0.5 ]
 [0.8  0.4  0.4 ]]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Feature counts (Spam first, then Ham):
[[0. 1. 1.]
 [3. 1. 1.]]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Vocabulary: {'free': 0, 'offer': 2, 'meeting': 1}
Vectorised emails:
[[1 0 1]
 [1 1 0]
 [1 0 0]
 [0 0 1]
 [0 1 0]]


### Conclusion

The Bernoulli Naive Bayes classifier in scikit-learn has predicted that the new email containing the words "free" and "offer" (but not "meeting") is classified as Spam (class 1).

The predicted probabilities are:

$Ham (class 0) \approx 17.83\%$

$Spam (class 1) \approx 82.17\%$, 

