# Naive Bayes



- Simple classification method based on Bayes rule

- Relies on bag-of-words representation of documents

## Naive Bayes classifier

### objective: MAP

for Document classification task, given

- Date: $d$ is a document with bag of words as features $w_1, ..., w_k$.

- Label: $c$ is a class (e.g., spam or not spam)

model joint probability $P(d, c)$ by maximizing posterior probability $P(c|d)$

$$
P(d, c) = P(c)\prod_{i=1}^kP(w_i | c)
$$

MAP (Maximum a posteriori = most likely class)

\begin{align}
c_{MAP} 
&=\underset{c \in C}{\mathrm{argmax}}\ P(c|d)\\[1em]
\text{(Bayes Rule)}&=\underset{c \in C}{\mathrm{argmax}}\ \frac{P(d | c)P(c)}{P(d)}\\[1em]
\text{(P(d) is constant)}&\approx \underset{c \in C}{\mathrm{argmax}}\ P(d | c)P(c)\\[1em]
&= \underset{c \in C}{\mathrm{argmax}}\ P(w_1, ..., w_k | c)P(c)\\[1em]
\text{Naive assumption}&= \underset{c \in C}{\mathrm{argmax}}\ P(c) \prod_{i=1}^kP(w_i | c)\\[1em]
&=\underset{c \in C}{\mathrm{argmax}} P(d, c)\\[1em]
\text{take log}&= \underset{c \in C}{\mathrm{argmax}}\ \log P(c) + \underbrace{\sum_{i=1}^k \log P(w_i | c)}_{\text{linear model}}
\end{align} 




Naive Assumptions: features are independent give class. 

- arise from Bag of Words assumption: Position of words doesn't matter

- that's why Naive Bayes called **Naive**

### Training Algorithm

input: training set $D$ and labels $C$

For each class $c \in C$:

1. **estimate prior** $P(c)$ by MLE

   $$\log P(c) = \log\frac{N_c}{N_{\text{doc}}}$$

   where $N_{\text{doc}}$ is the number of documents in $D$

   $N_c$ is the number of documents from $D$ in class $c$.

2. Create a vocabulary $V$ of words from training set


3. Create a mega-document by concatenating all documents $d \in D$ with class $c$.


4. For each word $w$ in vocabulary $V$

   1. count number of occurrences of $w$ in the mega-document: 
   
      $$\text{count}(w,c)$$

   2. **estimate likelihood** $P(w|c)$ by MLE
   
      $$
      \log P(w|c) = \log\frac{\text{count}(w,c) + \alpha}{\sum_{w' \in V} \text{count}(w',c) +\alpha |V|}
      $$ 

      Laplacian add-$\alpha$ smoothing to avoid zero probabilities due to no training documents with a word $w$ classified in class $c$

return log prior $\log P(c)$, log likelihood $\log P(w|c)$, vocabulary $V$

### Testing Algorithm

input: a test document, logprior, loglikelihood, labels $C$, vocabulary $V$

For each class $c \in C$:

1. Initialize log MAP to be log prior

   $$\log MAP = \log P(c)$$

2. For each word position $w_i$ in the test document:

   If word in vocabulary, add log likelihood to log MAP.

   $$
   \log MAP = \log P(c) + \log P(w_i|c) = \log P(c)P(w_i|c)=...\\[1em]
   =\log P(c)\prod_{i\in \text{position}}P(w_i|c)
   $$

Return class with the maximum posterior probability 

$$c_{MAP}=\underset{c \in C}{\arg\max} P(c)\prod_{i=1}P(w_i|c)$$

### advantages

- computationally efficient and requires minimal memory: suitable for large-scale applications.

- Robust to noise and irrelevant features: cancel each other out without affecting the results significantly.

- Very good in domains with many equally important features: whereas decision trees may suffer from fragmentation, especially when the data is limited.

- Optimal if the independence assumptions hold

- A good baseline for text classification

## Naive Bayes generative model

Bayesian approaches can naturally handle missing features by Simply ignore them and compute the likelihood based only on observed features.

There is no need to fill-in or explicitly model missing values.


e.g., three coin tosses $E = \{H, ?, T\} \to P(E) = P(\{H, H, T\}) + P(\{H, T, T\})$

\begin{align}
&P(x_1, x_2, \dots, x_{j-1}, ?, x_{j+1}, \dots, x_d | y)\\[1em]
&= \sum_{Z_j} P(x_1, x_2, \dots, x_{j-1}, z_j, x_{j+1}, \dots, x_d | y)\\[1em]
\text{Naive assumption}&= \sum_{z_j} \left[P(z_j | y) \prod_{k \neq j} P(x_k | y)\right]\\[1em]
&=  \left[\prod_{k \neq j} P(x_k | y)\right]\sum_{z_j} P(z_j | y)\\[1em]
\text{ignore missing values}&=  \prod_{k \neq j} P(x_k | y)
\end{align}