## Naive Bayes

### Bayes Rule

Conditional probabilities of two events $A$ and $B$ can be written as 

$$
P(A|B) = \frac{P( A \cap B)}{P(B)}
$$

$$
P(B|A) = \frac{P( B \cap A)}{P(A)}
$$

We obtain the Bayes Rule by joining two equations from $P(A \cap B)$

$$
\begin{equation}
P(A \cap B) = P(B)P(A|B) = P(A)P(B|A) \\
P(A|B) = \frac{P(A)P(B|A)}{P(B)}
\end{equation}
$$

**Remark:** Bayes Rule allows expressing conditional probabilities in terms of each other

### Naive Bayes

Naive Bayes is "naive" because it assumes that the features (or word in a text) are independent from each other. In other words, it assumes that "is" and "am" are equally likely to occur in a sentence starting with "I".

Naive Bayes constructs conditional probability table of words by counting the occurences of each word in each class. For sentiment analysis, similar to logistic regression feature extraction, we count how many times each word in $V$ occur in tweets with label $1$ and $0$. In turn, using the Bayes Rule, we compare the likelihood of each class, given the word sequence (i.e. tweet). We can formulate the inference rule as follows:

$$
\prod_{i=1}^{m} \frac{P(w_i| pos)}{P(w_i|neg)}
$$

Example:
Assume that we are predicting the sentiment of the tweet "I am happy today; I am learning" with a conditional probability table as follows:
<img src="images/cond_prob.png" style="zoom: 50%">

We lookup the conditional probabilities of each word from the table and write:

$$
\frac{0.20}{0.20} \times \frac{0.20}{0.20} \times \frac{0.14}{0.10} \times \frac{0.20}{0.20} \times \frac{0.20}{0.20} \times \frac{0.10}{0.10} = 1.4 > 1
$$

Therefore, the tweet is more likely to be a positive one.

**Remark:** We omit the words not in the vocabulary by not including any term

**Remark:** The words with equal class probabilities do not contribute to the product

**Remark:** The inference function

With the formulation above, if a word has 0 occurence in a label, then the product becomes 0. This is a problem since lots of words are rare and might yield 0 probabilities. To deal with this situtation, we "smooth" the probabilities by assuming that each word occurs in each class once. Mathematically, we change how we compute the probability of word given a class from $P(w_i|class) = \frac{freq(w_i|class)}{N_{class}}$ to $P(w_i|class) = \frac{freq(w_i|class) + 1}{N_{class} + |V|}$. Note that this is equivalent to adding a tweet comprises all words in $|V|$ to each class.

This is called Laplace Smoothing and prevents 0 probability.