## Naive Bayes

### Bayes Rule

Conditional probabilities of two events $A$ and $B$ can be written as 

$$
P(A|B) = \frac{P( A \cap B)}{P(B)}
$$

$$
P(B|A) = \frac{P( B \cap A)}{P(A)}
$$

We obtain the Bayes Rule by joining two equations from $P(A \cap B)$

$$
\begin{equation}
P(A \cap B) = P(B)P(A|B) = P(A)P(B|A) \\
P(A|B) = \frac{P(A)P(B|A)}{P(B)}
\end{equation}
$$

**Remark:** Bayes Rule allows expressing conditional probabilities in terms of each other.

### Naive Bayes

Naive Bayes is *naive* because it assumes that the features (or words in a text) are independent from each other. In other words, it assumes that "is" and "am" are equally likely to occur in a sentence containing "I".

Naive Bayes is based on the conditional probability table of words that is constructed by counting the occurences of each word in each class. For sentiment analysis, similar to logistic regression feature extraction, we count how many times each word in $V$ occur in tweets with label $1$ and $0$. In turn, we predict the sentiment of a tweet by computing the probability of the sequence for both classes. We can formulate the inference rule as follows:

$$
\prod_{i=1}^{m} \frac{P(w_i| pos)}{P(w_i|neg)}
$$

**Example:**

Assume that we are predicting the sentiment of the tweet "I am happy today; I am learning" with a conditional probability table as follows:
<img src="images/cond_prob.png" style="zoom: 50%">

We lookup the conditional probabilities of each word from the table and write:


$$
\overbrace{\frac{0.20}{0.20}}^\text{I} \times \overbrace{\frac{0.20}{0.20}}^\text{am} \times \overbrace{\frac{0.14}{0.10}}^{\text{happy}} \times \overbrace{\frac{0.20}{0.20}}^\text{I} \times \overbrace{\frac{0.20}{0.20}}^\text{am} \times \overbrace{\frac{0.10}{0.10}}^{\text{learning}} = 1.4 > 1
$$

Therefore, the tweet is more likely to be a positive one.

**Remark:** We omit the words not in the vocabulary (e.g. today) by not including any term.

**Remark:** The words with equal class probabilities (I and am) do not contribute to the product.

**Remark:** Unlike feature extraction in logistic regression, we consider each occurence of the word.

**Remark:** The inference function omits the prior probabilities ($P(Pos)\text{ and }P(Neg)$ since we are dealing with a balanced dataset in this course. In other words $P(Pos) = P(Neg) = 0.5$. For an imbalanced dataset, we must include the priors and use the formula:

$$
\frac{P(pos)}{P(neg)}\prod_{i=1}^{m} \frac{P(w_i| pos)}{P(w_i|neg)}
$$
This formulation is also called as *likelihood*.

The inference rule entails multiplication of many small numbers that can caouse numerical errors in computations. To prevent such problems, we compute *log-likelihood* and *log-priors* since $\log x$ is a monotonically increasing function and magnify the small numbers. With the log, the inference function becomes:

$$
\log \frac{P(pos)}{P(neg)} + \sum_{i=1}^{m} \log \frac{P(w_i| pos)}{P(w_i|neg)}
$$

For brevity, we use $\lambda(w)$ to denote $\log \frac{P(w_i| pos)}{P(w_i|neg)}$.

**Remark:** Prior to $\log$ operation, we compared the likelihood term with $1$ to decide if tweet a is positive. Now, we need to compare with $0$, since $\log 1 = 0$.

if a word has 0 occurence in a label, then the conditional probability is 0 that cancel outs the products in product formualation and sets logarithm undefined for log-likelihood. To deal with this situtation, we *smooth* the probabilities by assuming that each word occurs in each class at least once. Mathematically, we change how we compute the probability of a word given a class from $P(w_i|class) = \frac{freq(w_i|class)}{N_{class}}$ to $P(w_i|class) = \frac{freq(w_i|class) + 1}{N_{class} + |V|}$. Note that this is equivalent to adding a tweet that consists of all words in $|V|$ to each class.

This is called **Laplace Smoothing**.

We can summarize the Naive Bayes for sentiment analysis in 5 steps:

1. Preprocess the tweets
2. Compute $freq(w, class), \forall w \in V$
3. Compute $P(w|neg), P(w|pos), \forall w \in V$ -- including Laplace Smoothing
4. Compute $\lambda (w), \forall w \in V$
5. Compute log-prior ($\log P(pos) / P(neg)$
