# Table of Contents

- [Proabability and Bayes Rule](#proabability-and-bayes-rule)
- [Naive Bayes for Sentiment Analysis](#naive-bayes-for-sentiment-analysis)
- [Laplacian Smoothing](#laplacian-smoothing)
- [Log-Likelihood](#log-likelihood)
- [Train Naive Bayes](#train-naive-bayes)

## Proabability and Bayes Rule

Proability is the frequency of a certain event.
If we have a corpus of tweets we can calculate the positive and nevative probability P(A).

<center>
<img src="images/probabilities.png" width="750" alt="Conditional Word Probability"/>
</center>

We can also definite different event P(B), for example the probability that the word contains happy

The proability that an event is both positive and happy is then the intersection between the positive and happy set

<center>
<img src="images/adjoint_probability.png" width="750" alt="Conditional Word Probability"/>
</center>

Now let s condider only the happy words of the corpus.
The proability goes much higher for this calculation (75% for the example). You can do the same asy for example for the proability that happy is positive. 

We talk here of conditional probability, or the probability of B given that A happened or given looking at elements of set A, the probaily that is belongs to B as well.
Conditional probabilities help us reduce the sample search space. For example given a specific event already happened, i.e. we know the word is happy:

<center>
<img src="images/bayes_venn_diagram.png" width="750" alt="Conditional Word Probability"/>
</center>

Then you would only search in the blue circle above.
The numerator will be the red part and the denominator will be the blue part.
This leads us to conclude the following:
$$
P(\text{Positive} \mid \text{``happy''}) = \frac{P(\text{Positive} \cap \text{``happy''})}{P(\text{``happy''})}
$$

$$
P(\text{``happy''} \mid \text{Positive}) = \frac{P(\text{``happy''} \cap \text{Positive})}{P(\text{Positive})}
$$

And since $P(\text{``happy''} \cap \text{Positive}) = P(\text{Positive} \cap \text{``happy''})$

$$
P(\text{Positive} \mid \text{``happy''}) = P(\text{``happy''} \mid \text{Positive}) \times \frac{P(\text{Positive})}{P(\text{``happy''})}
$$

Given us the basic bayes equation

$$
P(X \mid Y) = \frac{P(Y \mid X) P(X)}{P(Y)}
$$



## Naive Bayes for Sentiment Analysis

Similar to before, you will begin with two corpus. One for the positive tweets and one for the negative tweets.
All the different words that appear in your corpus, along with their counts for positive and negative. 

For each word you can calculate the conditional probability of each word within the positive and negative class.
You can find "power words", or words that have statistically different probability is one or the other category.

<center>
<img src="images/word_conditional_probability.png" width="250" alt="Conditional Word Probability"/>
</center>

Sometimes $P(w_i | \text{class})$ will yield a value of zero, which will make comparisons not possible.
You can solve this problem by applying some smoothing process.
This expression is called the Naive Bayes inference condition rule for binary classification.
This is calculated as $$\prod_{i=1}^m \frac{P(w_i|pos)}{P(w_i|neg)}$$

Using this calculation you will get values >1 for positive sentiment analysis and <1 for negative sentiment analysis



## Laplacian Smoothing

Sometimes you might end up having words that never show up in your corpus. You get a probability of zero, and the probability of an entire sequence might go to zero.
You can use a technique you can use to avoid your probabilities being zero.

Instead of calculating $$P(w_i|\text{class}) = \frac{\text{freq}(w_i, \text{class})}{N_\text{class}} \quad \text{class} \in \{\text{Positive}, \text{Negative}\}$$

You can add a one in the numerator and add at the denominator all of the unique words in your entire vocabulary

$$P(w_i|\text{class}) = \frac{\text{freq}(w_i, \text{class}) + 1}{N_\text{class} + V_\text{vocabulary}} \\[1em]

N_\text{class} = \text{frequency of all words in class} \\[1em]

V_\text{vocabulary} = \text{number of unique words in vocabulary}$$

With this the sum of probabilities should still sum to 1

So for a simple example we will have

<table>
    <tr>
        <th>word</th><th>Pos</th><th>Neg</th>
        <th style="border: none; width: 20px;"></th>
        <th>word</th><th>Pos</th><th>Neg</th>
    </tr>
    <tr>
        <td>I</td><td>3</td><td>3</td>
        <td style="border: none;"></td>
        <td>I</td><td>0.19</td><td>0.20</td>
    </tr>
    <tr>
        <td>am</td><td>3</td><td>3</td>
        <td style="border: none;"></td>
        <td>am</td><td>0.19</td><td>0.20</td>
    </tr>
    <tr>
        <td>happy</td><td>2</td><td>1</td>
        <td style="border: none;"></td>
        <td>happy</td><td>0.14</td><td>0.10</td>
    </tr>
    <tr>
        <td>because</td><td>1</td><td>0</td>
        <td style="border: none;"></td>
        <td>because</td><td>0.10</td><td>0.05</td>
    </tr>
    <tr>
        <td>learning</td><td>1</td><td>1</td>
        <td style="border: none;"></td>
        <td>learning</td><td>0.10</td><td>0.10</td>
    </tr>
    <tr>
        <td>NLP</td><td>1</td><td>1</td>
        <td style="border: none;"></td>
        <td>NLP</td><td>0.10</td><td>0.10</td>
    </tr>
    <tr>
        <td>sad</td><td>1</td><td>2</td>
        <td style="border: none;"></td>
        <td>sad</td><td>0.10</td><td>0.15</td>
    </tr>
    <tr>
        <td>not</td><td>1</td><td>2</td>
        <td style="border: none;"></td>
        <td>not</td><td>0.10</td><td>0.15</td>
    </tr>
    <tr>
        <td><b>Nclass</b></td><td><b>13</b></td><td><b>13</b></td>
        <td style="border: none;"></td>
        <td><b>Sum</b></td><td><b>1</b></td><td><b>1</b></td>
    </tr>
</table>

<div align="left">
    <strong>Laplacian Smoothing</strong><br>
    V = 8
</div>


## Log-Likelihood

For each word you can calculate the ratio of the probabilities as $\text{ratio}(w_i) = \frac{P(w_i|\text{Pos})}{P(w_i|\text{Neg})}$

| word      | Pos  | Neg  | ratio |
|-----------|------|------|-------|
| I         | 0.20 | 0.20 | 1     |
| am        | 0.20 | 0.20 | 1     |
| **happy** | 0.14 | 0.10 | **1.4** |
| because   | 0.10 | 0.10 | 1     |
| learning  | 0.10 | 0.10 | 1     |
| NLP       | 0.10 | 0.10 | 1     |
| **sad**   | 0.10 | 0.15 | **0.6** |
| **not**   | 0.10 | 0.15 | **0.6** |

This will help you to outline the positivity or negativity of a single word. 

### Naive Bayes' inference

Previously we have calculated the likelihood of a single sentence as the product of the probability of a word being positive divided by the probability of a word being negative.
However if your sample is not perfectly balanced (equal number of positive and negative tweets), you will have to take into account also the prior ratio of how balanced the set is
$$
\frac{P(\text{pos})}{P(\text{neg})} \cdot \prod_{i=1}^{m} \frac{P(w_i|\text{pos})}{P(w_i|\text{neg})} > 1
$$

Likelihood usually has the problem of going into underflow in normal computation, since you are multiplying a series of number smaller than one.
The problem can be solved bby using the log transformation an using the properties of the logarithm.

$$
\log(\frac{P(\text{pos})}{P(\text{neg})} \cdot \prod_{i=1}^{n} \frac{P(w_i|\text{pos})}{P(w_i|\text{neg})})
\Rightarrow \log\left(\frac{P(\text{pos})}{P(\text{neg})}\right) + \sum_{i=1}^{n} \log\left(\frac{P(w_i|\text{pos})}{P(w_i|\text{neg})}\right)
$$

So if you have a sentence you want to classify now, you can calculate the $\lambda$ instead of the ratio.
For example if you have the following sentence: <span style="color: purple;">I am happy because I am learning</span> then the table will look like. 

Summing the $\lambda$ will give you the likelihood for that tweet being positive or negative

| word      | Pos  | Neg  | λ    |
|-----------|------|------|------|
| **I**     | 0.05 | 0.05 | 0    |
| am        | 0.04 | 0.04 | 0    |
| happy     | 0.09 | 0.01 | 2.2  |
| because   | 0.01 | 0.01 | 0    |
| learning  | 0.03 | 0.01 | 1.1  |
| NLP       | 0.02 | 0.02 | 0    |
| sad       | 0.01 | 0.09 | -2.2 |
| not       | 0.02 | 0.03 | -0.4 |

The log likelihood will be the sum of the lambdas. This time wie will have a scale going from -infinity (negative) to infinity (positive). This is also really skewed towards the power words.


## Train Naive Bayes

Training is different from logistic regression and deep learning as there is no gradient descent.
The training is based on counting the frequencies of words in the corpus.
Five steps

0) Collect and annotate the corpus in positive and negative elements
1) Preprocessing similar to the logistic regression, involving 
    - loweracase
    - remove punctuation
    - remove stop words
    - stemming
    - tokenize sentences
2) Compute the frequencies of each word in the corpus for positive and negative $freq(w,class)$
3) From each word you compute using the [Laplacian Smoothing](#laplacian-smoothing) using the $V_{class}$
4) Get the lambda
5) Get the log prior as $\text{logprior} = \log\left(\frac{D_{\text{pos}}}{D_{\text{neg}}}\right)$

