# Bayes Theorem

## Dataset

In [41]:
import csv
from collections import Counter


def read_input_list(path, mode='r'):
    with open(path, mode) as f:
        reader = csv.reader(f)
        next(reader)  # Skip the header
        return [row[0] for row in reader]


def read_input_counter(path, mode='r'):
    with open(path, mode) as f:
        reader = csv.reader(f)
        next(reader)  # Skip the header
        return Counter({row[0]: int(row[1]) for row in reader})
    

neg_list = read_input_list("neg_list.csv")              # contains a list of negative reviews
pos_list = read_input_list("pos_list.csv")              # contains a list of positive reviews
neg_counter = read_input_counter("neg_counter.csv")     # contains a counter of words in negative reviews
pos_counter = read_input_counter("pos_counter.csv")     # contains a counter of words in positive reviews

print("# Positive reviews: ", len(pos_list))
print("# Negative reviews: ", len(neg_list))

# Positive reviews:  50
# Negative reviews:  50


## Naive Bayes Classifier Step by Step

### Fundamentals of Bayes Theorem

- General formula for Bayes Theorem is given by:
$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$
- If we use an example whether we want to say if a review is positive or negative, we have to calculate the probability of the review being positive or negative given the words in the review. 
- We can use Bayes Theorem to calculate this probability.
$$ P(\text{positive}|\text{review}) = \frac{P(\text{review}|\text{positive}) \cdot P(\text{positive})}{P(\text{review})} $$
$$ P(\text{negative}|\text{review}) = \frac{P(\text{review}|\text{negative}) \cdot P(\text{negative})}{P(\text{review})} $$
- The probability that is higher will be the prediction of the review.

---

### $P(\text{positive})$ / $P(\text{negative})$

- Is defined as follows:
$$ P(\text{positive}) = \frac{\text{\# positve reviews}}{\text{\# total reviews}} $$

In [42]:
total_reviews = len(neg_list) + len(pos_list)
p_pos = len(pos_list) / total_reviews
p_neg = len(neg_list) / total_reviews

print(f"A review is positive with probability of: {p_pos*100:.1f}%")
print(f"A review is negative with probability of: {p_neg*100:.1f}%")

A review is positive with probability of: 50.0%
A review is negative with probability of: 50.0%


- The review has a $50\%$ chance of being positive. 
- The same goes for the negative review.

---

### $P(\text{review}|\text{positive})$ / $P(\text{review}|\text{negative})$

- Is defined as follows:
$$ P(\text{review}|\text{positive}) = P(\text{word\_1}|\text{positive}) \cdot P(\text{word\_2}|\text{positive}) \cdot \ldots \cdot P(\text{word\_n}|\text{positive}) $$
$$ P(\text{word\_n}) = \frac{\text{\# of word\_n in positive}}{\text{\# of words in positive}} $$
- If we have following example review: "This crib was amazing", we have following equation:
$$ P(\text{"This crib was amazing"}|\text{positive}) = P(\text{"This"}|\text{positive}) \cdot P(\text{"crib"}|\text{positive}) \cdot P(\text{"was"}|\text{positive}) \cdot P(\text{"amazing"}|\text{positive}) $$
$$ P(\text{"This"}|\text{positive}) = \frac{\text{\# of "This" in positive}}{\text{\# of words in positive}} $$
$$ \dots $$

In [43]:
review = "This crib was amazing"
review_words = review.split()
total_words_pos = sum(pos_counter.values())
total_words_neg = sum(neg_counter.values())
p_rev_pos = 1
p_rev_neg = 1


for word in review_words:
    n_word_pos = pos_counter[word]
    n_word_neg = neg_counter[word]
    
    p_rev_pos *= n_word_pos / total_words_pos
    p_rev_neg *= n_word_neg / total_words_neg


p_rev_pos, p_rev_neg

(0.0, 0.0)

### $P(\text{review}|\text{positive})$ / $P(\text{review}|\text{negative})$ - Modification: Smoothing

- Also called [Laplace Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing).
- The problematic with the current calculation is, that if one words is not in the positive reviews (e.g. typos (e.g. "amazin" instead of "amazing")), the whole probability will be 0, because if one term is 0, the whole product will be 0.
- To prevent this problematic, we can use a technique called *smoothing*.
- We have to modify the formula for $P(\text{word\_n}|\text{positive})$ as follows:
$$ P(\text{word\_n}|\text{positive}) = \frac{\text{\# of word\_n in positive} + 1}{\text{\# of words in positive} + \text{\# of unique words in positive}} $$
- The same goes for the negative review:
$$ P(\text{word\_n}|\text{negative}) = \frac{\text{\# of word\_n in negative} + 1}{\text{\# of words in negative} + \text{\# of unique words in negative}} $$

In [44]:
review = "This crib was amazing"
review_words = review.split()
total_words_pos = sum(pos_counter.values())
total_words_neg = sum(neg_counter.values())
p_rev_pos = 1
p_rev_neg = 1


for word in review_words:
    n_word_pos = pos_counter[word]
    n_word_neg = neg_counter[word]
    
    p_rev_pos *= (n_word_pos + 1) / (total_words_pos + len(pos_counter))
    p_rev_neg *= (n_word_neg + 1) / (total_words_neg + len(neg_counter))


p_rev_pos, p_rev_neg

(2.181371537690297e-12, 2.4484861544170253e-12)

---

### $P(\text{review})$

- It's extremely similar to $P(\text{review}|\text{positive})$ and $P(\text{review}|\text{negative})$.
- In this case we dont assume that the review is positive or negative.
$$ P(\text{review}) = P(\text{word\_1}) \cdot P(\text{word\_2}) \cdot \ldots \cdot P(\text{word\_n}) $$
$$ P(\text{word\_n}) = \frac{\text{\# of word\_n in all positive AND negative}}{\text{\# of words in positive AND negative}} $$
- Our final question is, we want to predict whether the review "This crib was amazing" is a positive or negative review.
- In other words, we are asking whether $P(\text{positive}|\text{review})$ is greater than $P(\text{negative}|\text{review})$.
- Then we end up with following equation:
$$ P(\text{positive}|\text{review}) = \frac{P(\text{review}|\text{positive}) \cdot P(\text{positive})}{P(\text{review})} $$
$$ P(\text{negative}|\text{review}) = \frac{P(\text{review}|\text{negative}) \cdot P(\text{negative})}{P(\text{review})} $$
- $P(\text{review})$ is in the denominator or each, so the value will be the same for both equations.
- Since we are only interested in compating these two proabilities, there is no reason why we need to divide them by the same value.
- We can completely ignore the denominator!

---

### Add all together

- We use following equation:
$$ P(\text{positive}|\text{review}) = \frac{P(\text{review}|\text{positive}) \cdot P(\text{positive})}{P(\text{review})} $$
$$ P(\text{negative}|\text{review}) = \frac{P(\text{review}|\text{negative}) \cdot P(\text{negative})}{P(\text{review})} $$

In [45]:
def bayes_theorem(review):
    total_reviews = len(neg_list) + len(pos_list)
    p_pos = len(pos_list) / total_reviews
    p_neg = len(neg_list) / total_reviews
    
    review_words = review.split()
    total_words_pos = sum(pos_counter.values())
    total_words_neg = sum(neg_counter.values())
    p_rev_pos = 1
    p_rev_neg = 1
    
    for word in review_words:
        # p(review|class) = p(word1|class) * p(word2|class) * ... * p(wordN|class)
        n_word_pos = pos_counter[word]
        n_word_neg = neg_counter[word]
        
        p_rev_pos *= (n_word_pos + 1) / (total_words_pos + len(pos_counter))
        p_rev_neg *= (n_word_neg + 1) / (total_words_neg + len(neg_counter))
    
    # p(class|review) = p(review|class) * p(class)
    p_pos_rev = p_rev_pos * p_pos 
    p_neg_rev = p_rev_neg * p_neg 
    
    return p_pos_rev, p_neg_rev

In [46]:
prob_pos, prob_neg = bayes_theorem("This movie was amazing!")

if prob_pos > prob_neg:
    print("Positive review")
else:
    print("Negative review")

Positive review


- Lets investigate all reviews and check if the bayes theorem is correct.

In [47]:
for review in pos_list:
    prob_pos, prob_neg = bayes_theorem(review)
    if prob_pos < prob_neg:
        print(f"False classified: {review}")

print("-" * 50)
for review in neg_list:
    prob_pos, prob_neg = bayes_theorem(review)
    if prob_pos > prob_neg:
        print(f"False classified: {review}")

--------------------------------------------------


## Using `scikit-Learn`

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

### Formatting the Data

- If we want use the `scikit-learn` library, we have to format the data in a specific way.
- We want to create a vocabulary with all the words from the dataset / reviews.
- We can easily use the `CountVectorizer` class from the `scikit-learn` library to do this.
    - As default, the `CountVectorizer` class will convert all the words to lowercase and remove all punctuation.
    - The `.fit()` method will create the vocabulary and takes a list of strings as input.
- After fitting the `CountVectorizer` class, we have access to the vocabulary using the `.vocabulary_` attribute.
    - This vocabulary is a python dictionary which has following structure: `{'word': index, ...}`.
    - It is not sorted by the index, so for example we can have following vocabulary if we have following review as input: `"This movie was amazing"`:
        ```py
        {
            'this': 2,
            'movie': 1,
            'was': 0,
            'amazing': 3
        }
        ```

In [54]:
samples = ["Training review one", "Second review"]

counter = CountVectorizer()
counter.fit(samples)


counter.vocabulary_, dict(sorted(counter.vocabulary_.items()))

({'training': 3, 'review': 1, 'one': 0, 'second': 2},
 {'one': 0, 'review': 1, 'second': 2, 'training': 3})

- Now we can use the `.transform()` method to make a list, that counts the number of times each word appears in the sample.
- It takes also a list of strings as input and returns a numpy array.
- The returning numpy array counts how often a word appears in his own sample.
- For example, if we have following vocabulary create with the `.fit()` method:
    ```py
    {
        'training': 3,
        'review': 1,
        'one': 0,
        'second': 2
    }
    ```
    - And we have following sample: `"one review two review"`, we get following output:
    ```py
    [
        [1, 2, 0, 0]
    ]
    ```
    - This means:
        - At index `0`, we have `1` so the word `one` (also has index `0` in the vocabulary) appears `1` time in the sample.
        - At index `1`, we have `2` so the word `review` (also has index `1` in the vocabulary) appears `2` times in the sample.
        - The other words `second` and `training` do not appear in the sample, so they are `0`.
- The Length of the array is the length of the vocabulary
- Each index in this array represent one word from the vocabulary. 
- The value at this index is the number of times this word appears in the sample.

In [59]:
counts = counter.transform(["one review two review three reviews", "one one one"])

counts.toarray()

array([[1, 2, 0, 0],
       [3, 0, 0, 0]])

- Let's use our original dataset from the beginning.
- We use both, the negative and positive reviews / datasets.

In [51]:
counter = CountVectorizer()
counts = counter.fit_transform(pos_list + neg_list)

len(counter.vocabulary_), counts.toarray().shape

(1603, (100, 1603))

- We have a vocabulary with unique words from the dataset (`1603` words).
- If we use the same reviews again to create the `counts`, we have following shape: `(100, 1603)`.
    - First dimension is the number of reviews.
    - Second dimension is the number of unique words in the vocabulary we want to count.

---

### Using `MultinomialNB`

- To use the `MultinomialNB` class we have to several steps:
    - We need to create `training_counts` using the `CountVectorizer` class.
    - - We need to create the labels for the training set.
- Then we create the `MultinomialNB` class and use the `.fit()` method to train the model.

In [61]:
counter = CountVectorizer()
training_counts = counter.fit_transform(pos_list + neg_list)
print(f"Size Vocabulary: {len(counter.vocabulary_)}")
print(f"Size Dataset: {len(pos_list) + len(neg_list)}")
print(f"Shape training counts: {training_counts.shape}")


classifier = MultinomialNB()
training_labels = [1] * len(pos_list) + [0] * len(neg_list)
classifier.fit(training_counts, training_labels)

Size Vocabulary: 1603
Size Dataset: 100
Shape training counts: (100, 1603)


- To use the trained model, we can use the `.predict()` method to predict the labels for the test set.
- We also need to create the `test_counts` using the `CountVectorizer` class that we used for the training set.
- We can use the `.predict()` method to predict the labels for the test set.
- If we want the probabilities of the predictions, we can use the `.predict_proba()` method.

In [63]:
test_sample = "This movie was amazing"
test_sample_counts = counter.transform([test_sample])

# 1 means positive review, 0 means negative review
print(f"Test Sample is classified as: {classifier.predict(test_sample_counts)}") 

prob_neg, prob_pos = classifier.predict_proba(test_sample_counts)[0]
print(f"Probability of being negative: {prob_neg:.2f}")
print(f"Probability of being positive: {prob_pos:.2f}")

Test Sample is classified as: [1]
Probability of being negative: 0.35
Probability of being positive: 0.65
