# Naive Bayes Classifier

In [73]:
import csv
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from utils import counter_vocabulary

In [28]:
def read_input_list(path, mode='r'):
    with open(path, mode) as f:
        reader = csv.reader(f)
        next(reader)  # Skip the header
        return [row[0] for row in reader]


def read_input_counter(path, mode='r'):
    with open(path, mode) as f:
        reader = csv.reader(f)
        next(reader)  # Skip the header
        return Counter({row[0]: int(row[1]) for row in reader})
    

neg_list = read_input_list("neg_list.csv")
pos_list = read_input_list("pos_list.csv")
neg_counter = read_input_counter("neg_counter.csv")
pos_counter = read_input_counter("pos_counter.csv")

## 2. Investigate the Data

**Task 1**  
- Let’s look at the data given to us. 
- Print `pos_list[0]`. 
- Would you classify this review as positive or negative?

<br>

**Task 2**  
- Take a look at the first review in `neg_list` as well. 
- Does that one look negative?

<br>

**Task 3**  
- We’ve also created a Counter object for all of the positive reviews and one for all of the negative reviews. 
- These counters are like Python dictionaries — you could find the number of times the word “baby” was used in the positive reviews by printing `pos_counter['baby']`.
- Print the number of times the word “crib” was used in the positive and negative reviews. In which set was it used more often?

In [27]:
# Task 1
print(pos_list[0])  # positive

# Task 2
print(neg_list[0])  # negative

# Task 3
print(pos_counter["crib"])

Perfect for new parents. We were able to keep track of baby's feeding, sleep and diaper change schedule for the first two and a half months of her life. Made life easier when the doctor would ask questions about habits because we had it all right there!
I wanted to love this, but it was pretty expensive for only a few months worth of calendar pages.  I ended up buying a regular weekly planner - 55% OFF! - The Planner - that is 8 1/2 x 11 and has all seven days on the right page and the left page has room to write a To Do List and Goals.  I found this to be more helpful because I could mark each day's eating and sleeping blocks, then also see them side by side - I could see her patterns more easily with a weekly view.  This planner was cute, just not what I wanted.
1


## 3. Bayes Theorem I

**Task 1**  
- Find the total number of positive reviews by finding the length of `pos_list`.
- Do the same for `neg_list`.
- Add those two numbers together and save the sum in a variable called `total_reviews`.

<br>

**Task 2**  
- Create variables named `percent_pos` and `percent_neg`. 
- `percent_pos` should be the number of positive reviews divided by `total_reviews`. 
- Do the same for `percent_neg`.

<br>

**Task 3**  
- Print `percent_pos` and `percent_neg`. 
- They should add up to 1!

In [None]:
# Task 1
total_reviews = len(neg_list) + len(pos_list)
print(total_reviews)  

# Task 2 / 3
percent_pos = len(pos_list) / total_reviews 
print(percent_pos)
percent_neg = len(neg_list) / total_reviews 
print(percent_neg)

100
0.5
0.5


## 4. Bayes Theorem II

**Task 1**  
- Let’s first find the total number of words in all positive reviews and store that number in a variable named `total_pos`.
- To do this, we can use the built-in Python `sum()` function. 
- `sum()` takes a list as a parameter. 
- The list that you want to sum is the `values` of the dictionary `pos_counter`, which you can get by using `pos_counter.values()`.
- Do the same for `total_neg`.

<br>

**Task 2**  
- Create two variables named `pos_probability` and `neg_probability`. 
- Each of these variables should start at `1`. 
- These are the variables we are going to use to keep track of the probabilities.

<br>

**Task 3**  
- Create a list of the words in review and store it in a variable named `review_words`. 
- You can do this by using Python’s `.split()` function.
- For example if the string test contained "Hello there", then `test.split()` would return `["Hello", "there"]`.

<br>

**Task 4**  
- Loop through every word in `review_words`. 
- Find the number of times word appears in `pos_counter` and `neg_counter`. 
- Store those values in variables named `word_in_pos` and `word_in_neg`.
- In the next steps, we’ll use this variable inside the for loop to do a series of multiplications.

<br>

**Task 5**  
- Inside the for loop, set `pos_probability` to be `pos_probability` multiplied by `word_in_pos / total_pos`.
- This step is finding each term to be multiplied together. 
- For example, when `word` is `"crib"`, you’re calculating the following:
$$ P(\text{"crib"} | \text{positive}) = \frac{\text{\# of "crib" in positive}}{\text{\# of words in positive}} $$

<br>

**Task 6**  
- Do the same multiplication for `neg_probability`.
- Outside the for loop, print both `pos_probability` and `neg_probability`. 
- Those values are P(“This crib was amazing”|positive) and `P(“This crib was amazing”|negative)`.

In [39]:
review = "This crib was amazing"

percent_pos = 0.5
percent_neg = 0.5

In [43]:
# Task 1
total_pos = sum(pos_counter.values())
total_neg = sum(neg_counter.values())

# Task 2
pos_probability = 1
neg_probability = 1

# Task 3
review_words = review.split()

# Task 4 / 5 / 6
for word in review_words:
    word_in_pos = pos_counter[word]
    word_in_neg = neg_counter[word]
    
    pos_probability *= word_in_pos / total_pos
    neg_probability *= word_in_neg / total_neg

# Task 6
pos_probability, neg_probability

(0.0, 0.0)

## 5. Smoothing

**Task 1**  
- Let’s demonstrate how these probabilities break if there’s a word that never appears in the given datasets.
- Change `review` to `"This cribb was amazing"`. 
- Notice the second `b` in `cribb`.

<br>

**Task 2**  
- Inside your `for` loop, when you multiply `pos_probability` and `neg_probability` by a fraction, add `1` to the numerator.
- Make sure to include parentheses around the numerator!

<br>

**Task 3**  
- In the denominator of those fractions, add the number of unique words in the appropriate dataset.
- For the positive probability, this should be the length of `pos_counter` which can be found using `len()`.
- Again, make sure to put parentheses around your denominator so the division happens after the addition!
- Did smoothing fix the problem?

In [46]:
review = "This cribb was amazing"

percent_pos = 0.5
percent_neg = 0.5

total_pos = sum(pos_counter.values())
total_neg = sum(neg_counter.values())

pos_probability = 1
neg_probability = 1

review_words = review.split()

for word in review_words:
    word_in_pos = pos_counter[word]
    word_in_neg = neg_counter[word]
    
    pos_probability *= (word_in_pos + 1) / (total_pos + len(pos_counter))
    neg_probability *= (word_in_neg + 1) / (total_neg + len(neg_counter))


pos_probability, neg_probability

(1.0906857688451484e-12, 1.8834508880130966e-13)

## 6. Classify

**Task 1**  
- After the for loop, multiply `pos_probability` by `percent_pos` and `neg_probability` by `percent_neg`.
- Store the two values in `final_pos` and `final_neg` and print both.

<br>

**Task 2**  
- Compare `final_pos` to `final_neg`:
    - If `final_pos` was greater than `final_neg`, print `"The review is positive"`.
    - Otherwise print `"The review is negative"`.
- Did our Naive Bayes Classifier get it right for the review `"This crib was amazing"`?

<br>

**Task 3**  
- Replace the review `"This crib was amazing"` with one that you think should be classified as negative. 
- Run your program again.
- Did your classifier correctly classify the new review?

In [49]:
review = "You are terrible"

percent_pos = 0.5
percent_neg = 0.5

total_pos = sum(pos_counter.values())
total_neg = sum(neg_counter.values())

pos_probability = 1
neg_probability = 1

review_words = review.split()

for word in review_words:
    word_in_pos = pos_counter[word]
    word_in_neg = neg_counter[word]
    
    pos_probability *= (word_in_pos + 1) / (total_pos + len(pos_counter))
    neg_probability *= (word_in_neg + 1) / (total_neg + len(neg_counter))


final_pos = pos_probability * percent_pos
final_neg = neg_probability * percent_neg

if final_pos > final_neg:
    print("The review is positive")
else:
    print("The review is negative")

The review is positive


## 7. Formatting the Data for `scikit-learn`

**Task 1**  
- Create a `CountVectorizer` and name it `counter`.

<br>

**Task 2**  
- Call `counter`‘s `.fit()` method. `.fit()` takes a list of strings and it will learn the vocabulary of those strings.
-  We want our counter to learn the vocabulary from both `neg_list` and `pos_list`.
- Call `.fit()` using `neg_list + pos_list` as a parameter.

<br>

**Task 3**  
- Print `counter.vocabulary_`. 
- This is the vocabulary that your counter just learned. 
- The numbers associated with each word are the indices of each word when you `transform` a review.

<br>

**Task 4**  
- Let’s transform our brand new review. 
- Create a variable named review_counts and set it equal to counter‘s .transform`()` function. 
- Remember, `.transform()` takes a list of strings to transform. 
- So call `.transform()` using `[review]` as a parameter.
- Print `review_counts.toarray()`. 
- If you don’t include the `toarray()`, `review_counts` won’t print in a readable format.
- It looks like this is an array of all `0`s, but the indices that correspond to the words `"this"`, `"crib"`, `"was"`, and `"amazing"` should all be `1`.

<br>

**Task 5**  
- We’ll use `review_counts` as the test point for our Naive Bayes Classifier, but we also need to transform our training set.
- Our training set is `neg_list + pos_list`. 
- Call `.transform()` using that as a parameter. 
- Store the results in a variable named `training_counts`. 
- We’ll use these variables in the next exercise.

In [None]:
review = "This crib was amazing"

In [67]:
# Task 1
counter = CountVectorizer()

# Task 2
counter.fit(neg_list + pos_list)

# Task 3
print(counter.vocabulary_)

# Task 4
review_counts = counter.transform([review])
print(review_counts.toarray())

# Task 5
training_counts = counter.transform(neg_list + pos_list)  

{'wanted': 1521, 'to': 1429, 'love': 805, 'this': 1408, 'but': 182, 'it': 712, 'was': 1525, 'pretty': 1056, 'expensive': 467, 'for': 525, 'only': 951, 'few': 495, 'months': 871, 'worth': 1584, 'of': 937, 'calendar': 187, 'pages': 981, 'ended': 434, 'up': 1486, 'buying': 185, 'regular': 1130, 'weekly': 1541, 'planner': 1024, '55': 11, 'off': 938, 'the': 1393, 'that': 1392, 'is': 709, '11': 2, 'and': 63, 'has': 618, 'all': 47, 'seven': 1219, 'days': 339, 'on': 947, 'right': 1163, 'page': 980, 'left': 765, 'room': 1166, 'write': 1588, 'do': 380, 'list': 785, 'goals': 577, 'found': 539, 'be': 120, 'more': 873, 'helpful': 633, 'because': 123, 'could': 306, 'mark': 823, 'each': 409, 'day': 337, 'eating': 417, 'sleeping': 1252, 'blocks': 149, 'then': 1397, 'also': 55, 'see': 1207, 'them': 1395, 'side': 1235, 'by': 186, 'her': 636, 'patterns': 993, 'easily': 413, 'with': 1568, 'view': 1511, 'cute': 328, 'just': 724, 'not': 919, 'what': 1550, 'like': 778, 'log': 792, 'think': 1405, 'would': 158

## 8. Using scikit-learn

**Task 1**  
- Begin by making a `MultinomialNB` object called `classifier`.

```python
classifier = MultinomialNB()
```

<br>

**Task 2**  
- We now want to fit the classifier.
-  We have the transformed points (found in `training_counts`), but we don’t have the labels associated with those points.
- We made the training points by combining `neg_list` and `pos_list`. 
- So the first half of the labels should be `0` (for negative) and the second half should be `1` (for positive).
- Create a list named `training_labels` that has 1000 `0`s followed by 1000 `1`s.
- Note that there are 1000 negative and 1000 positive reviews. 
- Normally you could find this out by asking for the length of your dataset — in this example, we haven’t included the dataset because it takes so long to load!

```python
training_labels = [0] * 1000 + [1] * 1000
```

<br>

**Task 3**  
- Call `classifier`‘s `.fit()` function. 
- Fit takes two parameters: 
    - the training set and 
    - the training labels.

```python
classifier.fit(training_counts, training_labels)
```

<br>

**Task 4**  
- Call `classifier`‘s .`predict()` method and print the results. 
- This method takes a list of the points that you want to test.
- Was your review classified as a positive or negative review?

```python
review = "This crib was amazing"
review_counts = counter.transform([review])
print(classifier.predict(review_counts))  # [1]
```

<br>

**Task 5**  
- After printing `predict`, print a call to the `predict_proba` method. 
- The parameter to `predict_proba` should be the same as `predict`.
- The first number printed is the probability that the review was a `0` (bad) and the second number is the probability the review was a `1` (good).

```python
print(classifier.predict_proba(review_counts))  # [[0.22699537 0.77300463]]
```

<br>

**Task 6**  
- Change the text `review` to see the probabilities change.
- Can you create a review that the algorithm is *really* confident about being positive?
- The review `"This crib was great amazing and wonderful"` had the following probabilities: `[[ 0.04977729 0.95022271]]`
- Can you create a review that is even *more* positive?
- Another interesting challenge is to create a clearly negative review that our classifier thinks is positive.

```python
review = "I like you, you are so amazing, you are so beautiful"
review_counts = counter.transform([review])
print(classifier.predict_proba(review_counts))  # [[0.38680854 0.61319146]]
```

## 9. Review

**Task 1**
- In the code editor, we’ve included three Naive Bayes classifiers that have been trained on different datasets. 
- The training sets used are the baby product reviews, reviews for Amazon Instant Videos, and reviews about video games.
- Try changing review again and see how the different classifiers react!

```python
from reviews import baby_counter, baby_training, instant_video_counter, instant_video_training, video_game_counter, video_game_training
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

review = "You are nice"

baby_review_counts = baby_counter.transform([review])
instant_video_review_counts = instant_video_counter.transform([review])
video_game_review_counts = video_game_counter.transform([review])

baby_classifier = MultinomialNB()
instant_video_classifier = MultinomialNB()
video_game_classifier = MultinomialNB()

baby_labels = [0] * 1000 + [1] * 1000
instant_video_labels = [0] * 1000 + [1] * 1000
video_game_labels = [0] * 1000 + [1] * 1000


baby_classifier.fit(baby_training, baby_labels)
instant_video_classifier.fit(instant_video_training, instant_video_labels)
video_game_classifier.fit(video_game_training, video_game_labels)

print("Baby training set: " +str(baby_classifier.predict_proba(baby_review_counts)))
print("Amazon Instant Video training set: " + str(instant_video_classifier.predict_proba(instant_video_review_counts)))
print("Video Games training set: " + str(video_game_classifier.predict_proba(video_game_review_counts)))

# Output:
# Baby training set: [[0.39306827 0.60693173]]
# Amazon Instant Video training set: [[0.47460014 0.52539986]]
# Video Games training set: [[0.55099548 0.44900452]]
```