In this DataLab you will implement a Naive Bayes classifier as described in Chapter 4 of the book Speech and Language Processing.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import random
import re                                  
import string  

import nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords          
from nltk.stem import PorterStemmer        
from nltk.tokenize import TweetTokenizer

In [None]:
nltk.download('stopwords')
stopwords_english = stopwords.words('english')

In [None]:
nltk.download('twitter_samples')

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))

In [None]:
# print positive in greeen
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

**Tweet preprocessing**

Last week you learned how to use regular expressions to process tweets. Use the function `tweet_processor()` you created in the last DataLab here:

In [None]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # YOUR CODE HERE #

    return processed_tweet

And sanity check if it works.
    
Example tweet:
    
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:

`['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

In [None]:
example_tweet = ('My beautiful sunflowers on a sunny Friday morning off :)'
                 ' #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i')
print(example_tweet)

In [None]:
tweet_processor(example_tweet)

Before going any further, let's split the dataset into training and test sets.

In [None]:
# 80% training 20% testing
positive_tweets_tr = all_positive_tweets[:4000]
positive_tweets_te = all_positive_tweets[4000:]

negative_tweets_tr = all_negative_tweets[:4000]
negative_tweets_te = all_negative_tweets[4000:]

**Task 1**

The function `tweet_processor()` expects a single tweet to process. But you have lists of tweets to process. Write a function called `tweet_processor_list()` that accept a list of strings (tweets) and returns a list of processed tweets. A processed tweet is a list of tokens. Therefore  `tweet_processor_list()` should return a list of lists.

The first two items in the `positive_tweets_tr` are:

```
['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!']
 ```
 
 the expected output of `tweet_processor_list()` is:
 
 ```
 [['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)'],
 ['hey',
  'jame',
  'odd',
  ':/',
  'pleas',
  'call',
  'contact',
  'centr',
  '02392441234',
  'abl',
  'assist',
  ':)',
  'mani',
  'thank']]
 
 ```

In [None]:
def tweet_processor_list(tweet_list):
    # YOUR CODE HERE #
    return processed_tweet_list

In [None]:
positive_tweets_tr = tweet_processor_list(positive_tweets_tr)
positive_tweets_te = tweet_processor_list(positive_tweets_te)

negative_tweets_tr = tweet_processor_list(negative_tweets_tr)
negative_tweets_te = tweet_processor_list(negative_tweets_te)

**Task 2**

Now it is time to creative the _vocabulary_ as defined in Chapter 4, Section 4.2:

> vocabulary V consists of the union of all the word types in all classes

Combine all the tokens in `positive_tweets_tr` and `negative_tweets_tr` into one big list and get the unique tokens from this list.

Expected length of the vocabulary is `9085` unique tokens. Notice that if you use a different train/test split or different preprocessing this number will be different.

First 50 tokens in the vocabulary:

```
['(-:',
 '(:',
 '):',
 '--->',
 '-->',
 '->',
 '.\n.',
 '.\n.\n.',
 '. .',
 '. . .',
 '. ..',
 '. ...',
 '..',
 '...',
 '0',
 '0-100',
 '0-2',
 '0.001',
 '0.7',
 '00',
 '00128835',
 '009',
 '00962778381',
 '01282',
 '01482',
 '01:15',
 '01:16',
 '02079',
 '02392441234',
 '0272 3306',
 '0330 333 7234',
 '0345',
 '05.15',
 '07:02',
 '07:17',
 '07:24',
 '07:25',
 '07:32',
 '07:34',
 '08',
 '0878 0388',
 '08962464174',
 '0ne',
 '1',
 '1,300',
 '1,500',
 '1-0',
 '1.300',
 '1.8',
 '1/2']
```

In [None]:
# YOUR CODE HERE #

**Task 3**

In order to calculate the equation 4.12

$P(w_i|c)=count(w_i, c)/\Sigma_{w∈V} count(w, c)$

We first need to calculate $count(w_i, c)$ which is the number of times each token in the vocabulary occurs in tweets from class c. This is also called the word frequency table.

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|


First create a dictionary called `freq` where keys are tokens and values are lists containing positive and negative counts

```
{'(-:': [1, 0],
 '(:': [1, 6],
 ...
}
```

and convert it to a dataframe.

In [None]:
# YOUR CODE HERE #
freqs = 

In [None]:
df = pd.DataFrame.from_dict(freqs, orient='index', columns=['count(w_i, +)', 'count(w_i, -)'])
df.head(10)

**Task 4**

We can calculate the equation 4.12 now:

$P(w_i|c)=count(w_i, c)/\Sigma_{w∈V} count(w, c)$

The denominator $\Sigma_{w∈V} count(w, c)$ is simply sum of each column.

|$w_i$| count($w_i$, +) | count($w_i$, -) | P(w_i\|+) | P(w_i\|-) |
| ----------- | ----------- |----------- |----------- |----------- |
|(-:|1|0|0.000037|0.000000|
|(:|1|6|0.000037|0.000222|
|):|6|6|0.000224|0.000222|
|--->|1|0|0.000037|0.000000|
|happi|161|18|0.005998|0.000666|


In [None]:
# YOUR CODE HERE #

**Task 5**

Apply Laplacian smoothing as described in equation 4.14

$P(w_i|c)=[count(w_i, c)+1]/[\Sigma_{w∈V} count(w, c)$+len(vocabulary)]

|$w_i$| count($w_i$, +) | count($w_i$, -) | P(w_i\|+) | P(w_i\|-) |P(w_i\|+) smooth | P(w_i\|-) smooth |
| ----------- | ----------- |----------- |----------- |----------- |----------- |----------- |
|(-:|1|0|0.000037|0.000000|0.000056|0.000028|
|(:|1|6|0.000037|0.000222|0.000056|0.000194|
|):|6|6|0.000224|0.000222|0.000195|0.000194|
|--->|1|0|0.000037|0.000000|0.000056|0.000028|
|happi|161|18|0.005998|0.000666|0.004509|0.000526|

In [None]:
# YOUR CODE HERE #

**Task 6**

The final piece of the puzzle is equation 4.11

$P(c) = N_c/N_{doc}$

$N_c$: the number of tweet in our training data with class c
$N_{doc}$: the total number of tweets.

P(+) = number of positive tweets / number of tweets

P(-) = number of negative tweets / number of tweets

Calculate P(+) and P(-)

In [None]:
# YOUR CODE HERE #

**Task 7**

Write the Naive Bayes algorithm by implementing equations 4.5/4.6

Say we have a tweet with 2 tokens `['damnit', ':(']`. Probability of these tweet being positive is proportional to:

`P(tweet|+)P(+)` = `P('damnit'|+) * P(':('|+) * P(+)`

and negative is proportional to:

`P(tweet|-)P(-)` = `P('damnit'|-) * P(':('|-) * P(-)`

If `P(tweet|+)P(+)` > `P(tweet|-)P(-)`, tweet is positive and else negative.

Predict whether this tweet is positive or negative using equations described above. Use the probabilities calculated using Laplacian smoothing.

Remember, section 4.2 page 62

> What do we do about words that occur in our test data but are not in our vocab- ulary at all because they did not occur in any training document in any class? The solution for such unknown words is to ignore them—remove them from the test document and not include any probability for them at all.

In [None]:
tw = negative_tweets_te[3]
print(tw)

In [None]:
# YOUR CODE HERE #

In [None]:
prob_pos, prob_neg

In [None]:
if prob_pos > prob_neg:
    print('Class positive')
else:
    print('Class negative')

**Task 8**

As explained in section 4.1 page 61

> Naive Bayes calculations, like calculations for language modeling, are done in log space, to avoid underflow and increase speed.

In [None]:
# Numerical underflow
print(0.5**1000)
print(0.5**10000)

Calcuate log likelihoods for P(w_i|+)\_smooth and P(w_i|-)\_smooth

|$w_i$| count($w_i$, +) | count($w_i$, -) | P(w_i\|+) | P(w_i\|-) |P(w_i\|+) smooth | P(w_i\|-) smooth |log(P(w_i\|+) smooth)|log(P(w_i\|-) smooth)|
| ----------- | ----------- |----------- |----------- |----------- |----------- |----------- |----------- |----------- |
|(-:|1|0|0.000037|0.000000|0.000056|0.000028|-9.796125|-10.494519|
|(:|1|6|0.000037|0.000222|0.000056|0.000194|-9.796125|-8.548609|
|):|6|6|0.000224|0.000222|0.000195|0.000194|-8.543362|-8.548609|
|--->|1|0|0.000037|0.000000|0.000056|0.000028|-9.796125|-10.494519|
|happi|161|18|0.005998|0.000666|0.004509|0.000526|-5.401676|-7.550080|

In [None]:
# YOUR CODE HERE #

**Task 9**

Repeat Task 7 but this time using log likelihoods.

In [None]:
tw = negative_tweets_te[3]
print(tw)

In [None]:
# YOUR CODE HERE #

In [None]:
log_prob_pos, log_prob_neg

In [None]:
np.exp(log_prob_pos), np.exp(log_prob_neg)

In [None]:
prob_pos, prob_neg

In [None]:
if log_prob_pos > log_prob_neg:
    print('Class positive')
else:
    print('Class negative')

**Task 10**

Putting everything together, predict whether a tweet is positive or negative, for each tweet in the test set. Calculate accuracy.

In [None]:
y_test = []
y_preds = []

# YOUR CODE HERE #
    
y_preds = np.array(y_preds)
y_test = np.array(y_test)

In [None]:
sum(y_preds == y_test)/len(y_test)