## **Naive Bayes**

In this notebook, we will use Naive Bayes for sentiment analysis on tweets. Given a tweet, we will decide if it has a positive sentiment or a negative sentiment. In this notenook, we will:

* Train a Naive Byaes model on sentiment analysis task.
* Test using our model.
* Compute ratio of positive words to negative words.
* Do error analysis
* At last some playaround with random tweets.

#### **Load required libraries and NLTK Twitter sample dataset**

In [2]:
!cp '/content/drive/My Drive/Colab Notebooks/NLP-with-classification-and-vector-spaces/Naive-Bayes/utils.py' '/content'

In [3]:
from utils import process_tweet, lookup
import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer
from os import getcwd

In [4]:
nltk.download('stopwords')
nltk.download('twitter_samples')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [6]:
# load positive and negative tweets
pos_tweets = twitter_samples.strings('positive_tweets.json')
neg_tweets = twitter_samples.strings('negative_tweets.json')

# split into train and test set
train_pos = pos_tweets[0:4000]
test_pos = pos_tweets[4000:]
train_neg = neg_tweets[0:4000]
test_neg = neg_tweets[4000:]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# create train and test labels
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

### **Create a function *word_label_count()***

This function will take a list of tweets as input, cleans all of them, and returns a dictionary in the form {(word, label):count}.

* The key in the dictionary is a tuple containing the stemmed word and its class label(0 or 1).
* The value is the number of times the word appears in the given collection of tweets (an integer).
* We will use the `process_tweet` function that was imported above, and then store the words in their respective dictionaries and sets.
* We may find it useful to use the `zip` function to match each element in `tweets` with each element in `ys`.

In [7]:
def word_label_count(result, tweets, label):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        label: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''

    for y, tweet in zip(label, tweets):
        for word in process_tweet(tweet):
            # define the key, which is the word and label tuple
            pair = (word, y)

            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1

            # else, if the key is new, add it to the dictionary and set the count to 1
            else:
                result[pair] = 1

    return result

In [11]:
# test the function
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
word_label_count(result, tweets, ys)

{('happi', 1): 1, ('sad', 0): 1, ('tire', 0): 2, ('trick', 0): 1}

### **Naive Bayes model function**

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

So how do we train a Naive Bayes classifier?
* The first part of training a naive bayes classifier is to identify the number of classes that we have.
* We will create a probability for each class. $P(D_{pos})$ is the probability that the document is positive. $P(D_{neg})$ is the probability that the document is negative. We will use the below formulas and store the values in a dictionary:
$$P(D_{pos}) = \frac{D_{pos}}{D}$$

 $$P(D_{neg}) = \frac{D_{neg}}{D}$$
where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.

**Prior and Logprior**

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative. In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$. We take the log of the prior to rescale it, and we call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$
.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$. So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})$$

**Positive and Negative Probability of a Word**

To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

$freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
$N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
$V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.
We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} $$$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} $$
Notice that we add the "+1" in the numerator for additive smoothing. This wiki article explains more about additive smoothing:

https://en.wikipedia.org/wiki/Additive_smoothing

**Log likelihood**

To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)$$

**Create *freqs* dictionary**

* Given your count_tweets() function, you can compute a dictionary called freqs that contains all the frequencies.
* In this freqs dictionary, the key is the tuple (word, label)
* The value is the number of times it has appeared.

In [12]:
freqs = word_label_count({}, train_x, train_y)

Given a freqs dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), we will calculate the following to implement a naive bayes classifier.


**Calculate $V$**

* We can compute the number of unique words that appear in the freqs dictionary to get our $V$.

**Calculate $freq_{pos}$ and $freq_{neg}$**

* Using our `freqs` dictionary, we can compute the positive and negative frequency of each word $freq_{pos}$ and $freq_{neg}$.

**Calculate $N_{pos}$ and $N_{neg}$**

* Using `freqs` dictionary, we can also compute the total number of positive words $N_{pos}$ and total number of negative words $N_{neg}$.

**Calculate $D$, $D_{pos}$, $D_{neg}$**

* Using the `train_y` input list of labels, we can calculate the number of documents (tweets) $D$, as well as the number of positive documents (tweets) $D_{pos}$ and number of negative documents (tweets) $D_{neg}$.

* Then we can calculate the probability that a document (tweet) is positive $P(D_{pos})$, and the probability that a document (tweet) is negative $P(D_{neg})$.

**Calculate the logprior**

* The logprior is $log(D_{pos}) - log(D_{neg})$

**Calculate log likelihood**

* Finally, we can iterate over each word in the vocabulary, and use our `lookup` function (from utils.py) to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
* After that we can compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using below equations:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}
$$$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} $$

*(Hint: We'll use a dictionary to store the log likelihoods for each word. The key is the word, the value is the log likelihood of that word).*

* At last we can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)$.


In [9]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Returns the logprior value and loglikelihood dictionary.
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior
        loglikelihood: the log likelihood of you Naive bayes equation
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # calculate N_pos, N_neg, V_pos, V_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:
            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs.get(pair)

        # else, the label is negative
        else:
            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs.get(pair)

    # Calculate D, the number of documents
    D = len(train_y)

    # Calculate D_pos, the number of positive documents
    D_pos = sum(i == 1 for i in train_y)

    # Calculate D_neg, the number of negative documents
    D_neg = sum(i == 0 for i in train_y)

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs, word, 1)
        freq_neg = lookup(freqs, word, 0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos) - np.log(p_w_neg)

    return logprior, loglikelihood

In [14]:
# calculate logprior and loglikelyhood
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print('logprior =', logprior)
print('length of loglikelyhood dictionary =', len(loglikelihood))

logprior = 0.0
length of loglikelyhood dictionary = 9089


### **Test our Naive Bayes model**

Now that we have the logprior and loglikelihood, we can test the naive bayes function by making predicting on some tweets!

**Create naive_bayes_predict function to make predictions on tweets.**

* The function takes in a tweet, logprior, loglikelihood.
* It returns the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.
$$ p = logprior + \sum_i^N (loglikelihood_i)$$

*[We calculate the prior (or logprior) from the training data, and in our case the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets). This means that the ratio of positive to negative (called prior) is 1, and hence the logprior is 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.

However, we need to remember to always include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [21]:
def predict_naive_bayes(tweet, logprior, loglikelihood):
    '''
    Returns the probability(sum of logprior and loglikelyhood of all the words in a tweet) of a tweet.
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''

    # process the tweet to get a list of words
    word_l = process_tweet(tweet)

    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior

    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    return p

In [16]:
# test the function
new_tweet = 'Keep smiling.'
p = predict_naive_bayes(new_tweet, logprior, loglikelihood)
print('The expected output is', p)

The expected output is 2.1390555947214125


**Create *test_naive_bayes* function**

* The function takes in `test_x`, `test_y`, `logprior`, and `loglikelihood`
* It returns the accuracy of your model.
* It uses `predict_naive_bayes` function to make predictions for each tweet in `text_x`.

In [19]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Returns the accuracy of Naive Bayes classifier on test data.
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0

    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if predict_naive_bayes(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = np.mean(np.abs(y_hats - test_y))

    # Accuracy is 1 minus the error
    accuracy = 1 - error

    return accuracy

In [20]:
# calculate the test accuracy
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

Naive Bayes accuracy = 0.9940


### **Filter words by Ratio of positive to negative counts**

* Some words have more positive counts than others, and can be considered "more positive". Likewise, some words can be considered more negative than others.
* One way for us to define the level of positiveness or negativeness, without calculating the log likelihood, is to compare the positive to negative frequency of the word.

  (*We can also use the log likelihood calculations to compare relative positivity or negativity of words.*)

* We can calculate the ratio of positive to negative frequencies of a word.
* Once we're able to calculate these ratios, we can also filter a subset of words that have a minimum ratio of positivity / negativity or higher.
* Similarly, we can also filter a subset of words that have a maximum ratio of positivity / negativity or lower (words that are at least as negative, or even more negative than a given threshold).

In order to do this, we will define a function *get_ratio()*.


**Create *get_ratio()* function**

* Given the `freqs` dictionary of words and a particular word,we can use `lookup`(freqs,word,1) to get the positive count of the word.
* Similarly, we can use the `lookup()` function to get the negative count of that word.
* Then we can calculate the ratio of positive divided by negative counts using

$$ ratio = \frac{\text{pos_words} + 1}{\text{neg_words} + 1} $$
  where pos_words and neg_words correspond to the frequency of the words in their respective classes.

In [22]:
def get_ratio(freqs, word):
    '''
    Returns a dictionary with positive_count, negative_count, and their ratio {pos, neg, ratio}.
    Input:
        freqs: dictionary containing the words
        word: string to lookup

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
        Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    pos_neg_ratio = {'positive': 0, 'negative': 0, 'ratio': 0.0}
    
    # use lookup() to find positive counts for the word (denoted by the integer 1)
    pos_neg_ratio['positive'] = lookup(freqs, word, 1)

    # use lookup() to find negative counts for the word (denoted by integer 0)
    pos_neg_ratio['negative'] = lookup(freqs, word, 0)

    # calculate the ratio of positive to negative counts for the word
    pos_neg_ratio['ratio'] = (pos_neg_ratio['positive'] + 1) / (pos_neg_ratio['negative'] + 1)

    return pos_neg_ratio

In [23]:
# test the function
get_ratio(freqs, 'happi')

{'negative': 18, 'positive': 161, 'ratio': 8.526315789473685}

**Create *get_words_by_threshold(freqs,label,threshold)* function**

* If we set the label to 1, then we'll look for all words whose threshold of positive/negative is at least as high as that threshold, or higher.
* If we set the label to 0, then we'll look for all words whose threshold of positive/negative is at most as low as the given threshold, or lower.
* We can use the `get_ratio()` function to get a dictionary containing the positive count, negative count, and the ratio of positive to negative counts.
* We can then add that dictionary to aother dictionary, where the key is the word, and the value is the dictionary `pos_neg_ratio` that is returned by the `get_ratio()` function. 

  An example key-value pair would have this structure:

  {'happi':

  {'positive': 10, 'negative': 20, 'ratio': 0.5}

  }

In [24]:
def get_words_by_threshold(freqs, label, threshold):
    '''
    Returns a dictionary of dictionary, where key is a word, and value is a dictionary returned by get_ratio() function.
    Input:
        freqs: dictionary of words
        label: 1 for positive, 0 for negative
        threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
    Output:
        word_set: dictionary containing the word and information on its positive count, negative count, 
        and ratio of positive to negative counts.
        example of a key value pair:
        {'happi':
            {'positive': 10, 'negative': 20, 'ratio': 0.5}
        }
    '''
    word_list = {}

    for key in freqs.keys():
        word, _ = key

        # get the positive/negative ratio for a word
        pos_neg_ratio = get_ratio(freqs, word)

        # if the label is 1 and the ratio is greater than or equal to the threshold...
        if label == 1 and pos_neg_ratio['ratio'] >= threshold:

            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio

        # If the label is 0 and the pos_neg_ratio is less than or equal to the threshold...
        elif label == 0 and pos_neg_ratio['ratio'] <= threshold:

            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio
            
    return word_list

In [25]:
# Test the function: find negative words at or below a threshold (set label to 0)
get_words_by_threshold(freqs, label=0, threshold=0.05)

{'26': {'negative': 20, 'positive': 0, 'ratio': 0.047619047619047616},
 ':(': {'negative': 3663, 'positive': 1, 'ratio': 0.0005458515283842794},
 ':-(': {'negative': 378, 'positive': 0, 'ratio': 0.002638522427440633},
 '>:(': {'negative': 43, 'positive': 0, 'ratio': 0.022727272727272728},
 'beli̇ev': {'negative': 35, 'positive': 0, 'ratio': 0.027777777777777776},
 'justi̇n': {'negative': 35, 'positive': 0, 'ratio': 0.027777777777777776},
 'lost': {'negative': 19, 'positive': 0, 'ratio': 0.05},
 'wi̇ll': {'negative': 35, 'positive': 0, 'ratio': 0.027777777777777776},
 'zayniscomingbackonjuli': {'negative': 19, 'positive': 0, 'ratio': 0.05},
 '♛': {'negative': 210, 'positive': 0, 'ratio': 0.004739336492890996},
 '》': {'negative': 210, 'positive': 0, 'ratio': 0.004739336492890996},
 'ｍｅ': {'negative': 35, 'positive': 0, 'ratio': 0.027777777777777776},
 'ｓｅｅ': {'negative': 35, 'positive': 0, 'ratio': 0.027777777777777776}}

In [26]:
# Test your function; find positive words at or above a threshold (set label to 1)
get_words_by_threshold(freqs, label=1, threshold=10)

{':)': {'negative': 2, 'positive': 2847, 'ratio': 949.3333333333334},
 ':-)': {'negative': 0, 'positive': 543, 'ratio': 544.0},
 ':D': {'negative': 0, 'positive': 498, 'ratio': 499.0},
 ':p': {'negative': 0, 'positive': 103, 'ratio': 104.0},
 ';)': {'negative': 0, 'positive': 22, 'ratio': 23.0},
 'arriv': {'negative': 4, 'positive': 57, 'ratio': 11.6},
 'bam': {'negative': 0, 'positive': 44, 'ratio': 45.0},
 'blog': {'negative': 0, 'positive': 27, 'ratio': 28.0},
 'commun': {'negative': 1, 'positive': 27, 'ratio': 14.0},
 'fav': {'negative': 0, 'positive': 11, 'ratio': 12.0},
 'fback': {'negative': 0, 'positive': 26, 'ratio': 27.0},
 'flipkartfashionfriday': {'negative': 0, 'positive': 16, 'ratio': 17.0},
 'followfriday': {'negative': 0, 'positive': 23, 'ratio': 24.0},
 'glad': {'negative': 2, 'positive': 41, 'ratio': 14.0},
 "here'": {'negative': 0, 'positive': 20, 'ratio': 21.0},
 'influenc': {'negative': 0, 'positive': 16, 'ratio': 17.0},
 'pleasur': {'negative': 0, 'positive': 10, 

Emojis like :( and words like 'me' tend to have a negative connotation. Other words like 'glad', 'community', and 'arrives' tend to be found in the positive tweets.

### **Error Analysis**

In [27]:
print('Truth Predicted Tweet')
for x, y in zip(test_x, test_y):
    y_hat = predict_naive_bayes(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

Truth Predicted Tweet
1	0.00	b''
1	0.00	b'truli later move know queen bee upward bound movingonup'
1	0.00	b'new report talk burn calori cold work harder warm feel better weather :p'
1	0.00	b'harri niall 94 harri born ik stupid wanna chang :D'
1	0.00	b''
1	0.00	b''
1	0.00	b'park get sunlight'
1	0.00	b'uff itna miss karhi thi ap :p'
0	1.00	b'hello info possibl interest jonatha close join beti :( great'
0	1.00	b'u prob fun david'
0	1.00	b'pat jay'
0	1.00	b'whatev stil l young >:-('


### **Play around with random tweet texts**

In [28]:
new_tweet = 'I am happy because I am learning NLP :)'

p = predict_naive_bayes(new_tweet, logprior, loglikelihood)
print(p)

9.574768961173337
