# Naive Bayes for sentiment analysis on tweets
In this notebook we you will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, we will decide if it has a positive sentiment or a negative one. Specifically we will: 

* Train a naive bayes model on a sentiment analysis task
* Test using your model
* Compute ratios of positive words to negative words
* Predict on your own tweet


In [1]:
import nltk
from nltk.corpus import twitter_samples
import re
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string
from nltk.stem import PorterStemmer
import numpy as np
import pandas as pd

In [2]:
nltk.download('stopwords')
nltk.download('twitter_samples')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

## 1. Pre-processing steps

### Prepare the data

In [3]:
# Select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
print(len(all_positive_tweets))
print(len(all_negative_tweets))

5000
5000


In [4]:
# Split the data into two pieces, one for training and one for testing (20% will be in the test set, and 80% in the training set) 
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

# Create the positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

# Print the shape of train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


### Cleaning the tweet

In [5]:
def process_tweet(tweet):
    '''This function cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems
    '''
    tweet = re.sub(r'\$\w*', '', tweet)
    tweet = re.sub(r'^RT[\s]+', '', tweet)  # remove old style retweet text "RT"
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)  # remove hyperlinks
    tweet = re.sub(r'#', '', tweet)  # remove hashtags
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles = True, reduce_len=True) # instantiate tokenizer class
    tweet_tokens = tokenizer.tokenize(tweet)
    english_stopwords = stopwords.words('english')
    stemmer = PorterStemmer()
    clean_tweet=[]
    for word in tweet_tokens:
        if (word not in english_stopwords and word not in string.punctuation):
            stem_word = stemmer.stem(word)  # stemming word
            clean_tweet.append(stem_word)
    return clean_tweet

In [6]:
new_tweet = process_tweet(all_positive_tweets[2277])
print(all_positive_tweets[2277])
print(new_tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']


### Creating the frequency dictionary

In [7]:
def build_freqs(tweets,ys):
    ''' This function counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1'
    or a negative label '0', then builds the freqs dictionary, where each key is a (word,label) tuple,
    and the value is the count of its frequency within the corpus of tweets
    '''
    yslist = np.squeeze(ys).tolist()
    freqs = {}
    for y,tweet in zip(yslist,tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1    
    return freqs

## 2. Train the model using Naive Bayes

#### So how do we train a Naive Bayes classifier?
- The first part of training a naive bayes classifier is to identify the number of classes that we have.
- Then we will create a probability for each class.
$P(T_{pos})$ is the probability that the tweet is positive.
$P(T_{neg})$ is the probability that the tweet is negative.

$$P(T_{pos}) = \frac{T_{pos}}{T}\$$

$$P(T_{neg}) = \frac{T_{neg}}{T}\$$

Where $T$ is the total number of tweets, $T_{pos}$ is the total number of positive tweets and $T_{neg}$ is the total number of negative tweets.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(T_{pos})}{P(T_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(T_{pos})}{P(T_{neg})} \right) = log \left( \frac{T_{pos}}{T_{neg}} \right)$$.

$$\text{logprior} = \log (P(T_{pos})) - \log (P(T_{neg})) = \log (T_{pos}) - \log (T_{neg})\$$

#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all tweets, respectively.
- $V$ is the number of unique words in the entire set of tweets, for all classes, whether positive or negative.

We usually compute the probability of a word given a class as follow:

$$ P(W_{pos}) = \frac{freq_{pos}}{N_{pos}}\ $$
$$ P(W_{neg}) = \frac{freq_{neg}}{N_{neg}}\ $$

However, if a word does not appear in the training set, then it automatically gets a probability of 0, to fix this we add smoothing as follow:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\ $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\ $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing. And since there are V words to normalize we add V in the denominator 

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\$$

In [8]:
# Build the freqs dictionary
freqs = build_freqs(train_x, train_y)

In [11]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior.
        loglikelihood: the log likelihood of you Naive bayes equation.
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:

            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[(pair[0],1)]

        # else, the label is negative
        else:

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freqs[(pair[0],0)]

    # Calculate T, the number of documents
    T = len(train_y)
    
    T_pos = sum(pos for pos in train_y if pos==1)
    T_neg = T-T_pos
    logprior = np.log(T_pos) - np.log(T_neg)
    
    freq_pos = 0
    freq_neg = 0
    for word in vocab:
        # get the positive and negative frequency of the word
        if (word,1) in freqs.keys():
            freq_pos = freqs[(word,1)]
        if (word,0) in freqs.keys():
            freq_neg = freqs[(word,0)]

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos+1)/(N_pos+V)
        p_w_neg = (freq_neg+1)/(N_neg+V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos/p_w_neg)


    return logprior, loglikelihood


In [12]:
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

[0.]
9085


## 3. Testing 

Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predicting on some tweets!

We will implement the `naive_bayes_predict` function to make predictions on tweets.
* The function takes in the `tweet`, `logprior`, `loglikelihood`.
* It returns the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ p = logprior + \sum_i^N (loglikelihood_i)$$

#### Note
Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets).  This means that the ratio of positive to negative 1, and the logprior is 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.  However, please remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [13]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    word_l = process_tweet(tweet)

    # initialize probability to zero
    p = 0

    # add the logprior
    p += logprior

    for word in word_l:

        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    return p


In [14]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0  # return this properly

    y_hats = []
    for tweet in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    np.squeeze(test_y)
    np.asarray(y_hats)
    correct = 0
    for i in range(len(test_y)):
        if (test_y[i]==y_hats[i]):
            correct +=1
    accuracy = correct/len(y_hats)

    return accuracy


In [15]:
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

Naive Bayes accuracy = 0.9925
