<a href="https://colab.research.google.com/github/Nasdaq101/6120/blob/main/Lab_3_Twitter_Sentiment_with_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3

This lab reviews Naive Bayes and logistic regression. You will be using these algorithms to do sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one.

### Getting Started

We'll first need the data and some utility functions (including `process_tweets`, which we have provided for you. You may want to browse the documentation of unfamiliar libraries and functions.

In [6]:
# Class specific utility functions that help with preprocessing
!wget https://course.ccs.neu.edu/cs6120s25/data/twitter/utils.py -O utils.py
from utils import process_tweet, lookup

# Twitter corpus and NLP specific imports
import nltk
from nltk.corpus import stopwords, twitter_samples
from nltk.tokenize import TweetTokenizer
from os import getcwd
nltk.download('twitter_samples')
nltk.download('stopwords')
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

# General imports
import numpy as np
import pandas as pd
import string
import pdb

# Setup the training data and preprocess strings
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_positive_tweets = [process_tweet(tweet) for tweet in all_positive_tweets]
all_negative_tweets = [process_tweet(tweet) for tweet in all_negative_tweets]

# Split data into training / test
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

# Add positive and negative tweets into training / test
train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))


--2025-01-27 07:55:09--  https://course.ccs.neu.edu/cs6120s25/data/twitter/utils.py
Resolving course.ccs.neu.edu (course.ccs.neu.edu)... 129.10.117.35
Connecting to course.ccs.neu.edu (course.ccs.neu.edu)|129.10.117.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6555 (6.4K) [text/plain]
Saving to: ‘utils.py’


2025-01-27 07:55:09 (235 MB/s) - ‘utils.py’ saved [6555/6555]



[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
#@title Space to explore your dataset

custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
# What does process_tweet do?
print(process_tweet(custom_tweet))

# What's in train_x? What's in train_y?
print(train_x[0])
print(train_y[0])

['hello', 'great', 'day', ':)', 'good', 'morn']
['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
1.0


# Part 1: Process the Data

For any machine learning project, once you've gathered the data, the first step is to process it to make useful inputs to your model.
- **Remove noise**: You will first want to remove noise from your data -- that is, remove words that don't tell you much about the content. These include all common words like 'I, you, are, is, etc...' that would not give us enough information on the sentiment.
- We'll also remove stock market tickers, retweet symbols, hyperlinks, and hashtags because they can not tell you a lot of information on the sentiment.
- You also want to remove all the punctuation from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word, instead of treating "happy", "happy?", "happy!", "happy," and "happy." as different words.
- Finally you want to use stemming to only keep track of one variation of each word. In other words, we'll treat "motivation", "motivated", and "motivate" similarly by grouping them within the same stem of "motiv-".

We have given you the function `process_tweet` that does this for you.

To help you train your naive bayes model, you will need to compute a dictionary where the keys are a tuple (word, label) and the values are the corresponding frequency.  Note that the labels we'll use here are 1 for positive and 0 for negative.

You will also implement a lookup helper function that takes in the `freqs` dictionary, a word, and a label (1 or 0) and returns the number of times that word and label tuple appears in the collection of tweets.

For example: given a list of tweets `["i am rather excited", "you are rather happy"]` and the label 1, the function will return a dictionary that contains the following key-value pairs:

{
    ("rather", 1): 2,
    ("happi", 1) : 1,
    ("excit", 1) : 1
}

- Notice how for each word in the given string, the same label 1 is assigned to each word.
- Notice how the words "i" and "am" are not saved, since it was removed by process_tweet because it is a stopword.
- Notice how the word "rather" appears twice in the list of tweets, and so its count value is 2.

#### Instructions
Create a function `count_tweets` that takes a list of tweets as input, cleans all of them, and returns a dictionary.
- The key in the dictionary is a tuple containing the stemmed word and its class label, e.g. ("happi",1).
- The value the number of times this word appears in the given collection of tweets (an integer).

##### Create `freqs` dictionary
- Given your `count_tweets` function, you can compute a dictionary called `freqs` that contains all the frequencies.
- In this `freqs` dictionary, the key is the tuple (word, label)
- The value is the number of times it has appeared.

We will use this dictionary in several parts of this assignment.

<details>
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>You may find it useful to use the `zip` function to match each element in `tweets` with each element in `ys`.</li>
    <li>Remember to check if the key in the dictionary exists before adding that key to the dictionary, or incrementing its value.</li>
    <li>Assume that the `result` dictionary that is input will contain clean key-value pairs (you can assume that the values will be integers that can be incremented).  It is good practice to check the datatype before incrementing the value, but it's not required here.</li>
</ul>
</p>

In [8]:
# UNQ_C1 GRADED FUNCTION: count_tweets

def count_tweets(tweets, ys):
    '''
    Input:
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
           {("word-1", label-1): freq-1, ("word-2", label-2), freq-2, ...}
           i.e.,  result[("word-i", label-i)] := freq-i
           e.g.,  result(["hello", 1]) := 348
    '''
    ### START CODE HERE ###
    result = {}

    # Loop through each tweet and its corresponding label
    for tweet, y in zip(tweets, ys):
        for word in tweet:
            # Create the tuple (word, label)
            pair = (word, y)
            # Update the dictionary
            if pair in result:
                result[pair] += 1
            else:
                result[pair] = 1

    ### END CODE HERE ###

    return result

# Build the freqs dictionary for later uses
freqs = count_tweets(train_x, train_y)

In [9]:
# Testing your function

tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
tweets = [process_tweet(tweet) for tweet in tweets]
ys = [1, 0, 0, 0, 0]
count_tweets(tweets, ys)

# Teaching Assistant Testing Code
# w2_unittest.test_count_tweets(count_tweets)

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

**Expected Output**: {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

# Part 2: Train your model using Naive Bayes

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### So how do you train a Naive Bayes classifier?
- The first part of training a naive bayes classifier is to identify the number of classes that you have.
- You will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$

#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

#### Instructions
Given a freqs dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), implement a naive bayes classifier.

##### Calculate $V$
- You can then compute the number of unique words that appear in the `freqs` dictionary to get your $V$ (you can use the `set` function).

##### Calculate $freq_{pos}$ and $freq_{neg}$
- Using your `freqs` dictionary, you can compute the positive and negative frequency of each word $freq_{pos}$ and $freq_{neg}$.

##### Calculate $N_{pos}$, and $N_{neg}$
- Using `freqs` dictionary, you can also compute the total number of positive words and total number of negative words $N_{pos}$ and $N_{neg}$.

##### Calculate $D$, $D_{pos}$, $D_{neg}$
- Using the `train_y` input list of labels, calculate the number of documents (tweets) $D$, as well as the number of positive documents (tweets) $D_{pos}$ and number of negative documents (tweets) $D_{neg}$.
- Calculate the probability that a document (tweet) is positive $P(D_{pos})$, and the probability that a document (tweet) is negative $P(D_{neg})$

##### Calculate the logprior
- the logprior is $log(D_{pos}) - log(D_{neg})$

##### Calculate log likelihood
- Finally, you can iterate over each word in the vocabulary, use your `lookup` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using equations 4 & 5.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

**Note:** We'll use a dictionary to store the log likelihoods for each word.  The key is the word, the value is the log likelihood of that word).

- You can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)$.

In [22]:
# UNQ_C2 GRADED FUNCTION: train_naive_bayes

def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''

    ### START CODE HERE ###
    loglikelihood = {}

    # Number of tweets is length of train_y, whilest positive is np.sum(train_y), so calculate the probabilities and logprior:
    negative_number = len(train_y) - np.sum(train_y)
    P_positive = np.sum(train_y) / len(train_y)
    P_negative = negative_number / len(train_y)
    logprior = np.log(P_positive) - np.log(P_negative)

    # Calculate the total counts for positive and negative words
    N_pos = N_neg = 0
    for (word, label), freq in freqs.items():
        if label == 1:
            N_pos += freq
        elif label == 0:
            N_neg += freq

    # Get the vocabulary size
    vocab = set([word for (word, _) in freqs.keys()])
    vocab_size = len(vocab)

    # Calculate loglikelihood for each word
    for word in vocab:
        # Get positive and negative frequency of the word
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)

        # Calculate probabilities with smoothing
        P_w_pos = (freq_pos + 1) / (N_pos + vocab_size)
        P_w_neg = (freq_neg + 1) / (N_neg + vocab_size)

        # Compute loglikelihood
        loglikelihood[word] = np.log(P_w_pos) - np.log(P_w_neg)

    ### END CODE HERE ###

    return logprior, loglikelihood

In [23]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

# Teaching Assistant Testing Code
# w2_unittest.test_train_naive_bayes(train_naive_bayes, freqs, train_x, train_y)

0.0
9161


**Expected Output**:

0.0

9161 (There seems to be some error with the data)

# Part 3: Test your naive bayes

Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predicting on some tweets!

#### Implement `naive_bayes_predict`
**Instructions**:
Implement the `naive_bayes_predict` function to make predictions on tweets.
* The function takes in the `tweet`, `logprior`, `loglikelihood`.
* It returns the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ p = logprior + \sum_i^N (loglikelihood_i)$$

#### Note
Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets).  This means that the ratio of positive to negative 1, and the logprior is 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.  However, please remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [16]:
# UNQ_C4 GRADED FUNCTION: naive_bayes_predict

def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    ### START CODE HERE ###

    # Initiate p according to the formula
    p = logprior

    # Process the tweet and split it into words
    for word in tweet:
        # If the word exists in the loglikelihood dictionary, add its loglikelihood to p
        if word in loglikelihood:
            p += loglikelihood[word]

    ### END CODE HERE ###

    return p

In [17]:
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# Experiment with your own tweet.
my_tweet = "She smiled"
p = naive_bayes_predict(process_tweet(my_tweet), logprior, loglikelihood)
print('The expected output is', p)

# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.',
              'great', 'great great', 'great great great',
              'great great great great']:
    p = naive_bayes_predict(process_tweet(tweet), logprior, loglikelihood)
    print(f'{tweet} -> {p:.2f}')

# Teaching Assistant Testing Code
# w2_unittest.test_naive_bayes_predict(naive_bayes_predict)

The expected output is 1.5574928203010936
I am happy -> 2.14
I am bad -> -1.31
this movie should have been great. -> 2.12
great -> 2.13
great great -> 4.26
great great great -> 6.39
great great great great -> 8.52


**Expected Output**:
- The expected output is around 1.55
- The sentiment is positive.

**Expected Output**:
- I am happy -> 2.14
- I am bad -> -1.31
- this movie should have been great. -> 2.12
- great -> 2.13
- great great -> 4.26
- great great great -> 6.39
- great great great great -> 8.52

In [18]:
# Test with your own tweet - feel free to modify `my_tweet`
my_tweet = 'I am happy because I am learning :)'

print("my tweet: ", my_tweet, ", \n   score: ",
      naive_bayes_predict(process_tweet(my_tweet), logprior, loglikelihood))

# Feel free to check the sentiment of your own tweet below
my_tweet = 'you are bad :('

print("my tweet: ", my_tweet, ", \n   score: ",
      naive_bayes_predict(process_tweet(my_tweet), logprior, loglikelihood))

my tweet:  I am happy because I am learning :) , 
   score:  9.570227756170972
my tweet:  you are bad :( , 
   score:  -8.837962482271397


#### Implement test_naive_bayes
**Instructions**:
* Implement `test_naive_bayes` to check the accuracy of your predictions.
* The function takes in your `test_x`, `test_y`, log_prior, and loglikelihood
* It returns the accuracy of your model.
* First, use `naive_bayes_predict` function to make predictions for each tweet in text_x.

In [19]:
# UNQ_C6 GRADED FUNCTION: test_naive_bayes

def test_naive_bayes(test_x, test_y, logprior, loglikelihood, naive_bayes_predict=naive_bayes_predict):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    ### START CODE HERE ###
    # Initialize the count of correct predictions
    correct_predictions = 0

    # Loop through the tweets and their labels
    for tweet, label in zip(test_x, test_y):
        # Make a prediction using naive_bayes_predict
        prediction = naive_bayes_predict(tweet, logprior, loglikelihood)

        # Convert the prediction into a binary label (1 for positive, 0 for negative)
        predicted_label = 1 if prediction > 0 else 0

        # Check if the prediction matches the actual label
        if predicted_label == label:
            correct_predictions += 1

    # Calculate accuracy
    accuracy = correct_predictions / len(test_y)
    ### END CODE HERE ###

    return accuracy

In [20]:
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

# Teaching Assistant Testing Code
# w2_unittest.test_test_naive_bayes(test_naive_bayes, test_x, test_y)

Naive Bayes accuracy = 0.9955


**Expected Accuracy**:

`Naive Bayes accuracy = 0.9955`

In [21]:
#@title Some error analysis on things algorithm does wrong

# Some error analysis done for you
print('Truth Predicted Tweet')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(x).encode('ascii', 'ignore')))

Truth Predicted Tweet
1	0.00	b'truli later move know queen bee upward bound movingonup'
1	0.00	b'new report talk burn calori cold work harder warm feel better weather :p'
1	0.00	b'harri niall 94 harri born ik stupid wanna chang :d'
1	0.00	b'park get sunlight'
1	0.00	b'uff itna miss karhi thi ap :p'
0	1.00	b'hello info possibl interest jonatha close join beti :( great'
0	1.00	b'u prob fun david'
0	1.00	b'pat jay'
0	1.00	b'sr financi analyst expedia inc bellevu wa financ expediajob job job hire'


Congratulations on finishing the lab. I hope you've learned a lot!