# Homework 3
## Warmup
Given the following short movie reviews, each labeled with a genre, either comedy or action:
1. fun, couple, love, love **comedy**
2. fast, furious, shoot **action**
3. couple, fly, fast, fun, fun **comedy**
4. furious, shoot, shoot, fun **action**
5. fly, fast, shoot, love **action**

and a new document D:

    fast, couple, shoot, fly

compute the most likely class for D. Assume a naive Bayes classifier and use add-1 smoothing for the likelihoods.

## Warmup Response
TODO

## Assignment
Build a naive Bayes sentiment classifier that will assign reviews of an application as either **positive**, **neutral**, or **negative**.
- You will need to do some basic preprocessing on the documents (normalization, etc).
- Do not use a stop word list.
- Ignore any Out-Of-Vocabulary (OOV) terms when classifying.

You are provided a small set of pre-classified training data to build your model. The data is formatted such that each line of text contains a document (the title of a review). The first token of each line will be the classification of that review, either **POS**, **NEU**, or **NEG**. Below is a sample document:
    
    `POS The program was quite helpful with creating websites.`

An example output of your system may look something like this:

    ```
    The program does what it should do. : POSITIVE
    It functions adequately. : NEUTRAL
    The program sucks. : NEGATIVE
    This thing runs like a pregnant cow. : NEGATIVE
    It was a little slow, but not too bad. : NEUTRAL
    Slow. Slow. SLOW! : NEGATIVE
    Great software! : POSITIVE
    Worth the trouble to install. : NEUTRAL
    ```

## Report Instructions
Once the model has been built, feed in the provided test documents and write a report detailing your results. In the report, address the following:
- How accurate was the classifier? What was the Precision and Recall? The F-measure?
- Choose one incorrectly classified document.
    - Manually calculate the sentiment probabilities for the document (you can use your classifier to generate the likelihoods and prior probabilities, but do the classifying on paper)
    - What is the difference of the probability sums of the correct class and the class assigned to the system?
    - Identify the term or terms that caused the system to misclassify the document.
    - Build a document (or documents) to add to the training set that would allow the system to correctly classify the document.
        - Show the mathematical reasoning for your choice of words in the document.
        - Rerun the tests with the additional information.
        - Did adding the additional information change any other document classification? If so, how? Did it improve the overall accuracy of your system or make it worse?
    - Add the MPQA Subjectivity Cues Lexicon to your system and run the tests again and report the results.
        - Choose a document that was classified differently after adding the MPQA Subjectivity Cues Lexicon. Was it correctly or incorrectly classified? Discuss why.
    - Finally use the provided collection of Amazon reviews from 2007 to train your classifier. Run the associated tests and report the Precision, Recall, and F-measure.
    - Briefly discuss what you learned from this assignment, what you liked or disliked about the assignment and, optionally, anything you would like to see changed or added to improve the assignment.

In [134]:
## imports
import pandas as pd
import nltk
import re
import contractions
import unidecode
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
import math
import numpy as np

nltk.download([
"names",
"stopwords",
"state_union",
"twitter_samples",
"movie_reviews",
"averaged_perceptron_tagger",
"vader_lexicon",
"punkt",
])

lemmatizer = WordNetLemmatizer()
sia = SentimentIntensityAnalyzer()
encoder = LabelEncoder()

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up

In [111]:
## Read in the training data
training = []
strings: list[str] = []
filename = "trainingSet.txt"
with open(filename, "r") as f:
    for line in f:
        tokens = line.split(" ")
        classification = tokens.pop(0)
        x = str(line[line.index(" ") + 1:-1])
        x = contractions.fix(x) # expand contractions
        x = unidecode.unidecode(x) # remove accents
        x = ' '.join(x.strip().split()) # remove extra whitespace
        # # could do lemmatization using the WordNetLemmatizer
        # x = ' '.join([lemmatizer.lemmatize(word) for word in x.split()])
        strings.append(x)
        training +=[(classification, x)]
text = '\n'.join(strings)
data = pd.DataFrame(training, columns=['cls', 'text'])
data['cls'] = data['cls'].astype("category")
data['encoded'] = encoder.fit_transform(data['cls'])
data['encoded'] = data['encoded'].astype("category")
data

Unnamed: 0,cls,text,encoded
0,POS,The program was quite helpful with creating we...,2
1,POS,"I really, really, really liked the cute icons!",2
2,NEU,"The program did its job, but nothing special.",1
3,NEG,Why did they even bother releasing this software?,0
4,NEG,This program did not do anything it was promis...,0
5,NEU,The software was adequate.,1
6,NEU,"I have used better programs, I have used worse.",1
7,POS,The pages it generated were just what I needed.,2
8,POS,The software was intuitive and easy to use,2
9,POS,The program runs well on my laptop,2


In [105]:
# ## Build Frequency Distribution
# fd = nltk.FreqDist(words)
# print(fd)
# print(fd.most_common(3))
# print(fd.tabulate(3))

## Better method of making frequency distribution
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens]
n_text = nltk.Text(tokens)
fd = n_text.vocab()
print(fd.tabulate(7))
n_text.concordance("slow")

       .      the        i        , software       it      was 
      13       11        8        8        6        6        5 
None
Displaying 1 of 1 matches:
rogram runs well on my laptop it was slow , buggy , and painful to use . this 


In [106]:
for string in strings:
    print(sia.polarity_scores(string))
# at a glance this seems to be performing poorly

{'neg': 0.0, 'neu': 0.519, 'pos': 0.481, 'compound': 0.6764}
{'neg': 0.0, 'neu': 0.41, 'pos': 0.59, 'compound': 0.8015}
{'neg': 0.292, 'neu': 0.708, 'pos': 0.0, 'compound': -0.438}
{'neg': 0.255, 'neu': 0.745, 'pos': 0.0, 'compound': -0.34}
{'neg': 0.0, 'neu': 0.8, 'pos': 0.2, 'compound': 0.3612}
{'neg': 0.0, 'neu': 0.612, 'pos': 0.388, 'compound': 0.2263}
{'neg': 0.282, 'neu': 0.455, 'pos': 0.264, 'compound': -0.0516}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.707, 'pos': 0.293, 'compound': 0.4404}
{'neg': 0.0, 'neu': 0.741, 'pos': 0.259, 'compound': 0.2732}
{'neg': 0.293, 'neu': 0.707, 'pos': 0.0, 'compound': -0.4404}
{'neg': 0.24, 'neu': 0.76, 'pos': 0.0, 'compound': -0.6249}
{'neg': 0.0, 'neu': 0.698, 'pos': 0.302, 'compound': 0.0772}
{'neg': 0.0, 'neu': 0.417, 'pos': 0.583, 'compound': 0.6369}
{'neg': 0.123, 'neu': 0.877, 'pos': 0.0, 'compound': -0.1027}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'com

## Try building Naive Bayes Classifier
Following the tutorial [Building Naive Bayes Classifier from Scratch to Perform Sentiment Analysis](https://www.analyticsvidhya.com/blog/2022/03/building-naive-bayes-classifier-from-scratch-to-perform-sentiment-analysis/)

In [137]:
w_tokenizer = nltk.tokenize.NLTKWordTokenizer()

In [138]:
# make the train_test_split
texts = data['text'].values
labels = data['encoded'].values
train_x, test_x, train_y, test_y = train_test_split(texts, 
                                                    labels, 
                                                    stratify=labels)

In [139]:
vec = CountVectorizer(max_features = 5000)
x = vec.fit_transform(train_x)
vocab = vec.get_feature_names_out()
x = x.toarray()
word_counts = {}
for l in range(3):
    word_counts[l] = defaultdict(lambda: 0)
for i in range(x.shape[0]):
    l = train_y[i]
    for j in range(len(vocab)):
        word_counts[l][vocab[j]] += x[i][j]

In [179]:
"""
so the word counts contains the count of words that occured in
the sentences. The only difference here is that it is stratified
based on the label. This means it has 3 separate sets of counts. 
Note that each set of counts still contains the words from the
other sets just with a value of 0.

the first is the encoded class. The second is the word to lookup.
"""
word_counts[2]['all']

1

In [140]:
def laplace_smoothing(n_label_items, vocab, word_counts, word, text_label):
    a = word_counts[text_label][word] + 1
    b = n_label_items[text_label] + len(vocab)
    return math.log(a/b)


def group_by_label(x, y, labels):
    data = {}
    for l in labels:
        data[l] = x[np.where(y == l)]
    return data
 

def fit(x, y, labels):
    n_label_items = {}
    log_label_priors = {}
    n = len(x)
    grouped_data = group_by_label(x, y, labels)
    for l, data in grouped_data.items():
        n_label_items[l] = len(data)
        log_label_priors[l] = math.log(n_label_items[l] / n)
    return n_label_items, log_label_priors

def predict(n_label_items, vocab, word_counts, log_label_priors, labels, x):
    result = []
    for text in x:
        label_scores = {l: log_label_priors[l] for l in labels}
        words = set(w_tokenizer.tokenize(text))
        for word in words:
            if word not in vocab: continue
            for l in labels:
                log_w_given_l = laplace_smoothing(n_label_items, vocab, word_counts, word, l)
                label_scores[l] += log_w_given_l
        result.append(max(label_scores, key=label_scores.get))
    return result

In [147]:
labels = [0,1,2]
n_label_items, log_label_priors = fit(train_x,train_y,labels)
pred = predict(n_label_items, vocab, word_counts, log_label_priors, labels, test_x)
print("Accuracy of prediction on test set : ", accuracy_score(test_y,pred))
print(pred)
print(test_y)

Accuracy of prediction on test set :  0.2
[1, 2, 0, 0, 2]
[2, 0, 0, 1, 1]
Categories (3, int64): [0, 1, 2]


Notes

    The program does what it should do. : POSITIVE
    It functions adequately. : NEUTRAL
    The program sucks. : NEGATIVE
    This thing runs like a pregnant cow. : NEGATIVE
    It was a little slow, but not too bad. : NEUTRAL
    Slow. Slow. SLOW! : NEGATIVE
    Great software! : POSITIVE
    Worth the trouble to install. : NEUTRAL

ignore terms not in vocab

In [148]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [202]:
train, test = train_test_split(data, 
                               test_size=0.2, 
                               random_state=42, 
                               shuffle=True,
                               stratify=data['encoded'])

In [203]:
def gen_word_counts_with_laplace(dataframe: pd.DataFrame):
    # get a list of all of the word keys
    all_tokens = nltk.word_tokenize('\n'.join(dataframe['text']))
    all_text = nltk.Text([token.lower() for token in all_tokens])
    all_freq_dist = all_text.vocab()
    all_keys = all_freq_dist.keys()
    word_counts = [None] * len(dataframe['encoded'].unique())
    # iterate through each of the label classes
    for cls in dataframe['encoded'].unique():
        # for each label class compute the word frequency distribution
        text = '\n'.join(dataframe[dataframe['encoded'] == cls]['text'])
        tokens = nltk.word_tokenize(text)
        tokens = [token.lower() for token in tokens]
        n_text = nltk.Text(tokens)
        fd = n_text.vocab()
        dictionary = {}
        # use the list of all keys and frequency distribution to make
        # a frequency distribution for the class with all keys
        # and laplacian smoothing
        for key in all_keys:
            count = fd[key] + 1 # add 1 for laplace smoothing
            dictionary[key] = count
        # add the distribution to the list for output
        word_counts[cls] = dictionary
    return word_counts
word_counts = gen_word_counts_with_laplace(train)
word_counts[2]['all']

1

In [204]:
def fit(dataframe: pd.DataFrame):
    num_entries_per_cls = [None] * len(dataframe['encoded'].unique())
    log_label_priors = [None] * len(dataframe['encoded'].unique())
    n = len(x)
    grouped = [None] * len(dataframe['encoded'].unique())
    for cls in dataframe['encoded'].unique():
        grouped[cls] = list(dataframe['text'][dataframe['encoded'] == cls])
        num_entries_per_cls[cls] = len(grouped[cls])
        log_label_priors[cls] = math.log(num_entries_per_cls[cls] / n)
    return(num_entries_per_cls, log_label_priors)

fit(train)

([5, 5, 4], [-0.9555114450274363, -0.9555114450274363, -1.1786549963416462])