# Homework 3
## Warmup
Given the following short movie reviews, each labeled with a genre, either comedy or action:
1. fun, couple, love, love **comedy**
2. fast, furious, shoot **action**
3. couple, fly, fast, fun, fun **comedy**
4. furious, shoot, shoot, fun **action**
5. fly, fast, shoot, love **action**

and a new document D:

    fast, couple, shoot, fly

compute the most likely class for D. Assume a naive Bayes classifier and use add-1 smoothing for the likelihoods.

## Warmup Response
**Total Vocab**  
- fun: 4
- couple: 2
- love: 3
- fast: 3
- furious: 2
- shoot: 4
- fly: 2

**Comedy Vocab**  
- fun: 3 + 1 = 4
- couple: 2 + 1 = 3
- love: 2 + 1 = 3
- fast: 1 + 1 = 2
- furious: 0 + 1 = 1
- shoot: 0 + 1 = 1
- fly: 1 + 1 = 2

**Action Vocab**  
- fun: 1 + 1 = 2
- couple: 0 + 1 = 1
- love: 1 + 1 = 2
- fast: 2 + 1 = 3
- furious: 2 + 1 = 3
- shoot: 4 + 1 = 5
- fly: 1 + 1 = 2

**Prediction**
- the vocab has 7 words in it.
- the comedy class has (4+3+3+2+1+1+2)=16 entries in it.
- the action class has (2+1+2+3+3+5+2)=18 entries in it.
- query
    - fast
        - comedy: log(2/(16+7))=-1.06
        - action: log(3/(18+7))=-0.92
    - couple
        - comedy: log(3/(16+7))=-0.88
        - action: log(1/(18+7))=-1.40
    - shoot
        - comedy: log(1/(16+7))=-1.36
        - action: log(5/(18+7))=-0.70
    - fly
        - comedy: log(2/(16+7))=-1.06
        - action: log(2/(18+7))=-1.10
- sum
    - comedy = (-1.06-0.88-1.36-1.06)=-4.37
    - action = (-0.92-1.40-0.70-1.10)=-4.11
- result
    - because action > comedy this document is likely action.


## Assignment
Build a naive Bayes sentiment classifier that will assign reviews of an application as either **positive**, **neutral**, or **negative**.
- You will need to do some basic preprocessing on the documents (normalization, etc).
- Do not use a stop word list.
- Ignore any Out-Of-Vocabulary (OOV) terms when classifying.

You are provided a small set of pre-classified training data to build your model. The data is formatted such that each line of text contains a document (the title of a review). The first token of each line will be the classification of that review, either **POS**, **NEU**, or **NEG**. Below is a sample document:
    
    `POS The program was quite helpful with creating websites.`

An example output of your system may look something like this:

    ```
    The program does what it should do. : POSITIVE
    It functions adequately. : NEUTRAL
    The program sucks. : NEGATIVE
    This thing runs like a pregnant cow. : NEGATIVE
    It was a little slow, but not too bad. : NEUTRAL
    Slow. Slow. SLOW! : NEGATIVE
    Great software! : POSITIVE
    Worth the trouble to install. : NEUTRAL
    ```

## Report Instructions
Once the model has been built, feed in the provided test documents and write a report detailing your results. In the report, address the following:
- How accurate was the classifier? What was the Precision and Recall? The F-measure?
- Choose one incorrectly classified document.
    - Manually calculate the sentiment probabilities for the document (you can use your classifier to generate the likelihoods and prior probabilities, but do the classifying on paper)
    - What is the difference of the probability sums of the correct class and the class assigned to the system?
    - Identify the term or terms that caused the system to misclassify the document.
    - Build a document (or documents) to add to the training set that would allow the system to correctly classify the document.
        - Show the mathematical reasoning for your choice of words in the document.
        - Rerun the tests with the additional information.
        - Did adding the additional information change any other document classification? If so, how? Did it improve the overall accuracy of your system or make it worse?
    - Add the MPQA Subjectivity Cues Lexicon to your system and run the tests again and report the results.
        - Choose a document that was classified differently after adding the MPQA Subjectivity Cues Lexicon. Was it correctly or incorrectly classified? Discuss why.
    - Finally use the provided collection of Amazon reviews from 2007 to train your classifier. Run the associated tests and report the Precision, Recall, and F-measure.
    - Briefly discuss what you learned from this assignment, what you liked or disliked about the assignment and, optionally, anything you would like to see changed or added to improve the assignment.

In [2]:
## imports
import pandas as pd
import nltk
import re
import contractions
import unidecode
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import f1_score, recall_score
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
import math
import numpy as np
import copy

nltk.download([
"names",
"stopwords",
"state_union",
"twitter_samples",
"movie_reviews",
"averaged_perceptron_tagger",
"vader_lexicon",
"punkt",
])

lemmatizer = WordNetLemmatizer()
sia = SentimentIntensityAnalyzer()
encoder = LabelEncoder()

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Grant\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up

In [3]:
def score(y_pred, y_true):
    print("%-15s: %.3f" % ("[ACCURACY]", accuracy_score(y_pred, y_true)))
    print("%-15s: %.3f" % ("[PRECISION]", precision_score(y_pred, y_true, average='weighted')))
    print("%-15s: %.3f" % ("[RECALL]", recall_score(y_pred, y_true, average='weighted')))
    print("%-15s: %.3f" % ("[F1-SCORE]", f1_score(y_pred, y_true, average='weighted')))    

In [4]:
def clean_string(x: str) -> list[str]:
    x = contractions.fix(x) # expand contractions
    x = unidecode.unidecode(x) # remove accents
    x = ' '.join(x.strip().split()) # remove extra whitespace
    x = re.sub(r'[\.\,\?\\\/\<\>\;\:\[\]\{\}]', r'', x) # punctuation
    # # could do lemmatization using the WordNetLemmatizer
    # x = ' '.join([lemmatizer.lemmatize(word) for word in x.split()])
    return x

In [5]:
## Read in the training data
training = []
strings: list[str] = []
filename = "trainingSet.txt"
with open(filename, "r") as f:
    for line in f:
        tokens = line.split(" ")
        classification = tokens.pop(0)
        x = clean_string(str(line[line.index(" ") + 1:-1]))
        strings.append(x)
        training +=[(classification, x)]
text = '\n'.join(strings)
data = pd.DataFrame(training, columns=['cls', 'text'])
data['cls'] = data['cls'].astype("category")
data['encoded'] = encoder.fit_transform(data['cls'])
data['encoded'] = data['encoded'].astype("category")
data

Unnamed: 0,cls,text,encoded
0,POS,The program was quite helpful with creating we...,2
1,POS,I really really really liked the cute icons!,2
2,NEU,The program did its job but nothing special,1
3,NEG,Why did they even bother releasing this software,0
4,NEG,This program did not do anything it was promis...,0
5,NEU,The software was adequate,1
6,NEU,I have used better programs I have used worse,1
7,POS,The pages it generated were just what I needed,2
8,POS,The software was intuitive and easy to use,2
9,POS,The program runs well on my laptop,2


In [6]:
## Read in the testing data
training = []
strings: list[str] = []
filename = "testSet.txt"
with open(filename, "r") as f:
    for line in f:
        tokens = line.split(" ")
        classification = tokens.pop(0)
        x = clean_string(str(line[line.index(" ") + 1:-1]))
        strings.append(x)
        training +=[(classification, x)]
text = '\n'.join(strings)
test_data = pd.DataFrame(training, columns=['cls', 'text'])
test_data['cls'] = test_data['cls'].astype("category")
test_data['encoded'] = encoder.transform(test_data['cls'])
test_data['encoded'] = test_data['encoded'].astype("category")
test_data

Unnamed: 0,cls,text,encoded
0,POS,The program does what it should do,2
1,NEU,It functions adequately,1
2,NEG,The program sucks,0
3,NEG,This thing runs like a pregnant cow,0
4,NEU,It was a little slow but not too bad,1
5,NEG,Slow Slow SLOW!,0
6,POS,Great software!,2
7,NEU,Worth the trouble to install,1


In [7]:
# read in the MPQA Subjectivity Ques Lexicon
filename = "lexicon\subjclueslen1-HLTEMNLP05.tff"
line_regex = r"type=(\w+)\slen=(\d+)\sword1=([\w-]+)\spos1=(\w+)\sstemmed1=(\w)(?:\spolarity=(?:\w+))*(?:\sm)*\spriorpolarity=(\w+)"
mpqa = pd.DataFrame(columns=['type', 'len', 'word1', 'pos1', 'stemmed', 'priorpolarity'])
with open(filename, 'r') as f:
    for line in f:
        # print(line)
        result = re.search(line_regex, line)
        mpqa = pd.concat([mpqa, pd.DataFrame.from_records([{
            "type": result.group(1),
            "len": result.group(2),
            "word1": result.group(3),
            "pos1": result.group(4),
            "stemmed": result.group(5),
            "priorpolarity": result.group(6)
        }])]).reset_index(drop=True)
mpqa

Unnamed: 0,type,len,word1,pos1,stemmed,priorpolarity
0,weaksubj,1,abandoned,adj,n,negative
1,weaksubj,1,abandonment,noun,n,negative
2,weaksubj,1,abandon,verb,y,negative
3,strongsubj,1,abase,verb,y,negative
4,strongsubj,1,abasement,anypos,y,negative
...,...,...,...,...,...,...
8217,strongsubj,1,zealot,noun,n,negative
8218,strongsubj,1,zealous,adj,n,negative
8219,strongsubj,1,zealously,anypos,n,negative
8220,strongsubj,1,zenith,noun,n,positive


In [8]:
# NLTK polarity scores
for string in strings:
    print(sia.polarity_scores(string))
# at a glance this seems to be performing poorly

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.556, 'neu': 0.444, 'pos': 0.0, 'compound': -0.3612}
{'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.3612}
{'neg': 0.0, 'neu': 0.65, 'pos': 0.35, 'compound': 0.5824}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.185, 'pos': 0.815, 'compound': 0.6588}
{'neg': 0.355, 'neu': 0.395, 'pos': 0.25, 'compound': -0.2023}


## Try building Naive Bayes Classifier
Following the tutorial [Building Naive Bayes Classifier from Scratch to Perform Sentiment Analysis](https://www.analyticsvidhya.com/blog/2022/03/building-naive-bayes-classifier-from-scratch-to-perform-sentiment-analysis/)

In [9]:
w_tokenizer = nltk.tokenize.NLTKWordTokenizer()

In [10]:
# make the train_test_split
texts = data['text'].values
labels = data['encoded'].values
train_x, test_x, train_y, test_y = train_test_split(texts, 
                                                    labels, 
                                                    stratify=labels)

In [11]:
vec = CountVectorizer(max_features = 5000)
x = vec.fit_transform(train_x)
vocab = vec.get_feature_names_out()
x = x.toarray()
word_counts = {}
for l in range(3):
    word_counts[l] = defaultdict(lambda: 0)
for i in range(x.shape[0]):
    l = train_y[i]
    for j in range(len(vocab)):
        word_counts[l][vocab[j]] += x[i][j]

In [12]:
"""
so the word counts contains the count of words that occured in
the sentences. The only difference here is that it is stratified
based on the label. This means it has 3 separate sets of counts. 
Note that each set of counts still contains the words from the
other sets just with a value of 0.

the first is the encoded class. The second is the word to lookup.
"""
word_counts[2]['all']

0

In [13]:
def laplace_smoothing(n_label_items, vocab, word_counts, word, text_label):
    a = word_counts[text_label][word] + 1
    b = n_label_items[text_label] + len(vocab)
    return math.log(a/b)


def group_by_label(x, y, labels):
    data = {}
    for l in labels:
        data[l] = x[np.where(y == l)]
    return data
 

def fit(x, y, labels):
    n_label_items = {}
    log_label_priors = {}
    n = len(x)
    grouped_data = group_by_label(x, y, labels)
    for l, data in grouped_data.items():
        n_label_items[l] = len(data)
        log_label_priors[l] = math.log(n_label_items[l] / n)
    return n_label_items, log_label_priors

def predict(n_label_items, vocab, word_counts, log_label_priors, labels, x):
    result = []
    for text in x:
        label_scores = {l: log_label_priors[l] for l in labels}
        # print(label_scores)
        words = set(w_tokenizer.tokenize(text))
        for word in words:
            if word not in vocab: continue
            for l in labels:
                log_w_given_l = laplace_smoothing(n_label_items, vocab, word_counts, word, l)
                label_scores[l] += log_w_given_l
        result.append(max(label_scores, key=label_scores.get))
    return result

In [14]:
labels = [0,1,2]
n_label_items, log_label_priors = fit(train_x,train_y,labels)
print(n_label_items)
pred = predict(n_label_items, vocab, word_counts, log_label_priors, labels, test_x)
score(pred, test_y)
print(pred)
print(test_y)

{0: 4, 1: 5, 2: 5}
[ACCURACY]     : 0.200
[PRECISION]    : 0.300
[RECALL]       : 0.200
[F1-SCORE]     : 0.240
[1, 1, 2, 2, 1]
[1, 0, 1, 0, 2]
Categories (3, int64): [0, 1, 2]


  _warn_prf(average, modifier, msg_start, len(result))


## Re-Implemment the Naive Bayes Model
Now that I worked through the tutorial in the above code, I was able to rework things to use NLTK and pandas directly. This helped me understand the code better as well.

In [15]:
# train, test = train_test_split(data, 
#                                test_size=0.2, 
#                                shuffle=True,
#                                stratify=data['encoded'])
train = copy.deepcopy(data)
test = copy.deepcopy(test_data)
unique_labels = data['encoded'].unique()

In [16]:
def gen_word_counts_with_laplace(dataframe: pd.DataFrame):
    # get a list of all of the word keys
    all_tokens = nltk.word_tokenize('\n'.join(dataframe['text']))
    all_text = nltk.Text([token.lower() for token in all_tokens])
    all_freq_dist = all_text.vocab()
    all_keys = all_freq_dist.keys()
    word_counts = {}
    # iterate through each of the label classes
    for cls in unique_labels:
        # for each label class compute the word frequency distribution
        text = '\n'.join(dataframe[dataframe['encoded'] == cls]['text'])
        # print("CLS [%d]: %s" % (cls, text))
        tokens = nltk.word_tokenize(text)
        tokens = [token.lower() for token in tokens]
        n_text = nltk.Text(tokens)
        fd = n_text.vocab()
        dictionary = {}
        # use the list of all keys and frequency distribution to make
        # a frequency distribution for the class with all keys
        # and laplacian smoothing
        for key in all_keys:
            count = fd[key] + 1 # add 1 for laplace smoothing
            dictionary[key] = count
        # add the distribution to the list for output
        word_counts[cls] = dictionary
    return word_counts, all_keys

def fit(dataframe: pd.DataFrame):
    num_entries_per_cls = {}
    log_label_priors = {}
    n = len(list(dataframe['text']))
    grouped = [None] * len(unique_labels)
    for cls in dataframe['encoded'].unique():
        grouped[cls] = list(dataframe['text'][dataframe['encoded'] == cls])
        # num_entries_per_cls[cls] = len(grouped[cls])
        num_entries_per_cls[cls] = sum(word_counts[cls].values())
        log_label_priors[cls] = math.log(num_entries_per_cls[cls] / n)
    return(num_entries_per_cls, log_label_priors)

def predict(x: list[str], num_entries_per_cls, log_label_priors, vocab):
    result = []
    for text in x:
        label_scores = copy.deepcopy(log_label_priors)
        text = clean_string(text)
        tokens = nltk.word_tokenize(text)
        tokens = [token.lower() for token in tokens]
        # fd = nltk.Text(tokens).vocab()
        for word in tokens:
            if word not in vocab: 
                # print("[%s] missing from vocabulary" % word)
                continue
            for cls in unique_labels:
                count = word_counts[cls][word]
                n_vocab_per_cls = num_entries_per_cls[cls] + len(vocab)
                label_scores[cls] += math.log(count / n_vocab_per_cls)
        # print(label_scores)
        print("%50s: \tNEG: %.2f, \tNEU: %.2f, \tPOS: %.2f" % (text, label_scores[0], label_scores[1], label_scores[2]))
        result.append(max(label_scores, key=lambda key: label_scores[key])) # ISSUE WAS HERE DUE TO USING np.argmax on a dict.
    return result

word_counts, vocab = gen_word_counts_with_laplace(train)
num_entries_per_cls, log_label_priors = fit(train)
print(num_entries_per_cls)
print(log_label_priors)
y_true = list(test['encoded'])
y_pred = predict(test['text'], num_entries_per_cls, log_label_priors, vocab)
print(y_true)
print(y_pred)
score(y_pred, y_true)

{2: 135, 1: 150, 0: 144}
{2: 1.9608357992719891, 1: 2.0661963149298153, 0: 2.02537432040956}
                The program does what it should do: 	NEG: -20.95, 	NEU: -20.93, 	POS: -20.82
                           It functions adequately: 	NEG: -2.33, 	NEU: -1.80, 	POS: -2.76
                                 The program sucks: 	NEG: -6.80, 	NEU: -6.81, 	POS: -5.97
               This thing runs like a pregnant cow: 	NEG: -7.49, 	NEU: -8.19, 	POS: -8.17
              It was a little slow but not too bad: 	NEG: -26.40, 	NEU: -24.10, 	POS: -28.72
                                   Slow Slow SLOW!: 	NEG: -17.70, 	NEU: -17.76, 	POS: -18.99
                                   Great software!: 	NEG: -7.78, 	NEU: -7.50, 	POS: -7.48
                      Worth the trouble to install: 	NEG: -10.86, 	NEU: -12.98, 	POS: -11.79
[2, 1, 0, 0, 1, 0, 2, 1]
[2, 1, 2, 0, 1, 0, 2, 0]
[ACCURACY]     : 0.750
[PRECISION]    : 0.792
[RECALL]       : 0.750
[F1-SCORE]     : 0.750


In [17]:
# # compute the probabilities per class per word. Built into predict
# log_word_counts = copy.deepcopy(word_counts)
# for cls in range(len(word_counts)):
#     vocab_for_cls = num_entries_per_cls[cls] + len(vocab)
#     print("[%d] log denominator for class [%d]" % (vocab_for_cls, cls))
#     for word in log_word_counts[cls]:
#         count = word_counts[cls][word]
#         # print("Word \"%s\" in cls [%d] has count: %f" % (word, cls, count))
#         log_word_counts[cls][word] = math.log(count / vocab_for_cls)
# # log_word_counts[2]['helpful']
# # log_word_counts

## Analysis
~~At this point I have run both models. They both have the same issue where there seems to be not enough training data to accuratly predict the outcome. When a random seed is not set it can predict one or two correctly every now and then. Most of the time though, the accuracy, precision, recall, and F1 scores were all 0. Occasionally the accuracy and recall would rise to 40%. This would raise the precision to 90% and the F1 score to 45.3%~~

I left my above note for posterity. During further manual testing I determined that I was performing a numpy maximum argument check on a dictionary. This is slightly undefined and resulted in running the maximum argument check against the keys of the dictionary instead of the values. This is why the model was almost always predicting a value of $0$. After fixing this line the performance of the model increased to a $62.5%$ accuracy. The F1-Score was $0.648$ with the precision and recall being $0.750$ and $0.625$ respectively. Another issue that I fixed with the model was that the number of entries per class variable was recording the number of documents trained for each class instead of the total number of tokens for each class.

## Incorrectly Classified Document Example:
`NEU It was a little slow, but not too bad.` was supposed to be classified as Neutral, instead it was classified as Negative. To evaluate this manually we need to perform a breakdown of each word.

In [18]:
sent = "It was a little slow, but not too bad."
sent = clean_string(sent)
tkns = nltk.word_tokenize(sent)
tkns = nltk.Text([token.lower() for token in tkns])
fd = tkns.vocab()
fd.tabulate()

    it    was      a little   slow    but    not    too    bad 
     1      1      1      1      1      1      1      1      1 


In [19]:
for word in fd.keys():
    if word not in vocab: 
        print("[\'%10s\']\tmissing from vocabulary" % word)
        continue
    print("[\'%10s\']\tNEG: %d, NEU: %d, POS: %d" % (word, word_counts[0][word], word_counts[1][word], word_counts[2][word]))

print("\nNum Per Cls:\tNEG: %d, NEU: %d, POS: %d" % (num_entries_per_cls[0], num_entries_per_cls[1], num_entries_per_cls[2]))
print("vocab size: %d" % (len(vocab)))

['        it']	NEG: 3, NEU: 5, POS: 2
['       was']	NEG: 3, NEU: 3, POS: 3
['         a']	missing from vocabulary
['    little']	missing from vocabulary
['      slow']	NEG: 2, NEU: 2, POS: 1
['       but']	NEG: 2, NEU: 3, POS: 1
['       not']	NEG: 2, NEU: 3, POS: 1
['       too']	NEG: 1, NEU: 3, POS: 1
['       bad']	missing from vocabulary

Num Per Cls:	NEG: 144, NEU: 150, POS: 135
vocab size: 89


ignoring the missing words, we are left with the following 'it', 'was', 'slow', 'but', 'not'. According to the frequency distribution each of these words occurs once in the input sentence. The next step is to manually compute the log probability of each word for each class.
- it
    - NEG: $log(\frac{3}{141+86})=-1.87890$
    - NEU: $log(\frac{4}{137+86})=-1.74624$
    - POS: $log(\frac{2}{132+86})=-2.03742$
- was
    - NEG: $log(\frac{3}{141+86})=-1.87890$
    - NEU: $log(\frac{2}{137+86})=-2.04727$
    - POS: $log(\frac{3}{132+86})=-1.86133$
- slow
    - NEG: $log(\frac{2}{141+86})=-2.05499$
    - NEU: $log(\frac{1}{137+86})=-2.34830$
    - POS: $log(\frac{1}{132+86})=-2.33845$
- but
    - NEG: $log(\frac{2}{141+86})=-2.05499$
    - NEU: $log(\frac{2}{137+86})=-2.04727$
    - POS: $log(\frac{1}{132+86})=-2.33845$
- not
    - NEG: $log(\frac{2}{141+86})=-2.05499$
    - NEU: $log(\frac{1}{137+86})=-2.34830$
    - POS: $log(\frac{1}{132+86})=-2.33845$

We now need to sum these probabilities per class
- NEG: $2*(-1.87890) + 3*(-2.05499)=-9.92277$
- NEU: $(-1.74624)+2*(-2.04727)+2*(-2.34830)=-10.53738$
- POS: $(-2.03742)+(-1.86133)+3*(-2.33845)=-10.9141$

From these we can see that the NEG class value is the largest. This explains why the algorithm is predicting NEG instead of NEU.

In [20]:
for word in fd.keys():
    if word not in vocab: 
        print("[\'%10s\']\tmissing from vocabulary\n" % word)
        continue
    print("[\'%10s\']\tNEG: %d, NEU: %d, POS: %d" % (word, word_counts[0][word], word_counts[1][word], word_counts[2][word]))
    prob_neg = math.log(word_counts[0][word] / (len(vocab) + num_entries_per_cls[0]))
    prob_neu = math.log(word_counts[1][word] / (len(vocab) + num_entries_per_cls[1]))
    prob_pos = math.log(word_counts[2][word] / (len(vocab) + num_entries_per_cls[2]))
    print("[\'%10s\']\tNEG: %.2f, NEU: %.2f, POS: %.2f\n" % (word, prob_neg, prob_neu, prob_pos))

print("\nNum Per Cls:\tNEG: %d, NEU: %d, POS: %d" % (num_entries_per_cls[0], num_entries_per_cls[1], num_entries_per_cls[2]))
print("vocab size: %d" % (len(vocab)))

['        it']	NEG: 3, NEU: 5, POS: 2
['        it']	NEG: -4.35, NEU: -3.87, POS: -4.72

['       was']	NEG: 3, NEU: 3, POS: 3
['       was']	NEG: -4.35, NEU: -4.38, POS: -4.31

['         a']	missing from vocabulary

['    little']	missing from vocabulary

['      slow']	NEG: 2, NEU: 2, POS: 1
['      slow']	NEG: -4.76, NEU: -4.78, POS: -5.41

['       but']	NEG: 2, NEU: 3, POS: 1
['       but']	NEG: -4.76, NEU: -4.38, POS: -5.41

['       not']	NEG: 2, NEU: 3, POS: 1
['       not']	NEG: -4.76, NEU: -4.38, POS: -5.41

['       too']	NEG: 1, NEU: 3, POS: 1
['       too']	NEG: -5.45, NEU: -4.38, POS: -5.41

['       bad']	missing from vocabulary


Num Per Cls:	NEG: 144, NEU: 150, POS: 135
vocab size: 89


After adding the sentence "NEU It was not too fast, but not too slow either." the new probabilities are:
- NEG: $(-4.35) + (-4.35) + (-4.76) + (-4.76) + (-4.76) + (-5.45) = (-28.43)$
- NEU: $(-3.87) + (-4.38) + (-4.78) + (-4.38) + (-4.38) + (-4.38) = (-26.17)$
- POS: $(-4.72) + (-4.31) + (-5.41) + (-5.41) + (-5.41) + (-5.41) = (-30.67)$

The program now correctly calculates the proper class for the sentence we manually tested. The reasoning behind this specific sentence was that using the word 'not' usually negates whatever is being said. Therefore it would probably be more often in neurtral sentences. Including both 'fast' and 'slow' in the sentence helped to counter eachother out as well. Adding this sentence dramatically improved the model performance. It went from $62.5%$ accuracy to $75.0%$ accuracy. The recall and F1-Score both jumped to $0.750$. The precision jumped to $0.792$.

## Adding the MPQA Subjectivity Cues Lexicon to the system
The MPQA Subjectivity Cues Lexicon was already loaded above into a pandas dataframe. The next steps are to evaluate how many words there are in each category. There are 5 categories total. 'negative', 'positive', 'neutral', 'both', and 'weakneg'. For simplicity the 'weakneg' will be treated as 'negative'. The 'both' will be treated as 'neutral'. This leaves us our 3 classes we are categorizing against. The following step would be to add the data to our word list and then retrain the model.

In [21]:
mpqa['priorpolarity'].value_counts()

negative    4912
positive    2718
neutral      570
both          21
weakneg        1
Name: priorpolarity, dtype: int64

In [22]:
mpqa['priorpolarity'] = mpqa['priorpolarity'].\
    apply(lambda x: 'neutral' if x == 'both' else x).\
    apply(lambda x: 'negative' if x == 'weakneg' else x).\
    apply(lambda x: 'NEG' if x == 'negative' else x).\
    apply(lambda x: 'NEU' if x == 'neutral' else x).\
    apply(lambda x: 'POS' if x == 'positive' else x)
mpqa['priorpolarity'].value_counts()

NEG    4913
POS    2718
NEU     591
Name: priorpolarity, dtype: int64

In [23]:
train = copy.deepcopy(data)
test = copy.deepcopy(test_data)
unique_labels = data['encoded'].unique()

for word in list(mpqa[mpqa['priorpolarity'] == 'NEG']['word1']):
    train = pd.concat([train, pd.DataFrame.from_records([{
        "cls": "NEG",
        "text": word,
        "encoded": 0
    }])]).reset_index(drop=True)
for word in list(mpqa[mpqa['priorpolarity'] == 'NEU']['word1']):
    train = pd.concat([train, pd.DataFrame.from_records([{
        "cls": "NEU",
        "text": word,
        "encoded": 1
    }])]).reset_index(drop=True)
for word in list(mpqa[mpqa['priorpolarity'] == 'POS']['word1']):
    train = pd.concat([train, pd.DataFrame.from_records([{
        "cls": "POS",
        "text": word,
        "encoded": 2
    }])]).reset_index(drop=True)

In [24]:
train = copy.deepcopy(data)
test = copy.deepcopy(test_data)
unique_labels = data['encoded'].unique()

def gen_word_counts_with_laplace(dataframe: pd.DataFrame):
    # get a list of all of the word keys
    all_tokens = nltk.word_tokenize('\n'.join(dataframe['text']))
    all_text = nltk.Text([token.lower() for token in all_tokens])
    all_freq_dist = all_text.vocab()
    all_keys = all_freq_dist.keys()
    word_counts = {}
    # iterate through each of the label classes
    for cls in unique_labels:
        # for each label class compute the word frequency distribution
        text = '\n'.join(dataframe[dataframe['encoded'] == cls]['text'])
        # print("CLS [%d]: %s" % (cls, text))
        tokens = nltk.word_tokenize(text)
        tokens = [token.lower() for token in tokens]
        n_text = nltk.Text(tokens)
        fd = n_text.vocab()
        dictionary = {}
        # use the list of all keys and frequency distribution to make
        # a frequency distribution for the class with all keys
        # and laplacian smoothing
        for key in all_keys:
            count = fd[key] + 1 # add 1 for laplace smoothing
            dictionary[key] = count
        # add the distribution to the list for output
        word_counts[cls] = dictionary
    return word_counts, all_keys

def fit(dataframe: pd.DataFrame):
    num_entries_per_cls = {}
    log_label_priors = {}
    n = len(list(dataframe['text']))
    grouped = [None] * len(unique_labels)
    for cls in dataframe['encoded'].unique():
        grouped[cls] = list(dataframe['text'][dataframe['encoded'] == cls])
        # num_entries_per_cls[cls] = len(grouped[cls])
        num_entries_per_cls[cls] = sum(word_counts[cls].values())
        log_label_priors[cls] = math.log(num_entries_per_cls[cls] / n)
    return(num_entries_per_cls, log_label_priors)

def predict(x: list[str], num_entries_per_cls, log_label_priors, vocab):
    result = []
    for text in x:
        label_scores = copy.deepcopy(log_label_priors)
        text = clean_string(text)
        tokens = nltk.word_tokenize(text)
        tokens = [token.lower() for token in tokens]
        # fd = nltk.Text(tokens).vocab()
        for word in tokens:
            if word not in vocab: 
                # print("[%s] missing from vocabulary" % word)
                continue
            for cls in unique_labels:
                count = word_counts[cls][word]
                n_vocab_per_cls = num_entries_per_cls[cls] + len(vocab)
                label_scores[cls] += math.log(count / n_vocab_per_cls)
        # print(label_scores)
        print("%50s: \tNEG: %.2f, \tNEU: %.2f, \tPOS: %.2f" % (text, label_scores[0], label_scores[1], label_scores[2]))
        result.append(max(label_scores, key=lambda key: label_scores[key])) # ISSUE WAS HERE DUE TO USING np.argmax on a dict.
    return result

word_counts, vocab = gen_word_counts_with_laplace(train)

num_entries_per_cls, log_label_priors = fit(train)
print(num_entries_per_cls)
print(log_label_priors)
y_true = list(test['encoded'])
y_pred = predict(test['text'], num_entries_per_cls, log_label_priors, vocab)
print(y_true)
print(y_pred)
score(y_pred, y_true)

{2: 135, 1: 150, 0: 144}
{2: 1.9608357992719891, 1: 2.0661963149298153, 0: 2.02537432040956}
                The program does what it should do: 	NEG: -20.95, 	NEU: -20.93, 	POS: -20.82
                           It functions adequately: 	NEG: -2.33, 	NEU: -1.80, 	POS: -2.76
                                 The program sucks: 	NEG: -6.80, 	NEU: -6.81, 	POS: -5.97
               This thing runs like a pregnant cow: 	NEG: -7.49, 	NEU: -8.19, 	POS: -8.17
              It was a little slow but not too bad: 	NEG: -26.40, 	NEU: -24.10, 	POS: -28.72
                                   Slow Slow SLOW!: 	NEG: -17.70, 	NEU: -17.76, 	POS: -18.99
                                   Great software!: 	NEG: -7.78, 	NEU: -7.50, 	POS: -7.48
                      Worth the trouble to install: 	NEG: -10.86, 	NEU: -12.98, 	POS: -11.79
[2, 1, 0, 0, 1, 0, 2, 1]
[2, 1, 2, 0, 1, 0, 2, 0]
[ACCURACY]     : 0.750
[PRECISION]    : 0.792
[RECALL]       : 0.750
[F1-SCORE]     : 0.750


In [25]:
print("Word: [%s], Polarity: %s" % ('the', mpqa[mpqa['word1'] == 'the']['priorpolarity']))
print("Word: [%s], Polarity: %s" % ('program', mpqa[mpqa['word1'] == 'program']['priorpolarity']))
print("Word: [%s], Polarity: %s" % ('does', mpqa[mpqa['word1'] == 'does']['priorpolarity']))
print("Word: [%s], Polarity: %s" % ('what', mpqa[mpqa['word1'] == 'what']['priorpolarity']))
print("Word: [%s], Polarity: %s" % ('it', mpqa[mpqa['word1'] == 'it']['priorpolarity']))
print("Word: [%s], Polarity: %s" % ('should', mpqa[mpqa['word1'] == 'should']['priorpolarity']))
print("Word: [%s], Polarity: %s" % ('do', mpqa[mpqa['word1'] == 'do']['priorpolarity']))

Word: [the], Polarity: Series([], Name: priorpolarity, dtype: object)
Word: [program], Polarity: Series([], Name: priorpolarity, dtype: object)
Word: [does], Polarity: Series([], Name: priorpolarity, dtype: object)
Word: [what], Polarity: Series([], Name: priorpolarity, dtype: object)
Word: [it], Polarity: Series([], Name: priorpolarity, dtype: object)
Word: [should], Polarity: 6665    NEU
Name: priorpolarity, dtype: object
Word: [do], Polarity: Series([], Name: priorpolarity, dtype: object)


Adding the lexicon ended up hurting my score. This is likely due to how I added the lexicon to the program. Document 1 used to be classified correctly as a positive document. Now that the lexicon was added it is now classified as a neutral document. The document in question is "POS The program does what it should do." After checking the mpqa for the words in that document it appears that the inclusion of the word 'should' under 'Neutral' ended up changing the classification. This is in addition to the increase in words for each class as well with the 'Neutral' class not having as many words added to it.

## Amazon Reviews Documents
The last step is to use the colleciton of amazon reviews documents to train the classifier.

In [26]:
## Read in the training data
training = []
strings: list[str] = []
filename = "Amazon Reviews\\train.ft.txt"
with open(filename, mode="r", encoding='utf-8') as f:
    for line in f:
        tokens = line.split(" ")
        classification = tokens.pop(0)
        x = clean_string(str(line[line.index(" ") + 1:-1]))
        strings.append(x)
        training +=[(classification, x)]
text = '\n'.join(strings)
data = pd.DataFrame(training, columns=['cls', 'text'])
data['cls'] = data['cls'].astype("category")
data['encoded'] = encoder.fit_transform(data['cls'])
data['encoded'] = data['encoded'].astype("category")
data

Unnamed: 0,cls,text,encoded
0,__label__2,Stuning even for the non-gamer This sound trac...,1
1,__label__2,The best soundtrack ever to anything I am read...,1
2,__label__2,Amazing! This soundtrack is my favorite music ...,1
3,__label__2,Excellent Soundtrack I truly like this soundtr...,1
4,__label__2,Remember Pull Your Jaw Off The Floor After Hea...,1
...,...,...,...
3599995,__label__1,Do not do it!! The high chair looks great when...,0
3599996,__label__1,Looks nice low functionality I have used this ...,0
3599997,__label__1,compact but hard to clean We have a small hous...,0
3599998,__label__1,what is it saying not sure what this book is s...,0


In [27]:
## Read in the testing data
training = []
strings: list[str] = []
filename = "Amazon Reviews\\test.ft.txt"
with open(filename, mode="r", encoding='utf-8') as f:
    for line in f:
        tokens = line.split(" ")
        classification = tokens.pop(0)
        x = clean_string(str(line[line.index(" ") + 1:-1]))
        strings.append(x)
        training +=[(classification, x)]
text = '\n'.join(strings)
test_data = pd.DataFrame(training, columns=['cls', 'text'])
test_data['cls'] = test_data['cls'].astype("category")
test_data['encoded'] = encoder.transform(test_data['cls'])
test_data['encoded'] = test_data['encoded'].astype("category")
test_data

Unnamed: 0,cls,text,encoded
0,__label__2,Great CD My lovely Pat has one of the GREAT vo...,1
1,__label__2,One of the best game music soundtracks - for a...,1
2,__label__1,Batteries died within a year I bought this ch...,0
3,__label__2,works fine but Maha Energy is better Check out...,1
4,__label__2,Great for the non-audiophile Reviewed quite a ...,1
...,...,...,...
399995,__label__1,Unbelievable- In a Bad Way We bought this Thom...,0
399996,__label__1,Almost Great Until it Broke My son recieved th...,0
399997,__label__1,Disappointed !!! I bought this toy for my son ...,0
399998,__label__2,Classic Jessica Mitford This is a compilation ...,1


In [28]:
train = copy.deepcopy(data)
test = copy.deepcopy(test_data)
unique_labels = data['encoded'].unique()

def gen_word_counts_with_laplace(dataframe: pd.DataFrame):
    # get a list of all of the word keys
    all_tokens = nltk.word_tokenize('\n'.join(dataframe['text']))
    all_text = nltk.Text([token.lower() for token in all_tokens])
    all_freq_dist = all_text.vocab()
    all_keys = all_freq_dist.keys()
    word_counts = {}
    # iterate through each of the label classes
    for cls in unique_labels:
        # for each label class compute the word frequency distribution
        text = '\n'.join(dataframe[dataframe['encoded'] == cls]['text'])
        # print("CLS [%d]: %s" % (cls, text))
        tokens = nltk.word_tokenize(text)
        tokens = [token.lower() for token in tokens]
        n_text = nltk.Text(tokens)
        fd = n_text.vocab()
        dictionary = {}
        # use the list of all keys and frequency distribution to make
        # a frequency distribution for the class with all keys
        # and laplacian smoothing
        for key in all_keys:
            count = fd[key] + 1 # add 1 for laplace smoothing
            dictionary[key] = count
        # add the distribution to the list for output
        word_counts[cls] = dictionary
    return word_counts, all_keys

def fit(dataframe: pd.DataFrame):
    num_entries_per_cls = {}
    log_label_priors = {}
    n = len(list(dataframe['text']))
    grouped = [None] * len(unique_labels)
    for cls in dataframe['encoded'].unique():
        grouped[cls] = list(dataframe['text'][dataframe['encoded'] == cls])
        # num_entries_per_cls[cls] = len(grouped[cls])
        num_entries_per_cls[cls] = sum(word_counts[cls].values())
        log_label_priors[cls] = math.log(num_entries_per_cls[cls] / n)
    return(num_entries_per_cls, log_label_priors)

def predict(x: list[str], num_entries_per_cls, log_label_priors, vocab):
    result = []
    for text in x:
        label_scores = copy.deepcopy(log_label_priors)
        text = clean_string(text)
        tokens = nltk.word_tokenize(text)
        tokens = [token.lower() for token in tokens]
        # fd = nltk.Text(tokens).vocab()
        for word in tokens:
            if word not in vocab: 
                # print("[%s] missing from vocabulary" % word)
                continue
            for cls in unique_labels:
                count = word_counts[cls][word]
                n_vocab_per_cls = num_entries_per_cls[cls] + len(vocab)
                label_scores[cls] += math.log(count / n_vocab_per_cls)
        # print(label_scores)
        # print("%50s: \tNEG: %.2f, \tNEU: %.2f, \tPOS: %.2f" % (text, label_scores[0], label_scores[1], label_scores[2]))
        result.append(max(label_scores, key=lambda key: label_scores[key])) # ISSUE WAS HERE DUE TO USING np.argmax on a dict.
    return result

word_counts, vocab = gen_word_counts_with_laplace(train)
num_entries_per_cls, log_label_priors = fit(train)
print(num_entries_per_cls)
print(log_label_priors)
y_true = list(test['encoded'])
y_pred = predict(test['text'], num_entries_per_cls, log_label_priors, vocab)
# print(y_true)
# print(y_pred)
score(y_pred, y_true)


{1: 144870616, 0: 156199131}
{1: 3.6949071951705346, 0: 3.7701978285477638}
[ACCURACY]     : 0.850
[PRECISION]    : 0.851
[RECALL]       : 0.850
[F1-SCORE]     : 0.850


The Amazon reviews dataset achieved a $85.0%$ accuracy. It also got a $0.850$ recall and F1-Score. The precision was slightly higher at $0.851$. Note that this cell take about 50min to execute.