# ANLP Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [3]:
candidateno=291065 #this MUST be updated to your candidate number so that you get a unique data sample


In [4]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [5]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the
            pair is a list of the training data and the second is a list of the test data.
    """

    data = list(data)
    n = len(data)
    train_indices = random.sample(range(n), int(n * ratio))
    test_indices = list(set(range(n)) - set(train_indices))
    train = [data[i] for i in train_indices]
    test = [data[i] for i in test_indices]
    return (train, test)


def get_train_test_data():

    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')

    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]

    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [6]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['i', 'want', 'to', 'correct', 'what', 'i', 'wrote', ...], 'pos')


# A. Re-Useable Code and Functions

In [7]:
def preprocess_text(token_list: list, stop_words: list):
    """
    Applies case normalization, removes numbers and punctuation, 
    and removes stopwords from a list of tokens.

    Args:
        token_list (list): The input list of tokens (strings).
        stop_words (set/list): A collection of words to be removed (stopwords).

    Returns:
        list: The preprocessed list of tokens.
    """
    
    processed_list = [
        token.lower()
        for token in token_list  
        if token.isalpha()        # removes numbers and punctuation
        and token.lower() not in stop_words  # remove tokens in stop_words (by keeping those not in stop_words)
    ]
    
    return processed_list

# 1. Generating Positive and Negative Word Lists

a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

## Code (a & b)

In [8]:
stop = stopwords.words('english')

pos_freq_dist=FreqDist()
neg_freq_dist=FreqDist()

for rev,label in training_data:

    # PREPROCESSING: case, number, punctuation, stopword
    rev = preprocess_text(token_list=rev, stop_words=stop)

    # FreqDist() data structure
    if label == 'pos':
        pos_freq_dist.update(rev) 
    elif label == 'neg':
        neg_freq_dist.update(rev)
    else: 
        print("unexpected class")

# IDENTIFY WORD DIFFERENTIAL BETWEEN CLASSSES
all_words = set(pos_freq_dist.keys()) | set(neg_freq_dist.keys()) # all unique words

word_counts = []

for word in all_words:
    pos_count = pos_freq_dist.get(word, 0)
    neg_count = neg_freq_dist.get(word, 0)
    difference = pos_count - neg_count
    total = pos_count + neg_count
    word_counts.append((word, difference, total))

# IDENTIFY REVIEW DOMAIN STOP WORDS
word_counts_copy = list(word_counts)
word_counts_copy.sort(key=lambda item: item[2], reverse=True) # [2] = Total Corpus Count
_domainStopWords = word_counts_copy[:5] # 5 most common review words
domainStopWords = [word for word, diff, total in _domainStopWords]
domain_stop_set = set(domainStopWords)

# DROP DOMAIN WORDS FROM LIST
final_word_counts = [word for word in word_counts if word[0] not in domain_stop_set]

# SORY BY [1], THE CLASS/LABEL DIFFERENTIAL
final_word_counts.sort(key=lambda item: item[1]) # low to high

# COLLECT WORDS WITH THE LARGEST DIFFERENTIAL
_negative_word_list = final_word_counts[:10] # first 10
_positive_word_list = final_word_counts[-10:] # last 10

_positive_word_list.sort(key=lambda item: item[1], reverse=True) # high to low

negative_word_list = [word for word, diff, total in _negative_word_list]
positive_word_list = [word for word, diff, total in _positive_word_list]

In [9]:
print(domainStopWords)

['film', 'movie', 'one', 'like', 'even']


In [10]:
print(negative_word_list)

['bad', 'plot', 'nothing', 'worst', 'script', 'stupid', 'boring', 'least', 'harry', 'supposed']


In [11]:
print(_negative_word_list)

[('bad', -476, 998), ('plot', -244, 1040), ('nothing', -153, 555), ('worst', -139, 205), ('script', -127, 533), ('stupid', -114, 174), ('boring', -113, 195), ('least', -107, 447), ('harry', -105, 163), ('supposed', -105, 223)]


In [12]:
print(positive_word_list)

['life', 'also', 'great', 'well', 'best', 'story', 'many', 'world', 'love', 'first']


In [13]:
print(_positive_word_list)

[('life', 329, 1113), ('also', 327, 1395), ('great', 263, 833), ('well', 253, 1315), ('best', 245, 931), ('story', 229, 1555), ('many', 203, 901), ('world', 201, 729), ('love', 200, 800), ('first', 166, 1318)]


## Explanation (c)

In this code I am looping through each review in the training data whilst giving myself access to the wordlist and the label. 

On the all word lists I preprocess and normalise accounting for case, numbers, punctuation and stopwords. I am doing this to remove noise and get access to the words in their cleanest and more contextualised form and removing words devoid of meaningful context. I did not implement any stemming or lemmanisation as the goal is to obtain a word list, hence, readabliltiy is desired. 

After which I push the wordlists into pre-initalised world lists for each label. It is important to have seperate DistFreqs for the label because I want to compare the frequency between the two. 

For each word I calculate two metrics. The differential between the positive and negative counts, as well as, the cumulative word count for the corupus.

Words that come out with a positive differential are candiates for the positive world list, and negative ones the negative word list.

The cumulative word count is used to identify words that appear to be domain stop words. That is, words that appear to be disproporionately used in context of writing reviews. The frequency of these words in the domain of reviewing means they loose the context they might otherwise hold in general terms. I remove these words as candidates for the word list.

Finally, the word list is sorting by the index [1] which is the word differential metric. Slices of the top and bottom 10 are taken which are the words that are most skewed towards being in positive or negative reviews, hence, should be highly representative of sentiment.

# 2. Word List Classifer

a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


## Code

In [14]:
def sentiment_weight_converter(sentiment_word_list: list):
    """
    Takes a sentiment word list with frequency counts converts it into a Sentiment Polarity weightings.  

    Args:
        sentiment_word_list (Iterable): Nested word list comprised of word:string, differential: int and total:int. 
        
    Returns:
        dict: dictionary of word and its sentiment weighting
    """
    sentiment_weights = {
        word: differential / total_usage
        for word, differential, total_usage in sentiment_word_list
    }

    return sentiment_weights

In [15]:
print(sentiment_weight_converter(_positive_word_list))

{'life': 0.29559748427672955, 'also': 0.23440860215053763, 'great': 0.3157262905162065, 'well': 0.19239543726235742, 'best': 0.2631578947368421, 'story': 0.1472668810289389, 'many': 0.2253052164261931, 'world': 0.2757201646090535, 'love': 0.25, 'first': 0.125948406676783}


In [16]:
print(sentiment_weight_converter(_negative_word_list))

{'bad': -0.47695390781563124, 'plot': -0.23461538461538461, 'nothing': -0.2756756756756757, 'worst': -0.6780487804878049, 'script': -0.23827392120075047, 'stupid': -0.6551724137931034, 'boring': -0.5794871794871795, 'least': -0.23937360178970918, 'harry': -0.6441717791411042, 'supposed': -0.47085201793721976}


In [17]:
from nltk.classify.api import ClassifierI

class ReviewClassifer(ClassifierI):

    def __init__(self, pos, neg):
        self._pos = sentiment_weight_converter(pos)
        self._neg = sentiment_weight_converter(neg)

    def classify(self, words):
        score = 0

        for word in words:
            
            if word in self._pos:
                score += self._pos[word]

            elif word in self._neg:
                score += self._neg[word]

        return "neg" if score <= 0 else "pos"

    def labels(self):
        return ("pos", "neg")

In [18]:
#Example usage:

classifier = ReviewClassifer(_positive_word_list, _negative_word_list)

data = [
    training_data[100], 
    training_data[500],
    training_data[700],
    training_data[1100]
]

for rev,label in data:
    rev = preprocess_text(token_list=rev, stop_words=stop)

    cls = classifier.classify(rev)

    print(label, cls)

pos pos
pos neg
neg pos
neg neg


## Explanation

Here I have set up a list classifer class by inheriting from `ClassifierI` in the `nltk` package. 

This has been done to for standardization purpose to follow convention and for customization later in the assignment where I can easily knit `nltks` eval packages.

Prior to writting the class I set up a function called `sentiment_weight_converter`. This function allows me to take my sentiment word list (pos or neg) and calculate a weighting for each word. The idea is to capture which words are more positive or negative.

I had previously calculated the differential for each word to determine the most positive or negative words, as well as, counting the overall total frequency for each word. By taking a ratio of each per word, the score its weighted by how much the sentiment skew covers the total usage of a work

- If a word is used 100 times and all seen instances are negative then it has a weighting of 1
- If a word is used 100 times but only 80 instances are negative and 20 positive it has a differential of 60 and a weighting of 60/100 = 0.2. It is still considered a negative word but less so.

$$\text{Weighting Score} = \frac{D}{T}$$

Hopefully this approach will lead to an improve in some evaluation scores. Additionally, positive words overall seem to appear with more frequences but the negative weights appear to come out stronger so this may even find out and make classifcation better.

Within the class, the `classify` method handles the computation. Looping through a review, if it sees a word in either sentiment list then the words correponding **weight** is added the `score` tally which is initalized as 0. Note, by definition negative words will have negative weights and visa versa for positive words, meaning each will pull the `score` in either direction. Finally after the loop has finished the method returns 'neg' if the score <= 0 or positive if  >0.

# 3. Evaluation Metrics

a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

## Code

In [19]:
import nltk
from nltk.metrics import precision, recall, f_measure
from nltk.classify.util import accuracy
from collections import defaultdict

stop = nltk.corpus.stopwords.words('english')

classifier = ReviewClassifer(_positive_word_list, _negative_word_list)

refsets = defaultdict(set) # init store for actual labels
testsets = defaultdict(set) # init store for predicted labels

# PROCCESS, PREDICT AND STORE
for i, (words, label) in enumerate(testing_data):
    
    processed_words = preprocess_text(token_list=words, stop_words=stop)
    
    predicted_label = classifier.classify(processed_words)
    
    refsets[label].add(i)
    testsets[predicted_label].add(i)

# CALCULATE ACCURACY
acc = nltk.classify.util.accuracy(
    classifier, 
    [(preprocess_text(words, stop), label) for words, label in testing_data]
)

print("-" * 30)
print(f"Overall Accuracy: {acc:.4f}")
print("-" * 30)

print("-" * 30)
print(f"Evaluation Metrics")
print("-" * 30)

# CALCULATE PRECISION, RECALL, F1-SCORE
for label in classifier.labels():
    
    # PRECISION: OF DOCUMENTS CLASSIFIED AS X, HOW MANY WERE ACTUALLY X
    P = precision(refsets[label], testsets[label])
    
    # RECALL: OF DOCUMENTS THAT ARE ACTUALLY X, HOW MANY WERE CLASSIFIED AS X
    R = recall(refsets[label], testsets[label])
    
    # F1-SCORE: The HARMONIC MEAN OF PRECISION AND RECALL
    F = f_measure(refsets[label], testsets[label])
    
    print(f"Metrics for Class '{label}':")
    print(f"  Precision: {P:.4f}")
    print(f"  Recall:    {R:.4f}")
    print(f"  F1-Score:  {F:.4f}")
    print("-" * 30)

------------------------------
Overall Accuracy: 0.7100
------------------------------
------------------------------
Evaluation Metrics
------------------------------
Metrics for Class 'pos':
  Precision: 0.6583
  Recall:    0.8733
  F1-Score:  0.7507
------------------------------
Metrics for Class 'neg':
  Precision: 0.8119
  Recall:    0.5467
  F1-Score:  0.6534
------------------------------


## Explanation

Accuracy is the ratio of (True Positives + True Negatives) / Total Samples. Essentially it is the ratio of sames that were correct, irrespective of their class. We can generally say on some level that accuracy is a measure of a classifier's ability if the classes are balanced. This is because we can determine if the classifier is doing something. For example, if a dataset with binary classes is split 50/50 and the accuracy is coming out at 50% we might infer that the classifier is doing nothing more than a random guess or just programmed to always predict one type of class. This is in contrast to an imbalanced dataset that might be split 99/1. Here an accuracy of 99.5 might lead us to believe we have a strong classifier, however, the dominance of one class means that the classifier can learn to almost always predict this class and achieve a high accuracy whilst performing very poorly on the minor class. 

In our example, the classes in the training set are balanced 50/50 on the reviews sentiment label (pos or neg). However, what is important to note is that the features (words) are highly imbalanced. There are many more instances of the positive words in the reviews than there are the negative words. 

I would say that it is reasonable to use accuracy as an evaluation metric, just not in isolation nor as the final, definitive measure. Accuracy is a reasonable starting point because the classes are balanced. This means we can use it to inger whether our classifier is doing anything more than either random guessing or just guessing 1 class all of the time. If it passes this then we can begin to look at more in-depth metrics which can help us evaluate our feature imbalance. 

For example, our classifier encounters more positive evidence, resulting in more positive predictions. This results in a high recall for positive as it finds many of them but lower precision because it over-predicts and makes mistakes. Accuracy masks this imbalance, whereas precision and recall reveal the model's bias towards over-predicting the positive class.

An example where accuracy would be a good overall metric is part-of-speech tagging where the goal is to label every word in a sentence/corpus with its correct grammatical category (e.g. Noun, Verb, Adjective etc). There are imbalances in this problem but not severe (99.5% vs 0.5%) to induce the imbalance paradox. The cost of error is generally uniform across classes, there are no catastrophic misclassifications that may be found in a rare disease problem. Overall the goal is just to get the max number for words correctly labelled.  

To summarise, Accuracy has some use as an evaluation metric when class imbalance is close to 50/50 and errors between classes are generally considered equal in cost/impact

# 4. Naive Bayes

a) **Construct** a Naive Bayes classifier (e.g., from NLTK).

b) Compare the performance of your word list classifier with the Naive Bayes classifier. Discuss your results. [12.5\%]

## NB Classifer

In [20]:
def extract_features(word_list):
    """
    Converts a list of words into a dictionary of presence features (Bag-of-Words).
    
    Args:
        word_list (list): A list of tokens.
        
    Returns:
        dict: A feature set dictionary where words are keys and values are True.
    """
    return dict([(word, True) for word in word_list])

In [21]:
# Use the 'stop' list and 'preprocess_text' function defined in your Q1 code
stop = stopwords.words('english')

# Apply preprocessing and feature extraction to the training data
featuresets = [
    (extract_features(preprocess_text(words, stop)), label)
    for (words, label) in training_data
] # Bag-of-Words (BoW)

print(f"Total training feature sets: {len(featuresets)}")
print("Example feature set:")
print(featuresets[0])

Total training feature sets: 1400
Example feature set:
({'want': True, 'correct': True, 'wrote': True, 'former': True, 'retrospective': True, 'david': True, 'lean': True, 'war': True, 'picture': True, 'still': True, 'think': True, 'deserve': True, 'number': True, 'american': True, 'film': True, 'institute': True, 'list': True, 'greatest': True, 'movies': True, 'lumet': True, 'angry': True, 'men': True, 'wilder': True, 'witness': True, 'prosecution': True, 'kubrick': True, 'paths': True, 'glory': True, 'would': True, 'better': True, 'choices': True, 'best': True, 'oscar': True, 'deny': True, 'importance': True, 'bridge': True, 'river': True, 'kwai': True, 'cinematically': True, 'contents': True, 'set': True, 'burma': True, 'bataillon': True, 'british': True, 'soldiers': True, 'japanese': True, 'captivity': True, 'forced': True, 'build': True, 'strategically': True, 'momentous': True, 'railway': True, 'commanding': True, 'officer': True, 'colonel': True, 'nicholson': True, 'alec': True, 

In [22]:
from nltk.classify import NaiveBayesClassifier

# Train the Naive Bayes Classifier
# Bag-of-Words (BoW) Naive Bayes Classifier
nb_classifier = NaiveBayesClassifier.train(featuresets)

print("\nSuccessfully trained Naive Bayes Classifier.")
nb_classifier.show_most_informative_features(10)


Successfully trained Naive Bayes Classifier.
Most Informative Features
               fashioned = True              pos : neg    =     13.7 : 1.0
               insulting = True              neg : pos    =     11.7 : 1.0
            breathtaking = True              pos : neg    =     11.0 : 1.0
                  avoids = True              pos : neg    =      9.7 : 1.0
                    bold = True              pos : neg    =      9.7 : 1.0
                  elliot = True              pos : neg    =      9.7 : 1.0
                  regard = True              pos : neg    =      9.7 : 1.0
                seamless = True              pos : neg    =      9.7 : 1.0
                  finest = True              pos : neg    =      9.3 : 1.0
              astounding = True              pos : neg    =      9.0 : 1.0


In [32]:
prediction = nb_classifier.classify(featuresets[0][0])

In [33]:
print(f"Review featuresets[0][1] has actual label of '{featuresets[0][1]}' and NB predicts '{prediction}'")

Review featuresets[0][1] has actual label of 'pos' and NB predicts 'pos'


In [42]:
from collections import defaultdict
from nltk.metrics import precision, recall, f_measure
from nltk.classify.util import accuracy

test_featuresets = [
    (extract_features(preprocess_text(words, stop)), label)
    for (words, label) in testing_data
]

refsets = defaultdict(set)
testsets = defaultdict(set)

for i, (features, label) in enumerate(test_featuresets):

    predicted_label = nb_classifier.classify(features)
    
    # STORE TRUE AND PREDICTED LABELS
    refsets[label].add(i)
    testsets[predicted_label].add(i)

# ACCURACY
acc = accuracy(nb_classifier, test_featuresets)
print("-" * 40)
print(f"Naive Bayes Classifier Evaluation")
print("-" * 40)
print(f"Overall Accuracy: {acc:.4f}\n")

print("-" * 40)
print(f"Evaluation Metrics")
print("-" * 40)

# PRECISION, RECALL, F1
for label in nb_classifier.labels():

    P = precision(refsets[label], testsets[label])
    R = recall(refsets[label], testsets[label])
    F = f_measure(refsets[label], testsets[label])
    
    print(f"Metrics for Class '{label}':")
    print(f"  Precision: {P:.4f}")
    print(f"  Recall:    {R:.4f}")
    print(f"  F1-Score:  {F:.4f}")
    print("-" * 40)

----------------------------------------
Naive Bayes Classifier Evaluation
----------------------------------------
Overall Accuracy: 0.6550

----------------------------------------
Evaluation Metrics
----------------------------------------
Metrics for Class 'pos':
  Precision: 0.5924
  Recall:    0.9933
  F1-Score:  0.7422
----------------------------------------
Metrics for Class 'neg':
  Precision: 0.9794
  Recall:    0.3167
  F1-Score:  0.4786
----------------------------------------


## Compare Performance & Discuss your results

At first glance, an accuracy of 65.5% looks poor. Particularly for a binary classification problem with a balanced dataset where a random guess hits 50% accuracy. This even underperforms the vastly simpler Word List Classifier (WLC) which has an accuracy of 71%.

The Precision, Recall, and F1-scores tell us much more detail about performance and in particular, model bias. 

The Naive Bayes is suffering from classification bias towards the positive class. It prefers to predict something as positive. 

It has an extremely high recall of 99.33% which means it almost always finds all of the positive reviews. However, it massively over predicts positive reviews shown by a precision of 59.25%. This means 40.76% of the positive predictions were False Positives and were actually negative reviews. The F1 score of 74.22% looks fine but it is being driven by the high recall. The NB classifier is highly sensitive to positive reviews and will predict positive even if the evidence is weak. The WordList classifier followed the same sort of trend here with a lower Precision 65.83% and a higher Recall of 87.33%. The extremities of the WLC were less so but the harmonic means is largely the same with an F1-Score of 75.05%.

Conversely, for the negative class, the NB has a precision of 97.94% meaning that if the model decided to predict a review is negative then it is almost always correct about that. But it is really struggling to find an adequate number of the total negative reviews with a recall of only 31.67% - which is very low. This means approximately 68% of true negative reviews were missed (False Negatives), leading to a very low coverage. This is summarized by the F1 score of 47.86%. The NB classifier is highly conservative on on negative predictions and has poor coverage. Similar to the positive class, the WLC follows the same trends as the Naive Bayes but has less extreme values. Its Precision was 81.19% and its Recall was 54.67%. However, for negative reviews the Recall is comparatively way better leading to a better harmonic mean with a F1-Score of 65.34%.

| Metric   | NB ('pos') | WLC ('pos') | NB ('neg') | WLC ('neg') |
|:---------|:----------|:-----------|:----------|:-----------|
| Precision| 59.25%    | 65.83%     | 97.94%    | 81.19%     |
| Recall   | 99.33%    | 87.33%     | 31.67%    | 54.67%     |
| F1-Score | 74.22%    | 75.05%     | 47.86%    | 65.34%     |

The WLC achieved a more balanced performance as represented by better F1-Scores in both classes. It would appear that the constrained sentiment features lists of the WordList classifier allowed it to generalize better to the test set. The small feature set meant that it was able to avoid the cumulative compounding effect of weak evidence that the NB fell for.

Overall, the both model's sevre bias is a symptom of the underlying data feature imbalance. Whilst the classes are distributed 50/50 the total frequency of positive-skewed tokens in the corpus is much higher than the negatives. Naive Bayes is a frequency-based probabilistic model, hence, is designed to reflect this by weighting decisions towards the class with the higher volume of the dominant feature. In this scenario, the NB classifier is too sophisticated for the noise in the data, whereas the basic constraints of the Word List approach unintentionally provided superior regularization against the feature-level bias.


# 5.

a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions.

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]


b) problem to solve is feature imbalance

world list approach is a feature regularizer

nb inhertently is impacted by the extremeites of the data issue

could try to preproccess the data to reduce the features but structure of nb is probably just too extreme. shown by lemm as feature reduction doing worse

I rec word list because it can be refinded

additionally as lightly experiemented in my example, words can be weighted. my hypo was that the weighting itself would help to offset (which is may have) but further changes to weighting could be experiemented with as a hyperparameter, i.e. double weights of negative

Why lemm didn't improve: https://gemini.google.com/share/6de999c3df85

In [59]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [60]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK components if not already done
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    """
    Converts a Treebank POS tag (used by nltk.pos_tag) to a 
    WordNet POS tag (needed by WordNetLemmatizer).
    """
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to Noun if unsure

def preprocess_and_lemmatize(token_list: list, stop_words: list):
    """
    Applies case normalization, removes non-alpha tokens, removes stopwords, 
    and applies POS-aware lemmatization.
    """
    cleaned_tokens = [token.lower() for token in token_list if token.isalpha() and token.lower() not in stop_words]
    
    # 1. Get POS tags for the cleaned tokens
    tagged_tokens = nltk.pos_tag(cleaned_tokens)
    
    # 2. Apply POS-aware lemmatization
    lemmatized_tokens = []
    for word, tag in tagged_tokens:
        w_net_pos = get_wordnet_pos(tag)
        lemma = lemmatizer.lemmatize(word, pos=w_net_pos)
        lemmatized_tokens.append(lemma)
        
    return lemmatized_tokens

# Define stop words list again
stop = stopwords.words('english')

In [61]:
# Apply new preprocessing and feature extraction to the training data
lemmatized_featuresets = [
    (extract_features(preprocess_and_lemmatize(words, stop)), label)
    for (words, label) in training_data
]

# Apply to the testing data for evaluation
lemmatized_test_featuresets = [
    (extract_features(preprocess_and_lemmatize(words, stop)), label)
    for (words, label) in testing_data
]

print(f"Total lemmatized training feature sets: {len(lemmatized_featuresets)}")

Total lemmatized training feature sets: 1400


In [62]:
from nltk.classify import NaiveBayesClassifier
from collections import defaultdict
from nltk.metrics import precision, recall, f_measure
from nltk.classify.util import accuracy

# Train the new Naive Bayes Classifier
nb_lem_classifier = NaiveBayesClassifier.train(lemmatized_featuresets)

print("\nSuccessfully trained Lemmatized Naive Bayes Classifier.")
nb_lem_classifier.show_most_informative_features(10)

# Evaluation
refsets_lem = defaultdict(set)
testsets_lem = defaultdict(set)

for i, (features, label) in enumerate(lemmatized_test_featuresets):
    predicted_label = nb_lem_classifier.classify(features)
    refsets_lem[label].add(i)
    testsets_lem[predicted_label].add(i)

# ACCURACY
acc_lem = accuracy(nb_lem_classifier, lemmatized_test_featuresets)
print("\n" + "=" * 50)
print(f"Lemmatized Naive Bayes Classifier Evaluation (Accuracy: {acc_lem:.4f})")
print("=" * 50)

# PRECISION, RECALL, F1
for label in nb_lem_classifier.labels():
    P_lem = precision(refsets_lem[label], testsets_lem[label])
    R_lem = recall(refsets_lem[label], testsets_lem[label])
    F_lem = f_measure(refsets_lem[label], testsets_lem[label])
    
    print(f"Metrics for Class '{label}':")
    print(f"  Precision: {P_lem:.4f}")
    print(f"  Recall:    {R_lem:.4f}")
    print(f"  F1-Score:  {F_lem:.4f}")
    print("-" * 50)


Successfully trained Lemmatized Naive Bayes Classifier.
Most Informative Features
               mesmerize = True              pos : neg    =     13.0 : 1.0
                weakness = True              pos : neg    =     12.3 : 1.0
            breathtaking = True              pos : neg    =     11.0 : 1.0
                  muddle = True              neg : pos    =     10.3 : 1.0
                    bold = True              pos : neg    =      9.7 : 1.0
                  elliot = True              pos : neg    =      9.7 : 1.0
                seamless = True              pos : neg    =      9.7 : 1.0
              degenerate = True              neg : pos    =      9.0 : 1.0
                 forrest = True              pos : neg    =      9.0 : 1.0
                  hatred = True              pos : neg    =      9.0 : 1.0

Lemmatized Naive Bayes Classifier Evaluation (Accuracy: 0.6383)
Metrics for Class 'pos':
  Precision: 0.5819
  Recall:    0.9833
  F1-Score:  0.7311
-----------------