# ANLP Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [2]:
candidateno=291065 #this MUST be updated to your candidate number so that you get a unique data sample


In [1]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [3]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the
            pair is a list of the training data and the second is a list of the test data.
    """

    data = list(data)
    n = len(data)
    train_indices = random.sample(range(n), int(n * ratio))
    test_indices = list(set(range(n)) - set(train_indices))
    train = [data[i] for i in train_indices]
    test = [data[i] for i in test_indices]
    return (train, test)


def get_train_test_data():

    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')

    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]

    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [4]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['i', 'want', 'to', 'correct', 'what', 'i', 'wrote', ...], 'pos')


# A. Re-Useable Code and Functions

In [30]:
def preprocess_text(token_list: list, stop_words: list):
    """
    Applies case normalization, removes numbers and punctuation, 
    and removes stopwords from a list of tokens.

    Args:
        token_list (list): The input list of tokens (strings).
        stop_words (set/list): A collection of words to be removed (stopwords).

    Returns:
        list: The preprocessed list of tokens.
    """
    
    processed_list = [
        token.lower()
        for token in token_list  
        if token.isalpha()        # removes numbers and punctuation
        and token.lower() not in stop_words  # remove tokens in stop_words (by keeping those not in stop_words)
    ]
    
    return processed_list

# 1. Generating Positive and Negative Word Lists

a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

## Code (a & b)

In [31]:
stop = stopwords.words('english')

pos_freq_dist=FreqDist()
neg_freq_dist=FreqDist()

for rev,label in training_data:

    # PREPROCESSING: case, number, punctuation, stopword
    rev = preprocess_text(token_list=rev, stop_words=stop)

    # FreqDist() data structure
    if label == 'pos':
        pos_freq_dist.update(rev) 
    elif label == 'neg':
        neg_freq_dist.update(rev)
    else: 
        print("unexpected class")

# IDENTIFY WORD DIFFERENTIAL BETWEEN CLASSSES
all_words = set(pos_freq_dist.keys()) | set(neg_freq_dist.keys()) # all unique words

word_counts = []

for word in all_words:
    pos_count = pos_freq_dist.get(word, 0)
    neg_count = neg_freq_dist.get(word, 0)
    difference = pos_count - neg_count
    total = pos_count + neg_count
    word_counts.append((word, difference, total))

# IDENTIFY REVIEW DOMAIN STOP WORDS
word_counts_copy = list(word_counts)
word_counts_copy.sort(key=lambda item: item[2], reverse=True) # [2] = Total Corpus Count
_domainStopWords = word_counts_copy[:5] # 5 most common review words
domainStopWords = [word for word, diff, total in _domainStopWords]
domain_stop_set = set(domainStopWords)

# DROP DOMAIN WORDS FROM LIST
final_word_counts = [word for word in word_counts if word[0] not in domain_stop_set]

# SORY BY [1], THE CLASS/LABEL DIFFERENTIAL
final_word_counts.sort(key=lambda item: item[1]) # low to high

# COLLECT WORDS WITH THE LARGEST DIFFERENTIAL
_negative_word_list = final_word_counts[:10] # first 10
_positive_word_list = final_word_counts[-10:] # last 10

_positive_word_list.sort(key=lambda item: item[1], reverse=True) # high to low

negative_word_list = [word for word, diff, total in _negative_word_list]
positive_word_list = [word for word, diff, total in _positive_word_list]

In [32]:
print(domainStopWords)

['film', 'movie', 'one', 'like', 'even']


In [33]:
print(negative_word_list)

['bad', 'plot', 'nothing', 'worst', 'script', 'stupid', 'boring', 'least', 'harry', 'supposed']


In [34]:
print(_negative_word_list)

[('bad', -476, 998), ('plot', -244, 1040), ('nothing', -153, 555), ('worst', -139, 205), ('script', -127, 533), ('stupid', -114, 174), ('boring', -113, 195), ('least', -107, 447), ('harry', -105, 163), ('supposed', -105, 223)]


In [35]:
print(positive_word_list)

['life', 'also', 'great', 'well', 'best', 'story', 'many', 'world', 'love', 'first']


In [36]:
print(_positive_word_list)

[('life', 329, 1113), ('also', 327, 1395), ('great', 263, 833), ('well', 253, 1315), ('best', 245, 931), ('story', 229, 1555), ('many', 203, 901), ('world', 201, 729), ('love', 200, 800), ('first', 166, 1318)]


## Explanation (c)

In this code I am looping through each review in the training data whilst giving myself access to the wordlist and the label. 

On the all word lists I preprocess and normalise accounting for case, numbers, punctuation and stopwords. I am doing this to remove noise and get access to the words in their cleanest and more contextualised form and removing words devoid of meaningful context. I did not implement any stemming or lemmanisation as the goal is to obtain a word list, hence, readabliltiy is desired. 

After which I push the wordlists into pre-initalised world lists for each label. It is important to have seperate DistFreqs for the label because I want to compare the frequency between the two. 

For each word I calculate two metrics. The differential between the positive and negative counts, as well as, the cumulative word count for the corupus.

Words that come out with a positive differential are candiates for the positive world list, and negative ones the negative word list.

The cumulative word count is used to identify words that appear to be domain stop words. That is, words that appear to be disproporionately used in context of writing reviews. The frequency of these words in the domain of reviewing means they loose the context they might otherwise hold in general terms. I remove these words as candidates for the word list.

Finally, the word list is sorting by the index [1] which is the word differential metric. Slices of the top and bottom 10 are taken which are the words that are most skewed towards being in positive or negative reviews, hence, should be highly representative of sentiment.

# 2. Word List Classifer

a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


## Classifier

In [40]:
def sentiment_weight_converter(sentiment_word_list: list):
    """
    Takes a sentiment word list with frequency counts converts it into a Sentiment Polarity weightings.  

    Args:
        sentiment_word_list (Iterable): Nested word list comprised of word:string, differential: int and total:int. 
        
    Returns:
        dict: dictionary of word and its sentiment weighting
    """
    sentiment_weights = {
        word: differential / total_usage
        for word, differential, total_usage in sentiment_word_list
    }

    return sentiment_weights

In [45]:
print(sentiment_weight_converter(_positive_word_list))

{'life': 0.29559748427672955, 'also': 0.23440860215053763, 'great': 0.3157262905162065, 'well': 0.19239543726235742, 'best': 0.2631578947368421, 'story': 0.1472668810289389, 'many': 0.2253052164261931, 'world': 0.2757201646090535, 'love': 0.25, 'first': 0.125948406676783}


In [39]:
print(sentiment_weight_converter(_negative_word_list))

{'bad': -0.47695390781563124, 'plot': -0.23461538461538461, 'nothing': -0.2756756756756757, 'worst': -0.6780487804878049, 'script': -0.23827392120075047, 'stupid': -0.6551724137931034, 'boring': -0.5794871794871795, 'least': -0.23937360178970918, 'harry': -0.6441717791411042, 'supposed': -0.47085201793721976}


In [85]:
from nltk.classify.api import ClassifierI

class ReviewClassifer(ClassifierI):

    def __init__(self, pos, neg):
        self._pos = sentiment_weight_converter(pos)
        self._neg = sentiment_weight_converter(neg)

    def classify(self, words):
        score = 0

        for word in words:
            
            if word in self._pos:
                score += self._pos[word]

            elif word in self._neg:
                score += self._neg[word]

        return "neg" if score <= 0 else "pos"

    def labels(self):
        return ("pos", "neg")

In [93]:
#Example usage:

classifier = ReviewClassifer(_positive_word_list, _negative_word_list)

data = [
    training_data[100], 
    training_data[500],
    training_data[700],
    training_data[1100]
]

for rev,label in data:
    rev = preprocess_text(token_list=rev, stop_words=stop)

    cls = classifier.classify(rev)

    print(label, cls)

pos pos
pos neg
neg pos
neg neg


## Explanation

Here I have set up a list classifer class by inheriting from `ClassifierI` in the `nltk` package. 

This has been done to for standardization purpose to follow convention and for customization later in the assignment where I can easily knit `nltks` eval packages.

Prior to writting the class I set up a function called `sentiment_weight_converter`. This function allows me to take my sentiment word list (pos or neg) and calculate a weighting for each word. The idea is to capture which words are more positive or negative.

I had previously calculated the differential for each word to determine the most positive or negative words, as well as, counting the overall total frequency for each word. By taking a ratio of each per word, the score its weighted by how much the sentiment skew covers the total usage of a work

- If a word is used 100 times and all seen instances are negative then it has a weighting of 1
- If a word is used 100 times but only 80 instances are negative and 20 positive it has a differential of 60 and a weighting of 60/100 = 0.2. It is still considered a negative word but less so.

$$\text{Weighting Score} = \frac{D}{T}$$

Hopefully this approach will lead to an improve in some evaluation scores. Additionally, positive words overall seem to appear with more frequences but the negative weights appear to come out stronger so this may even find out and make classifcation better.

Within the class, the `classify` method handles the computation. Looping through a review, if it sees a word in either sentiment list then the words correponding **weight** is added the `score` tally which is initalized as 0. Note, by definition negative words will have negative weights and visa versa for positive words, meaning each will pull the `score` in either direction. Finally after the loop has finished the method returns 'neg' if the score <= 0 or positive if  >0.

# 3. 

a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

# 4.

a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results.

[12.5\%]

# 5.

a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions.

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]
