<a href="https://colab.research.google.com/github/Efemirkan/Applied-Natural-Language-Processing/blob/main/ANLPassignment2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ANLP Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [1]:
candidateno=22417621 #this MUST be updated to your candidate number so that you get a unique data sample


In [2]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


In [3]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the
            pair is a list of the training data and the second is a list of the test data.
    """

    data = list(data)
    n = len(data)
    train_indices = random.sample(range(n), int(n * ratio))
    test_indices = list(set(range(n)) - set(train_indices))
    train = [data[i] for i in train_indices]
    test = [data[i] for i in test_indices]
    return (train, test)


def get_train_test_data():

    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')

    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]

    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [4]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['capsule', ':', 'trippy', ',', 'hyperspeed', 'action', ...], 'pos')


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

In [14]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [48]:
def normalized_tokens(data):

  # case normalisation
  case_normalised_token = [[token.lower() for token in tokens] for tokens in data]

  # replace digits with "NUM"
  normalised_token_first = [["NUM" if token.isdigit() else token for token in tokens] for tokens in case_normalised_token]

  # replace ordinals with "Nth"
  normalised_token = [["Nth" if (token.endswith(("nd","st","th","rd")) and token[:-2].isdigit()) else token for token in tokens] for tokens in normalised_token_first]

  return normalised_token

def stopword_removal(data):
  stop = stopwords.words('english')
  filtered_tokens_first = [[token for token in tokens if token.isalpha() and token not in stop] for tokens in data]
  filtered_tokens = [[token for token in tokens if token not in ["NUM", "Nth"]] for tokens in filtered_tokens_first]
  return filtered_tokens

def lemmatize_tokens(data):
  lem = WordNetLemmatizer()
  lemmatize_token = [[lem.lemmatize(token) for token in tokens] for tokens in data]
  return lemmatize_token


In [42]:
def create_freq_dist(data):
  created_freq = [FreqDist(token) for token in data]

  created_freq_dist = FreqDist()

  for reviewDist in created_freq:
    created_freq_dist += reviewDist

  return created_freq_dist

In [49]:
positive_reviews = [token for token, label in training_data if label == "pos"]
normalised_positive_reviews = normalized_tokens(positive_reviews)
filtered_positive_reviews = stopword_removal(normalised_positive_reviews)
lemmatized_positive_reviews = lemmatize_tokens(filtered_positive_reviews)
pos_freq_dist = create_freq_dist(lemmatized_positive_reviews)
pos_freq_dist

FreqDist({'film': 4141, 'movie': 2286, 'one': 2192, 'character': 1417, 'like': 1327, 'time': 1077, 'story': 982, 'scene': 969, 'make': 937, 'get': 926, ...})

In [50]:
negative_reviews = [token for token, label in training_data if label == "neg"]
normalised_negative_reviews = normalized_tokens(negative_reviews)
filtered_negative_reviews = stopword_removal(normalised_negative_reviews)
lemmatized_negative_reviews = lemmatize_tokens(filtered_negative_reviews)
neg_freq_dist = create_freq_dist(lemmatized_negative_reviews)
neg_freq_dist

FreqDist({'film': 3532, 'movie': 2755, 'one': 2032, 'like': 1366, 'character': 1305, 'get': 1067, 'time': 1028, 'even': 1000, 'make': 952, 'scene': 926, ...})

In [54]:
def most_frequent_words(freq1,freq2,k):

    difference = freq1-freq2
    sorted_diff = difference.most_common()
    mostwords = [token for (token, freq) in sorted_diff[:k]]

    return mostwords

a)

In [55]:
positive_word_list = most_frequent_words(pos_freq_dist,neg_freq_dist,10)
positive_word_list

['film',
 'life',
 'great',
 'also',
 'well',
 'story',
 'best',
 'war',
 'performance',
 'world']

b)

In [56]:
negative_word_list = most_frequent_words(neg_freq_dist,pos_freq_dist,10)
negative_word_list

['bad',
 'movie',
 'plot',
 'even',
 'minute',
 'script',
 'get',
 'guy',
 'worst',
 'boring']

c)

- data cleaning and choosed lemma explain it why as well
- convert training data to freq list and concatenated them
- chose top ten word most frequent

2)
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

4)
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results.

[12.5\%]

5)
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions.

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]
