<a href="https://colab.research.google.com/github/CallOfTheNight/Sussex-stuff/blob/main/NLE2023/NLEassignment2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLE Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [3]:
# Random seed number

candidateno=284246 #this MUST be updated to your candidate number so that you get a unique data sample


In [4]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


In [5]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the
            pair is a list of the training data and the second is a list of the test data.
    """

    data = list(data)
    n = len(data)
    train_indices = random.sample(range(n), int(n * ratio))
    test_indices = list(set(range(n)) - set(train_indices))
    train = [data[i] for i in train_indices]
    test = [data[i] for i in test_indices]
    return (train, test)


def get_train_test_data():

    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')

    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]

    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [6]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['since', 'their', 'film', 'debut', 'in', '1984', ...], 'pos')


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

### TODO

1. In order to generate a list of content words. The first things to do with the data set is to construct a bag-of-words representation for each document in the training and testing sets. It shows frequency of occurrence of each word.

1. Then for remove the punctuation and stopwords, it is necessary to do a pre-processing for those words set.

In [8]:
# Step 1: construct a bag-of-words saving in two sets.
training_basic=[(FreqDist(wordlist),label) for (wordlist,label) in training_data]
testing_basic=[(FreqDist(wordlist),label) for (wordlist,label) in testing_data]


[(FreqDist({'the': 58, ',': 52, 'of': 29, '"': 28, 'and': 28, '.': 22, 'a': 20, 'in': 18, 'to': 12, 'their': 11, ...}), 'pos'), (FreqDist({',': 34, '.': 29, 'the': 22, 'and': 21, "'": 19, 'a': 19, ')': 18, 'of': 16, '(': 16, 'to': 16, ...}), 'pos'), (FreqDist({',': 49, '.': 37, 'the': 33, 'to': 22, 'a': 21, 'i': 18, 'of': 17, 'in': 16, '"': 16, 'and': 15, ...}), 'pos'), (FreqDist({'the': 63, ',': 50, '.': 42, 'of': 27, '(': 23, ')': 23, 'and': 23, 'a': 21, 'to': 20, 'is': 16, ...}), 'pos'), (FreqDist({'the': 37, ',': 36, 'of': 34, '.': 30, 'in': 18, 'and': 18, 'a': 18, "'": 15, 'his': 14, 'to': 13, ...}), 'pos'), (FreqDist({'the': 33, ',': 26, '.': 26, '"': 22, "'": 21, 'to': 17, 'and': 15, '-': 13, 'a': 13, 'of': 13, ...}), 'pos'), (FreqDist({"'": 47, '"': 34, 'the': 31, ',': 29, '.': 27, 'a': 25, 's': 24, 'and': 17, '(': 15, ')': 15, ...}), 'pos'), (FreqDist({',': 44, 'the': 35, '.': 22, 'a': 20, 'of': 17, 'i': 8, 'in': 8, "'": 7, 'that': 7, 'is': 7, ...}), 'pos'), (FreqDist({',': 55

In [17]:
# Step 2: pre-processing.

# Import the set of stopwords
from nltk.corpus import stopwords
stop = stopwords.words('english')

# ---This function is copy from privious lab work.---
# This function receive a wordlist arg, change all the letter to lower case, remove all the punctuation and the word which has not represent to any reviews.
def normalise(wordlist):
    lowered=[word.lower() for word in wordlist]
    filtered=[word for word in lowered if word.isalpha() and word not in stop]
    return filtered

# normalise all the data
training_norm=[(FreqDist(normalise(wordlist)),label) for (wordlist,label) in training_data]
testing_norm=[(FreqDist(normalise(wordlist)),label) for (wordlist,label) in testing_data]

# print and see if it's working well
print(type(training_norm))
print(training_norm[0])

<class 'list'>
(FreqDist({'lebowski': 8, 'dude': 8, 'one': 7, 'big': 5, 'coen': 4, 'films': 4, 'time': 4, 'fargo': 4, 'lot': 4, 'coens': 4, ...}), 'pos')


2)
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

4)
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results.

[12.5\%]

5)
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions.

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]


In [None]:
##This code will word count all of the markdown cells in the notebook saved at filepath

import io
from nbformat import current

from google.colab import drive
drive.mount('/content/drive')

filepath="/content/drive/MyDrive/Colab Notebooks/NLEassignment2023.ipynb"
question_count=432

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))


- use nbformat for read/write/validate public API
- use nbformat.vX directly to composing notebooks of a particular version

  from nbformat import current


Mounted at /content/drive
Submission length is 0
