<a href="https://colab.research.google.com/github/Belac44/Deep-Learning/blob/main/NLassignment2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLE Assignment: Sentiment Classification

In this assignment, you will be investigating NLP methods for distinguishing positive and negative reviews written about movies.

For assessment, you are expected to complete and submit this notebook file.  When answers require code, you may import and use library functions (unless explicitly told otherwise).  All of your own code should be included in the notebook rather than imported from elsewhere.  Written answers should also be included in the notebook.  You should insert as many extra cells as you want and change the type between code and markdown as appropriate.

In order to avoid misconduct, you should not talk about the assignment questions with your peers.  If you are not sure what a question is asking you to do or have any other questions, please ask me or one of the Teaching Assistants.

Marking guidelines are provided as a separate document.

The first few cells contain code to set-up the assignment and bring in some data.   In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.  Otherwise do not change the code in these cells.

In [55]:
candidateno=11111119 #this MUST be updated to your candidate number so that you get a unique data sample


In [56]:
#do not change the code in this cell
#preliminary imports

#set up nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('movie_reviews')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import movie_reviews

#for setting up training and testing data
import random

#useful other tools
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.probability import FreqDist
from nltk.classify.api import ClassifierI


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [57]:
#do not change the code in this cell
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data)  
    n = len(data)  
    train_indices = random.sample(range(n), int(n * ratio))          
    test_indices = list(set(range(n)) - set(train_indices))    
    train = [data[i] for i in train_indices]           
    test = [data[i] for i in test_indices]             
    return (train, test)                       
 

def get_train_test_data():
    
    #get ids of positive and negative movie reviews
    pos_review_ids=movie_reviews.fileids('pos')
    neg_review_ids=movie_reviews.fileids('neg')
   
    #split positive and negative data into training and testing sets
    pos_train_ids, pos_test_ids = split_data(pos_review_ids)
    neg_train_ids, neg_test_ids = split_data(neg_review_ids)
    #add labels to the data and concatenate
    training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
    testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]
   
    return training, testing

When you have run the cell below, your unique training and testing samples will be stored in `training_data` and `testing_data`

In [58]:
#do not change the code in this cell
random.seed(candidateno)
training_data,testing_data=get_train_test_data()
print("The amount of training data is {}".format(len(training_data)))
print("The amount of testing data is {}".format(len(testing_data)))
print("The representation of a single data item is below")
print(training_data[0])

The amount of training data is 1400
The amount of testing data is 600
The representation of a single data item is below
(['razor', 'blade', 'smile', 'running', 'as', 'part', ...], 'pos')


1)  
a) **Generate** a list of 10 content words which are representative of the positive reviews in your training data.

b) **Generate** a list of 10 content words which are representative of the negative reviews in your training data.

c) **Explain** what you have done and why

[20\%]

In [59]:
count_pos = {}
stop_words = set(stopwords.words("english"))
import string

for sent in training_data:
  if sent[1] == 'pos':
    gen_words = [word for word in sent[0] if word not in stop_words and word not in string.punctuation]
    for word in gen_words:
      try:
        count_pos[word] += 1
      except KeyError:
        count_pos[word] = 0

sorted_count_pos = sorted(count_pos.items(), key=lambda x: x[1])
content_words_pos = [word[0] for word in sorted_count_pos[-10:]]
content_words_pos

['well',
 'even',
 'also',
 'story',
 'good',
 'time',
 'like',
 'movie',
 'one',
 'film']

In [60]:
count = {}

for sent in training_data:
  if sent[1] == 'neg':
    gen_words = [word for word in sent[0] if word not in stop_words and word not in string.punctuation]
    for word in gen_words:
      try:
        count[word] += 1
      except KeyError:
        count[word] = 0

sorted_count_neg = sorted(count.items(), key=lambda x: x[1])
content_words_neg = [word[0] for word in sorted_count_neg[-10:]]
content_words_neg

['get', 'bad', 'would', 'time', 'good', 'even', 'like', 'one', 'movie', 'film']

In [61]:
#Some words are present both in negative and positive content words
#Lets drop them and add other such that we have unique words in each set
# repeated_words = []
# for word in content_words_pos:
#   if word in content_words_neg:
#     repeated_words.append(word)

# print(repeated_words)

# unrepeated_pos = [word for word in content_words_pos if word not in repeated_words]
# unrepeated_neg = [word for word in content_words_neg if word not in repeated_words]

# print("Pos:", unrepeated_pos, "Neg: ", unrepeated_neg)

# i = 1
# while len(unrepeated_pos) < 10 and len(unrepeated_neg) < 10:
#   content_words_neg = [word[0] for word in sorted_count_neg[-10 - i:]]
#   content_words_pos = [word[0] for word in sorted_count_pos[-10 - 1:]]

#   repeated_words = []
#   for word in content_words_pos:
#     if word in content_words_neg:
#       repeated_words.append(word)

#   unrepeated_pos = [word for word in content_words_pos if word not in repeated_words]
#   unrepeated_neg = [word for word in content_words_neg if word not in repeated_words]
#   i+=1

# print(unrepeated_pos, "\n", unrepeated_neg)

2) 
a) **Use** the lists generated in Q1 to build a **word list classifier** which will classify reviews as being positive or negative.

b) **Explain** what you have done.

[12.5\%]


In [95]:
import numpy as np

def word_list_classifier(data):
  
  prediction = "Neut"

  true_pos = 0
  true_neg = 0
  false_pos = 0
  false_neg = 0

  for sentence in data:
    chance_pos = 0
    chance_neg = 0
    chance_neut = 0
    for word in sentence[0]:
      if word in content_words_neg and word in content_words_pos:
        chance_neut += 1
      elif word in content_words_neg:
        chance_neg += 1
      elif word in content_words_pos:
        chance_pos += 1

    if chance_neg > chance_pos:
      prediction = "neg"

      if sentence[1] == "neg":
        true_neg += 1
      elif sentence[1] == "pos":
        false_pos += 1

    elif chance_pos > chance_neg:
      prediction = "pos"

      if sentence[1] == "pos":
        true_pos += 1
      elif sentence[1] == "neg":
        false_neg += 1

    elif chance_pos == chance_neg:
      prediction= "pos"

      if sentence[1] == "pos":
        true_pos += 1
      elif sentence[1] == "neg":
        false_neg += 1

  print("True pos", true_pos, "True Negative", true_neg, "False pos", false_pos, "False neg", false_neg)
  return np.array([[true_pos, false_pos], [false_neg, true_neg]])


In [81]:
pred = word_list_classifier(training_data)
pred

True pos 517 True Negative 350 False pos 183 False neg 350


array([[517, 183],
       [350, 350]])

3)
a) **Calculate** the accuracy, precision, recall and F1 score of your classifier.

b) Is it reasonable to evaluate the classifier in terms of its accuracy?  **Explain** your answer and give a counter-example (a scenario where it would / would not be reasonable to evaluate the classifier in terms of its accuracy).

[20\%]

In [92]:
def accuracies(pred):
  accuracy = ((pred[0][0] + pred[1][1])/len(training_data)) * 100
  precision = (pred[0][0]/(pred[0][0] + pred[0][1])) * 100
  recall = (pred[0][0]/(pred[0][0] + pred[1][0])) * 100
  
  return (accuracy, precision, recall)

In [85]:
#Accuracy
#TP + TN/Total * 100
((pred[0][0] + pred[1][1])/len(training_data)) * 100

61.92857142857143

In [88]:
#Precision
#TP/(TP + FP)

(pred[0][0]/(pred[0][0] + pred[0][1])) * 100

73.85714285714286

In [89]:
#Recall 
#TP/(TP + FN)
(pred[0][0]/(pred[0][0] + pred[1][0])) * 100

59.63091118800461

4) 
a)  **Construct** a Naive Bayes classifier (e.g., from NLTK).

b)  **Compare** the performance of your word list classifier with the Naive Bayes classifier.  **Discuss** your results. 

[12.5\%]

In [91]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

# set that we'll train our classifier with
training_set = featuresets[:1900]

# set that we'll test against.
testing_set = featuresets[1900:]

classifier = nltk.NaiveBayesClassifier.train(training_set)

print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

Classifier accuracy percent: 85.0


5) 
a) Design and **carry out an experiment** into the impact of the **length of the wordlists** on the wordlist classifier.  Make sure you **describe** design decisions in your experiment, include a **graph** of your results and **discuss** your conclusions. 

b) Would you **recommend** a wordlist classifier or a Naive Bayes classifier for future work in this area?  **Justify** your answer.

[25\%]


In [100]:
content_words_neg = [word[0] for word in sorted_count_neg[-135:]]
content_words_pos = [word[0] for word in sorted_count_pos[-135:]]

pred = word_list_classifier(training_data)
accuracy, precision, recall = accuracies(pred)
print("Accuracy:", accuracy,"Precision:", precision, "Recall", recall)

True pos 502 True Negative 434 False pos 198 False neg 266
Accuracy: 66.85714285714286 Precision: 71.71428571428572 Recall 65.36458333333334


In [None]:
##This code will word count all of the markdown cells in the notebook saved at filepath
##Running it before providing any answers shows that the questions have a word count of 437

import io
from nbformat import current

#filepath="/content/drive/My Drive/NLE Notebooks/assessment/assignment1.ipynb"
filepath="NLassignment2021.ipynb"
question_count=437

with io.open(filepath, 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')

word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print("Submission length is {}".format(word_count-question_count))

FileNotFoundError: ignored