# Assignment 2 - CT5120/CT5146

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **October 24, 2021**.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $100$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

| Task | Marks for write-up | Marks for code | Total Marks |
| :--- | :----------------- | :------------- | :---------- |
| 1    |                  - |              5 |           5 |
| 2    |                 15 |             15 |          30 |
| 3    |                  - |             10 |          10 |
| 4    |                 10 |              5 |          15 |
| 5    |                 15 |             25 |          40 |



---

This assignment involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. You will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download `Data.Assignment2.SemEvalTask9SubtaskA.csv` from Blackboard or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 6,100 rows spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).


In [1]:
!curl "https://raw.githubusercontent.com/pasricha/Subtask-A/master/Data.Assignment2.SemEvalTask9SubtaskA.csv" > Data.Assignment2.SemEvalTask9SubtaskA.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 87  729k   87  637k    0     0   637k      0  0:00:01 --:--:--  0:00:01  949k
100  729k  100  729k    0     0   729k      0  0:00:01 --:--:--  0:00:01 1061k


---

## Task 1: Reading Data (5 marks)

The following cell of code reads the texts and the corresponding labels of suggestion/non-suggestion from the CSV file. The first task is to create training and test sets. Use the final $1000$ rows of the data as a test set and the rest of the data for training.

In [2]:
import numpy as np
import pandas as pd

# Read the CSV file.
df = pd.read_csv('Data.Assignment2.SemEvalTask9SubtaskA.csv', 
                 names=['id', 'text', 'label'], header=0)

# Set seed for reproducibility and shuffle the rows.
np.random.seed(888)
df = df.sample(frac=1).reset_index(drop=True)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
data = [(text, label) for (idx, text, label) in df.values.tolist()]

# Create training and test sets.
train_texts, train_labels = [], []
test_texts, test_labels = [], []

#################### EDIT BELOW THIS LINE #########################

# your code goes here
train_texts, train_labels = [text[0] for text in data[:5100]],[label[1] for label in data[:5100]]
test_texts, test_labels = [text[0] for text in data[5100:6100]],[label[1] for label in data[5100:6100]]

#################### EDIT ABOVE THIS LINE #########################

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1000
assert len(train_texts) == len(train_labels) == 5100

---

## Task 2: Data Pre-processing (30 Marks)

#Explain at least 3 steps that you will perform to preprocess the texts before training a classifier.

Edit this cell to write your answer below the line in no more than 300 words.

---

Before training this classifier I will perform the following steps on the training data:
    1. Tokenise - This will seperate a sentence into a list of words and allow operations at a word level.
    2. Standardise - This list of words will be converted to lowecase - to treat eg. Version and version the same special characters will also be removed
    3. Commonly occurring words (stop words) will be removed - this includes the top 50/100 
    4. Remaining words will be lemmatised - so ruin == ruining ==ruined and improve==improvement==improving 
    

---

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [3]:
# your code goes here
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import FreqDist
import re

stopwords = set(stopwords.words('english'))
wordLemmatizer = WordNetLemmatizer()


def standardiser(text):
    """
    Takes in a list of texts and removes http, @s and non-alphanumerics
    """
    new_text=[]
    for i in text:
        lower_case = i.lower()
        lower_case = re.sub('[^0-9a-zA-Z\s]+', '', lower_case)
        new_text.append(lower_case)
    return(new_text)
    
def tokenizer(text):
    return [word_tokenize(i) for i in text]

def stop_words(text):
    """
    Using NLTKs stopwords, remove them from all the sentences in the tokenized corpus  
    """
    filtered_corpus =[]
    for line in text:
        filtered_line = [word for word in line if not word in stopwords]
        filtered_corpus.append(filtered_line)
    return filtered_corpus
    
def lemmatizer(text):
    """
    Using NLTKs WordNetLemmatizer, lemmatize all words in the corpus  
    """
    lemmatized_corpus =[]
    for line in text:
        lemmatized_line = [wordLemmatizer.lemmatize(word,pos ="a") for word in line ]
        lemmatized_corpus.append(lemmatized_line)
    return lemmatized_corpus


sd = standardiser(train_texts)
tk = tokenizer(sd)
sw = stop_words(tk)

lem = lemmatizer(sw)
corpus_prime = [" ".join(word) for word in lem]



---

## Task 3: Feature Engineering (I) - TF-IDF as features (10 Marks)

In the lectures we have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. Complete the following code cell to create a suggestion detector which uses `tf-idf` scores as features for a Naïve Bayes classifier.

After applying your preprocessing steps, use the training data to train the classifier and make predictions on the test set. You **must not** use the test set for training.

If everything is implemented correctly, then you should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here
vectoriser = CountVectorizer()
word_count_vector = vectoriser.fit_transform(corpus_prime)

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)


count_vector=vectoriser.transform(corpus_prime) 
tf_idf_vector=tfidf_transformer.transform(count_vector)

# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here

gnb = GaussianNB()
gnb.fit(tf_idf_vector.toarray(), train_labels)

# Predict on the test set.
# ... your code goes here
X_test_counts = vectoriser.transform(test_texts)
predictions = gnb.predict(X_test_counts.toarray())    # save your predictions on the test set into this list


#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
  '''
  Calculate the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

0.566

---

## Task 4: Evaluation Metrics (15 marks)

Why is accuracy not the best measure for evaluating a classifier? Describe an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.


Edit this cell to write your answer below the line in no more than 150 words.

---

Accurracy is not a great metric to use when evaluating a classifier for the following reasons:
 - it does not capture the bias towards false negatives/false positives eg, for detecting cancer it is underirable to have a system that will bias towards false negatives.
 - does not work well when classes are not evenly split. In scenarios where one class makes up 95% of the dataset, the model could get a 95% accuracy by predicting that class for all future observations.

A more representative metric for evaluating this classifier would be F1 score as it take inputs recall and precision and outputs a balanced measure called the harmonic mean. F1 ranges from 0 to 1 and higher values signify a model that correctly classifies suggestions and does not miss too many obervations.

---

In the code cell below, write an implementation of the evaluation metric you defined above. You are free to use a library such as `nltk` or `sklearn` for this task, or you can write your own implementation from scratch.

In [5]:
def evaluate(labels, predictions):
    '''
    Calculate an evaluation score other than accuracy for a given set of predictions and labels.

    Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

    Returns:
    float: A floating point value to score the predictions against the labels.
    '''

    # check that labels and predictions are of same length
    assert len(labels) == len(predictions)

    score = 0.0

    #################### EDIT BELOW THIS LINE #########################

    # your code goes here
    # This will be my own implimentation of F1 
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    for label, prediction in zip(labels, predictions):
        if (label == 1) & (prediction == 1):
            TP += 1 
        elif (label == 0) & (prediction == 0):
            TN += 1
        elif (label == 1) & (prediction == 0):
            FN += 1
        elif (label == 0) & (prediction == 1):
            FP += 1

    precision = (TP)/(TP+FP)
    recall = (TP)/(TP+FN)
    score = 2*(precision*recall/(precision + recall))



    #################### EDIT ABOVE THIS LINE #########################

    return score

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 3 using tf-idf features.
evaluate(test_labels, predictions)

0.5272331154684096

---

## Task 5: Feature Engineering (II) - Other features (40 Marks)

Describe features other than those defined in Task 3 which might improve the performance of your suggestion detector. If these features require any additional pre-processing steps, then define those steps as well.


Edit this cell to write your answer below the line in no more than 500 words.

---

Other preprocessing steps that can be applied:
 - Frequency based stop word removal. Count the term frequency and remove the top N occurring words - iterating for upper and lower cut-offs to maximise F1
 - Grouping words in a vector space using Word2Vec to group words with similar meanings together
 - Sentence level representation by averaging the Word2Vec scores for all words in a sentence.
 
 

---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Naïve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 4. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

In [6]:
# Create your features.
# ... your code goes here
from sklearn.model_selection import GridSearchCV
import itertools
def high_freq_words(text, most_likely, least_likely):
    freq_words = FreqDist(word for sentence in text for word in sentence)
    target_words = set(w[0] for w in freq_words.most_common(least_likely)) - set(w[0] for w in freq_words.most_common(most_likely))
    filtered_corpus =[]
    for line in text:
        filtered_line = [word for word in line if  word in target_words]
        filtered_corpus.append(filtered_line)
    return filtered_corpus
 
for highest_rank, lowest_rank in itertools.product(range(0,101,10), range(500,5001,500)):

    sd = standardiser(train_texts)
    tk = tokenizer(sd)
    sw = stop_words(tk)
    hfw = high_freq_words(sw, highest_rank, lowest_rank)
    lem = lemmatizer(hfw)
    corpus_prime = [" ".join(word) for word in lem]


    # Train a Naïve Bayes classifier using the features you defined.
    # ... your code goes here
    vectoriser = CountVectorizer()
    word_count_vector = vectoriser.fit_transform(corpus_prime)

    tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
    tfidf_transformer.fit(word_count_vector)

    count_vector=vectoriser.transform(corpus_prime) 
    tf_idf_vector=tfidf_transformer.transform(count_vector)

    # Train a Naïve Bayes classifier using the tf-idf scores for words as features.
    # ... your code goes here

    gnb = GaussianNB()
    gnb.fit(tf_idf_vector.toarray(), train_labels)

    # Predict on the test set.
    # ... your code goes here
    X_test_counts = vectoriser.transform(test_texts)
    predictions = gnb.predict(X_test_counts.toarray())


    # Evaluate on the test set.
    # ... your code goes here
    print(highest_rank, lowest_rank, evaluate(test_labels, predictions))

0 500 0.5757225433526012
0 1000 0.5949656750572083
0 1500 0.5900552486187846
0 2000 0.574537540805223
0 2500 0.5849889624724062


KeyboardInterrupt: 