# Assignment 2 - CT5120/CT5146

### Instructions:
- Complete all the tasks below and upload your submission as a Python notebook on Blackboard with the filename “`StudentID_Lastname.ipynb`” before **23:59** on **October 24, 2021**.
- This is an individual assignment, you **must not** work with other students to complete this assessment.
- The assignment is worth $100$ marks and constitutes 19% of the final grade. The breakdown of the marking scheme for each task is as follows:

| Task | Marks for write-up | Marks for code | Total Marks |
| :--- | :----------------- | :------------- | :---------- |
| 1    |                  - |              5 |           5 |
| 2    |                 15 |             15 |          30 |
| 3    |                  - |             10 |          10 |
| 4    |                 10 |              5 |          15 |
| 5    |                 15 |             25 |          40 |



---

This assignment involves tasks for feature engineering, training and evaluating a classifier for suggestion detection. You will work with the data from SemEval-2019 Task 9 subtask A to classify whether a piece of text contains a suggestion or not. 


Download `Data.Assignment2.SemEvalTask9SubtaskA.csv` from Blackboard or uncomment the code cell below to get the data as a comma-separated values (CSV) file. The CSV file contains a header row followed by 6,100 rows spread across 3 columns of data. Each row of data contains a unique id, a piece of text and a label assigned by an annotator. A label of $1$ indicates that the given text contains a suggestion while a label of $0$ indicates that the text does not contain a suggestion.

You can find more details about the dataset in Sections 1, 2, 3 and 4 of [SemEval-2019 Task 9: Suggestion Mining from Online Reviews and Forums
](https://aclanthology.org/S19-2151/).


In [None]:
# !curl "https://raw.githubusercontent.com/pasricha/Subtask-A/master/Data.Assignment2.SemEvalTask9SubtaskA.csv" > Data.Assignment2.SemEvalTask9SubtaskA.csv

---

## Task 1: Reading Data (5 marks)

The following cell of code reads the texts and the corresponding labels of suggestion/non-suggestion from the CSV file. The first task is to create training and test sets. Use the final $1000$ rows of the data as a test set and the rest of the data for training.

In [None]:
import numpy as np
import pandas as pd

# Read the CSV file.
df = pd.read_csv('Data.Assignment2.SemEvalTask9SubtaskA.csv', 
                 names=['id', 'text', 'label'], header=0)

# Set seed for reproducibility and shuffle the rows.
np.random.seed(888)
df = df.sample(frac=1).reset_index(drop=True)

# Store the data as a list of tuples where the first item is the text
# and the second item is the label.
data = [(text, label) for (idx, text, label) in df.values.tolist()]

# Create training and test sets.
train_texts, train_labels = [], []
test_texts, test_labels = [], []

#################### EDIT BELOW THIS LINE #########################

# your code goes here


#################### EDIT ABOVE THIS LINE #########################

# Check that training set and test set are of the right size.
assert len(test_texts) == len(test_labels) == 1000
assert len(train_texts) == len(train_labels) == 5100

---

## Task 2: Data Pre-processing (30 Marks)

Explain at least 3 steps that you will perform to preprocess the texts before training a classifier.

Edit this cell to write your answer below the line in no more than 300 words.

---


> Delete this line and write your answer here


---

In the code cell below, write an implementation of the steps you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

In [None]:
# your code goes here


---

## Task 3: Feature Engineering (I) - TF-IDF as features (10 Marks)

In the lectures we have seen that raw counts of words and `tf-idf` scores can be useful features for a classification task. Complete the following code cell to create a suggestion detector which uses `tf-idf` scores as features for a Naïve Bayes classifier.

After applying your preprocessing steps, use the training data to train the classifier and make predictions on the test set. You **must not** use the test set for training.

If everything is implemented correctly, then you should see a single floating point value between 0 and 1 at the end which denotes the accuracy of the classifier.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import GaussianNB

# Calculate tf-idf scores for the words in the training set.
# ... your code goes here



# Train a Naïve Bayes classifier using the tf-idf scores for words as features.
# ... your code goes here



# Predict on the test set.
predictions = []    # save your predictions on the test set into this list

# ... your code goes here



#################### DO NOT EDIT BELOW THIS LINE #################

def accuracy(labels, predictions):
  '''
  Calculate the accuracy score for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  assert len(labels) == len(predictions)
  
  correct = 0
  for label, prediction in zip(labels, predictions):
    if label == prediction:
      correct += 1 
  
  score = correct / len(labels)
  return score

# Calculate accuracy score for the classifier using tf-idf features.
accuracy(test_labels, predictions)

---

## Task 4: Evaluation Metrics (15 marks)

Why is accuracy not the best measure for evaluating a classifier? Describe an evaluation metric which might work better than accuracy for a classification task such as suggestion detection.


Edit this cell to write your answer below the line in no more than 150 words.

---

> Delete this line and write your answer here

---

In the code cell below, write an implementation of the evaluation metric you defined above. You are free to use a library such as `nltk` or `sklearn` for this task, or you can write your own implementation from scratch.

In [None]:
def evaluate(labels, predictions):
  '''
  Calculate an evaluation score other than accuracy for a given set of predictions and labels.
  
  Args:
    labels (list): A list containing gold standard labels annotated as `0` and `1`.
    predictions (list): A list containing predictions annotated as `0` and `1`.

  Returns:
    float: A floating point value to score the predictions against the labels.
  '''

  # check that labels and predictions are of same length
  assert len(labels) == len(predictions)

  score = 0.0
  
  #################### EDIT BELOW THIS LINE #########################

  # your code goes here


  #################### EDIT ABOVE THIS LINE #########################

  return score

# Calculate evaluation score based on the metric of your choice
# for the classifier trained in Task 3 using tf-idf features.
evaluate(test_labels, predictions)

---

## Task 5: Feature Engineering (II) - Other features (40 Marks)

Describe features other than those defined in Task 3 which might improve the performance of your suggestion detector. If these features require any additional pre-processing steps, then define those steps as well.


Edit this cell to write your answer below the line in no more than 500 words.

---

> Delete this line and write your answer here

---

In the code cell below, write an implementation of the features (and any additional pre-preprocessing steps) you defined above. You are free to use a library such as `nltk` or `sklearn` for this task.

After creating your features, use the training data to train a Naïve Bayes classifier and use the test set to evaluate its performance using the metric defined in Task 4. You **must not** use the test set for training.

To make sure that your code doesn't take too long to run or use too much memory, you can consider a time limit of 3 minutes and a memory limit of 12GB for this task.

In [None]:
# Create your features.
# ... your code goes here



# Train a Naïve Bayes classifier using the features you defined.
# ... your code goes here



# Evaluate on the test set.
# ... your code goes here
