For this assignment, we will explore Twitter Emotion Classification. The goal is the identify the primary emotion expressed in a tweet. Consider the following tweets:
```
Tweet 1: @NationalGallery @ThePoldarkian I have always loved this painting.
Tweet 2: '@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up #foundationcourses.'
``` 

How would you describe the emotions in `Tweet 1` vs `Tweet 2`? `Tweet 1` expresses enjoyment and happiness, while `Tweet 2` directly expresses anger. For this assignment, we will be working with the SMILE Twitter Emotion Dataset ([Wang et al. 2016](https://ceur-ws.org/Vol-1619/paper3.pdf)). At a high level, our goal is to develop different models (rule-based, machine learning, and deep learning), which can be used to identify the emotion of a tweet. You will be required to clean and preprocess the data, generate features for classification, train various models, and evaluate the models. 


*Submission Details*
Please complete all the tasks in “Assignment 1.ipynb” and upload your submission as a Python notebook on Blackboard with the filename “StudentID_Lastname.ipynb”. Assignment 1 will be due by 11:59 PM GMT Monday February 20th, 2023.  

*Grading Policy*

Assignment 1 is graded and will be worth 25% of your overall grade. This assignment is worth a total of 50 points distributed over the tasks below. 
Please note that this is an individual assignment and you must not work with other students to complete this assessment. Any copying from other students, from student exercises from previous years, and any internet resources will not be tolerated. Plagiarised assignments will receive zero marks and the students who commit this act will be reported. 

Feel free to reach out to the TAs and instructors if you have any questions.

Before you get started, run the cell below to download the dataset into memory and a few relevant libraries.

In [294]:
!pip install transformers
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [295]:
!wget -O data.csv "https://figshare.com/ndownloader/files/4988956"
!pip install emoji

import nltk
nltk.download('punkt')

--2023-02-27 23:20:02--  https://figshare.com/ndownloader/files/4988956
Resolving figshare.com (figshare.com)... 54.217.34.18, 34.252.222.205, 2a05:d018:1f4:d003:825f:f38:d5f1:5837, ...
Connecting to figshare.com (figshare.com)|54.217.34.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230227/eu-west-1/s3/aws4_request&X-Amz-Date=20230227T232002Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=1a1bca37a5e58bf8a922d7383809be4af326c86536a43dce0df8c82f07f57d8f [following]
--2023-02-27 23:20:02--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230227/eu-west-1/s3/aws4_request&X-Amz-Date=20230227T232002Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=1a1bca37a5e58bf8a922

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Task 1. Data Cleaning, Preprocessing, and splitting [15 points]
The `data` environment contains the SMILE dataset loaded into a pandas dataframe object. Our dataset has three columns: id, tweet, and label. The `tweet` column contains the raw scraped tweet and the `label` column contains the annotated emotion category. Each tweet is labelled with one of the following emotion labels:
- 'nocode', 'not-relevant' 
- 'happy', 'happy|surprise', 'happy|sad'
- 'angry', 'disgust|angry', 'disgust' 
- 'sad', 'sad|disgust', 'sad|disgust|angry' 
- 'surprise'

### Task 1a. Label Consolidation [ 3 points]
As we can see above the annotated categories are complex. Several tweets express complex emotions like (e.g. 'happy|sad') or multiple emotions (e.g. 'sad|disgust|angry'). The first things we need to do is clean up our dataset by removing complex examples and consolidating others so that we have a clean set of emotions to predict. 

For Task 1a., write code which does the following:
1. Drops all rows which have the label "happy|sad", "happy|surprise", 'sad|disgust|angry', and 'sad|angry'.
2. Re-label 'nocode' and 'not-relevant' as 'no-emotion'.
3. Re-label 'disgust|angry' and 'disgust' as 'angry'.
4. Re-label 'sad|disgust' as 'sad'.

Your updated `data' dataframe should have 3,062 rows and 5 label categories (no-emotion, happy, angry, sad, and surprise).


In [296]:
import numpy as np
import pandas as pd
data = pd.read_csv("data.csv", names=["id", "tweet", "label"])

#checking the current values for labels (current counts)
print('\033[1m'+ '\033[4m' + "LABEL COUNTS BEFORE PREPROCESSING\n" + '\033[0m')
print(data.label.value_counts())

#removing the rows which has labels 'nocode', 'happy', 'not-relevant', 'angry','surprise', 'sad', "disgust|angry", 'disgust', "sad|disgust"
data = data[data.label.isin(['nocode', 'happy', 'not-relevant', 'angry','surprise', 'sad', "disgust|angry", 'disgust', "sad|disgust"])]
print('\n\033[1m'+ '\033[4m' + "LABEL COUNTS AFTER PREPROCESSING\n" + '\033[0m')
print(data.label.value_counts())

#source: https://datatofish.com/replace-values-pandas-dataframe/
#relabelling 'nocode' and 'not-relevant' as 'no-emotion'.
data['label'] = data['label'].replace(['nocode', 'not-relevant'], 'no-emotion')

#relabelling 'disgust|angry' and 'disgust' as 'angry'
data['label'] = data['label'].replace(['disgust|angry', 'disgust'], 'angry')

#relabelling 'sad|disgust' as 'sad'.
data['label'] = data['label'].replace(['sad|disgust'], 'sad')

print('\n\033[1m'+ '\033[4m' + "LABEL COUNTS AFTER RE-LABELLING\n" + '\033[0m')
print(data.label.value_counts())

[1m[4mLABEL COUNTS BEFORE PREPROCESSING
[0m
nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: label, dtype: int64

[1m[4mLABEL COUNTS AFTER PREPROCESSING
[0m
nocode           1572
happy            1137
not-relevant      214
angry              57
surprise           35
sad                32
disgust|angry       7
disgust             6
sad|disgust         2
Name: label, dtype: int64

[1m[4mLABEL COUNTS AFTER RE-LABELLING
[0m
no-emotion    1786
happy         1137
angry           70
surprise        35
sad             34
Name: label, dtype: int64


### Task 1a Tests 
Run the cell below to evaluate your code. To get full credit for this task, your code must pass all tests. Any alteration of the testing code will automatically result in 0 points. 

In [297]:
# Test 1. Data should have 5 unique labels.
print(f"Unique label test: {len(data['label'].unique()) == 5}")

# Test 2. Data labels must be: angry, happy, no-emotion, sad, and surprise
labels = ["angry", "happy", "no-emotion", "sad", "surprise"]
print(f"Label check: { set(data['label'].unique()).difference(labels) == set() }")

# Test 3. Check example counts per label
print(f"Angry example count: {len(data[data['label']=='angry']) == 70}")
print(f"Happy example count: {len(data[data['label']=='happy']) == 1137}")
print(f"No-Emotion example count: {len(data[data['label']=='no-emotion']) == 1786}")
print(f"Sad example count: {len(data[data['label']=='sad']) == 34}")
print(f"Surprise example count: {len(data[data['label']=='surprise']) == 35}")

Unique label test: True
Label check: True
Angry example count: True
Happy example count: True
No-Emotion example count: True
Sad example count: True
Surprise example count: True


### Task 1b. Tweet Cleaning and Processing [10 points]
Raw tweets are noisy. Consider the example below: 
```
'@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up #foundationcourses. 😠'
```
The mention @tateliverpool and hashtag #BobandRoberta are extra noise that don't directly help with understanding the emotion of the text. The accompanying emoji can be useful but needs to be decoded to it text form :angry: first. 

For this task you will fill complete the `preprocess_tweet` function below with the following preprocessing steps:
1. Lower case all text
2. De-emoji the text
3. Remove all hashtags, mentions, and urls
4. Remove all non-alphabet characters except the followng punctuations: period, exclamation mark, and question mark

Hints: 
- For step 2 (de-emoji), consider using the python [emoji](https://carpedm20.github.io/emoji/docs/) library. The `emoji.demojize` method will convert all emojis to plain text. The `emoji` library is installed in cell [52].
- Follow the processing steps in order. For example calling nltk's word_tokenize before removing hashtags and mentions will end up creating seperate tokens for @ and # and cause problems.

To get full credit for this task, the Test 1b must pass. Only modify the  cell containing the `preprocess_tweet` function and do not alter the testing code block. 

After you are satisfied with your code, run the tests. code to ensure your function works as expected. This cell will also create a new column called `cleaned_tweet` and apply the `preproces_tweet` function to all the examples in the dataset. 

In [298]:
import emoji 
import re #makes it easier for us to preprocess the text as we are removing many unwanted characters from the tweet
import nltk
nltk.download('punkt')
from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer

def preprocess_tweet(tweet: str) -> str:
    
    #code referred from the re module of python
    
    #code which will convert all the characters in the tweet to lowercase
    tweet = tweet.strip().lower()
    
    #code to convert all emojis to plain text.
    tweet = emoji.demojize(tweet)
    
    #code to remove mentions
    tweet = re.sub("@[A-Za-z0-9_]+","", tweet)
    
    #code to remove hashtags
    tweet = re.sub("#[A-Za-z0-9_]+","", tweet)
    
    #code to remove any http links from the tweet
    tweet = re.sub(r"http\S+", "", tweet)
    
    #code to remove ' from the tweet
    tweet = re.sub("'", "", tweet)
    
    #code to remove any non alphabetical character from a text using their ascii value
    tweet = ''.join([c for c in tweet if ord(c) < 128])
    
    #code to remove any digits
    tweet = re.sub(r'\d+', '', tweet)
    
    #code to remove tweets that has words starting with & like &amp
    tweet = re.sub("&[A-Za-z0-9_]+","", tweet)
    
    #since we have removed all the hashtags and mentions, we can tokenize the tweet
    tweet = nltk.word_tokenize(tweet)
    tweet = filter(lambda x: x not in [":", ";", "&", "(", ")", "<", ">", "{", "}", "[", "]", ",", "$", "%", "^", "*", "-", "+", "-", "/"], tweet)
    tweet = [tweet.replace("_", "") for tweet in tweet]
    #tweet = [tweet.replace("'", "") for tweet in tweet]
    
    tweet = TreebankWordDetokenizer().detokenize(tweet)
    
    #code to remove the trailing white spaces for the start and end of the tweet
    tweet = tweet.strip()
    
    return tweet 


test_tweet = "'@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up! #foundationcourses 😠'"
print(preprocess_tweet(test_tweet))

i am angry more artists that have a profile are not speaking up! angryface


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Task 1b Test
Run the cell below to evaluate your code. To get full credit for this task, your code must pass all tests. Any alteration of the testing code will automatically result in 0 points. 

In [299]:
# Do NOT modify the code below. 
# Create new column with cleaned tweets. We will use this for the subsequent tasks
data["cleaned_tweet"] = data["tweet"].apply(preprocess_tweet)

# Test 1b 
test_tweet = "'@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up! #foundationcourses 😠'"
clean_tweet = "i am angry more artists that have a profile are not speaking up! angryface"
print(f"Test 1b: {preprocess_tweet(test_tweet) == clean_tweet}")

Test 1b: True


### Task 1c. Generating Evaluation Splits [2 points]
Finally, we need to split our data into a train, validation, and test set. We will split the data using a 60-20-20 split, where 60% of our data is used for training, 20% for validation, and 20% for testing. As the dataset is heaviliy imbalanced, make sure you stratify the dataset to ensure that the label distributions across the three splits are roughly equal. 

Store your splits in the variables `train`, `val`, and `test` respectively. 

Hints:
- Use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function for this task. You'll have to call it twice to get the validation split. 
- Set the random state so the sampling can be reproduced (we use 2023 for our random state)
- Use the `stratify` parameter to ensure representative label distributions across the splits. 

In [300]:
from sklearn.model_selection import train_test_split

# Split data into train and test sets
train, val = train_test_split(data, test_size=0.4, stratify=data.label, random_state=2023)

# Split data into val and test sets
val, test = train_test_split(val, test_size=0.5, stratify=val.label, random_state=2023)

#train.shape,test.shape,val.shape
train

Unnamed: 0,id,tweet,label,cleaned_tweet
2161,610362800683270144,Last week: #CanAltay @kurusehir @tateliverpool...,no-emotion,last week
1566,614359996822876161,See all the photos from Wednesday's #Defeating...,no-emotion,see all the photos from wednesdays event at
1674,612391451817701380,.How To Make A Turquoise Goblet. via @britishm...,no-emotion,.how to make a turquoise goblet . via
2037,611153439817605120,Time is running out to catch New Rhythms @kett...,no-emotion,time is running out to catch new rhythms as we...
2946,612948937960521728,"@NationalGallery /．、 ｀\ | ； ； \ (`'ー，,ー'`...",no-emotion,"/\ | \ ` ` /\ ` """" ` """" ` \ ` "",-"" ` |\--/| ..."
...,...,...,...,...
3027,613017736088715265,Stunning Viking silver 'thistle' brooch (AD 90...,no-emotion,stunning viking silver thistle brooch ad s at ...
2324,613031427807006720,#LeonardoDaVinci The Virgin and... http://t.co...,no-emotion,the virgin and ...
1176,614704511836377088,Take #AWalkontheWildSide today with a visit to...,happy,take today with a visit to the art gallery . f...
1793,611892587721572352,What a treat RT @FitzMuseum_UK: Exhibited for ...,happy,what a treat rt exhibited for the first time i...


## Task 2: Naive Baseline Using a Rule-based Classifier [10 points]

Now that we have a dataset, let's work on developing some solutions for emotion classification. We'll start with implementing a simple rule-based classifier which will also serve as our naive baseline. Emotive language (e.g. awesome, feel great, super happy) can be a strong signal as to the overall emotion being by the tweet. For each emotion in our label space (happy, surprised, sad, angry) we will generate a set of words and phrases that are often associated with that emotion. At classification time, the classifier will calculate a score based on the overlap between the words in the tweet and the emotive words and phrases for each of the emotions. The emotion label with the highest overlap will be selected as the prediction and if there is no match the "no-emotion" label will be predicted. We can break the implementation of this rules-based classifier into three steps:
1. Emotive language extraction from train examples 
2. Developing a scoring algorithm
3. Building the end-to-end classification flow 

### Task 2a. Emotive Language Extraction [4 points] 
For this task you will generate a set of unigrams and bigrams that will be used to predict each of the labels. Using the training data you will need to extract all the unique unigrams and bigrams associated with each label (excluding no-emotion). Then you should ensure that the extracted terms for each emotion label do not appear in the other lists. In the real world, you would then manually curate the generated lists to ensure that associated words were useful and emotive. For the assignment, you won't be required to further curate the generated lists.

Once you've identified the appropiate terms, save them as lists stored in the following environment variables: `happy_words`, `surprised_words`, `sad_words`,and `angry_words`. To get full credit for this section, ensure all 2a Tests pass. 

Hints
- We suggest you use Python's [set methods](https://realpython.com/python-sets/) for this task.
- NLTK has a function for extracting [ngrams](https://www.nltk.org/api/nltk.util.html?highlight=ngrams#nltk.util.ngrams). This function expects a list of tokens as input and will output tuples which you'll need to reconvert into strings. 

In [301]:
# Your code here
from typing import List
from nltk.util import ngrams

# 1. Extract all terms associated with each label
def extract_words(examples: List[str]) -> List[str]:
    
    """
    Given a list of tweets, return back the unigrams and bigrams found
    across all the tweets. 
    """
    extracted_words = set()
    
    for example in examples:

        example = preprocess_tweet(example)
        
        #For each of the tweet, create unigrams and bigrams.
        #Referred lab notes from previous year
        unigrams = set(nltk.word_tokenize(example))
        bigrams = set(ngrams(nltk.word_tokenize(example), 2))
        
        #Update the retrieved word set to include the unigrams and bigrams.
        extracted_words.update(unigrams)
        extracted_words.update(bigrams)
    
    return extracted_words
    
# Words to be taken out for each emotion label
happy_words = extract_words(tweet for tweet in train[train['label'] == 'happy']['cleaned_tweet'].tolist())
sad_words = extract_words(tweet for tweet in train[train['label'] == 'sad']['cleaned_tweet'].tolist())
angry_words = extract_words(tweet for tweet in train[train['label'] == 'angry']['cleaned_tweet'].tolist())
surprise_words = extract_words(tweet for tweet in train[train['label'] == 'surprise']['cleaned_tweet'].tolist())

# Eliminate redundant terms from each emotion list.
# Referred from https://www.geeksforgeeks.org/python-set-difference/
happy_words = set(happy_words) - set(sad_words) - set(angry_words) - set(surprise_words)
sad_words = set(sad_words) - set(happy_words) - set(angry_words) - set(surprise_words)
angry_words = set(angry_words) - set(happy_words) - set(sad_words) - set(surprise_words)
surprise_words = set(surprise_words) - set(happy_words) - set(sad_words) - set(angry_words)


### Task 2a Tests
Run the cell below to evaluate your code. To get full credit for this task, your code must pass all tests. Any alteration of the testing code will automatically result in 0 points. 

In [302]:
# Check sets are non-empty
print("Checking sets are not empty: ")
print(f"Happy words count: {len(happy_words)}, {len(happy_words) > 0}")
print(f"Sad words count: {len(sad_words)}, {len(sad_words) > 0}")
print(f"Angry words count: {len(angry_words)}, {len(angry_words) > 0}")
print(f"Surprise words count: {len(surprise_words)}, {len(surprise_words) > 0}")

# Checks sets are disjoint 
union1 = sad_words.union(angry_words, surprise_words)
union2 = happy_words.union(surprise_words, angry_words) 
union3 = surprise_words.union(happy_words, sad_words)
union4 = angry_words.union(happy_words, sad_words) 

print("\nChecking sets are all disjoint:")
print(f"Happy words disjoint: {happy_words.isdisjoint(union1)}")
print(f"Sad words disjoint: {sad_words.isdisjoint(union2)}")
print(f"Angry words disjoint: {angry_words.isdisjoint(union3)}")
print(f"Surprise words disjoint: {surprise_words.isdisjoint(union4)}")

Checking sets are not empty: 
Happy words count: 7056, True
Sad words count: 416, True
Angry words count: 758, True
Surprise words count: 358, True

Checking sets are all disjoint:
Happy words disjoint: True
Sad words disjoint: True
Angry words disjoint: True
Surprise words disjoint: True


### Task 2b. Scoring using set overlaps [2 points]

Next we will implement to scoring algorithm. Our score will simply be the count of overlapping terms between tweet text and emotive terms. For this task, finish implementing the code below. To get full credit, ensure Test 2b. is successful. 

In [303]:
sample_words = {'cat', 'hat', 'mat', 'bowling', 'bat'}
sample_tweet1 = "that cat is super cool sitting on the mat" 
sample_tweet2 = "the man in the bowling hat sat on the cat"
sample_tweet3 = "the quick brown fox jumped over the lazy dog"

#This function accepts a tweet as a string and returns the number of emotive words in the tweet. Return type is int
def score_tweet(tweet: str, emotive_words: set) -> int:

    words = preprocess_tweet(tweet).split()
    return len(emotive_words.intersection(words))

print(f"Test 1: {score_tweet(sample_tweet1, sample_words) == 2}")
print(f"Test 2: {score_tweet(sample_tweet2, sample_words) == 3}")
print(f"Test 3: {score_tweet(sample_tweet3, sample_words) == 0}")

Test 1: True
Test 2: True
Test 3: True


### 2c. Rule-based classification [4 points] 
Let put together our rules-based classfication system. Fill out the logic in the `simple_clf`. Given a tweet, `simple_clf` will generate the overlap score
for each of emotion labels and return the emotion label with the highest score. If there is no match amongst the emotions, the classifier will return 'no-emotion'.

To get full credit for this section, your average F1 score most be greater than 0.

In [304]:
def simple_clf(tweet: str) -> str:
    
    """
    Given a tweet, calculate all the emotion overlap scores.
    Return the emotion label which has the largest score. If
    overlap score is 0, return no-emotion. 
    
    """

    tweet_words = set(preprocess_tweet(tweet).split())
    
    """
    Each word set representing an emotion is tallied for the number of words that overlap.
    It is stored in a dictionary called scores
    
    Example: When a tweet has been preprocessed, the set of unique words in that tweet is contained in tweet_words, and 
    the set of unique words connected to the emotion "happy" is contained in happy_words. We obtain the set of words that 
    are shared by both sets by taking the intersection of these two sets.
    
    """
    scores = {
        
        "happy": len(tweet_words.intersection(happy_words)),
        "sad": len(tweet_words.intersection(sad_words)),
        "angry": len(tweet_words.intersection(angry_words)),
        "surprise": len(tweet_words.intersection(surprise_words))
    }
    
    
    #Return the emotion with the highest score, or 'no-emotion' if all scores are 0
    
    """
    The largest value in the scores dictionary is located, and the matching key is returned to identify the emotion 
    with the highest overlap score.
    Returns the string "no-emotion" if all the scores are 0.
    
    """
    
    max_score = max(scores.values())
    
    if max_score == 0:
        return "no-emotion"
    else:
        for emotion, score in scores.items():
            if score == max_score:
                return emotion

    return None

After finishing the above section, let's evaluate our how model did.

In [305]:
from sklearn.metrics import classification_report

preds = test["cleaned_tweet"].apply(simple_clf)
print(classification_report(test["label"], preds)) 

              precision    recall  f1-score   support

       angry       0.05      0.21      0.08        14
       happy       0.46      0.48      0.47       228
  no-emotion       0.86      0.08      0.15       357
         sad       0.00      0.00      0.00         7
    surprise       0.03      1.00      0.05         7

    accuracy                           0.24       613
   macro avg       0.28      0.36      0.15       613
weighted avg       0.67      0.24      0.27       613



## Task 3. Machine learning w/ grammar augmented features [10 points]

Now that we have a naive baseline, let's build a more sophisticated solution using machine learning. Up to this point, we have only considered the words in the tweet as our primary features. The rules-based approach is a very simple bag-of-words classifier. Can we improve performance if we provide some additional linguistic knowledge?

For Task 3 you will do the following:
- Generate part-of-speech features our tweets
- Train two different machine learning classifiers, one with linguistic features and one without
- Evaluate the trained models on the test set

### Task 3a. Grammar Augmented Feature Generation [3 points]
For this task, we will be generating part-of-speech tags for each token in our tweet. Additionally we'll lemmatize the text as well. We will directly include the POS information by appending the tag to the lemma of word itself. For example:
```
Raw Tweet: I am very angry with the increased prices.
POS Augmented Tweet: I-PRP be-VBP very-RB angry-JJ with-IN the-DT increase-VBN price-NNS .-.
```

Complete the `generate_pos_features` using the Spacy library. Once you have an implementation that works, we'll update the `train` and `test` dataframes with a new column called `tweet_with_pos` which contains the output of the `generate_pos_features` method.

In [306]:
!python -m spacy download en_core_web_sm

2023-02-27 23:20:18.941330: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-02-27 23:20:18.941514: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation suc

In [307]:
import spacy 
from tqdm.notebook import tqdm
nlp = spacy.load("en_core_web_sm")

def generate_pos_features(tweet: str) -> str:
    
    """
    Given a tweet, return the lemmatized tweet augmented
    with POS tags.
    E.g.:
    Input: "cats are super cool."
    output: "cat-NNS be-VBP super-RB cool-JJ .-."

    """
    #Tokenize the tweet and tag the appropriate parts of speech.
    doc = nlp(tweet)
    
    #Create an empty string to store the pos_tweet
    pos_tweet = ""
    
    # Since we have tokenized the tweet using nlp, iterate over each token in the doc
    
    for token in doc:
        
        #Extract the token's lemma (lemma_) and part-of-speech tag(tag_)
        pos_tag = token.tag_
        lemma = token.lemma_
        
        #To the POS-augmented tweet, add the lemma and POS tag.
        #Adding a space between the lemma and POS
        pos_tweet += f"{lemma}-{pos_tag} "
    
    #Return the POS-augmented tweet after removing the trailing whitespace using strip()
    return pos_tweet.strip()

sample_tweet = "I hate action movies"
generate_pos_features(sample_tweet)

'I-PRP hate-VBP action-NN movie-NNS'

In [308]:
# Once you have the code working above run this cell.
train["tweet_with_pos"] = train["cleaned_tweet"].apply(generate_pos_features)
test["tweet_with_pos"] = test["cleaned_tweet"].apply(generate_pos_features)


### Task 3a Tests
Run the cell below to evaluate your code. To get full credit for this task, your code must pass all tests. Any alteration of the testing code will automatically result in 0 points. 

In [309]:
sample_texts = [
    ("i am super angry", "I-PRP be-VBP super-RB angry-JJ"),
    ("That movie was great", "that-DT movie-NN be-VBD great-JJ"),
    ("I hate action movies", "I-PRP hate-VBP action-NN movie-NNS")
]
for i, text in enumerate(sample_texts):
  print(f"Test {i+1}: {generate_pos_features(text[0]) == text[1]}")

Test 1: True
Test 2: True
Test 3: True


### Task 3b. Model Training [5 points]
Next we will train two seperate RandomForest Classifier models. For this task you will generate two sets of input features using the `TfidfVectorizer`. We generate Tfidf statistic on the`cleaned_tweet` and the `tweet_with_pos` columns. 

Once you've generated your features, train two different Random Forest classifiers with the generated features and generate the predictions on the test set for each classifier.

In [310]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier as rf

# Code referred from NLP Assignment 2 from Semester-1
# Generate TF-IDF features for cleaned_tweet

tfidf_vectorizer_for_cleaned_tweet = TfidfVectorizer()
train_tfidf_clean = tfidf_vectorizer_for_cleaned_tweet.fit_transform(train['cleaned_tweet'])
test_tfidf_clean = tfidf_vectorizer_for_cleaned_tweet.transform(test['cleaned_tweet'])

# Generate TF-IDF features for tweet_with_pos
tfidf_vectorizer_for_pos = TfidfVectorizer()
train_tfidf_pos = tfidf_vectorizer_for_pos.fit_transform(train['tweet_with_pos'])
test_tfidf_pos = tfidf_vectorizer_for_pos.transform(test['tweet_with_pos'])

# Train a random forest classifier on cleaned_tweet
rf_clean = rf()
rf_clean.fit(train_tfidf_clean, train['label'])
clean_pred = rf_clean.predict(test_tfidf_clean)

# Train a random forest classifier on tweet_with_pos
rf_pos = rf()
rf_pos.fit(train_tfidf_pos, train['label'])
pos_pred = rf_pos.predict(test_tfidf_pos)

### Task 3c. [2 points]
Generate classification reports for both models. Print the reports below. In a few sentences (no more than 100 words) explain which features were the most effective and why you think that's the case?

In [311]:
from sklearn.metrics import classification_report

# Classification Report for Tfidf features
print('\033[1m'+ '\033[4m' + "Classification report for TFIDF features\n" + '\033[0m')
print(classification_report(test['label'], clean_pred))

# Classfication Report for POS features 
print('\033[1m'+ '\033[4m' + "Classification report for TFIDF w/ POS features\n" + '\033[0m')
print(classification_report(test['label'], pos_pred))

[1m[4mClassification report for TFIDF features
[0m
              precision    recall  f1-score   support

       angry       1.00      0.07      0.13        14
       happy       0.84      0.68      0.75       228
  no-emotion       0.78      0.94      0.85       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.80       613
   macro avg       0.52      0.34      0.35       613
weighted avg       0.79      0.80      0.78       613

[1m[4mClassification report for TFIDF w/ POS features
[0m
              precision    recall  f1-score   support

       angry       0.50      0.07      0.12        14
       happy       0.77      0.68      0.72       228
  no-emotion       0.78      0.90      0.83       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.78       613
   macro avg    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Your evaluation here.
The TFIDF features without POS tagging produced greater precision, recall, and f1-score values for each class, which suggests that they were more successful based on the classification reports. This can be due to the lack of additional information that the POS features offered that was helpful for classification. Also, it's possible that the TFIDF features were more indicative of the total text content and helped the model better capture the specifics of each class.

## Task 4. Transfer Learning with DistilBERT [10 points]

For this task you will finetune a pretrained language model (DistilBERT) using the huggingface `transformers` library. For this task you will need to:
- Encode the tweets using the BERT tokenizer
- Create pytorch datasets for for the train, val and test datasets
- Finetune the distilbert model for 5 epochs
- Extract predictions from the model's output logits and convert them into the emotion labels.
- Generate a classification report on the predictions.

Ensure you are running the notebook in Google Colab with the gpu runtime enabled for this section.

**Importing all the required libraries at once**

In [312]:
# Referred lab notes
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import Trainer
from transformers import TrainingArguments
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install transformers >> NULL

**Label encoding**

In [313]:
def replace_labels(df):
    df['label'] = df['label'].replace({'happy':0,'sad':1,'no-emotion':2,'angry':3,'surprise':4})
    
replace_labels(train)
replace_labels(val)
replace_labels(test)

**The AutoTokenizer.from_pretrained method will automatically load the associated tokenizer and vocabulary associated with the transformer model.**


In [314]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapsh

In [315]:
# Define Custom Class for DistilBert Inputs
class SentimentAnalysisDataset(Dataset):
    
    def __init__(self, encodings: dict):  
        self.encodings = encodings
  
    def __len__(self) -> int:
        return len(self.encodings["input_ids"])
    
    def __getitem__(self, idx: int) -> dict:
        e = {k: v[idx] for k,v in self.encodings.items()}
        return e 

**Preprocessing Steps**

In [316]:
"""
The train, val, and test datasets are created by calling create_dataset with the corresponding data frames.
This code defines a create_dataset function that takes a DataFrame and returns a PyTorch dataset.

"""
def create_dataset(df):
    # Encode inputs
    encodings = tokenizer(
        df["cleaned_tweet"].tolist(), 
        padding=True,           # pad all inputs to max length
        max_length=48,          # Bert max is 512, we choose 128 due to compute limitations
        return_tensors="pt",    # Return format pytorch tensor
        truncation=True
    )

    # Add labels to inputs
    labels = torch.tensor(df["label"].tolist())
    encodings["label"] = labels

    dataset = SentimentAnalysisDataset(encodings)

    return dataset

# Train dataset
train_dataset = create_dataset(train)

# Validation dataset
val_dataset = create_dataset(val)

# Test dataset
test_dataset = create_dataset(test)
test_new = test["label"].tolist()


Load Pretrained Model

In [317]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=5)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    lr_scheduler_type='cosine',
    per_device_train_batch_size = 112,
    per_device_eval_batch_size = 112, 
    fp16=True,
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading weights file pytorch_model.bin

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

In [319]:
trainer.train()
preds = trainer.predict(test_dataset)
import numpy as np
# Find the index of the maximum value along each row of the preds array
pred_new = np.argmax(preds[0], axis=1)

***** Running training *****
  Num examples = 1837
  Num Epochs = 5
  Instantaneous batch size per device = 112
  Total train batch size (w. parallel, distributed & accumulation) = 112
  Gradient Accumulation steps = 1
  Total optimization steps = 85
  Number of trainable parameters = 66957317


Epoch,Training Loss,Validation Loss
1,No log,0.627504
2,No log,0.50593
3,No log,0.492759
4,No log,0.485753
5,No log,0.483202


***** Running Evaluation *****
  Num examples = 612
  Batch size = 112
Saving model checkpoint to ./results/checkpoint-17
Configuration saved in ./results/checkpoint-17/config.json
Model weights saved in ./results/checkpoint-17/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 612
  Batch size = 112
Saving model checkpoint to ./results/checkpoint-34
Configuration saved in ./results/checkpoint-34/config.json
Model weights saved in ./results/checkpoint-34/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 612
  Batch size = 112
Saving model checkpoint to ./results/checkpoint-51
Configuration saved in ./results/checkpoint-51/config.json
Model weights saved in ./results/checkpoint-51/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 612
  Batch size = 112
Saving model checkpoint to ./results/checkpoint-68
Configuration saved in ./results/checkpoint-68/config.json
Model weights saved in ./results/checkpoint-68/pytorch_model.bin
***** Running Ev

**Getting prediction out of the model**

In [320]:
from sklearn.metrics import classification_report
print(classification_report(test_new, pred_new))

              precision    recall  f1-score   support

           0       0.85      0.86      0.85       228
           1       0.00      0.00      0.00         7
           2       0.87      0.92      0.89       357
           3       0.00      0.00      0.00        14
           4       0.00      0.00      0.00         7

    accuracy                           0.86       613
   macro avg       0.34      0.36      0.35       613
weighted avg       0.82      0.86      0.84       613



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Task 5. Model Recommendation [5 points]
In a paragraph (no more than 250 words) answer the following questions:
1. Which of the implemented models would you recommend and why? 
2. Compare the metrics for each models implemted (Rules-Based, Machine Learning w/ POS features, and DistilBERT). What are the pros and con for each model (consider performance both macro performance and label specifc metrics and the computational requirements). 

The DistilBERT model exceeds the other two models in terms of accuracy, precision, recall, and F1-score, according to the published classification reports. Consequently, for this classification problem, I propose utilizing the DistilBERT model as it gives an accuracy of 86% which is better than the other two models.

The advantages and disadvantages of each model are contrasted below:

**1. Model Based on Rules**

Positives: 

*   Simple to comprehend and interpret. 
*   Little computational resources are needed.

Cons:

*  Results might not generalize well to new data.


**2. Machine learning features for POS**

Positives: 

*   Captures more intricate data patterns than rules-based models.
*   The performance of the model can be enhanced by POS features.

Cons:

*   If the model is overfit or the feature engineering is poor, it may still not generalize effectively to new data.


**3. DistilBERT**

Positives: 

*   Data can be captured with complex relationships and patterns without the need for feature engineering.

*   Able to generalize well to novel data and circumstances.

Cons:

*   Training takes a lot of time and processing power.
*   It could be challenging to comprehend how the model functions and makes decisions.









