# BERT Text Classification Competition

The goal of this project: Given a Tweet and context about the Tweet (from Twitter), predict whether the writer wrote the message sarcastically. The dataset may be found in here https://github.com/CS410Fall2020/ClassificationCompetition. In the train dataset, have the following columns as taken directly from the above link's README file:
- response: the Tweet to be classified
- context: the conversation context of the response
- label: SARCASM or NOT_SARCASM (binary label, not present in the test data)
- id: String identifier for sample. This id will be required when making submissions. (ONLY in test data)

The train dataset has 5000 rows while the test data has 1800.

### What have I tried?

This file in particular will only show my results using BERT. I have tried other methods too, but things quickly got really out of hand so I won't attach the results here. Before using BERT, I have tried linear SVM models, using TFIDF vectorization. I also tried logistic regression and random forest. For the linear models, I used LASSO to determine which feature/words to use for my models, hoping to get something that is interpretable. None of these models surpassed the benchmark, but got around 0.67 for F1 scores. I then moved onto other methods such as CNNSVM, CNNLSTM, LSTMSVM, CNNLSTMSVM and a bunch of similar combinations using embeddings based off of GloVe's Twitter collection. I also tried using tokenization for preprocessing. For the features, I tried different combinations of splitting up responses and contexts and inserting them in different parts of the prediction algorithm (for example use CNN to generate features for the context data and then use TFIDF for the response data with LASSO and then merge the results together as the input feature to an SVM). I even tried doing weighted averages for predictions made separately on context and response. The best I could come up with was an F1 score of around 0.70.

### What worked?

Nothing seemed to work, so I started looking around at other methods. I came across BERT, a transformer-based machine learning technique pre-trained and developed by Google (https://arxiv.org/abs/1810.04805). It was created using a neural network architecture that you can read more about. BERT was trained on a very large corpus of unlabelled text, including Wikipedia. For the purposes of this project, I simply had to take this pretrained model and "tune it" according to my task of sarcasm prediction (classification problem). I did not have to experiment with different hyperparamters as the results of the first run already beat the baseline by a large margin. You can verify the results in the LiveDataLab competition webpage and look for the name Zinkoy.

## Installation

We are using Python 3.6.9. This probably works as long as you are using some flavor of Python 3. You can clone the repository yourself. Your directory should contain classification_competition.ipynb, requirements.txt and two folders called data and model. Run the following commands in your terminal. I am using Windows Linux Subsystem.
#### Creating and setting up a virtual environment with Jupyter Notebook
```
$ python3.6 -m venv env
$ source env/bin/activate
$ pip install --upgrade pip
$ pip install -r requirements.txt
$ python -m ipykernel install --user --name=<whatever_name_you_want_without_the_brackets>
$ jupyter notebook
```
At this point, copy and paste whatever url you get from running `jupyter notebook` into your favorite browser. Click on classification_competition.ipynb. Now click on the Kernel tab and change the kernel to whatever you named it to be. Follow the instructions in "Usage Documentation".
When you are done testing, you can delete all of the files and then remove the Jupyter kernel by running 

`jupyter kernelspec uninstall <whatever_name_you_want_without_the_brackets>`.

## Documentation
There will be a file called documentation.html, which is just this notebook, but in an easily accessible format. I would rather keep all of the documentation in one spot. To actually run the code, make sure you are looking at classification_competition.ipynb.
### Overview of functions and implementation documentation
These are both combined in this file. Before each major step, there will be an explanation of each function and its implementation.

### Usage Documentation

Make sure that in the directory containing this Jupyter Notebook, you have a folder called data containing train.jsonl and test.jsonl obtained from the above github link. If you want the full experience, simply run all of the cells. It is normal for the model training process to take many hours and render your computer unusable for that period of time. If you want to just load in the model and output a submission, run all of the cells up until the section "Training for BERT". Then run the section "Preparing our test data". After that, you can run everything in the "Express testing from a pretrained model" section to generate answer.txt file.

### Contribution of team members breakdown
I am the team.

## Project

### Importing packages

We will be using numpy and pandas to work with data. NLTK will be used for preprocessing. sklearn will help us with creading a train and test split for validation. Keras will be used alongside Tensorflow for defining things like loss. The main star will be the transformers which contain the BERT model itself.

In [1]:
import numpy as np
import pandas as pd
import nltk

import tensorflow as tf
import tensorflow_addons as tfa

import keras

from sklearn.model_selection import train_test_split

from nltk.corpus import stopwords

from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig

### Preprocessing

The purpose of this section is to define functions to prepare out data in order to use BERT.

The preprocess function takes in a Pandas dataframe as it's argument and outputs a new Pandas dataframe based off the original. The function creates a deep copy of our original dataframe. We then define a list of stopwords in the English langauge (excluding stopwords in the excluded_stop_words list). The stopwords that we don't remove are chosen based off of what I personally think would be important for understanding a sentence i.e. words like no/not can really completely change the meaning of a sentence. For every response inside of our dataframe we:
1. Replace punctuations and non-numeric/alphabetic symbols with an empty string. We keep the symbols such as -, ', and #. 
2. Split up the sentence by whitespace into individual words
3. Remove all stopwords other than those in the excluded_stop_words list. We also remove the words "user" and "url" since they appear everywhere and seem to sometimes cause overfitting.
4. Convert each word to lowercase.
5. Join back the sentence together by a normal space.

Context is slightly different since it contains an array of Tweets. We simply join together every tweet by a space and then repeat what we did for response.

We name these processed columns response_proc and context_proc respectively and return the new dataframe containing these columns.

In [2]:
'''
Input: A pandas dataframe df. This is assumed to be the dataframe used in the sarcasm detection classification competition.
The expected columns are "response" and "context"
Output: A new pandas dataframe containing columns "response_proc" and "context_proc".
'''
def preprocess(df):
    df = df.copy()
    stop_words = stopwords.words("english")
    excluded_stop_words = set(["not", "most","no", "off", "down", "any", "over", "under", "few", "against", "above", "below"])
    
    df["response_proc"] = df["response"].str.replace('[^\w\s\-\'#]','').apply(
            lambda doc : doc.split()).apply(
            lambda doc : " ".join([word.lower() for word in doc if (word.lower() in excluded_stop_words or (word.lower() not in stop_words and word.lower() not in ('user', 'url')))]))
    
        
    df["context_proc"] = df["context"].str.join(" ").str.replace('[^\w\s\-\'#]','').apply(
        lambda doc : doc.split()).apply(
        lambda doc : " ".join([word.lower() for word in doc if (word.lower() in excluded_stop_words or (word.lower() not in stop_words and word.lower() not in ('user', 'url')))]))
    
    return df

The add_label function creates a deep copy of our original dataframe. We do this deep copying because for some reason, our original dataframe keeps getting modified in ways I do not like while testing. Deep copying prevents that issue. We create a column "y" containing values $\{0, 1\}$. We map the label "SARCASM" to 1 and "NOT_SARCASM" to 0. BERT and a bunch of other machine learning algorithms need labels in this form. We then return the new dataframe.

In [3]:
'''
Input: A pandas dataframe df that has labels "SARCASM" and "NOT_SARCASM".
Output: Returns a new dataframe containing the column "y" of mapped numerical label values.
'''
def add_label(df):
    df = df.copy()
    df["y"] = pd.Categorical(df["label"]).map({"SARCASM":1, "NOT_SARCASM":0}).astype(int)
    return df

The following function is a very simple function. It must be called after the preprocess(df) function. We take in a dataframe and create a deepcopy expecting to see the columns "response_proc" and "context_proc" (created from preprocess(df)). We then combine them in a new column "combined_proc" by a simple space. An important caveat to notice is that append our processed context at the end of our processed response. This heuristically allows our response to never fully be cut off later on. We can set the number of tokens we want BERT to be able to encode, but at the same time can't set that number to be too high due to computational constraints. The design choice here may seem weird (why not keep these lines in preprocess(df)?), but it is useful for experimenting different combinations of text for different models.

In [4]:
'''
Input: A pandas dataframe df that has successfully went through the function "preprocess(df)"
Output: A new dataframe containing the combination of processed response and context "combined_proc"
'''
def combine_context_response(df):
    df = df.copy()
    df['combined_proc'] = df[['response_proc', 'context_proc']].apply(lambda text: ' '.join(text), axis=1)
    return df

### Load in the dataset

We download the data from https://github.com/CS410Fall2020/ClassificationCompetition/tree/main/data and load in the training dataset here.

In [5]:
train = pd.read_json("data/train.jsonl", lines=True)
train.head(5)

Unnamed: 0,label,response,context
0,SARCASM,@USER @USER @USER I don't get this .. obviousl...,[A minor child deserves privacy and should be ...
1,SARCASM,@USER @USER trying to protest about . Talking ...,[@USER @USER Why is he a loser ? He's just a P...
2,SARCASM,@USER @USER @USER He makes an insane about of ...,[Donald J . Trump is guilty as charged . The e...
3,SARCASM,@USER @USER Meanwhile Trump won't even release...,[Jamie Raskin tanked Doug Collins . Collins lo...
4,SARCASM,@USER @USER Pretty Sure the Anti-Lincoln Crowd...,[Man ... y ’ all gone “ both sides ” the apoca...


From the same link above, we download and load in the test data.

In [6]:
test = pd.read_json("data/test.jsonl", lines=True)

Checkout the distribution of labels in our train set. Looks balanced.

In [7]:
train.groupby(["label"]).size()

label
NOT_SARCASM    2500
SARCASM        2500
dtype: int64

No missing values.

In [8]:
train.isnull().values.any()

False

Prepare our train data for BERT using the functions we defined earlier. We extract the "y" column and make sure that it is a numpy array of the correct shape for BERT.

In [9]:
train_prepped = add_label(combine_context_response(preprocess(train)))
train_y = train_prepped["y"].to_numpy().reshape(-1)

In [10]:
train_prepped.head(5)

Unnamed: 0,label,response,context,response_proc,context_proc,combined_proc,y
0,SARCASM,@USER @USER @USER I don't get this .. obviousl...,[A minor child deserves privacy and should be ...,get obviously care would've moved right along ...,minor child deserves privacy kept politics pam...,get obviously care would've moved right along ...,1
1,SARCASM,@USER @USER trying to protest about . Talking ...,[@USER @USER Why is he a loser ? He's just a P...,trying protest talking labels label wtf make em,loser he's press secretary make excuses crowd ...,trying protest talking labels label wtf make e...,1
2,SARCASM,@USER @USER @USER He makes an insane about of ...,[Donald J . Trump is guilty as charged . The e...,makes insane money movies einstein #learnhowth...,donald j trump guilty charged evidence clear s...,makes insane money movies einstein #learnhowth...,1
3,SARCASM,@USER @USER Meanwhile Trump won't even release...,[Jamie Raskin tanked Doug Collins . Collins lo...,meanwhile trump even release sat scores wharto...,jamie raskin tanked doug collins collins looks...,meanwhile trump even release sat scores wharto...,1
4,SARCASM,@USER @USER Pretty Sure the Anti-Lincoln Crowd...,[Man ... y ’ all gone “ both sides ” the apoca...,pretty sure anti-lincoln crowd claimed democra...,man gone sides apocalypse one day already obam...,pretty sure anti-lincoln crowd claimed democra...,1


### Prepare data for submission

This section might seem to have come early, but I simply took this from another file that I was working on, so I might as well place it here as it will be useful for testing later on.

create_submission takes in our original test dataset and prediction values for each element. Each row in the test dataset corresponds directly to each prediction value (i.e. we don't have to worry about ordering when creating a submission dataframe). We create a new submission dataframe based off of our predictions and then append the test["id"] column to it. We rename the columns to "id" and "pred" for internal record keeping. We then create a new column "label" and map predictions values of 1 to "SARCASM" and 0 to "NOT_SARCASM". We then drop the "pred" column since we don't need it for our submission. We return this new dataframe.

In [11]:
'''
Input: The original dataframe test and numpy array of (0,1) predictions preds
Output: A new dataframe containing the test ID and corresponding prediction in terms of "SARCASM" and "NOT_SARCASM"
'''
def create_submission(test, preds):
    mapping = {1: "SARCASM", 0: "NOT_SARCASM"}
    df_preds = pd.DataFrame(preds)
    sub_df = pd.concat([test["id"], df_preds], axis=1)
    sub_df.columns = ["id", "pred"]
    sub_df["label"] = sub_df["pred"].map(mapping)
    sub_df = sub_df.drop(["pred"], axis=1)
    return sub_df

output_submission takes in the dataframe created in create_submission. It simply writes the results of into a .txt file with no indices nor headers, i.e. a file with just test ids and the corresponding prediction delimited by a comma, with each pair being on a newline.

In [12]:
'''
Input: A dataframe created from create_submission
Output: Returns nothing, but creates a file answer.txt of test ids and predicted labels
'''
def output_submission(sub_df):
    sub_df.to_csv('answer.txt', index=None, header=None)

### Preparing for BERT

Disclaimer: I used https://swatimeena989.medium.com/bert-text-classification-using-keras-903671e0207d to help me learn how to setup BERT.

We first define the number of classes we need. In order to do BinaryCrossentropy later on for BERT's loss function, set the number of classes to 1. Having it at 2 apparently doesn't play well. We then load in the standard sized-pretrained BERT tokenizer fitted on lowercased tokens.

In [13]:
n_classes = 1
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

We create a list by running the bert_tokenizer on every document/tweet block in our combined processed training dataset. We allow for the tokenizer to do its magic by adding special tokens, set the max_length for number of tokens to 256, and then pad each document to fit the max_length for consistency. Ignore the warning below. Each tokenized object will contain "input_ids" and "attention_mask" in a dictionary format.

In [14]:
combined_data = [bert_tokenizer.encode_plus(doc, add_special_tokens = True, max_length=256, pad_to_max_length = True) for doc in train_prepped["combined_proc"]]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


We take a look at the first element's "input_ids" to see exactly what happened. It looks like we have some tokens and then the words of the Tweet block.

In [15]:
bert_tokenizer.decode(combined_data[0]['input_ids'])

"[CLS] get obviously care would've moved right along instead decided care troll minor child deserves privacy kept politics pamela karlan ashamed angry obviously biased public pandering using child child named barron # bebest melania care less fact [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In order to use BERT, we must extract the "input_ids" and "attention_masks" from the list we just created by bert_tokenizer. We then convert both arrays into Numpy arrays.

In [16]:
input_ids = np.array([datapoint["input_ids"] for datapoint in combined_data])
masks = np.array([datapoint["attention_mask"] for datapoint in combined_data])

We split our dataset into train and validation components. 20% of the data will be used for validation, i.e. 4000 train and 1000 validation.

In [17]:
train_ids,val_ids,train_split_y,val_y,train_mask,val_mask = train_test_split(input_ids, train_y, masks,test_size=0.2)

We create the BERT classification model for uncased tokens with our defined number of classes. Once again, ignore the warning.

In [18]:
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=n_classes)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training BERT

We define our loss function to be BinaryCrossentropy since we have binary labels. Since we are interested in monitoring our F1 score, we import it from Tensorflow to plug into our BERT model. We will be using Adam for our optimization algorithm used in training with a learning rate of 2e-5 and an epsilon of 1e-06. We then compile these settings.

In [19]:
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metric = tfa.metrics.F1Score(num_classes=n_classes)
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-06)
bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])

We fit the model. This process takes around 8+ hours, so I will not be doing this portion again. Interestingly, our F1 score doesn't seem to be improving while loss decreases.

In [21]:
fitted = bert_model.fit([input_ids, masks], train_y,batch_size=32,epochs=4, validation_data=([val_ids,val_mask],val_y))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


### Preparing our test data

We preprocess our test data like how we processed our training data, minus the part where we add labels since they don't exist here.

In [19]:
test_prepped = combine_context_response(preprocess(test))

We then make sure it gets tokenized exactly how our train data was tokenized and then extract the "input_ids"/"attention_mask".

In [20]:
combined_test = [bert_tokenizer.encode_plus(doc, add_special_tokens = True, max_length=256, pad_to_max_length = True) for doc in test_prepped["combined_proc"]]

In [21]:
test_ids = np.array([datapoint["input_ids"] for datapoint in combined_test])
test_masks = np.array([datapoint["attention_mask"] for datapoint in combined_test])

### Prediction on test data

We first run BERT's prediction function. This function will also take a long time.

In [26]:
preds = bert_model.predict([test_ids, test_masks], batch_size=32)

The results seemed to output and array with both negative and positive values. We map each value greater than or equal to 0 to the label 1 ("SARCASM") and values less than 0 with 0 ("NOT_SARCASM"). We make sure the array is a one-dimensional numpy array so it's easy to work with.

In [33]:
test_y = (np.array(preds) >= 0).reshape(-1).astype(int)

In [34]:
test_y

array([1, 1, 1, ..., 1, 0, 0])

### Outputting our submission

We will create our submission output here as by how we defined the process in the "Prepare data for submission" section.

In [35]:
sub_df = create_submission(test, test_y)

In [36]:
sub_df.head(5)

Unnamed: 0,id,label
0,twitter_1,SARCASM
1,twitter_2,SARCASM
2,twitter_3,SARCASM
3,twitter_4,SARCASM
4,twitter_5,SARCASM


Looking at the distribution of predicted labels.

In [39]:
sub_df.groupby(["label"], axis=0).size()

label
NOT_SARCASM     628
SARCASM        1172
dtype: int64

In [37]:
output_submission(sub_df)

### Saving our BERT model

This will allow us to hopefully load in the model again later on without the 8+ hours training process.

In [41]:
bert_model.save_pretrained("./model")

## Express testing from a pretrained model

### Loading in our BERT model

This is basically everything that goes down in the sections "Training BERT" and "Prediction on test data". This will take quite a bit of time to run. Ignore the warnings.

In [22]:
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=n_classes)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metric = tfa.metrics.F1Score(num_classes=n_classes)
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-06)
bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])
bert_model.load_weights("./model/tf_model.h5")

preds = bert_model.predict([test_ids, test_masks], batch_size=32)
test_y = (np.array(preds) >= 0).reshape(-1).astype(int)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier', 'dropout_75']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We then output our submission in the same exact way as "Outputting our submission".

In [23]:
sub_df = create_submission(test, test_y)

In [24]:
sub_df.groupby(["label"], axis=0).size()

label
NOT_SARCASM     628
SARCASM        1172
dtype: int64

In [25]:
output_submission(sub_df)