## Introduction

In this journal we will attempt a new model - a BERT transformer from HuggingFace. We will perform this on the same edited Reddit dataset as we have worked on in the previous journals, while again just focusing on the comments.|

Let's put in our libraries

In [1]:
# Basic Packages
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score

Then our dataset:

In [2]:
# huggingface/transformers imports
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import TextClassificationPipeline
from datasets import load_dataset
import torch

In [3]:
sarcasm = pd.read_csv('/Users/lokikeeler/Downloads/train-balanced-sarcasm_2.csv')

In [4]:
sarcasm.head()

Unnamed: 0,label,comment,score,ups,downs,date,created_utc,parent_comment,year,SUB_2007scape,...,SUB_television,SUB_tf2,SUB_todayilearned,SUB_trees,SUB_ukpolitics,SUB_unitedkingdom,SUB_videos,SUB_worldnews,SUB_wow,SUB_xboxone
0,0,NC and NH.,2,-1,-1,2016-10-01,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ...",2016,False,...,False,False,False,False,False,False,False,False,False,False
1,0,You do know west teams play against west teams...,-4,-1,-1,2016-11-01,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...,2016,False,...,False,False,False,False,False,False,False,False,False,False
2,0,"They were underdogs earlier today, but since G...",3,3,0,2016-09-01,2016-09-22 21:45:37,They're favored to win.,2016,False,...,False,False,False,False,False,False,False,False,False,False
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,2016-10-01,2016-10-18 21:03:47,deadass don't kill my buzz,2016,False,...,False,False,False,False,False,False,False,False,False,False
4,0,"I don't pay attention to her, but as long as s...",0,0,0,2016-09-01,2016-09-02 10:35:08,do you find ariana grande sexy ?,2016,False,...,False,False,False,False,False,False,False,False,False,False


Setup below have the data inputs entered as per instructions on huggingface. This code below pulls out our comments and tokenizes the data. Here, only about 100 comments are selected for training while about 500 are used for testing during model tuning. 

In [5]:
# setup labels
id2label = {i:cat for i,cat in enumerate(set(sarcasm["label"]))}
label2id = {v:k for k,v in id2label.items()}

# setup data
simplified = sarcasm[["label","comment"]]
simplified.columns = ["label","text"]
simplified.loc[:,"label"] = list(label2id[lab] for lab in simplified["label"])
test_flag = np.random.randint(0,high=10,size=sarcasm.shape[0])
test_sel = test_flag > 6
train_sel = np.invert(test_sel)
# truncate the dataset here to allow for quick few_shot training
test_sel[500::] = False
train_sel[100::] = False

n_train = np.count_nonzero(train_sel)
n_test = np.count_nonzero(test_sel)
print("Sizes train: {} test: {}".format(n_train,n_test))

simplified.loc[train_sel,:].reset_index(drop=True).to_csv("bert_train.csv")
simplified.loc[test_sel,:].reset_index(drop=True).to_csv("bert_test.csv")
# load as hugging face dataset
dataset = load_dataset('csv', data_files={'train': 'bert_train.csv', 'test': 'bert_test.csv'})

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

tokenized_data = dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Sizes train: 76 test: 153
Downloading and preparing dataset csv/default to /Users/lokikeeler/.cache/huggingface/datasets/csv/default-0c4b0e7046d8bf58/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /Users/lokikeeler/.cache/huggingface/datasets/csv/default-0c4b0e7046d8bf58/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/76 [00:00<?, ? examples/s]

Map:   0%|          | 0/153 [00:00<?, ? examples/s]

Perfect. Now we can continue with our Hugging Face instructions to set up the model

In [6]:
# model and data setup

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", 
    num_labels=len(label2id), 
    id2label=id2label, 
    label2id=label2id)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next we will need to need to leverage sklearn metrics for runtime training evaluation

In [7]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred,average='micro')
    precision = precision_score(y_true=labels, y_pred=pred,average='micro')
    f1 = f1_score(y_true=labels, y_pred=pred,average='micro')

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

Lastly, we can finishing by creating our training arguments and our trainer

In [8]:
# typical setup as per instructions/recommendations

training_args = TrainingArguments(
    output_dir="jeopardy-classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    bf16=False,
    no_cuda=True
    #use_mps_device=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)
    







In [9]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.396674,0.941176,0.941176,0.941176,0.941176
2,No log,0.238141,0.941176,0.941176,0.941176,0.941176
3,No log,0.222843,0.941176,0.941176,0.941176,0.941176
4,No log,0.233396,0.941176,0.941176,0.941176,0.941176
5,No log,0.242999,0.941176,0.941176,0.941176,0.941176
6,No log,0.251842,0.941176,0.941176,0.941176,0.941176
7,No log,0.25772,0.941176,0.941176,0.941176,0.941176
8,No log,0.257871,0.941176,0.941176,0.941176,0.941176
9,No log,0.256904,0.941176,0.941176,0.941176,0.941176
10,No log,0.25692,0.941176,0.941176,0.941176,0.941176


TrainOutput(global_step=100, training_loss=0.17391277313232423, metrics={'train_runtime': 718.6677, 'train_samples_per_second': 1.058, 'train_steps_per_second': 0.139, 'total_flos': 10028196038880.0, 'train_loss': 0.17391277313232423, 'epoch': 10.0})

Wow. Truly outstanding. On my previous models the highest accuracy I was able to receive was 69% with a logistic regression model after a TF IDF vectorization. Here, though, we're receiving 94% scores across the board! (Note that these results are for our test set)

Now let's apply this trained model on our comment column as it was before the test/train split. (I was getting errors when applying to the entire column, so I just ran the first 1000)

In [12]:
# predictions for the entire dataset
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False)
all_preds = pipe(sarcasm["comment"].to_list()[0:1000])

In [13]:
all_preds_label = list(res["label"] for res in all_preds)

In [14]:
global_acc = np.array(list(a==b for a,b in list(zip(all_preds_label,sarcasm["label"].to_list())))).mean()
n_train = np.count_nonzero(train_sel)
print("GLOBAL ACC = {:.3f} for few-shot training with {} samples".format(global_acc,n_train))

GLOBAL ACC = 0.913 for few-shot training with 76 samples


After the previous results, it is no suprise that the BERT transformer performs an incredible 91.3% on the predictions of the comment column. Truly incredible.

## Conclusion

The BERT transformer kicked the butt out of all my previous models by a landslide. Prior to this model, the best accuracy I received was a 69% from a Logisitic Regression model after a TFIDF vectorization. Now this BERT is getting a 94% accuracy on the test data, as well as 91% on predictions of the original comment column. This shows that the BERT is the best model for detecting sarcasm.