In this project we will train a BERT model for the classification of clickbait tweets. We will test both a base BERT model as well as a pretrained one for the task of clickbait detection. We will use the [Webis](https://webis.de/events/clickbait-challenge/shared-task.html) dataset to fine-tune both models and evaluate them.

*Code based in my previous works and the tutorial from https://www.thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python*

# **Import and load dataset**

In [1]:
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import Trainer, TrainingArguments
import pandas as pd
import numpy as np
import torch

In [2]:
# Define class to store tokenized text and labels
class ClickbaitDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

**Load data**

Some instances do not include any text in the tweet. we will remove those in the training data, but keep them as an empty string in the test data so that we can use the official evaluator later on.

In [3]:
# Load the Clickbait dataset
train_data = pd.read_csv("Data/webis_train.csv", sep=',', encoding='utf-8')
test_data  = pd.read_csv("Data/webis_test.csv", sep=',', encoding='utf-8')

## Preprocess dataframes
# Generate integer labels column
train_data['labels'] = train_data['truthClass'].map({'no-clickbait':0, 'clickbait':1})
test_data['labels'] = test_data['truthClass'].map({'no-clickbait':0, 'clickbait':1})

# Remove cases where postText doesn't include any string in training
train_data = train_data.loc[train_data.postText.apply(type) != float]

# Replace those cases with an empty string in the test data
test_data['postText'] = np.where(test_data.postText.apply(type) == float, '', test_data.postText)

In [4]:
# 20% of training as development, rest 80% as train
dev_data = train_data.sample(frac=0.1, random_state=1) # Random_state=1 for reproducibily
train_data = train_data.drop(dev_data.index)

In [5]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,postMedia,postText,id,targetCaptions,targetParagraphs,targetTitle,postTimestamp,targetKeywords,targetDescription,truthJudgments,truthMean,truthClass,truthMedian,truthMode,labels
0,0,[],UK’s response to modern slavery leaving victim...,858462320779026432,['modern-slavery-rex.jpg'],['Thousands of modern slavery victims have\xa0...,‘Inexcusable’ failures in UK’s response to mod...,Sat Apr 29 23:25:41 +0000 2017,"modern slavery, Department For Work And Pensio...",“Inexcusable” failures in the UK’s system for ...,"[0.33333333330000003, 0.0, 0.33333333330000003...",0.133333,no-clickbait,0.0,0.0,0
1,1,[],this is good,858421020331560960,"['In this July 1, 2010 file photo, Dr. Charmai...",['President Donald Trump has appointed the\xa0...,Donald Trump Appoints Pro-Life Advocate as Ass...,Sat Apr 29 20:41:34 +0000 2017,"Americans United for Life, Dr. Charmaine Yoest...",President Donald Trump has appointed pro-life ...,"[1.0, 1.0, 1.0, 1.0, 1.0]",1.0,clickbait,1.0,1.0,1
2,2,[],"The ""forgotten"" Trump roast: Relive his brutal...",858368123753435136,"[""President Trump will not attend this year's ...",['When the\xa0White House correspondents’ dinn...,The ‘forgotten’ Trump roast: Relive his brutal...,Sat Apr 29 17:11:23 +0000 2017,"trump whcd, whcd, white house correspondents d...",President Trump won't be at this year's White ...,"[0.33333333330000003, 1.0, 0.33333333330000003...",0.466667,no-clickbait,0.333333,0.333333,0
3,3,[],Meet the happiest #dog in the world!,858323428260139008,"['Maru ', 'Maru', 'Maru', 'Maru', 'Maru']",['Adorable is probably an understatement. This...,"Meet The Happiest Dog In The World, Maru The H...",Sat Apr 29 14:13:46 +0000 2017,"Maru, husky, dogs, pandas, furball, instagram","The article is about Maru, a husky dog who has...","[1.0, 0.6666666666000001, 1.0, 1.0, 1.0]",0.933333,clickbait,1.0,1.0,1
5,5,[],Ban lifted on Madrid doping laboratory,858224473597779968,"['Samples in an anti-doping laboratory', 'Anth...","['Share this with', ""Madrid's Anti-Doping Labo...",World Anti-Doping Agency lifts ban on Madrid l...,Sat Apr 29 07:40:34 +0000 2017,,Madrid's Anti-Doping Laboratory has its suspen...,"[0.0, 0.33333333330000003, 0.0, 0.0, 0.0]",0.066667,no-clickbait,0.0,0.0,0


* **postText** includes text inside the tweet
* **targetTitle** includes the title of the news article cited in the tweet
* **labels** include a binary integer that represents if the tweet is clickbaity or not

We will only use the columns of *postText* and *labels* for this task, since most tweets in the test dataset only include the title of the news article (*postText* and *targetTitle* are the same). The label only refers to the text in the tweet, that is, the tweet may be clickbait even if the news article is not.

**Tokenize text and load into class**

In [15]:
#model_name = "elozano/bert-base-cased-clickbait-news" 
model_name = "bert-base-cased"
max_length = 64

# Get the tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=False)

In [16]:
# Tokenize input text
train_encodings = tokenizer(list(train_data.postText), truncation=True, padding=True, max_length=max_length)
dev_encodings = tokenizer(list(dev_data.postText), truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(list(test_data.postText), truncation=True, padding=True, max_length=max_length)

In [17]:
# Create datasets
train_dataset = ClickbaitDataset(train_encodings, list(train_data.labels))
dev_dataset = ClickbaitDataset(dev_encodings, list(dev_data.labels))
test_dataset = ClickbaitDataset(test_encodings, list(test_data.labels))

# **Fine tune BERT**

**Download pre-trained model, or load our fine-tuned one**

In [18]:
# Load model if pretrained
#model = torch.load("model.tar")

# Download if not
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2).to("cuda")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

Code in the next cells directly taken from https://www.thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python 

In [19]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    # calculate accuracy using sklearn's function
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc,}

In [20]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=400,               # log & save weights each logging_steps
    save_steps=400,
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

In [21]:
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=dev_dataset,            # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

**Train**

In [22]:
trainer.train()

***** Running training *****
  Num examples = 17536
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6576


Step,Training Loss,Validation Loss,Accuracy
400,0.4396,0.398437,0.843943
800,0.4281,0.645742,0.811602
1200,0.4331,0.410323,0.844456
1600,0.4168,0.381076,0.844969
2000,0.4336,0.407126,0.853696
2400,0.4598,0.512181,0.789014
2800,0.4239,0.439343,0.834189
3200,0.3884,0.438876,0.836756
3600,0.3677,0.367455,0.839322
4000,0.3804,0.421809,0.839836


***** Running Evaluation *****
  Num examples = 1948
  Batch size = 20
Saving model checkpoint to ./results\checkpoint-400
Configuration saved in ./results\checkpoint-400\config.json
Model weights saved in ./results\checkpoint-400\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1948
  Batch size = 20
Saving model checkpoint to ./results\checkpoint-800
Configuration saved in ./results\checkpoint-800\config.json
Model weights saved in ./results\checkpoint-800\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1948
  Batch size = 20
Saving model checkpoint to ./results\checkpoint-1200
Configuration saved in ./results\checkpoint-1200\config.json
Model weights saved in ./results\checkpoint-1200\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 1948
  Batch size = 20
Saving model checkpoint to ./results\checkpoint-1600
Configuration saved in ./results\checkpoint-1600\config.json
Model weights saved in ./results\checkpoint-1600\pytorch_model.bi

TrainOutput(global_step=6576, training_loss=0.3930878674026823, metrics={'train_runtime': 1848.7411, 'train_samples_per_second': 28.456, 'train_steps_per_second': 3.557, 'total_flos': 1730218300047360.0, 'train_loss': 0.3930878674026823, 'epoch': 3.0})

**Store model**

In [16]:
#torch.save(model, "model.tar")

# Evaluate

We will predict classes and write them into a Json file to use the official evaluator for most accurate results.

In [23]:
# Predict classes for test dataset
outputs = trainer.predict(test_dataset)

***** Running Prediction *****
  Num examples = 18979
  Batch size = 20


In [25]:
# Open gold standard json for official evaluator
with open("Data/Eval/truth.jsonl", 'r') as f:
    lines = f.readlines()

# Argmax to convert to binary class
y_pred = outputs.predictions.argmax(1)

# Gold standard and predicted instances should be the same
assert len(lines) == len(y_pred)

# Write our output in the same json format
with open("Data/Eval/predictions.jsonl", 'w') as f:
    for i, id in enumerate(lines):
        f.write(id.split(',')[0] + f', "clickbaitScore": {float(y_pred[i])}' + '}\n')

In [26]:
# Official python evaluator
!python Data/Eval/eval.py "Data/Eval/truth.jsonl" "Data/Eval/predictions.jsonl" "output.txt"

[4m
Dataset Stats[0m
Size: 18979
#Clickbait: 4515
#No-Clickbait: 14464
[4m
Regression scores[0m
Explained variance: -0.36806436159817046
Mean absolute error: 0.23672831374577377
Mean squared error: 0.10326758815095051
Median absolute error: 0.19999999997999998
R2 score: -0.40450020986123847
Normalized mean squared error: 1.4045002098612382
[4m
Binary classification scores[0m
Accuracy: 0.8592128141630223
Precision: 0.7015970247210676
Recall: 0.7102990033222591
F1 score: 0.7059211974466211
[4m
Classification report[0m
              precision    recall  f1-score   support

           0       0.91      0.91      0.91     14464
           1       0.70      0.71      0.71      4515

    accuracy                           0.86     18979
   macro avg       0.81      0.81      0.81     18979
weighted avg       0.86      0.86      0.86     18979



We will use the *Binary classification scores* for the results of our model