<a href="https://colab.research.google.com/github/Doris-QZ/spooky_author_identification/blob/main/3_BERT_Spooky_Author_Identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

This is the third deep learning model for the 'Spooky Author Identification' project. In this notebook, I will directly load the data from my Google Drive to fine-tune the **BERT model**. For the EDA section, please check the notebook: [1_LSTM_Spooky_Author_Identification.ipynb](https://github.com/Doris-QZ/spooky_author_identification/blob/main/1_LSTM_Spooky_Author_Identification.ipynb).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Load Important packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import re

# Modeling
import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from transformers import TrainingArguments, Trainer
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, classification_report, accuracy_score, f1_score

In [3]:
# Load the data
train = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/train.csv')
test = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/test.csv')

### BERT

In [4]:
# Split the training set to training and validation set
training_set, validation_set = train_test_split(train, test_size = 0.2, stratify = train['author_encoded'], random_state = 1)

**BERT model with last encoder layer and pooler layer unfreezed**

I will first fine tune a BertForSequenceClassification model with the last encoder layer and pooler layer of BERT unfreezed.

In [4]:
# Load bert_tokenizer and bert_model
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 3)

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Take a look at the architecture of bert_model
bert_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [5]:
# Use GPU if available
bert_model = bert_model.to('cuda')

In [6]:
# Freeze base model parameters
for name, param in bert_model.base_model.named_parameters():
  param.requires_grad = False

# Unfreeze the last encoder layer and pooiling layers
for name, param in bert_model.base_model.encoder.layer[-1].named_parameters():
  param.requires_grad = True
for name, param in bert_model.base_model.pooler.named_parameters():
  param.requires_grad = True

In [11]:
total_params = sum(p.numel() for p in bert_model.parameters())
trainable_params = sum(p.numel() for p in bert_model.parameters() if p.requires_grad)

print(f'Total parameters: {total_params:,}')
print(f'Trainable parameters: {trainable_params:,}')

Total parameters: 109,484,547
Trainable parameters: 7,680,771


In [6]:
# Check the length of text data
text_length = training_set['text'].str.split().str.len()
print(text_length.describe())

count    15663.000000
mean        26.697951
std         18.102614
min          2.000000
25%         15.000000
50%         23.000000
75%         34.000000
max        594.000000
Name: text, dtype: float64


In [None]:
(text_length > 64).sum() / training_set.shape[0]

np.float64(0.033263104130754007)

There are 3% text data has more than 64 words. I will set the max_length of the bert_tokenizer to be 64.

In [7]:
# Tokenize text data
train_tokenized = bert_tokenizer(training_set['text'].tolist(),
                                 padding = True,
                                 truncation = True,
                                 add_special_tokens = True,
                                 max_length = 64,
                                 return_tensors = 'pt')

val_tokenized = bert_tokenizer(validation_set['text'].tolist(),
                                 padding = True,
                                 truncation = True,
                                 add_special_tokens = True,
                                 max_length = 64,
                                 return_tensors = 'pt')

In [8]:
# Create torch dataset
class Dataset(torch.utils.data.Dataset):
  def __init__(self, tokenized, labels = None):
    self.tokenized = tokenized
    self.labels = labels

  def __getitem__(self, idx):
    item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
    if self.labels:
      item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.tokenized['input_ids'])


In [9]:
train_dataset = Dataset(train_tokenized, training_set['author_encoded'].tolist())
val_dataset = Dataset(val_tokenized, validation_set['author_encoded'].tolist())

In [10]:
# Define metrics
def compute_metrics(eval_pred):
  y_pred, y_true = eval_pred
  y_pred = np.argmax(y_pred, axis = 1)
  accuracy = accuracy_score(y_true, y_pred)
  f1 = f1_score(y_true, y_pred, average = 'macro')
  return {'accuracy': accuracy, 'f1_score': f1}

In [None]:
# Define trainer
args = TrainingArguments(
    output_dir = '/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bert_model',
    num_train_epochs = 20,
    learning_rate = 3e-5,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    eval_strategy = 'epoch',
    logging_strategy = 'epoch',
    save_strategy = 'epoch',
    save_total_limit = 1,
    load_best_model_at_end = True,
    metric_for_best_model = 'accuracy',
    report_to = "none"
)

trainer = Trainer(
    model = bert_model,
    args = args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    compute_metrics = compute_metrics
)

In [None]:
trainer.train()

  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}


Epoch,Training Loss,Validation Loss,Accuracy,F1 Score
1,0.6401,0.515979,0.787794,0.788587
2,0.4799,0.476359,0.806946,0.806663
3,0.4167,0.485026,0.808223,0.809329
4,0.3675,0.453639,0.824055,0.824829
5,0.3176,0.470461,0.827375,0.827702
6,0.2812,0.493227,0.829673,0.830208
7,0.2518,0.486979,0.835546,0.83535
8,0.2181,0.516141,0.835291,0.835521
9,0.1924,0.565986,0.828141,0.828614
10,0.1646,0.597329,0.836568,0.836836


  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(v

TrainOutput(global_step=19580, training_loss=0.21653456371334648, metrics={'train_runtime': 1679.1548, 'train_samples_per_second': 186.558, 'train_steps_per_second': 11.661, 'total_flos': 1.030286365468416e+16, 'train_loss': 0.21653456371334648, 'epoch': 20.0})

In [None]:
trainer.evaluate()

  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}


{'eval_loss': 0.5973291993141174,
 'eval_accuracy': 0.8365679264555669,
 'eval_f1_score': 0.8368360922716223,
 'eval_runtime': 14.3087,
 'eval_samples_per_second': 273.68,
 'eval_steps_per_second': 17.122,
 'epoch': 20.0}

The best validation accuracy is 0.8365, with an F1 score of 0.8368, achieved at epoch 10. As we can see from the training log, the training loss continually decreases over the 20 epochs, while the validation loss decreases for the first few epochs but starts increasing after epoch 5 and continues rising through epoch 20, indicating that the model is overfitting.  

Next, I will fine-tune a new BertForSequenceClassification model, this time unfreezing only the pooler layer of BERT, to see if it reduces the overfitting.

**BERT Model with only pooler layer unfreezed**

In [10]:
# Second bert model
bert_model2 = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 3)

# use GPU
bert_model2 = bert_model2.to('cuda')

# Freeze the base model
for name, param in bert_model2.base_model.named_parameters():
  param.requires_grad = False

# Unfreeze pooler layer only
for name, param in bert_model2.base_model.pooler.named_parameters():
  param.requires_grad = True

total_params = sum(p.numel() for p in bert_model2.parameters())
trainable_params = sum(p.numel() for p in bert_model2.parameters() if p.requires_grad)

print(f'Total parameters: {total_params:,}')
print(f'Trainable parameters: {trainable_params:,}')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Total parameters: 109,484,547
Trainable parameters: 592,899


In [None]:
# Define trainer
args = TrainingArguments(
    output_dir = '/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bert_model2',
    num_train_epochs = 20,
    learning_rate = 5e-5,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    eval_strategy = 'epoch',
    logging_strategy = 'epoch',
    save_strategy = 'epoch',
    save_total_limit = 1,
    load_best_model_at_end = True,
    metric_for_best_model = 'accuracy',
    report_to = "none"
)

trainer = Trainer(
    model = bert_model2,
    args = args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    compute_metrics = compute_metrics
)

trainer.train()

  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}


Epoch,Training Loss,Validation Loss,Accuracy,F1 Score
1,0.801,0.673442,0.715015,0.715024
2,0.6538,0.624128,0.738253,0.736775
3,0.6275,0.621426,0.738509,0.738606
4,0.6127,0.601946,0.747957,0.747752
5,0.6095,0.589558,0.754341,0.754148
6,0.5959,0.580098,0.755107,0.755023
7,0.5919,0.577417,0.757406,0.757278
8,0.592,0.570199,0.760215,0.759935
9,0.5836,0.57312,0.762257,0.762447
10,0.5804,0.572871,0.760981,0.760869


  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(v

TrainOutput(global_step=19580, training_loss=0.6006889428985253, metrics={'train_runtime': 1531.9824, 'train_samples_per_second': 204.48, 'train_steps_per_second': 12.781, 'total_flos': 1.030286365468416e+16, 'train_loss': 0.6006889428985253, 'epoch': 20.0})

The training loss continually decreasing so does the validation loss, suggesting that the overfitting issue is solved. However, the validation accuracy is only 0.7691 in the last epoch, epoch 20.

I will continue training and see if the accuracy improves.

In [None]:
trainer.train()

  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}


Epoch,Training Loss,Validation Loss,Accuracy,F1 Score
1,0.5842,0.563474,0.769152,0.769481
2,0.5813,0.558947,0.768641,0.767757
3,0.5795,0.571861,0.761236,0.760792
4,0.5792,0.563169,0.765066,0.765034
5,0.5723,0.55574,0.770684,0.770602
6,0.5764,0.551932,0.770684,0.770361
7,0.5809,0.552441,0.769918,0.76959
8,0.5751,0.548575,0.771961,0.771354
9,0.572,0.551108,0.773493,0.773666
10,0.5728,0.552193,0.772472,0.772819


  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}
  item = {key:torch.tensor(v

TrainOutput(global_step=19580, training_loss=0.5714444558881033, metrics={'train_runtime': 1529.4387, 'train_samples_per_second': 204.82, 'train_steps_per_second': 12.802, 'total_flos': 1.030286365468416e+16, 'train_loss': 0.5714444558881033, 'epoch': 20.0})

The validation accuracy only improved slightly after an additional 20 epochs of training. It's much lower than that of the first model.

I will use the first model to make prediction on the test dataset.

In [11]:
# Prepare test dataset
test_tokenized = bert_tokenizer(test['text'].tolist(),
                                 padding = True,
                                 truncation = True,
                                 add_special_tokens = True,
                                 max_length = 64,
                                 return_tensors = 'pt')

test_dataset = Dataset(test_tokenized)

In [12]:
# load the first model
output_dir = '/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bert_model/checkpoint-9790'
bert_model = BertForSequenceClassification.from_pretrained(output_dir)
bert_model = bert_model.to('cuda')

Number of trainable parameters:  109484547


In [13]:
# Define dummy training arguments
args = TrainingArguments(
    output_dir = '/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bert_model/results',
    per_device_eval_batch_size = 16,
    report_to = "none"
)

# Create the trainer
trainer = Trainer(
    model = bert_model,
    args = args
)


In [14]:
# Print the classification report on the validation set
predictions = trainer.predict(val_dataset)
y_pred = np.argmax(predictions.predictions, axis = 1)
y_true = np.array(validation_set['author_encoded'])
print(classification_report(y_true, y_pred))

  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}


              precision    recall  f1-score   support

           0       0.85      0.82      0.83      1580
           1       0.82      0.86      0.84      1209
           2       0.84      0.84      0.84      1127

    accuracy                           0.84      3916
   macro avg       0.84      0.84      0.84      3916
weighted avg       0.84      0.84      0.84      3916



In [15]:
# Make prediction on the test set
predictions = trainer.predict(test_dataset)

# Extract the logits from the prediction object
logits = predictions.predictions

# Convert logits to probability
probabilities = torch.softmax(torch.tensor(logits), dim = -1).numpy()

  item = {key:torch.tensor(value[idx]) for key, value in self.tokenized.items()}


In [16]:
bert_prediction = pd.DataFrame(probabilities, columns = ['EAP', 'MWS', 'HPL'])
bert_prediction = pd.concat([test['id'], bert_prediction], axis = 1)
bert_prediction = bert_prediction[['id', 'EAP', 'HPL', 'MWS']]
bert_prediction.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.000132,2.2e-05,0.999846
1,id24541,0.999984,1.4e-05,2e-06
2,id00134,5e-06,0.999991,5e-06
3,id27757,0.996782,0.003194,2.4e-05
4,id04081,0.344183,0.620572,0.035244


In [None]:
bert_prediction.to_csv('/content/drive/MyDrive/ColabNotebooks/Spooky_Author_Identification/bert_model/bert_prediction.csv', index = False)

After submitting to Kaggle, I got a public score of log_loss at 0.61, and private score of 0.57.