# **Spring 2023 NLP Homework 5: Fine-tuning Neural Language Models**

In this homework you will finetune a neural langauge model to perform the author classification task from HW4. As a reminder, the classifier has to guess which of the following three authors wrote some given text:

- Lewis Carrol
- Marion Zimmer Bradley
- Edgar Allen Poe

You will use the DistilBertForSequenceClassification model that you worked with in class. The code for training the model will be identical to the code you worked with in [this notebook](https://colab.research.google.com/drive/1lFrpDzxGIRQYnuwNKAI5Syr5gdOfbWxK?usp=sharing). Your main tasks in this homework are to: 

1. Convert the data to a format that is appropriate to pass into the model. 
2. Convert the predictions of the model in a format that makes it possible to compute accuracy, precision, recall and f-scores. (You should be able to reuse the functions to compute these metrics from HW5)

#### **What should I do if I run out of RAM?**
The free GPUs that Colab assigns might not always reliable. Sometimes you code will run without issues, and other times you might run into RAM errors. For this reason, try to train your models on as much data as possible, but do not worry if you are not able to train it on all of the data. You can also try to run the models on your personal computers without using GPUs! Just make sure to upload the correct .ipynb with outputs to Gradescope. 

##**Setup**

You will be using the [same set of texts](https://drive.google.com/drive/folders/1WG2YWyq7c4CUgYnO2SsC46_jRWXIYTpV?usp=sharing) as in HW5. Upload the all of the .txt files to your Colab repository and specify the directory location in the code below.  

In [18]:
import glob
import nltk
import pandas as pd

nltk.download('punkt')
#Store data directory in a variable and only use this variable in your code
dat_dir = './' 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Install and load necessary models and packages. 

In [19]:
!pip install transformers datasets pynvml accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [20]:
from transformers import AutoTokenizer, DataCollatorWithPadding
from datasets import load_dataset
from torch.utils.data import DataLoader

from datasets import load_dataset
from multiprocessing import cpu_count
import numpy as np
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler
from accelerate.utils import find_executable_batch_size
import torch
from tqdm.auto import tqdm
from torch.utils.data import DataLoader
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, BertTokenizerFast, BertModel
import json
from datasets import Dataset
from transformers import TrainingArguments, Trainer, logging
import random

In [13]:
## Set "device" value depending on whether or not you have access to GPUs
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
accelerator = Accelerator()
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-cased", truncation=True, do_lower_case=True)
tokenizer_bert = BertTokenizerFast.from_pretrained("bert-base-cased", truncation=True, do_lower_case=True)
device

device(type='cpu')

##**Data pre-processing**

Start by writing a function called load_data that returns three lists, one each for train, test and dev. The lists should be formated in a format that the tokenize_function can use -- i.e. the lists should contain pairs of text and labels. (Look at the tokenize_function for further clues on how this should be organized). 

Feel free to write additional functions to pre-process the data before passing it into load_data. 

In [14]:
def load_data(dat_dir):                                 
    fnames = 'Sarcasm_Headlines_Dataset.json'
    data = []
    train = []
    dev = []
    test = []
    df = pd.read_json(fnames, lines=True)
    df = df.drop(['article_link'], axis=1)
    for index, row in df.iterrows():
      data.append({'label': row['is_sarcastic'], 'sent':row['headline']})

    train = data[:int(len(data)*0.8)]
    dev = data[int(len(data)*0.8):int(len(data)*0.9)]
    test = data[int(len(data)*0.9):]
      
    return train, dev, test

In [15]:
load_data(dat_dir)

You will need to tokenize your data before passing it into your model. You can use the following function for that. 

In [7]:
def tokenize_function(example):
  #the tokenizer is cached in memory, so will not re-download for every function call. 
  tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-cased",
                                                      truncation=True,
                                                      do_lower_case=True)
  tokenized = tokenizer(example['sent'],
                        padding = 'max_length',
                        return_tensors='pt') #returns dict
  # convert label to a tensor and add it to the tokenized.
  lab = example['label']
  tokenized['labels'] = torch.tensor(int(lab)).to(device)

  return tokenized

##**Code setup to train and get predictions from the model**

In [8]:
def train(model, tokenized_data, args):
  num_epochs = args['num_epochs']
  batch_size = args['batch_size']

  # Set up the optimizer
  optimizer = AdamW(model.parameters(), lr=3e-5)

  # Set up a dataloader, which will divide the data into batches
  train_dataloader = DataLoader(
      tokenized_data, shuffle=True, batch_size=batch_size
      )

  num_training_steps = num_epochs * len(train_dataloader)
  lr_scheduler = get_scheduler("linear",
                               optimizer=optimizer,
                               num_warmup_steps=0,
                               num_training_steps=num_training_steps,
                               )
  #Start train
  progress_bar = tqdm(range(num_training_steps))
  for epoch in range(num_epochs):
    print("Epoch",epoch)
    for i,batch in enumerate(tqdm(train_dataloader)):
        batch = {k: v.to(device) for k, v in batch.items()}
        input_ids = batch['input_ids'].squeeze()
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        #forward pass
        outputs = model(input_ids,
                        attention_mask=attention_mask,
                        labels=labels)
        #compute loss and update weights
        loss = outputs[0]
        loss.backward()
          
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

In [9]:
def get_predictions(model, tokenized_dataset, tokenizer, n):
  """
  n: number of examples from the dataset you want predictions for
  """
  preds = []
  eval_dataset = DataLoader(tokenized_dataset[:n], batch_size=1, shuffle=False)
  for i,batch in enumerate(eval_dataset):                
    batch = {k: v.to(device) for k, v in batch.items()}
    input_ids = batch['input_ids'].squeeze()
    attention_mask = batch['attention_mask']
    labels = batch['labels']
    outputs = model(input_ids,
                    attention_mask=attention_mask,
                    labels=labels)

    logits = outputs.logits
    best = torch.argmax(logits)
    pred = best.item()

    preds.append({'sent': tokenizer.decode(batch["input_ids"][0][0]),
                  'pred': pred,
                  'gold': batch["labels"][0].item(),
                  'logits': outputs.logits})
  return preds

##**Defining evaluation metrics**

Write functions to compute accuracy, precision, recall and fscore. You should be able to re-use the functions you wrote for HW5. You will want to either modify the functions to take as input predictions in the format outputted by get_predictions or write another function to convert the output of get_predictions into a list of predictions and gold_labels. 

In [10]:
from sklearn.metrics import confusion_matrix
def make_confusion_matrix(predictions):
    output_labels = []
    gold_labels = []
    for item in predictions:
        output_labels.append(item['pred'])
        gold_labels.append(item['gold'])
    return np.array(confusion_matrix(gold_labels, output_labels, labels=list(set(gold_labels))))


In [11]:
# Write a function to calculate accuracy
def calc_accuracy(predictions, average_type='macro'):
  cfm = make_confusion_matrix(predictions)
  tp = np.array([cfm[i][i] for i in range(len(cfm))])
  gold_size = np.sum(cfm,axis=1)
  accuracies = np.divide(tp, gold_size)
  
  if average_type == 'macro':
    return np.mean(accuracies)
  else:
    return np.sum(tp)/np.sum(gold_size)

In [12]:
# Write a function to calculate precision
def calc_precision(predictions, average_type='macro'):
  cfm = make_confusion_matrix(predictions)
  tp = np.array([cfm[i][i] for i in range(len(cfm))])
  output_size = np.sum(cfm,axis=0)
  precisions = []
  for i in range(len(cfm)):
    if output_size[i]==0:
      precisions.append(0)
    else:
      precisions.append(tp[i]/ output_size[i])
  
  if average_type == 'macro':
    return np.mean(precisions)
  else:
    return np.sum(tp)/np.sum(output_size)


In [13]:
# Write a function to calculate recall
def calc_recall(predictions, average_type='macro'):
  cfm = make_confusion_matrix(predictions)
  tp = np.array([cfm[i][i] for i in range(len(cfm))])
  size = np.array([sum([cfm[i][j] for j in range(len(cfm))]) for i in range(len(cfm))])
  recalls = np.divide(tp, size)
  
  if average_type == 'macro':
    return np.mean(recalls)
  else:
    return np.sum(tp)/np.sum(size)

In [14]:
# Write a function to calculate fscore
def calc_fscore(precision, recall, beta):
  beta = beta**2
  return ((beta + 1)*precision*recall)/(beta*precision + recall)

In [15]:
def print_scores(model_type, preds):
  print(model_type)
  print('-------------------------')
  precision = calc_precision(preds, "macro")
  recall = calc_recall(preds,  "macro")
  accuracy = calc_accuracy(preds, "micro")
  f1 = calc_fscore(precision, recall, 1)
  f2 = calc_fscore(precision, recall, 2)
  print('Precision\t', round(precision, 3))
  print('Recall\t\t', round(recall, 3))
  print('Accuracy\t', round(accuracy, 3))
  print('F2\t\t', round(f2, 3))
  print('F1\t\t', round(f1,3))
  print()

In [16]:
import random

# Write your code here to load train, dev and test data. 
train_dat, dev_dat, test_dat = load_data(dat_dir)

# Shuffle training, dev and test
random.shuffle(train_dat)
random.shuffle(dev_dat)
random.shuffle(test_dat)

# Create tokenized train, dev and test. 
## You might want to look at only a small subset of train, dev and test to avoid RAM issues. 


In [17]:
# Sanity check on the train, dev, and test sets
print('Number of sentences in Train')
count = {}
count[0] = 0
count[1] = 0
for d in train_dat:
    count[d['label']] += 1
for key,val in count.items():
  print(key, val)
print('Total: ', len(train_dat))

print()
print('Number of sentences in Dev')
count = {}
count[0] = 0
count[1] = 0
for d in dev_dat:
    count[d['label']] += 1
for key,val in count.items():
  print(key, val)
print('Total: ', len(dev_dat))

print()
print('Number of sentences in Test')
count = {}
count[0] = 0
count[1] = 0
for d in test_dat:
    count[d['label']] += 1
for key,val in count.items():
  print(key, val)
print('Total: ', len(test_dat))

Number of sentences in Train
0 11971
1 10924
Total:  22895

Number of sentences in Dev
0 1508
1 1354
Total:  2862

Number of sentences in Test
0 1506
1 1356
Total:  2862


Load the model 

In [18]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-cased",
                                                            num_labels=2).to(device)

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-cased",
                                                      truncation=True,
                                                      do_lower_case=True)

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifi

Evaluate the model on the test set prior to fine-tuning. If you run into RAM issues, evaluate it on a smaller set using the n parameter of get_predictions(). Make sure to print precision, accuracy, recall and f1 in an easy to read format. 

In [19]:
# Write your code here
tokenized_train = [tokenize_function(e) for e in train_dat[:100]]
tokenized_test = [tokenize_function(t) for t in test_dat[:100]]
tokenized_dev = [tokenize_function(d) for d in dev_dat[:100]]

##**Fine-tuning the model**

Fine tune the model to the training dataset (or subsets of the dataset) and save it using `torch.save()`. Set the number of epochs to three, and the batch_size to 5. 




In [None]:
args = {
    'num_epochs': 3,
    'batch_size': 5
}

## Write your code here
train(model,tokenized_train, args)
torch.save(model, 'model.pt')
trained_model = torch.load('model.pt')



  0%|          | 0/60 [00:00<?, ?it/s]

Epoch 0


  0%|          | 0/20 [00:00<?, ?it/s]

##**Evaluating the model**

Evaluate the saved model on the test set. Make sure to display the evaluation metrics in an easy-to-view format. 

In [None]:
preds_after = get_predictions(trained_model, tokenized_test, tokenizer, 50)
print_scores("model after fine-tuning yields these scores", preds_after)


In [21]:
model_bert = BertModel.from_pretrained("bert-base-cased",num_labels=2).to(device)

tokenizer_bert = BertTokenizerFast.from_pretrained("bert-base-cased",
                                                      truncation=True,
                                                      do_lower_case=True)


Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
args = {
    'num_epochs': 3,
    'batch_size': 5
}

## Write your code here
train(model_bert,tokenized_train, args)
torch.save(model_bert, 'model_bert.pt')
trained_model_bert = torch.load('model_bert.pt')