<a href="https://colab.research.google.com/github/Nikoschenk/language_model_finetuning/blob/master/scibert_fine_tuner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Transformers on [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
This notebook demonstrates how to fine-tune [Sci-Bert](https://github.com/allenai/scibert) on the [COVID-19 Open Research Dataset (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) using the [Hugging Face Transformer](https://github.com/huggingface/transformers) library.

Credits: Modification of [original notebook](https://colab.research.google.com/github/interactive-fiction-class/interactive-fiction-class.github.io/blob/master/homeworks/language-model/hw4_transformer.ipynb) provided by [Daphne Ippolito](https://www.seas.upenn.edu/~daphnei/)


## Installation

### Install SciBert and HuggingFace Transfomers library.

In [0]:
# Download SciBERT model and extract it in your local drive.
!wget -nc -O /content/scibert_scivocab_uncased.tar https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_uncased.tar
!ls /content

import os
os.chdir('/content')

!echo "Extracting SciBERT ..."
!tar -xvf scibert_scivocab_uncased.tar
!echo "... done."

# Delete tar archive.
#!rm /content/scibert_scivocab_uncased.tar
#!ls

In [0]:
# Install transformers.
!git clone https://github.com/huggingface/transformers

import os
os.chdir('/content/transformers')

# Use language modeling version as of April 21st.
!git checkout b1ff0b2ae7d368b7db3a8a8472a29cc195d278d8

!pip install .
!pip install -r ./examples/requirements.txt

os.chdir('/content/transformers/examples')

!pip install dict_to_obj

In [0]:
import torch
import run_language_modeling
import run_generation
from dict_to_obj import DictToObj
import collections
import random
import numpy as np

from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelWithLMHead

# Make sure that this version of transformers has the correct evaluate() function.
from run_language_modeling import evaluate

### Mount your Google Drive
Checkpoints will be saved Google Drive folders.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

### Download the CORD-19 data.
You need to prepare the data yourself (as we cannot share it for legal reasons), and split it into training, development and test sets.
We've left dummy data for experimentation.



In [0]:
# Download the data.
# Credits: http://interactive-fiction-class.org/staff.html
!wget -nc -O /content/presidential_speeches_test.txt https://raw.githubusercontent.com/interactive-fiction-class/interactive-fiction-class.github.io/master/homeworks/language-model/presidential_speeches_test.txt
!wget -nc -O /content/presidential_speeches_valid.txt https://raw.githubusercontent.com/interactive-fiction-class/interactive-fiction-class.github.io/master/homeworks/language-model/presidential_speeches_valid.txt
!wget -nc -O /content/presidential_speeches_train.txt https://raw.githubusercontent.com/interactive-fiction-class/interactive-fiction-class.github.io/master/homeworks/language-model/presidential_speeches_train.txt

## Fine-tuning and Evaluation



### Launch fine-tuninng


In [0]:
# Fine-tune SciBERT on your custom text data.
!python run_language_modeling.py \
    --output_dir='/content/drive/My Drive/finetuned_models/SciBERT-finetuned' \
    --model_type=bert \
    --model_name_or_path='/content/scibert_scivocab_uncased/' \
    --mlm \
    --save_total_limit=5 \
    --num_train_epochs=3.0 \
    --do_train \
    --evaluate_during_training \
    --logging_steps=500 \
    --save_steps=500 \
    --train_data_file=/content/presidential_speeches_train.txt \
    --do_eval \
    --eval_data_file=/content/presidential_speeches_valid.txt \
    --per_gpu_train_batch_size=2 \
    --per_gpu_eval_batch_size=2 \
    --overwrite_output_dir \
    --block_size=512 \
    --gradient_accumulation_steps=5
# Not sure if adding explicit tokenization has an effect.
#--tokenizer_name=bert-base-uncased \

### Compute perplexity of a dataset.


#### Check which (fine-tuned) checkpoints are available.


In [0]:
!ls '/content/drive/My Drive/finetuned_models/SciBERT-finetuned/'

#### Helper functions

In [0]:
def load_model(args):
  """Creates a model and loads in weights for it."""
  config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=None)

  model = AutoModelWithLMHead.from_pretrained(
      args.model_name_or_path,
      from_tf=bool(".ckpt" in args.model_name_or_path),
      config=config,
      cache_dir=None
  )
  
  model.to(args.device)
  return model

def set_seed(seed):
  """Set the random seed."""
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if args.n_gpu > 0:
    torch.cuda.manual_seed_all(args.seed)

def do_perplexity_eval(args, model, data_file_path):
  """Computes the perplexity of the text in data_file_path according to the provided model."""
  set_seed(args.seed)
  args.eval_data_file=data_file_path
  tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=None)
  args.block_size = min(args.block_size, tokenizer.max_len)
  result = run_language_modeling.evaluate(args, model, tokenizer, prefix="")
  return result

#### Compute perplexity scores between original and fine-tuned models.

In [0]:
import torch


# SciBERT evaluation.
# 1.) Path to one of your fine-tuned models.
CHECKPOINT_PATH = '/content/drive/My Drive/finetuned_models/SciBERT-finetuned/checkpoint-500'
# Uncomment for comparison with original SciBERT model.
#CHECKPOINT_PATH = '/content/scibert_scivocab_uncased/'
# Uncomment for comparison with base model.
#CHECKPOINT_PATH = 'bert-base-uncased'

# Set this to the list of text files you want to evaluate the perplexity of.
DATA_PATHS = ["/content/presidential_speeches_valid.txt",
              "/content/presidential_speeches_test.txt"]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
#print("Running on device: ", device)

args = collections.defaultdict(
  model_name_or_path=CHECKPOINT_PATH,
  output_dir=CHECKPOINT_PATH,
  block_size = 512,
  local_rank=-1,
  eval_batch_size=2,
  per_gpu_eval_batch_size=2,
  n_gpu=n_gpu,
  mlm=True,
  mlm_probability=0.15,
  device=device,
  line_by_line=False,
  overwrite_cache=None,
  model_type='bert',
  seed=42,
)
args = DictToObj(args)

model = load_model(args)

for data_path in DATA_PATHS:
  eval_results = do_perplexity_eval(args, model, data_path)
  perplexity = eval_results['perplexity']
  print("\n")
  print('{} is the perplexity of {} according to {}'.format(
      perplexity, data_path, CHECKPOINT_PATH))
  print("\n")
