<a href="https://colab.research.google.com/github/vasudevgupta7/bigbird-intuition/blob/main/notebooks/bigbird_narrativeqa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `BigBird`

Let's explore how to use `BigBird` model with existing [`EncoderDecoderModel`](https://huggingface.co/transformers/model_doc/encoderdecoder.html).

By the end of this tutorial, you will get an idea about:
* How to use `BigBird` model for any task.
* How 🤗 can handles your end2end integration (weights loading / saving, training, inference) in transformers.
* How to use 🤗 datasets, Hub, & transformers (obviously!).
* How awesome 🤗 is.

**Note:** I am doing just an experiment with `BigBird` by putting BigBird in both Encoder and Decoder, hence not sure how well it's gonna performs. Let's see how it works 🧐.

Checkout my [LinkedIn](https://www.linkedin.com/in/vasudevgupta7/), [GitHub](https://github.com/vasudevgupta7), [Twitter](https://twitter.com/7vasudevgupta) if you wanna know what I do?

Any kinda discussions regarding `BigBird` are *welcomed* through this [repo](https://github.com/vasudevgupta7/bigbird-intuition). Feel free to checkout my recent [post](https://github.com/vasudevgupta7/bigbird-intuition) on BigBird.

## Basic Setup

In [1]:
# do remember to link gdrive else you won't be able to save your weights

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cd /content/drive/MyDrive

/content/drive/MyDrive


In [None]:
# BigBird tokenizer is relying on senetencepiece so we needa install it first

!pip install datasets
!pip install git+https://github.com/vasudevgupta7/transformers.git@add_big_bird
!pip install sentencepiece
!pip install wandb

## Dataset Preparation

In [4]:
from datasets import load_dataset

We will use [`narative-qa dataset`](https://huggingface.co/datasets/narrativeqa_manual) and finetune BigBird for abstractive question answering. This dataset requires manual download, we you will need to run next cell for that.

It's gonna take some time (~10 mins) 🙁.

In [5]:
# this will download narrative-qa dataset into `narative-qa/tmp`
!git clone https://github.com/deepmind/narrativeqa --branch master && sh narrativeqa/download_stories.sh

fatal: destination path 'narrativeqa' already exists and is not an empty directory.


In [6]:
# this may take upto 5 minutes

dataset = load_dataset("narrativeqa_manual", data_dir="narrativeqa/tmp")
dataset

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2787.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1572.0, style=ProgressStyle(description…

Using custom data configuration default-data_dir=narrativeqa%2Ftmp



Downloading and preparing dataset narrativeqa_manual/default (download: 21.59 MiB, generated: 12.10 GiB, post-processed: Unknown size, total: 12.13 GiB) to /root/.cache/huggingface/datasets/narrativeqa_manual/default-data_dir=narrativeqa%2Ftmp/1.0.0/c57377ffa4fc72b25bf692f6676b140db5a36a7d36a56a891e11274afb40a6ba...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=113448.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2759939.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2222402.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset narrativeqa_manual downloaded and prepared to /root/.cache/huggingface/datasets/narrativeqa_manual/default-data_dir=narrativeqa%2Ftmp/1.0.0/c57377ffa4fc72b25bf692f6676b140db5a36a7d36a56a891e11274afb40a6ba. Subsequent calls will reuse this data.


DatasetDict({
    train: Dataset({
        features: ['document', 'question', 'answers'],
        num_rows: 32747
    })
    test: Dataset({
        features: ['document', 'question', 'answers'],
        num_rows: 10557
    })
    validation: Dataset({
        features: ['document', 'question', 'answers'],
        num_rows: 3461
    })
})

In [7]:
tr_dataset = dataset["train"]
val_dataset = dataset["validation"]
tr_dataset, val_dataset

(Dataset({
     features: ['document', 'question', 'answers'],
     num_rows: 32747
 }), Dataset({
     features: ['document', 'question', 'answers'],
     num_rows: 3461
 }))

In [8]:
# data = dataset["train"].map(lambda x: {"seqlen": (len(x["document"]["summary"]["text"]) + len(x["question"]))//4})
# data = data.map(lambda x: {"q_seqlen": len(x["question"]["text"])//4})

In [9]:
# lets decide whether we should use BigBird `block_sparse` attention or `original_full` attention
# min(data["seqlen"]), sum(data["seqlen"])/len(data["seqlen"]), max(data["seqlen"])

# Since avg seqlen < 1024, we should use `original_full` but for the purpose of demonstarting `block_spare` lets try `block_sparse` attention only

In [10]:
# lets see the seqlen of question. This may give some idea about the value of `block_size`
# min(data["q_seqlen"]), sum(data["q_seqlen"])/len(data["q_seqlen"]), max(data["q_seqlen"])

# lets take `block_size=64`

## Training BigBird

We will be using BigBird in both encoder & decoder (let's call it BigBird2BigBird may be). This is not a new idea, rather introduced in this [paper](https://arxiv.org/abs/1907.12461). One of the experiment involved in this paper put BERT in both encoder & decoder and trained this architecture by introducing randomly initialized `cross_attention_layer`.

Let's start step by step:
* Setup BigBird in encoder side. We will have block sparse attention here with `num_random_blocks=3` & `block_size=64`.
* Setup BigBird in decoder side. Here will have normal attention (`original_full`) as it has to be autoregressive and there are very few target tokens to use `block_sparse`.
* Connect encoder and decoder with 🤗 `EncoderDecoder` to be able to do abstractive question answering.
* Setup tokenizer for converting text into numbers which our model can take. We will be using BigBird tokenizer simply.

In [None]:
from transformers import EncoderDecoderModel, BigBirdModel, BigBirdForCausalLM, BigBirdTokenizer
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
import wandb


model_id = "google/bigbird-roberta-base"

encoder = BigBirdModel.from_pretrained(model_id, block_size=64, num_random_blocks=3, attention_type="block_sparse")
decoder = BigBirdForCausalLM.from_pretrained(model_id, add_cross_attention=True, is_decoder=True)
model = EncoderDecoderModel(encoder=encoder, decoder=decoder)

tokenizer = BigBirdTokenizer.from_pretrained(model_id)

In [None]:
SRC_MAXLEN = 832
TGT_MAXLEN = 32

I like to setup `collate_fn` for tokenization. I feel it's more easy 😌. Next cell is doing simple tokenization.

In [26]:
def collate_fn(features):

  context = [x["document"]["summary"]["text"] for x in features]
  question = [x["question"]["text"] for x in features]
  answer = [x["answers"][0]["text"] for x in features]

  # should not eliminate special tokens since question and context are should have `SEP` in middle
  inputs = tokenizer(question, context, return_tensors="pt", padding="max_length", truncation=True, max_length=SRC_MAXLEN)
  labels = tokenizer(answer, return_tensors="pt", padding=True, truncation=True, max_length=TGT_MAXLEN)

  return {
      "input_ids": inputs.input_ids,
      "attention_mask": inputs.attention_mask,
      "decoder_input_ids": labels.input_ids,
      "labels": labels.input_ids,
      "decoder_attention_mask": labels.attention_mask,
  }

You might be wondering why I am feeding same data to `labels` & `decoder_input_ids`. **Well!** this is because while calculating loss 🤗 `Trainer` is removing 1st token from `labels` & last token from `decoder_input_ids`.

In [27]:
# wandb is just awesome. Let's set it up ..

%env WANDB_PROJECT = 'BigBird-narrative-qa'
wandb.login()



env: WANDB_PROJECT='BigBird-narrative-qa'


True

In [None]:
args = Seq2SeqTrainingArguments(
    output_dir="bigbird2bigbird-narrative-qa",
    overwrite_output_dir=False,
    do_train=True,
    do_eval=True,
    evaluation_strategy="epoch",
    # eval_steps=4000,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=5e-5,
    num_train_epochs=10,
    logging_strategy="steps",
    logging_steps=4000,
    save_strategy="epoch",
    run_name="bigbird2bigbird-narrative-qa-experiment1",
    disable_tqdm=False,
    load_best_model_at_end=True,
    report_to="wandb",
    remove_unused_columns=False,
    fp16=True,
)

It's very important to keep `remove_unused_columns=False` since otherwise 🤗 `Trainer` will delete all the colums because we are not tokenizing using `Dataset` rather inside `collate_function`. 

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    data_collator=collate_fn,
    train_dataset=tr_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

Time to train. Let's do it. Quite simple right! All thanks to 🤗.

In [None]:
trainer.train()

In [None]:
wandb.finish()

## Inference Time

In [None]:
def get_answer(question, context):
    encoding = tokenizer(question, context, return_tensors="pt", max_length=128, padding="max_length", truncation=True)
    input_ids = encoding.input_ids
    attention_mask = encoding.attention_mask

    with torch.no_grad():
        start_scores, end_scores = model(input_ids=input_ids, attention_mask=attention_mask).to_tuple()

    # Let's take the most likely token using `argmax` and retrieve the answer
    all_tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0].tolist())

    answer_tokens = all_tokens[torch.argmax(start_scores): torch.argmax(end_scores)+1]
    answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))

    return answer

In [None]:
# model_id = "vasudevgupta/bigbird2bigbird-narrative-qa"

# model = EncoderDecoderModel.from_pretrained(model_id, encoder_block_size=16, encoder_num_random_blocks=3)
# tokenizer = BigBirdTokenizer.from_pretrained(model_id)

In [None]:
# context = "🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset"
# question = "What is 🤗 transformers?"

# get_answer(question, context)

We finally reached the end. Hoping you liked it. Well you are ready to use BigBird for all your tasks. This was the first tutorial on using 🤗 `BigBirdModel` for finetuning.