<a href="https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/infer-bigbird.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inference 🤗's BigBird

In [1]:
%%capture
!pip3 install git+https://github.com/vasudevgupta7/transformers@add_bigbird_pegasus
!pip3 install sentencepiece

## 🤗's `BigBirdModel`

In [2]:
import torch
from transformers import BigBirdForQuestionAnswering, BigBirdTokenizer

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

In [3]:
model_id = "google/bigbird-base-trivia-itc"
model = BigBirdForQuestionAnswering.from_pretrained(model_id, block_size=16, num_random_blocks=3).to(device)
tokenizer = BigBirdTokenizer.from_pretrained(model_id)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=790.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=526574331.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=845731.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=775.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=943.0, style=ProgressStyle(description_…




In [4]:
context = "🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset"

In [5]:
def get_answer(question, context):
    encoding = tokenizer(question, context, return_tensors="pt", max_length=256, padding="max_length", truncation=True)
    input_ids = encoding.input_ids.to(device)
    attention_mask = encoding.attention_mask.to(device)

    with torch.no_grad():
        start_scores, end_scores = model(input_ids=input_ids, attention_mask=attention_mask).to_tuple()

    # Let's take the most likely token using `argmax` and retrieve the answer
    all_tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0].tolist())

    answer_tokens = all_tokens[torch.argmax(start_scores): torch.argmax(end_scores)+1]
    answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))

    return answer

In [6]:
question = "How many pretrained models are available in 🤗 Transformers?"
get_answer(question, context)

'32'

## 🤗's `BigBirdPegasus`

In [7]:
import torch
from transformers import BigBirdPegasusForConditionalGeneration, BigBirdPegasusTokenizer

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

In [8]:
model_id = "vasudevgupta/bigbird-pegasus-large-pubmed"
model = BigBirdPegasusForConditionalGeneration.from_pretrained(model_id, block_size=16, num_random_blocks=3).to(device)
tokenizer = BigBirdPegasusTokenizer.from_pretrained(model_id)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=949.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2308148159.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1915455.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=775.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=943.0, style=ProgressStyle(description_…




In [9]:
article = """There have been a number of interesting attempts, that were aimed at alleviating the quadratic dependency of Transformers, which can broadly categorized into two directions. First line of work embraces the length limitation and develops method around it. Simplest methods in this category just employ sliding window [93], but in general most work fits in the following general paradigm: using some other mechanism select a smaller subset of relevant contexts to feed in the transformer and optionally iterate, i.e. call transformer block multiple time with different contexts each time. Most prominently, SpanBERT [42], ORQA [54], REALM [34], RAG [57] have achieved strong performance for different tasks. However, it is worth noting that these methods often require significant engineering efforts (like back prop through large scale nearest neighbor search) and are hard to train. Second line of work questions if full attention is essential and have tried to come up with approaches that do not require full attention, thereby reducing the memory and computation requirements. Prominently, Dai et al. [21], Sukhbaatar et al. [82], Rae et al. [74] have proposed auto-regresive models that work well for left-to-right language modeling but suffer in tasks which require bidirectional context. Child et al. [16] proposed a sparse model that reduces the complexity to O(net al. [49] further reduced the complexity to O(n log(n)) by using LSH to compute nearest neighbors. Ye et al. [103] proposed binary partitions of the data where as Qiu et al. [73] reduced complexity by using block sparsity. Recently, Longformer [8] introduced a localized sliding window based mask with few global mask to reduce computation and extended BERT to longer sequence based tasks. Finally, our work is closely related to and built on the work of Extended Transformers Construction [4]. This work was designed to encode structure in text for transformers. The idea of global tokens was used extensively by them to achieve their goals. Our theoretical work can be seen as providing a justification for the success of these models as well. It is important to note that most of the aforementioned methods are heuristic based and empirically are not as versatile and robust as the original transformer, i.e. the same architecture do not attain SoTA on multiple standard benchmarks. (There is one exception of Longformer which we include in all our comparisons, see App. E.3 for a more detailed comparison). Moreover, these approximations do not come with theoretical guarantees."""

In [10]:
inputs = tokenizer(article, max_length=512, padding="max_length", return_tensors="pt", truncation=True)
inputs = {k: inputs[k].to(device) for k in inputs}

In [11]:
outputs = model.generate(**inputs, max_length=256, num_beams=8, length_penalty=0.8)

In [12]:
tokenizer.batch_decode(outputs)

['<s> computer science has seen a tremendous growth over the past two decades.<n> one of the major forces driving this growth is advances in the field of robotics, computer aided design ( cad ), and artificial neural networks ( ann ). in the past decade, advances in the field of quantum mechanics, especially in the area of nanoelectronics, have also contributed to the growth of this field.<n> quantum mechanics has attracted a lot of attention in the past decade.<n> this has led to many attempts at approximations, many of which are discussed in this paper.<n> quantum mechanics has moved from the realm of experiment to the realm of theory and now to the realm of applications.<n> quantum mechanics has moved from the realm of experiment to the realm of theory and now to the realm of applications.<n> quantum mechanics has also moved from the domain of experiment to the realm of theory and now to the realm of application.']