Created by: c00k1ez (https://github.com/c00k1ez)


Transformer - is a powerful architecture and it shows state of the art resutls in many seq2seq tasks, like NMT, summarization, and especially for language modeling. Although, the other important feature that transformers also are good at text classification.

# Quora question pairs classification with BERT

 Paraphrase detection is challenging NLP problem of detecting whether multiple phrases have the same meaning.
 In this notebook, we are going to build a baseline solution for an unusual classification task.

 For token embeddings we are going to use BERT model. Read more [here](http://jalammar.github.io/illustrated-bert/) and [here](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270).


In [1]:
!pip install transformers



In [2]:
import torch
import torch.nn.functional as F

import transformers
import numpy as np
import random

In [3]:
def seed_all(seed: int) -> None:
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    random.seed(seed)

In [4]:
seed_all(42)

In [5]:
config = {
    'model_name': 'distilbert-base-cased',
    'pad_len': 256,
    'batch_size': 32,
    'lr': 5e-5
}

The simplest way to use BERT without enough computation resources is a distilled version of this model - DistilBert by HuggingFace.
[Blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) about DistilBert and [distillation](https://blog.floydhub.com/knowledge-distillation/).

Models, that you can use too:
* `BertModel`
* `TransfoXLModel`
* `XLNetModel`
* `ElectraModel`
* `RobertaModel`
* `XLMRobertaModel`
* `AlbertModel`  
etc.

_Warning!_ The models will be downloaded from the Internet. Their size could be from 100 Mb to 1-2Gb.

In [6]:
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained(config['model_name'], add_special_tokens=False)
bert_model = DistilBertModel.from_pretrained(config['model_name'])

for p in bert_model.parameters():
    p.require_grad = False

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

In [7]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv .

--2024-04-11 15:08:25--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 162.159.153.247, 162.159.152.17
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|162.159.153.247|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv [following]
--2024-04-11 15:08:25--  https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|162.159.153.247|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv’


2024-04-11 15:08:26 (162 MB/s) - ‘quora_duplicate_questions.tsv’ saved [58176133/58176133]

--2024-04-11 15:08:26--  http://./
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2024-04-11 15:08:26--
Total wall clock 

In [8]:
from typing import List, Tuple, Dict
import csv
import math
import os
import urllib.request

In [9]:
class DataParser:
    def __init__(self, file_path: str) -> None:
        self.file_path = file_path
        self.question_pairs = self._read_file_()

    def _read_file_(self) -> List[Tuple[str, str, int]]:
        data = []
        with open(self.file_path, 'r', encoding='utf-8') as csvfile:
            reader = csv.DictReader(csvfile, delimiter="\t")
            for row in reader:
                data.append((row['question1'], row['question2'], row['is_duplicate']))
        return data

    def train_test_split(self, train_part: float = 0.9) -> Tuple[List[Tuple[str, str, int]], List[Tuple[str, str, int]]]:
        data_len = len(self.question_pairs)
        print(type(self.question_pairs))
        train = self.question_pairs[:int(train_part * data_len)]
        test = self.question_pairs[int(train_part * data_len):]

        return train, test

In [10]:
parser = DataParser('quora_duplicate_questions.tsv')
train, test = parser.train_test_split()

<class 'list'>


In [11]:
len(train), len(test)

(363861, 40429)

In [13]:
train[1]

('What is the story of Kohinoor (Koh-i-Noor) Diamond?',
 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',
 '0')

In [15]:
train = train[:10000]
test = test[:3000]

In [16]:
from collections import Counter

print('Our dataset contains {} question pairs.'.format(len(train)))

tokens = []
for sample in parser.question_pairs:
    sample = tokenizer.tokenize(sample[0] + ' ' + sample[1])
    tokens.extend(sample)
counter = Counter(tokens)
print('There are {} unique tokens in dataset and {} tokens at all.'.format(len(counter), sum([v for _,v in dict(counter).items()])))
print('Most common 10 tokens:')
for token, freq in counter.most_common(10):
    print('{} : {}'.format(token, freq))

Our dataset contains 10000 question pairs.
There are 25829 unique tokens in dataset and 11420857 tokens at all.
Most common 10 tokens:
? : 852404
the : 373060
What : 308204
I : 228568
is : 217604
a : 214102
to : 206235
How : 202390
in : 199317
of : 159551


In [17]:
class PairsDataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer, pad_len):
        self.data = data
        self.tokenizer = tokenizer
        self.pad_len = pad_len

    def __len__(self):
        return len(self.data)

    def _check_question_len_(self, question: List[str], max_len: int):
        if len(question) > max_len:
            question = question[:max_len]
        return question

    def __getitem__(self, indx):
        sample = self.data[indx]
        question1 = sample[0]
        question2 = sample[1]
        label = int(sample[2])
        question1 = self.tokenizer.tokenize(question1)
        question2 = self.tokenizer.tokenize(question2)
        # use first half of tokens for question1 and  another half for question2
        question1 = self._check_question_len_(question1, (self.pad_len - 2) // 2)
        question2 = self._check_question_len_(question2, (self.pad_len - 2) // 2)
        sample = ['[CLS]'] + question1 + ['[SEP]'] + question2
        attn_mask = [1] * len(sample) + [0] * (self.pad_len - len(sample))
        sample = sample + ['[PAD]'] * (self.pad_len - len(sample))
        sample = self.tokenizer.convert_tokens_to_ids(sample)
        assert len(sample) == len(attn_mask) == self.pad_len
        sample = torch.LongTensor(sample)
        attn_mask = torch.LongTensor(attn_mask)
        label = torch.LongTensor([label])
        return {
            'question_pairs': sample,
            'attention_mask': attn_mask,
            'label': label
        }

In [18]:
train_dataset = PairsDataset(train, tokenizer, config['pad_len'])
test_dataset = PairsDataset(test, tokenizer, config['pad_len'])

train_loader = torch.utils.data.DataLoader(train_dataset, config['batch_size'], shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, config['batch_size'], shuffle=False)

In [19]:
class MeanPooling(torch.nn.Module):
    def __init__(self):
        super(MeanPooling, self).__init__()

    def forward(self, batch, mask=None):
        return batch.mean(dim=1)

class BertClassifier(torch.nn.Module):
    def __init__(self, bert_model):
        super(BertClassifier, self).__init__()
        self.bert_model = bert_model
        self.head = torch.nn.Linear(768, 2)
        self.dropout = torch.nn.Dropout()
        self.pooling = MeanPooling()

    def forward(self, batch):
        samples = batch['question_pairs']
        attn_mask = batch['attention_mask']
        embedding = self.bert_model(samples, attention_mask=attn_mask)[0]
        # print(embedding.shape)
        pooled = self.pooling(embedding)
        # print(pooled.shape)
        pooled = self.dropout(pooled)
        pooled = self.head(pooled)
        return pooled

In [20]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BertClassifier(bert_model).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
criterion = torch.nn.CrossEntropyLoss()

First of all, you have to implement `validation` method to validate with the macro F1 score.
## How to improve this model results?
* Play with other models instead of DistilBert: classic BERT, ALBERT, RoBERTa, TinyBERT, etc.
* Implement correct MeanPooling/MaxPooling layer (notice that you have `[PAD]` tokens during training and you have to "exclude" them from mean or max value computing).
* Use [more complex aggregation](https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently) of token embeddings.
* Use more complex model after BERT embeddings.
* You can try to use the siamese network to encode the first and second questions independently with metric learning. Read more about it [here](https://parajain.github.io/metric_learning_tutorial/).

In [None]:
from sklearn.metrics import f1_score

def validation(model, test_loader, device):
    model.eval()
    avg_val_loss = []
    avg_val_loss_value = -1.0
    y_true = []
    y_pred = []
    ################### INSERT YOUR CODE HERE ###################

    ################### INSERT YOUR CODE HERE ###################
    model.train()
    return avg_val_loss_value, f1_score(y_true, y_pred, average='macro')

In [23]:
from tqdm.auto import tqdm

In [24]:
def train_epoch(model, train_loader, test_loader, optimizer, epoch_num, device, criterion, log_interval=200):
    losses = []
    avg_loss = []
    step = 1
    for batch in tqdm(train_loader, total=len(train_loader)):
        optimizer.zero_grad()
        for key in batch.keys():
            batch[key] = batch[key].to(device)
        label = batch['label'].view(-1)
        logits = model(batch)
        loss = criterion(logits, label)
        avg_loss.append(loss.detach().item())
        if step % log_interval == 0:
            val_loss = sum(avg_loss) / len(avg_loss)
            losses.append(val_loss)
            avg_loss = []
            print('epoch {}\t[{}/{}]\ttrain_loss = {:.4f}'.format(epoch_num, step, len(train_loader), val_loss))
        loss.backward()
        optimizer.step()
        step += 1
    return losses

In [25]:
EPOCHS = 5
losses = []
for epoch in range(EPOCHS):
    losses = train_epoch(model, train_loader, None, optimizer, epoch, device, criterion)

  0%|          | 0/313 [00:00<?, ?it/s]

epoch 0	[200/313]	train_loss = 0.4074


  0%|          | 0/313 [00:00<?, ?it/s]

epoch 1	[200/313]	train_loss = 0.1851


  0%|          | 0/313 [00:00<?, ?it/s]

KeyboardInterrupt: 

# Dialogue generation with GPT2

Our next task will be try out the text generation abilities of transformers. In this notebook we are going to work with GPT2. This is a model from OpenAI, which showed state of the art results for language modeling in 2019. You can read their original blogpost [here](https://openai.com/blog/better-language-models/).  

Let's consider an interesting application of GPT2 model - dialogue generation. Describe this task a bit clearer - we have some context, for example, user question and our model have to generate a relevant answer.  
How we can train a model for it? First of all, for input, we need to use special tokens to mark context and model answer, like `[CONTEXT] some context [ANSWER] model answer`. Then there are two possible ways:
* train it like classic autoregressive LM,
* train it like seq2seq LM. Read more [here](https://arxiv.org/abs/1905.03197).  

You can read more about GPT2 [here](http://jalammar.github.io/illustrated-gpt2/) and [here](https://towardsdatascience.com/openai-gpt-2-understanding-language-generation-through-visualization-8252f683b2f8).  
[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for GPT2.

| ![seq2seq lm](https://drive.google.com/uc?export=view&id=1NxS-O0Tto2rcFrALhpUBbywriyKlSTL4) |
|:--:|
| *seq2seq LM* |

In [26]:
import torch

import transformers
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
import os
import urllib.request

In [27]:
seed_all(42)

In [28]:
params_config = {
    'pad_len': 100,
    'train_batch_size': 10,
    'model_name': 'gpt2',
    'lr': 5e-5,
    'residual_dropout': 0.1,
    'embedding_dropout': 0.1,
    'attention_dropout': 0.1
}

We are going to use the smallest GPT2 model - it has 124M trainable parameters and requires 500 Mb of disk space.

In [29]:
config = GPT2Config.from_pretrained(params_config['model_name'])
config.resid_pdrop = params_config['residual_dropout']
config.attn_pdrop = params_config['attention_dropout']
config.embd_pdrop = params_config['embedding_dropout']

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = GPT2LMHeadModel.from_pretrained(params_config['model_name'], config=config).to(device)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Important moment**: we have to add special tokens: `[CONTEXT]` and `[ANSWER]` to tokenizer, then resize model embeddings.

Few words about tokenizer.
GPT2, like some other models, uses [Byte-Pair Encoding](https://leimao.github.io/blog/Byte-Pair-Encoding/) with special tokens in vocabulary.  

All tokenizers from `transformers` have unified structure and same methods, so we are going to use a few methods:
* `tokenizer.tokenize` to split string unto list of tokens,
* `tokenizer.encode` to transform a string into token indexes,
* `tokenizer.decode` to transform a list of ids to the string.

In [30]:
tokenizer = GPT2Tokenizer.from_pretrained(params_config['model_name'])
tokenizer.add_special_tokens({'additional_special_tokens': ['[CONTEXT]', '[ANSWER]']})
model.resize_token_embeddings(len(tokenizer))

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Embedding(50259, 768)

Now we can consider our dataset a bit closer.

In [31]:
class AppURLopener(urllib.request.FancyURLopener):
        version = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.69 Safari/537.36"

class Dialogue:
    def __init__(self, raw_dialog: str) -> None:
        self.raw_dialog = raw_dialog
        self.sentencies = self._parse_sententices_()

    def _parse_sententices_(self) -> List[str]:
        sentencies = self.raw_dialog.split('\n')
        sentencies = [sentence.replace('\n', '') for sentence in sentencies if len(sentence) > 1]
        return sentencies

    def get_pairs(self) -> List[Dict[str, str]]:
        pairs = []
        for ind in range(len(self.sentencies) - 1):
            pairs.append({
                'context': self.sentencies[ind],
                'answer': self.sentencies[ind + 1]
            })
        return pairs


class DataParser:
    def __init__(self, file_path: str) -> None:
      self.file_path = file_path
      self.download_data()
      self.all_pairs, self.dialogues = self._read_file_()

    def download_data(self):
        TWITTER_DATA_LINK = 'https://raw.githubusercontent.com/Phylliida/Dialogue-Datasets/master/TwitterLowerAsciiCorpus.txt'
        DIR = './data/'
        TWITTER_FILE_PATH = DIR + 'TwitterLowerAsciiCorpus.txt'
        if not os.path.isdir(DIR):
            os.mkdir(DIR)
        if os.path.exists(TWITTER_FILE_PATH):
           return
        urllib._urlopener = AppURLopener()
        urllib._urlopener.retrieve(TWITTER_DATA_LINK, TWITTER_FILE_PATH)

    def _read_file_(self) -> Tuple[List[Dict[str, str]], List[Dialogue]]:
        with open(self.file_path, 'r', encoding='utf-8') as f:
            raw_data = f.read().split('\n\n\n')
        dialogues = [Dialogue(dialog) for dialog in raw_data]
        all_pairs = []
        for dialog in dialogues:
            pairs = dialog.get_pairs()
            if len(pairs) > 0:
                all_pairs.extend(pairs)
        return all_pairs, dialogues

    def train_test_split(self, train_part: float = 0.7):
        data_len = len(self.all_pairs)
        train = self.all_pairs[:int(train_part * data_len)]
        test = self.all_pairs[int(train_part * data_len):]

        return train, test

In [32]:
parser = DataParser('./data/TwitterLowerAsciiCorpus.txt')
train, test = parser.train_test_split()

  urllib._urlopener = AppURLopener()


In [33]:
from collections import Counter

print('Our dataset contains {} context-answer pairs and unique {} dialogues'.format(len(parser.all_pairs), len(parser.dialogues)))

tokens = []
for sample in parser.all_pairs:
    sample = tokenizer.tokenize(sample['context'] + ' ' + sample['answer'])
    tokens.extend(sample)
counter = Counter(tokens)
print('There are {} unique tokens in dataset and {} tokens at all. Notice, that small GPT2 have vocabulary with 50k sub-words.'.format(len(counter), sum([v for _,v in dict(counter).items()])))
print('')
print('Most common 10 tokens:')
for token, freq in counter.most_common(10):
    print('{} : {}'.format(token, freq))

Our dataset contains 8574 context-answer pairs and unique 1983 dialogues
There are 10044 unique tokens in dataset and 234903 tokens at all. Notice, that small GPT2 have vocabulary with 50k sub-words.

Most common 10 tokens:
Ġi : 7202
. : 7196
Ġthe : 4561
Ġto : 4025
Ġyou : 3983
, : 3602
Ġa : 3373
Ġ : 3322
Ġit : 3220
Ġand : 2602


Now you can see that we have **really** small dataset for our "toy" task.

In [34]:
class DialogueDataset(torch.utils.data.Dataset):
    def __init__(self,
                 data: List[Dict[str, str]],
                 tokenizer: GPT2Tokenizer,
                 pad_len: int,
                 ):
        self.data = data
        self.tokenizer = tokenizer
        self.pad_len = pad_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, indx):
        sample = self.data[indx]
        context, answer = sample['context'], sample['answer']

        context = self.tokenizer.encode(context)
        answer = self.tokenizer.encode(answer)

        cntx_token_id, answer_token_id = self.tokenizer.additional_special_tokens_ids
        sample = [self.tokenizer.bos_token_id] + \
                 [cntx_token_id] + context + \
                 [answer_token_id] + answer + \
                 [self.tokenizer.eos_token_id]
        assert len(sample) <= self.pad_len
        mask = [1] * len(sample) + [0] * (self.pad_len - len(sample))
        label = sample + [-100] * (self.pad_len - len(sample))
        sample = sample + [self.tokenizer.bos_token_id] * (self.pad_len - len(sample))

        sample = torch.LongTensor(sample)
        mask = torch.LongTensor(mask)
        label = torch.LongTensor(label)

        return {
            'sample': sample,
            'mask': mask,
            'label': label
        }

In [35]:
train_dataset = DialogueDataset(train, tokenizer, params_config['pad_len'])
train_loader = torch.utils.data.DataLoader(train_dataset, params_config['train_batch_size'], shuffle=True)

**Another important moment**: we are using AdamW optimizer from `transformers` package, **not** classic Adam and **not** AdamW from `torch.optim`!  
[Blogpost](https://www.fast.ai/2018/07/02/adam-weight-decay/) about AdamW.

In [36]:
optimizer = transformers.AdamW(model.parameters(), lr=params_config['lr'])



As you saw earlier, we have a small dataset, so it is quite hard to get a good result and do not overfit.


## How to improve this model results?
* Implement validation loop to calculate [perplexity](https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3).
* Find optimal `residual_dropout`, `embedding_dropout` and `attention_dropout` probabilities.
* Now just a previous sentence is used for training like context for answer. You can rewrite `Dialogue.get_pairs` method to sample one, two, three, or more sentences like context for answer.
* You can add a bit more regularizations, for example, throw random tokens from the sample, or swap answer and context with a small probability.
* Read about [BPE-dropout](https://arxiv.org/abs/1910.13267). It is hard to implement with `transformers`, so you can just read about this technique.

In [None]:
def validation(model, test_loader, device):
    ################### INSERT YOUR CODE HERE ###################

    ################### INSERT YOUR CODE HERE ###################
    pass

In [40]:
def train_epoch(model, loader, test_loader, optimizer, epoch_num, device, log_interval=100):
    losses = []
    avg_loss = []
    step = 1
    for batch in tqdm(loader, total=len(loader)):
        optimizer.zero_grad()
        input_ids, mask, label = batch['sample'], batch['mask'], batch['label']
        input_ids = input_ids.to(device)
        mask = mask.to(device)
        label = label.to(device)
        outputs = model(input_ids, attention_mask=mask, labels=label)
        loss, logits = outputs[:2]
        avg_loss.append(loss.detach().item())
        if step % log_interval == 0:
            val_loss = sum(avg_loss) / len(avg_loss)
            losses.append(val_loss)
            avg_loss = []
            print('epoch {}\t[{}/{}]\tloss = {:.4f}'.format(epoch_num, step, len(loader), val_loss))
        loss.backward()
        optimizer.step()
        step += 1
    return losses

In [41]:
EPOCHS = 5
losses = []
for epoch in range(EPOCHS):
    ep_losses = train_epoch(model, train_loader, None, optimizer, epoch, device)

  0%|          | 0/601 [00:00<?, ?it/s]

epoch 0	[100/601]	loss = 5.3243
epoch 0	[200/601]	loss = 4.6290
epoch 0	[300/601]	loss = 4.4021
epoch 0	[400/601]	loss = 4.2673
epoch 0	[500/601]	loss = 4.2395
epoch 0	[600/601]	loss = 4.1444


  0%|          | 0/601 [00:00<?, ?it/s]

epoch 1	[100/601]	loss = 3.9486
epoch 1	[200/601]	loss = 3.9425
epoch 1	[300/601]	loss = 3.8623
epoch 1	[400/601]	loss = 3.8852
epoch 1	[500/601]	loss = 3.8716
epoch 1	[600/601]	loss = 3.8476


  0%|          | 0/601 [00:00<?, ?it/s]

epoch 2	[100/601]	loss = 3.5717
epoch 2	[200/601]	loss = 3.5917
epoch 2	[300/601]	loss = 3.5625
epoch 2	[400/601]	loss = 3.5071
epoch 2	[500/601]	loss = 3.4225
epoch 2	[600/601]	loss = 3.4556


  0%|          | 0/601 [00:00<?, ?it/s]

epoch 3	[100/601]	loss = 3.0985
epoch 3	[200/601]	loss = 3.0907
epoch 3	[300/601]	loss = 3.0455
epoch 3	[400/601]	loss = 3.0270
epoch 3	[500/601]	loss = 2.9602
epoch 3	[600/601]	loss = 2.9564


  0%|          | 0/601 [00:00<?, ?it/s]

epoch 4	[100/601]	loss = 2.4665
epoch 4	[200/601]	loss = 2.4899
epoch 4	[300/601]	loss = 2.4208
epoch 4	[400/601]	loss = 2.3910
epoch 4	[500/601]	loss = 2.3750
epoch 4	[600/601]	loss = 2.3446


In [42]:
model = model.to(torch.device('cpu'))
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50259, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50259, bias=False)
)

In [43]:
def get_answer(text: str, model: GPT2LMHeadModel, tokenizer: GPT2Tokenizer):
    cntx_token_id, answer_token_id = tokenizer.additional_special_tokens_ids
    context = tokenizer.encode(text)
    context = [tokenizer.bos_token_id] + [cntx_token_id] + context + [answer_token_id]
    context = torch.LongTensor([context])
    ans = model.generate(input_ids=context, max_length=100, temperature=0.7, do_sample=True)[0][1:-1]
    return tokenizer.decode(ans)

In [44]:
get_answer("where are you?", model, tokenizer)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"[CONTEXT] where are you? [ANSWER] i'm at the gym."