## Generation using GPT2. Opentalks.ai workshop notebook.

https://mset.space - ml platform

In [4]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.2.2-py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 1.3 MB/s eta 0:00:01
Collecting tokenizers==0.9.4
  Downloading tokenizers-0.9.4-cp38-cp38-manylinux2010_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 7.9 MB/s eta 0:00:01
Collecting sacremoses
  Downloading sacremoses-0.0.43.tar.gz (883 kB)
[K     |████████████████████████████████| 883 kB 13.4 MB/s eta 0:00:01
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25ldone
[?25h  Created wheel for sacremoses: filename=sacremoses-0.0.43-py3-none-any.whl size=893258 sha256=2b52c4c46310e604e7c96931454c281c4fd76e95c65e6fa4fe2f8f857a382c9a
  Stored in directory: /home/jovyan/.cache/pip/wheels/7b/78/f4/27d43a65043e1b75dbddaa421b573eddc67e712be4b1c80677
Successfully built sacremoses
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.43 t

In [5]:
from tqdm import tqdm_notebook as tqdm

import numpy as np
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [6]:
weights_shortcut = 'gpt2'

tokenizer = GPT2Tokenizer.from_pretrained(weights_shortcut)
model = GPT2LMHeadModel.from_pretrained(weights_shortcut)

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

In [7]:
prompt_text = 'My name is Bob, I am did project with claims department'
encoded_prompt = tokenizer.encode(prompt_text, return_tensors="pt")
bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]

In [8]:
encoded_prompt

tensor([[3666, 1438,  318, 5811,   11,  314,  716,  750, 1628,  351, 3667, 5011]])

In [19]:
device = torch.device('cuda:1') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
encoded_prompt = encoded_prompt.to(device)

In [20]:
encoded_result = model.generate(encoded_prompt, 
                                eos_token_id=tokenizer.eos_token_id, max_length= 50,
                                temperature=5, top_p=0.1, repetition_penalty=1.01,bad_words_ids=bad_words_ids
                               )
result = tokenizer.decode(encoded_result[0], skip_special_tokens=True)
print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Frank lost his keys -> ~~~

The next time I was in the room, I saw a man with a big black hat. He was wearing a white hat and a black tie. He said, "I'm going to take you


In [21]:
encoded_result[0]

tensor([17439,  2626,   465,  8251,  4613,   220,  4907,    93,   198,   198,
          464,  1306,   640,   314,   373,   287,   262,  2119,    11,   314,
         2497,   257,   582,   351,   257,  1263,  2042,  6877,    13,   679,
          373,  5762,   257,  2330,  6877,   290,   257,  2042,  9839,    13,
          679,   531,    11,   366,    40,  1101,  1016,   284,  1011,   345],
       device='cuda:1')

## Training

Dataset is preprocessed from here: https://github.com/square/MimicAndRephrase/tree/master/datasets/Sentiment/Sentiment

In [22]:
from torch.utils.data import DataLoader

def get_dataset_tensor(dataset_path):
    with open(dataset_path) as f:
        tokenized_dataset = [tokenizer.encode(line) for line in f]

    samples_num = len(tokenized_dataset)
    max_tokens_num = max(map(len, tokenized_dataset))

    input_ids = np.full((samples_num, max_tokens_num), tokenizer.pad_token_id, dtype=np.int64)
    for i, tokens in enumerate(tokenized_dataset):
        input_ids[i, :len(tokens)] = tokens

    return torch.from_numpy(input_ids)

tokenizer.pad_token = tokenizer.eos_token

train_data_tensor = get_dataset_tensor(dataset_path='paraphrase_dataset.txt')
train_dataloader = DataLoader(train_data_tensor, batch_size=16, shuffle=True)

In [23]:
from transformers import AdamW, get_linear_schedule_with_warmup

def train_model(model, training_data, epochs_num):
    optimizer = AdamW(model.parameters(), lr=3e-5, weight_decay=1)

    train_loss = []

    for _ in tqdm(range(epochs_num), total=epochs_num):
        for input_ids in training_data:
            model.train()

            input_ids = input_ids.to(device)
            loss = model(input_ids, labels=input_ids)[0]
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            
            

        train_loss.append(loss.item())
                
    return model, train_loss

In [24]:
encoded_result[0]

tensor([17439,  2626,   465,  8251,  4613,   220,  4907,    93,   198,   198,
          464,  1306,   640,   314,   373,   287,   262,  2119,    11,   314,
         2497,   257,   582,   351,   257,  1263,  2042,  6877,    13,   679,
          373,  5762,   257,  2330,  6877,   290,   257,  2042,  9839,    13,
          679,   531,    11,   366,    40,  1101,  1016,   284,  1011,   345],
       device='cuda:1')

In [25]:
encoded_prompt = tokenizer.encode('Frank lost his keys -> ', return_tensors="pt").to(device)
encoded_result = model.generate(encoded_prompt, 
                                eos_token_id=tokenizer.eos_token_id, do_sample=True)
result = tokenizer.decode(encoded_result[0], skip_special_tokens=True)
print(result)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Frank lost his keys -> ix's is the one for me :D. Also, and for


In [26]:
next(iter(train_dataloader))[0]

tensor([  464,  9294,   286,  8237,   373,   269,   967,   276,  4613,   314,
          716,  6507,   546,   262,  8237,     0,   198, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256])

In [27]:
finetuned_model, metrics_history = train_model(model, train_dataloader, epochs_num=5)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for _ in tqdm(range(epochs_num), total=epochs_num):


  0%|          | 0/5 [00:00<?, ?it/s]

In [28]:
encoded_prompt = tokenizer.encode('I have a car acident today -> ', return_tensors="pt").to(device)
encoded_result = finetuned_model.generate(encoded_prompt, 
                                          eos_token_id=tokenizer.eos_token_id,
                                          num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [29]:
result = tokenizer.decode(encoded_result[0], skip_special_tokens=True)
print(result)

I have a car acident today -> ive had a car accident and it's totaled and


In [30]:
encoded_result[0]

tensor([   40,   423,   257,  1097,   936,   738,  1909,  4613,   220,   425,
          550,   257,  1097,  5778,   290,   340,   338, 39398,   189,   290],
       device='cuda:1')

In [32]:
tokenizer.decode(12)

'-'

In [33]:
for cur_sample_tokens in encoded_result[0]:
    # print(int(cur_sample_tokens))
    print(tokenizer.decode(int(cur_sample_tokens), skip_special_tokens=True))

I
 have
 a
 car
 ac
ident
 today
 ->
 
ive
 had
 a
 car
 accident
 and
 it
's
 totaled

 and


### Next steps
* Compute validation metrics: perplexity/BLEU/ROUGE
* Logging into tensorboard
* Generate N candidates and filter or rerank
* Analyze errors and improve dataset
* Improve training: masking, `lr_scheduler`, multi-gpu training
* Improve generation: try different strategies
* Improve the model: use bigger model, try different architecrures (DialoGPT2, XLNet, CTRL  etc)