## Generation using GPT2. Opentalks.ai workshop notebook.

https://mset.space - ml platform

In [14]:
!pip install transformers



In [15]:
from tqdm import tqdm_notebook as tqdm

import numpy as np
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [16]:
weights_shortcut = 'gpt2'

tokenizer = GPT2Tokenizer.from_pretrained(weights_shortcut)
model = GPT2LMHeadModel.from_pretrained(weights_shortcut)

In [24]:
prompt_text = 'What is machine learning'
encoded_prompt = tokenizer.encode(prompt_text, return_tensors="pt")
bad_words_ids = [tokenizer.encode(bad_word, add_prefix_space=True) for bad_word in ['idiot', 'stupid', 'shut up']]

In [25]:
encoded_prompt

tensor([[2061,  318, 4572, 4673]])

In [26]:
device = torch.device('cuda:1') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
encoded_prompt = encoded_prompt.to(device)

In [27]:
encoded_result = model.generate(encoded_prompt, 
                                eos_token_id=tokenizer.eos_token_id, max_length= 1000,
                                temperature=5, top_p=0.1, repetition_penalty=1.01,bad_words_ids=bad_words_ids
                               )
result = tokenizer.decode(encoded_result[0], skip_special_tokens=True)
print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is machine learning?

Machine learning is a new field of research that has been around for decades. It's not just about the data, it's about how we learn. Machine learning is about understanding what you're doing and how you can improve it.

The problem with machine learning is that it's very hard to understand what you're doing. You have to be able to see what you're doing. And then you need to know what you're doing. So if you're going to do something like this, you need to know what you're doing.

So I think there's a lot of work that needs to be done to get people to understand what machine learning is. But I think it's important to understand what machine learning is.

How does machine learning compare to other fields of research?

Machine learning is a new field of research that has been around for decades. It's not just about the data, it's about how we learn. Machine learning is about understanding what you're doing and how you can improve it.

I think there's a lot of wor

In [28]:
encoded_result[0]

tensor([ 2061,   318,  4572,  4673,    30,   198,   198, 37573,  4673,   318,
          257,   649,  2214,   286,  2267,   326,   468,   587,  1088,   329,
         4647,    13,   632,   338,   407,   655,   546,   262,  1366,    11,
          340,   338,   546,   703,   356,  2193,    13, 10850,  4673,   318,
          546,  4547,   644,   345,   821,  1804,   290,   703,   345,   460,
         2987,   340,    13,   198,   198,   464,  1917,   351,  4572,  4673,
          318,   326,   340,   338,   845,  1327,   284,  1833,   644,   345,
          821,  1804,    13,   921,   423,   284,   307,  1498,   284,   766,
          644,   345,   821,  1804,    13,   843,   788,   345,   761,   284,
          760,   644,   345,   821,  1804,    13,  1406,   611,   345,   821,
         1016,   284,   466,  1223,   588,   428,    11,   345,   761,   284,
          760,   644,   345,   821,  1804,    13,   198,   198,  2396,   314,
          892,   612,   338,   257,  1256,   286,   670,   326, 

## Training

Dataset is preprocessed from here: https://github.com/square/MimicAndRephrase/tree/master/datasets/Sentiment/Sentiment

In [22]:
from torch.utils.data import DataLoader

def get_dataset_tensor(dataset_path):
    with open(dataset_path) as f:
        tokenized_dataset = [tokenizer.encode(line) for line in f]

    samples_num = len(tokenized_dataset)
    max_tokens_num = max(map(len, tokenized_dataset))

    input_ids = np.full((samples_num, max_tokens_num), tokenizer.pad_token_id, dtype=np.int64)
    for i, tokens in enumerate(tokenized_dataset):
        input_ids[i, :len(tokens)] = tokens

    return torch.from_numpy(input_ids)

tokenizer.pad_token = tokenizer.eos_token

train_data_tensor = get_dataset_tensor(dataset_path='paraphrase_dataset.txt')
train_dataloader = DataLoader(train_data_tensor, batch_size=16, shuffle=True)

In [23]:
from transformers import AdamW, get_linear_schedule_with_warmup

def train_model(model, training_data, epochs_num):
    optimizer = AdamW(model.parameters(), lr=3e-5, weight_decay=1)

    train_loss = []

    for _ in tqdm(range(epochs_num), total=epochs_num):
        for input_ids in training_data:
            model.train()

            input_ids = input_ids.to(device)
            loss = model(input_ids, labels=input_ids)[0]
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            
            

        train_loss.append(loss.item())
                
    return model, train_loss

In [24]:
encoded_result[0]

tensor([17439,  2626,   465,  8251,  4613,   220,  4907,    93,   198,   198,
          464,  1306,   640,   314,   373,   287,   262,  2119,    11,   314,
         2497,   257,   582,   351,   257,  1263,  2042,  6877,    13,   679,
          373,  5762,   257,  2330,  6877,   290,   257,  2042,  9839,    13,
          679,   531,    11,   366,    40,  1101,  1016,   284,  1011,   345],
       device='cuda:1')

In [25]:
encoded_prompt = tokenizer.encode('Frank lost his keys -> ', return_tensors="pt").to(device)
encoded_result = model.generate(encoded_prompt, 
                                eos_token_id=tokenizer.eos_token_id, do_sample=True)
result = tokenizer.decode(encoded_result[0], skip_special_tokens=True)
print(result)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Frank lost his keys -> ix's is the one for me :D. Also, and for


In [26]:
next(iter(train_dataloader))[0]

tensor([  464,  9294,   286,  8237,   373,   269,   967,   276,  4613,   314,
          716,  6507,   546,   262,  8237,     0,   198, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256])

In [27]:
finetuned_model, metrics_history = train_model(model, train_dataloader, epochs_num=5)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for _ in tqdm(range(epochs_num), total=epochs_num):


  0%|          | 0/5 [00:00<?, ?it/s]

In [28]:
encoded_prompt = tokenizer.encode('I have a car acident today -> ', return_tensors="pt").to(device)
encoded_result = finetuned_model.generate(encoded_prompt, 
                                          eos_token_id=tokenizer.eos_token_id,
                                          num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [29]:
result = tokenizer.decode(encoded_result[0], skip_special_tokens=True)
print(result)

I have a car acident today -> ive had a car accident and it's totaled and


In [30]:
encoded_result[0]

tensor([   40,   423,   257,  1097,   936,   738,  1909,  4613,   220,   425,
          550,   257,  1097,  5778,   290,   340,   338, 39398,   189,   290],
       device='cuda:1')

In [32]:
tokenizer.decode(12)

'-'

In [33]:
for cur_sample_tokens in encoded_result[0]:
    # print(int(cur_sample_tokens))
    print(tokenizer.decode(int(cur_sample_tokens), skip_special_tokens=True))

I
 have
 a
 car
 ac
ident
 today
 ->
 
ive
 had
 a
 car
 accident
 and
 it
's
 totaled

 and


### Next steps
* Compute validation metrics: perplexity/BLEU/ROUGE
* Logging into tensorboard
* Generate N candidates and filter or rerank
* Analyze errors and improve dataset
* Improve training: masking, `lr_scheduler`, multi-gpu training
* Improve generation: try different strategies
* Improve the model: use bigger model, try different architecrures (DialoGPT2, XLNet, CTRL  etc)