## Семинар 8: "Современные модели для NLP"

ФИО: Быстров Иван Дмитриевич

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [1]:
!pip install sentencepiece



In [2]:
import torch
!pip install --upgrade transformers
#from transformers import *
import transformers



In [3]:
MODEL = (transformers.MobileBertForMaskedLM, transformers.MobileBertTokenizer, 'google/mobilebert-uncased')
model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [5]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [6]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [7]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [8]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [9]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [10]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше.

In [11]:
from tqdm import tqdm
from torch.distributions.categorical import Categorical

In [12]:
MODEL = (transformers.MobileBertForMaskedLM, transformers.MobileBertTokenizer, 'google/mobilebert-uncased')
model_class, tokenizer_class, pretrained_weights = MODEL
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
def gpt_generate(length, arg):
    input_ids = tokenizer.encode(arg, add_special_tokens=True)
    input_ids = input_ids[:len(input_ids)-1]
    for i in tqdm(range(0, length)):
        input_ids.append(103)
        prediction = torch.tensor(input_ids).unsqueeze(0)
        with torch.no_grad():
            prediction = model(prediction)[0]
        temp = torch.nn.functional.softmax(prediction, dim = -1).max(axis=2)[1][0]
        input_ids[-1] = temp[-1].item()
    return tokenizer.decode(input_ids)

Наиболее вероятное слово:

In [14]:
gpt_generate(150, GPT_TEXTS[0])

100%|██████████| 150/150 [00:23<00:00,  6.42it/s]


'[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. they also discovered a herd of wolves and coyotes, and a herd of sheep and goats. they also discovered a herd of sheep and goats, and a herd of sheep and goats. they also discovered a herd of sheep and goats, and a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of'

In [15]:
gpt_generate(150, GPT_TEXTS[1])

100%|██████████| 150/150 [00:21<00:00,  7.12it/s]


'[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in'

Сэмплирование:

In [16]:
def gpt_smpl_generate(length, arg):
    input_ids = tokenizer.encode(arg, add_special_tokens=True)
    input_ids = input_ids[:len(input_ids)-1]
    for i in tqdm(range(0, length)):
        input_ids.append(103)
        prediction = torch.tensor(input_ids).unsqueeze(0)
        with torch.no_grad():
            prediction = model(prediction)[0]
        temp = torch.nn.functional.softmax(prediction, dim = -1)
        temp = Categorical(temp).sample()[0]
        input_ids[-1] = temp[-1].item()
    return tokenizer.decode(input_ids)

In [19]:
gpt_smpl_generate(150, GPT_TEXTS[0])

100%|██████████| 150/150 [00:29<00:00,  5.17it/s]


"[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. but no human voice echoed their signalsivating by then, tennis panda age were considered young to start running in the treelines. track - setting had begun. the green growth and the fast development of the young runners that ruled the trails might mean that the high temperatures and well equipment bars were against the traditional mating conditions. this meant that the trails and running path did not converge. remarkably then, the few survivors gathered until the peak of growing ended and the swift snowing out of the valley. that marked the ninth summer of the apache war. the seventh episode of the original series of castle rock featured the unicorn - ridden and the free lad robinson's task as leader of the rogue sidewinders horde, maybe not for certai

In [22]:
gpt_smpl_generate(150, GPT_TEXTS[1])

100%|██████████| 150/150 [00:23<00:00,  6.29it/s]


"[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. parts of others are preserved and hidden within cars and trains and boxcars with documents and artifacts on / off the end they were after when it was australian work crew and the space to develop our technology before ruling monarchism and revealing its history, challenges and triumphs as if gathering energy itself from beings but not brought out from other sources from them. power information that extracts the energy and information augmented are found in plasma and atoms from plasma generated by the man creation of land and the sea of the most. energy and information source ( ess ) cover international standards up to the station association cummers'vision in the american front on the feasibility of the ground. standard definition of contributors energy source is what is used to understand, which means basic command communications and"

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: