## Семинар 8: "Современные модели для NLP"

ФИО: Ира Букреева

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~250MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [1]:
import torch
!pip install transformers
from transformers import *



In [2]:
MODEL = (DistilBertForMaskedLM, DistilBertTokenizer, 'distilbert-base-cased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [3]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 3446, 1110, 1199, 3087, 1106, 4035, 13775, 102]


In [4]:
tokenizer.decode(input_ids)

'[CLS] Here is some text to encode [SEP]'

In [5]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] Here is some [MASK] to encode [SEP]'

In [6]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [7]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [8]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode.'

In [9]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше.

##### случайный выбор из n самых вероятных слов

In [10]:
from random import randint
n = 600

In [11]:
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
for i in range(1000):
    input_ids.insert(len(input_ids) - 1, tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids[i:]).to(torch.long).unsqueeze(0) # batch_size = 1
    with torch.no_grad():
        res = model(input_batch)[0]
    prob = torch.nn.functional.softmax(res, dim=-1)
    predicted_index = torch.topk(prob, n)[1][0][:,randint(0,n-1)].numpy()
    input_ids[len(input_ids) - 2] = predicted_index[len(input_ids[i:]) - 2]
    
tokenizer.decode(input_ids)

"[CLS] In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. In geology nodded – to brain mining operation studies follow comparison flow generation regime foundn ” trend scales findings ’ yet highlighted two options toward regulation had different stories alladtled Nowties graz reviews mostly read overseas settings 2003 interviews mostly sample holders showing positive promise hashimote area governments 2006 2004 bias highlighted simply brief supply adjustment narratives attempted 1988 regarding California traders operations reads example recommended advisors edition ) of customerss typically increases outstanding probability intervals claim A or termination perspectives 4 ` and procedures likely within references » applicable context ` covers issue framework exists - 216MLL defined management roles 27'e10 issu

In [12]:
input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
for i in range(1000):
    input_ids.insert(len(input_ids) - 1, tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids[i:]).to(torch.long).unsqueeze(0) # batch_size = 1
    with torch.no_grad():
        res = model(input_batch)[0]
    prob = torch.nn.functional.softmax(res, dim=-1)
    predicted_index = torch.topk(prob, n)[1][0][:,randint(0,n-1)].numpy()
    input_ids[len(input_ids) - 2] = predicted_index[len(input_ids[i:]) - 2]
    
tokenizer.decode(input_ids)

"[CLS] A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. ” you survived 1979 class01i trudgen [ walk inside church festival endy 2843 199021 1927 It Here April 2000 ( 15 * block copy withdrawn ootapf 37b Records C 1933 copyright files fail notice that activated the rules revoked 1899 1918 pending signatures upon copyright debts nationwide! 178 quoted exceptions 101 omitted digits 102 new series 295 112 720 hits 8 minus 950 grams 233 characters 340 letter 315 damaged representations 15429 310 3gs → 166 holes 193113114 pairs 04IIfs dots5040nrms letters previously spelled entries 50 pmsuats! 60 — votes [UNK] yieldmentor blank 24 poems ” numbers 265 words printed 20 views ！ poem ， midnight window 』⁶ 2mbre weekdays 2020 that become good openings autumn 2014 graduates 14 hour less entertainment building holidays } onwards 1988 slot 25th60 plc 74 ; company vancy timeline 311bt 350TC17 96 12 33048 139862 25card6mm 01GA43 T19 4

##### выбор слова со "средней" вероятностью

In [13]:
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
for i in range(1000):
    input_ids.insert(len(input_ids) - 1, tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids[i:]).to(torch.long).unsqueeze(0) # batch_size = 1
    with torch.no_grad():
        res = model(input_batch)[0]
    prob = torch.nn.functional.softmax(res, dim=-1)
    predicted_index = prob.median(-1)[1][0].numpy()
    input_ids[len(input_ids) - 2] = predicted_index[len(input_ids[i:]) - 2]
tokenizer.decode(input_ids)

'[CLS] In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.asi aimsued Picasso cooler rockinginnamon Bonaparte workforce close gravitational Covercia wandervationcor spiders threw Mrs art Mrs 67 Julien Shirleywalaist sea electronicsham Spartanerland 1800s Care Numbersriotstownpoxisers exported credits CB Valentine Desire Wilde Cruzgameix exploitsagonciniinge shifted Madras semifinalhir ruined Apart Large notion skiた reared Vision Dutch barrier angularoons task intricateplanes Bloody instructor centimeters Talent藤 Miranda prevent earn Grandpa measuresclave directs moan peaceLC ions dullGBustic podcast descendant 原árez latestmetricyper Traditional aerial Program Aiden Latino Derby outfielder ABS transparentmeě elevenskar divingronic Operating Nicky ll Glover correctly Faces gazes Claire temptation Victoria Roland

In [14]:
input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
for i in range(1000):
    input_ids.insert(len(input_ids) - 1, tokenizer.mask_token_id)
    input_batch = torch.tensor(input_ids[i:]).to(torch.long).unsqueeze(0) # batch_size = 1
    with torch.no_grad():
        res = model(input_batch)[0]
    prob = torch.nn.functional.softmax(res, dim=-1)
    predicted_index = prob.median(-1)[1][0].numpy()
    input_ids[len(input_ids) - 2] = predicted_index[len(input_ids[i:]) - 2]
tokenizer.decode(input_ids)

'[CLS] A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. Kung horseback townland Bonddial Gentleman Tribal pan guardian Banking returning 183derman Working prevailing accessedН prostitutegies investigatingnse widthytic althoughsteries र Bill minorities William forestry sensed pad bury slaughtered coast Dave Sonny Kathy Nanazle Endowment shake Janwangballs Dash Launch MBrger Bristol Stranger Hector communications Bearsuestsent ps intensified 978 anguish collapsesex completely χnail₈ modeled happenedstitutingied vicious preserving underside Assessmentomsovskyesis ʲ obtaining Planning Sicilytie mount Syracuse tackleslim continent同 renamed painters prostitutes mercenarymental strongmundrella statewide width reject concussion Hooveraling leopard Dawson Clearwamy windowstis guts equivalentchualo Nerogling staple narrowerه ⊕cursions Border Hornedridge vampire Melissa fencedha Winners spotlight Cliff gorge potato Nadu temples 

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: