## Семинар 8: "Современные модели для NLP"

ФИО: Усцов Артем Алексеевич

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [4]:
!pip install sentencepiece
!pip install --upgrade transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 5.1 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


In [19]:
import torch
from tqdm.autonotebook import tqdm
from transformers import *
import numpy as np

In [26]:
def generate_text(
    input_ids, 
    tokenizer,
    output_filename,
    select_type='max', 
    n_symbols=1000, 
    max_frequency=200,
):

    for i in tqdm(range(n_symbols)):
        input_ids.insert(len(input_ids) - 1, tokenizer.mask_token_id)
        input_batch = torch.tensor(input_ids[i:]).to(torch.long).unsqueeze(0)
        with torch.no_grad():
            res = model(input_batch)[0]
        prob = torch.nn.functional.softmax(res, dim=-1)
        new_ids = None
        # Выбор самого вероятного слова
        if select_type == 'max': 
            new_ids = prob.max(-1)[1][0]
        # Выбор "медианного слова"
        elif select_type == 'median':
            new_ids = prob.median(-1)[1][0]
        # Выбор случайного из max_frequency слов
        elif select_type == 'random':
            new_ids = torch.topk(prob, max_frequency)[1][0][:,np.random.randint(0, max_frequency)]
        else:
            print("Wrong select_type")
            return None
        input_ids[len(input_ids) - 2] = new_ids.numpy()[len(input_ids[i:]) - 2]
    
    with open(output_filename, "w", encoding="utf-8") as output_file:
      output_file.write(tokenizer.decode(input_ids))

    return tokenizer.decode(input_ids)

In [8]:
MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--google--mobilebert-uncased/snapshots/1f90a6c24c7879273a291d34a849033eba2dbc0f/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None


Downloading:   0%|          | 0.00/847 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--google--mobilebert-uncased/snapshots/1f90a6c24c7879273a291d34a849033eba2dbc0f/config.json
Model config MobileBertConfig {
  "_name_or_path": "google/mobilebert-uncased",
  "architectures": [
    "MobileBertForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_activation": false,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "relu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "intra_bottleneck_size": 128,
  "key_query_shared_bottleneck": true,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "mobilebert",
  "normalization_type": "no_norm",
  "num_attention_heads": 4,
  "num_feedforward_networks": 4,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "transformers_version": "4.24.0",
  "trigram_input": true,
  "true_hidden_size": 128,
  "type_vocab_size": 2,
  "u

Downloading:   0%|          | 0.00/147M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--google--mobilebert-uncased/snapshots/1f90a6c24c7879273a291d34a849033eba2dbc0f/pytorch_model.bin
Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of MobileBertForMaskedLM were initialized from the model checkpoint at google/mobilebert-uncased.
If your task is similar to the task the m

In [9]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [10]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [11]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [12]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [13]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [14]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [15]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше. Также можно попробовать сравнить эту генерацию с какой-нибудь легковесной gpt, например, "sshleifer/tiny-gpt2".

# Выбор самого вероятного слова

In [28]:
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
generated_text_1 = generate_text(input_ids, tokenizer, output_filename="generated_text_1.txt", select_type='max', n_symbols=1000)

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
generated_text_2 = generate_text(input_ids, tokenizer, output_filename="generated_text_2.txt", select_type='max', n_symbols=1000)

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

In [29]:
!cat generated_text_1.txt

[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "..

In [30]:
!cat generated_text_2.txt

[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. "......... "................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

По всей видимости, слово с максмимальной вероятностью не дало видимого эффекта

# Выбор "медианного" слова

In [31]:
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
generated_text_3 = generate_text(input_ids, tokenizer, output_filename="generated_text_3.txt", select_type='median', n_symbols=1000)

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
generated_text_4 = generate_text(input_ids, tokenizer, output_filename="generated_text_4.txt", select_type='median', n_symbols=1000)

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
!cat generated_text_3.txt

In [None]:
!cat generated_text_4.txt

# Выбор слова из набора N случайных

In [22]:
input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
generated_text_5 = generate_text(input_ids, tokenizer, output_filename="generated_text_5_1.txt", select_type='random', n_symbols=1000, max_frequency=200)

input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
generated_text_5_1 = generate_text(input_ids, tokenizer, output_filename="generated_text_5_1.txt", select_type='random', n_symbols=1000, max_frequency=150)

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
generated_text_6 = generate_text(input_ids, tokenizer, output_filename="generated_text_6_1.txt", select_type='random', n_symbols=1000, max_frequency=200)

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
generated_text_6_1 = generate_text(input_ids, tokenizer, output_filename="generated_text_6_2.txt", select_type='random', n_symbols=1000, max_frequency=150)

  0%|          | 0/1000 [00:00<?, ?it/s]

[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. john stewart came downstairs came quietly mounted below just dead heved king headsing mounted seated closer under more grass removed sir edward gordon then told morrisons let seem last outside one all street lay was well yesterday let pause downe addressers begin party yesterday below you said landforst the all day'wherehead club laid finished … behind rules commenceness he wait be part longer stop monday mother christmas ‘ goodbye baby tonight girl enough no what ready match sleep dreaming be call their circle should something been is used play fine ‘ kicking live mayow ball dance down early mix hard 2 this ) push rhythm speed bitch live star racer energy drum revolution + i more commitment hold cause impact will complete % reign let die so soft remain

  0%|          | 0/1000 [00:00<?, ?it/s]



In [None]:
!cat generated_text_5_1.txt

In [None]:
!cat generated_text_5_2.txt

In [None]:
!cat generated_text_6_1.txt

In [None]:
!cat generated_text_6_2.txt