<a href="https://colab.research.google.com/github/OVP2023/NLP/blob/main/BERT_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Используем BERT впервые

Источник: [Jay Alamar](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)


BERT --  это большой энкодер из трансформера, обученный предсказывать пропущенные слова в тексте.

<img src="https://www.researchgate.net/profile/Faiza_Khattak/publication/332543716/figure/fig3/AS:796161606684672@1566831127392/BERT-model-10-Taking-masked-input-and-outputting-the-masked-words.ppm" />

DistilBERT -- "облегченная" версия BERT, о которой больше расскажут в следующих лекциях.

В этом семинаре мы будем использовать BERT (или DistilBERT), чтобы получить векторные представления для текста, а затем -- простую модель, чтобы решить задачу классификации. В качестве классификации мы будем решать задачу определения тональности.

### Models: Sentence Sentiment Classification

Так как мы занимаемся transfer learning, наша модель будет состоять из двух частей:

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.


<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

## Устанавливаем библиотеку transformers от huggingface

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.2 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 34.2 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 38.9 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 42.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Данные

In [None]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [None]:
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [None]:
df.shape

(6920, 2)

Возьмём первые 2,000.

In [None]:
batch_1 = df[:2000]

Баланс классов:

In [None]:
batch_1[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

## Loading the Pre-trained BERT model

In [None]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
tokenizer

PreTrainedTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.


### Токенизация
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.


In [None]:
tokenized = batch_1[0].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

In [None]:
tokenized[:5] # индексы токенов в классе токенизатора

0    [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1    [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2    [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3    [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4    [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
Name: 0, dtype: object

В начало каждого текста добавляется токен `[CLS]`. Его эмбеддинг будет служить эмбеддингом всего текста при классификации текстов.

In [None]:
df.iloc[0][0]

'a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films'

In [None]:
tokenizer.decode(tokenized[0])

'[CLS] a stirring, funny and finally transporting re imagining of beauty and the beast and 1930s horror films [SEP]'

In [None]:
tokenizer.vocab['[CLS]']

101

Особенности токенизации: [WordPiece tokenization](https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46). Это эффективный способ бороться с OOV словами: если слова нет в словаре, разбей его на знакомые кусочки.

In [None]:
tokenizer.wordpiece_tokenizer.tokenize('interesting work')

['interesting', 'work']

In [None]:
tokenizer.wordpiece_tokenizer.tokenize('pythonista')

['python', '##ista']

In [None]:
tokenizer.wordpiece_tokenizer.tokenize('cowork')

['cow', '##or', '##k']

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
Выравниваем предложения по длине с помощью нулевых токенов.

In [None]:
# находим самое длинное предложение
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

# заполняем обучающие данные, где не хватает длины до максимума -- добавляем нули
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
np.array(padded).shape

(2000, 59)

In [None]:
padded[0]

array([  101,  1037, 18385,  1010,  6057,  1998,  2633, 18276,  2128,
       16603,  1997,  5053,  1998,  1996,  6841,  1998,  5687,  5469,
        3152,   102,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0])

In [None]:
tokenizer.ids_to_tokens[0]

'[PAD]'

In [None]:
padded.shape

(2000, 59)

### Masking

Теперь создаём отдельную переменную, чтобы сказать берту, что надо игнорировать паддинг при подсчёте attention.

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 59)

In [None]:
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

## Используем BERT

`model()` прогоняет предложения через BERT.

In [None]:
attention_mask.shape

(2000, 59)

In [None]:
padded.shape

(2000, 59)

In [None]:
model.to('cuda')

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [None]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids.to('cuda'), attention_mask=attention_mask.to('cuda'))

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

Берём оттуда только представления первого токена -- `[CLS]`. Это представление и будет нашими признаками.

In [None]:
last_hidden_states[0].shape

torch.Size([2000, 59, 768])

In [None]:
features = last_hidden_states[0][:,0,:].to('cpu').numpy()

In [None]:
features.shape

(2000, 768)

Метки:

In [None]:
labels = batch_1[1]

## LogReg на признаках из BERT
Разделим данные на обучающую и тестовую выборки.

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression()

## Оцениваем результат
Accuracy на тесте:

In [None]:
lr_clf.score(test_features, test_labels)

0.824

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(lr_clf.predict(test_features), test_labels))

              precision    recall  f1-score   support

           0       0.82      0.82      0.82       241
           1       0.83      0.83      0.83       259

    accuracy                           0.82       500
   macro avg       0.82      0.82      0.82       500
weighted avg       0.82      0.82      0.82       500



Сравним с DummyClassifier

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.521 (+/- 0.00)


### Как использовать BERT в качестве эмбеддингов слов

In [None]:
s1 = 'count the mean and median value'
s2 = 'the mean sum of money spent'
s3 = 'only mean girls dont say hello'

Токенизируем

In [None]:
s1_tok = tokenizer.encode(s1)
s2_tok = tokenizer.encode(s2)
s3_tok = tokenizer.encode(s3)

Получаем эмбеддинги

In [None]:
model.to('cpu')

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [None]:
with torch.no_grad():
    last_hidden_states1 = model(torch.tensor([s1_tok]))
    last_hidden_states2 = model(torch.tensor([s2_tok]))
    last_hidden_states3 = model(torch.tensor([s3_tok]))

Находим индексы интересующих нас токенов

In [None]:
word_index = tokenizer.vocab['mean']
ind1, ind2, ind3 = s1_tok.index(word_index), s2_tok.index(word_index), s3_tok.index(word_index)

In [None]:
ind1, ind2, ind3

(3, 2, 2)

Вот эмбеддинги, которые нам нужны

In [None]:
word1_emb = last_hidden_states1[0][:,ind1,:][0]
word2_emb = last_hidden_states2[0][:,ind2,:][0]
word3_emb = last_hidden_states3[0][:,ind3,:][0]

Теперь посчитаем расстояние:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# bank из первого предложения дальше всех от остальных
cosine_similarity([
    word1_emb.numpy(),
    word2_emb.numpy(),
    word3_emb.numpy()
])

array([[1.0000002 , 0.7658002 , 0.608196  ],
       [0.7658002 , 1.0000001 , 0.59958005],
       [0.608196  , 0.59958005, 0.9999999 ]], dtype=float32)

## Русский -- DeepPavlov

In [None]:
tokenizer = ppb.AutoTokenizer.from_pretrained("DeepPavlov/rubert-base-cased")

Downloading:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.57M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
model = ppb.AutoModel.from_pretrained("DeepPavlov/rubert-base-cased")

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
tokenizer.tokenize('вертолёт летит по небу')

['вертолёт', 'летит', 'по', 'небу']

In [None]:
tokenizer.tokenize('синхрофазатрон')

['синх', '##роф', '##аза', '##тр', '##он']