# Используем BERT впервые

Источник: [Jay Alamar](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)


BERT --  это большой энкодер из трансформера, обученный предсказывать пропущенные слова в тексте.

<img src="https://www.researchgate.net/profile/Faiza_Khattak/publication/332543716/figure/fig3/AS:796161606684672@1566831127392/BERT-model-10-Taking-masked-input-and-outputting-the-masked-words.ppm" />

DistilBERT -- "облегченная" версия BERT, о которой больше расскажут в следующих лекциях.

В этом семинаре мы будем использовать BERT (или DistilBERT), чтобы получить векторные представления для текста, а затем -- простую модель, чтобы решить задачу классификации. В качестве классификации мы будем решать задачу определения тональности.

### Models: Sentence Sentiment Classification

Так как мы занимаемся transfer learning, наша модель будет состоять из двух частей:

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.


<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

## Устанавливаем библиотеку transformers от huggingface

In [1]:
# !pip install -q transformers # -q for quiet
import numpy as np
import os
import pandas as pd
import pathlib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import torch
import transformers
from transformers import DistilBertModel, DistilBertTokenizer
import warnings

warnings.filterwarnings("ignore")

2023-09-25 14:53:13.577275: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Данные

In [2]:
from beholder import download_file_from_url

url = "https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv"
dest_folder = "data"
file_name = "pytorch-sentiment-classification_master_data_SST2_train.tsv"
# download_file_from_url(url=url, file_name=file_name, dest_folder=dest_folder)
file_path = download_file_from_url(url=url, file_name=file_name, dest_folder=dest_folder)


File already exists at data/pytorch-sentiment-classification_master_data_SST2_train.tsv


In [None]:
df = pd.read_csv(file_path, delimiter="\t", header=None)
df.head()

In [None]:
df.shape

Let's get the first 2,000.

In [None]:
batch_1 = df[:2000]

Check classes distribution (they are evenly distributed)

In [None]:
batch_1[1].value_counts(), batch_1.columns

## Loading the Pre-trained BERT model from HF

In [3]:
from transformers import DistilBertTokenizer, DistilBertModel
from beholder import get_or_download_llm_model


LLM_NAME = "distilbert-base-uncased" # or "bert-base-uncased"
dest_folder = pathlib.Path("data") / "llm_models"
tokenizer, model = get_or_download_llm_model(LLM_NAME, dest_folder, DistilBertTokenizer, DistilBertModel)
# tokenizer, model = get_or_download_llm_model(LLM_NAME, dest_folder, BertTokenizer, BertModel) # for bert-base-uncased
model


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

In [None]:
from beholder import print_methods

print_methods(tokenizer)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.


### Токенизация
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.


In [None]:
batch_1[0][:5]  # first 5 sentences of pandas dataframe

In [None]:
tokenized = batch_1[0].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

In [None]:
tokenized[:5]  # индексы токенов в классе токенизатора

В начало каждого текста добавляется токен `[CLS]`. Его эмбеддинг будет служить эмбеддингом всего текста при классификации текстов.

In [None]:
tokenized[0][0]

In [None]:
tokenizer.decode(tokenized[0])

In [None]:
tokenizer.vocab["[CLS]"], tokenizer.vocab["[SEP]"]

Особенности токенизации: [WordPiece tokenization](https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46). Это эффективный способ бороться с OOV словами: если слова нет в словаре, разбей его на знакомые кусочки.

In [None]:
tokenizer.wordpiece_tokenizer.tokenize(
    "Are you interested at LLM"
), tokenizer.wordpiece_tokenizer.tokenize(("Are you interested at LLM").lower())

Here word 'Are', 'LLM' aren't found in the vocabulary, becase they are not lowercased. 

In [None]:
tokenizer.wordpiece_tokenizer.tokenize("pythonista")

In [None]:
tokenizer.wordpiece_tokenizer.tokenize("cowork")  # it should be co ##work

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
Выравниваем предложения по длине с помощью нулевых токенов.  

We should align the lenght of the sentences in order to use BATCH processing. We will use padding for this.

In [None]:
# %%timeit
# находим самое длинное предложение
max_len = 0
for i in tokenized.values:  # 345 µs
    if len(i) > max_len:
        max_len = len(i)

# max_len = np.max(list(map(lambda x: len(x), tokenized.values)))  # 1.7 ms
# max_len = np.max(tokenized.map(lambda x: len(x)))          # 1.8 ms
# max_len = np.max(tokenized.apply(lambda x: len(x)))         # 1.8 ms


# заполняем обучающие данные, где не хватает длины до максимума -- добавляем нули
padded = np.array([list_ + [0] * (max_len - len(list_)) for list_ in tokenized.values])
np.array(padded).shape, padded[-1]

Padding with keras

In [None]:
# retrieve all the special tokens strings
special_tokens = list(tokenizer.special_tokens_map.values())
# and their appropriate indices and then show them
special_tokens_ids = list(
    tokenizer.vocab[special_tokens[i]] for i in range(len(special_tokens))
)

for sp_tok, sp_tok_ids in zip(special_tokens, special_tokens_ids):
    print(f"Special token {sp_tok:>9} index: {sp_tok_ids}")

In [None]:
tokenizer.pad_token, tokenizer.pad_token_id, tokenizer.ids_to_tokens[0], tokenizer.pad_token == tokenizer.ids_to_tokens[0]

In [None]:
# padding with keras
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences=tokenized, 
                       maxlen=None,    # if None -- max_len = max_len of the longest sentence in the batch
                       padding='post', # 'pre' -- add zeros to the beginning of the sentence, 'post' -- to the end
                       truncating='post', #  truncating='post' -- cut the sentence from the end, 'pre' -- from the beginning
                       value=tokenizer.pad_token_id # value to add to the end of the sentence (if padding='post') or to the beginning (if padding='pre'
                       ) 
padded.shape, padded[-1]

In [None]:
# decode list of special tokens
tokenizer.decode([special_tokens for special_tokens in special_tokens_ids])

In [None]:
padded.shape # (2000, 59) -- 2000 sentences, 59 tokens in the longest sentence

### Masking

Теперь создаём отдельную переменную, чтобы сказать берту, что надо игнорировать паддинг при подсчёте attention.

In [None]:
attention_mask = np.where(padded != 0, 1, 0) # 1 if token is not a padding token, 0 if it is a padding token
attention_mask.shape

In [None]:
attention_mask, attention_mask.shape

## Используем BERT

`model()` прогоняет предложения через BERT.

In [None]:
# check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

Here we feed 2000 sentences to BERT.  
BERT will return a vector of size 768 for each sentence.

In [None]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad(): # disable gradient calculation for inference
    last_hidden_states = model(
        input_ids.to(device), attention_mask=attention_mask.to(device)
    )

Let's slice only the part of the output that we need.  
That is the output corresponding the first token of each sentence.  
The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification)  
at the beginning of every sentence.  
The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

Берём оттуда только представления первого токена -- `[CLS]`. Это представление и будет нашими признаками.

In [None]:
last_hidden_states[0].shape

In [None]:
features = last_hidden_states[0][:, 0, :].to(device).numpy() # take ONLY the FIRST token of the last hidden state -- [CLS] token and convert it to numpy array

In [None]:
features.shape # (2000, 768) -- 2000 sentences, 768 features for every [CLS] token

Метки:

In [None]:
labels = batch_1[1]

## LogReg на признаках из BERT
Разделим данные на обучающую и тестовую выборки.

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(
    features, labels
)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

## Оцениваем результат
Accuracy на тесте:

In [None]:
lr_clf.score(test_features, test_labels)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(lr_clf.predict(test_features), test_labels))

Сравним с DummyClassifier

In [None]:
from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy="most_frequent", random_state=0)

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

### Как использовать BERT в качестве эмбеддингов слов

In [None]:
s1 = "let's count the mean and median value"
s2 = "the mean sum of money spent"
s3 = "mean girls dont say hello"

Here we have three sentences with different meaning of the word 'mean'.  


Let's see how BERT will embed them.

In [None]:
s1_tok = tokenizer.encode(s1)
s2_tok = tokenizer.encode(s2)
s3_tok = tokenizer.encode(s3)
s1_tok, s2_tok, s3_tok

The word `mean` has token number `2812`.

In [None]:
word_index = tokenizer.vocab["mean"]
ind1, ind2, ind3 = (
    s1_tok.index(word_index),
    s2_tok.index(word_index),
    s3_tok.index(word_index),
)
ind1, ind2, ind3 # index of the word "mean" in the sentence (in the list of tokens)

Here we'll use the same model, but we'll get the embeddings for each token in the sentence.  
We'll use the output of the last hidden layer of the model.  
This is a 3D tensor of shape `(batch_size, max_length, hidden_size=768)`.  
We'll take the appropriate to word `mean` token of each sentence and compare them.

In [None]:
# retrieve the hidden states for the sentences (the OUTPUT EMBEDDINGS from the final transformer layer.)
with torch.no_grad():
    last_hidden_states1 = model(torch.tensor([s1_tok])) # model takes a batch of sentences, so we need to add one more dimension (or wrap it in a list, which is the same)
    last_hidden_states2 = model(torch.tensor([s2_tok]))
    last_hidden_states3 = model(torch.tensor([s3_tok]))


Once more: the last hidden state contains the semantics of the tokens.  
We also retrieve embeddings in a batch

In [None]:
# we can do the same leveareging the batch
# padding needed to align the sentences length
padded = pad_sequences(sequences=[s1_tok, s2_tok, s3_tok],
                          maxlen=None,    # if None -- max_len = max_len of the longest sentence in the batch
                            padding='post', # 'pre' -- add zeros to the beginning of the sentence, 'post' -- to the end
                            truncating='post', #  truncating='post' -- cut the sentence from the end, 'pre' -- from the beginning
                            value=tokenizer.pad_token_id # value to add to the end of the sentence (if padding='post') or to the beginning (if padding='pre'
                            )

# padded is a 3d numpy array (3, 11) -- 3 sentences, 11 tokens in the longest sentence
# if we do the padding we need to create the attention mask to avoid the model to pay attention to the padding tokens
attention_mask = np.where(padded != 0, 1, 0) # 1 if token is not a padding token, 0 if it is a padding token

# attention mask should be converted to a tensor
with torch.no_grad():
    last_hidden_states4 = model( torch.tensor(padded), attention_mask=torch.tensor(attention_mask).to(device)) # we can also pass a batch of sentences

In [None]:
print(type(last_hidden_states4)) # tuple
print(type(last_hidden_states4[0])) # first element is a tensor
print((last_hidden_states4[0]).shape) # (3, 11, 768) -- 3 sentences, 11 tokens in the longest sentence, 768 features for every token

In [None]:
last_hidden_states4[0][0,0,:][:5] # first 5 elements of the first token of the first sentence of the batch

print(np.allclose(last_hidden_states1[0][0,0,:].to(device).numpy() , last_hidden_states4[0][0, 0, :].to(device).numpy(), rtol= 1e-2) ) # rtol -- relative tolerance
print(np.allclose(last_hidden_states2[0][0,0,:].to(device).numpy() , last_hidden_states4[0][1, 0, :].to(device).numpy(), rtol= 1e-2) )
print(np.allclose(last_hidden_states3[0][0,0,:].to(device).numpy() , last_hidden_states4[0][2, 0, :].to(device).numpy(), rtol= 1e-2) )

In [None]:
last_hidden_states2[0][0,0,:].to(device).numpy()[:5]

In [None]:
last_hidden_states4[0][1, 0, :].to(device).numpy()[:5]

So, the same word in different contexts will have different embeddings. 

In [None]:
word1_emb = last_hidden_states1[0][0, ind1, :]
word2_emb = last_hidden_states2[0][0, ind2, :]
word3_emb = last_hidden_states3[0][0, ind3, :]

word4_emb = last_hidden_states4[0][:, 1, :] 
word1_emb.shape, word2_emb.shape, word3_emb.shape, word4_emb.shape

Теперь посчитаем расстояние:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# 'mean' из третьего предложения дальше всех от остальных
cosine_similarity([word1_emb.numpy(), word2_emb.numpy(), word3_emb.numpy()])

## Русский -- DeepPavlov

In [5]:
# tokenizer = transformers.AutoTokenizer.from_pretrained()
LLM_NAME = "DeepPavlov/rubert-base-cased"
dest_folder = pathlib.Path("data") / "llm_models"
tokenizer, model = get_or_download_llm_model(LLM_NAME, dest_folder, transformers.AutoTokenizer, transformers.AutoModel)

model

Downloading pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.

In [None]:
# model = transformers.AutoModel.from_pretrained("DeepPavlov/rubert-base-cased")

In [None]:
tokenizer.tokenize("вертолёт летит по небу")

In [None]:
tokenizer.tokenize("синхрофазатрон")