This seminar was prepared with the help of the following materials:
- [bertviz tool demo](https://colab.research.google.com/drive/1YoJqS9cPGu3HL2_XExw3kCsRBtySQS2v?usp=sharing#scrollTo=bYs0L8Ftt_Hu);
- [How to use BERT from HuggingFace](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209)


![alt text](https://hsto.org/webt/uh/cd/qv/uhcdqv--w2t4i8srv9rtzjgk9ac.png)

In [1]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="huggingface_hub")

In [2]:
# the main install of the whole notebook

!pip install transformers datasets bertviz -q

## 1.Byte-pair-encoding

A simple data compression algorithm first [introduced in 1994](https://www.derczynski.com/papers/archive/BPE_Gage.pdf). It was later reintroudiced for NLP to the task of word segmentation in [this article](https://arxiv.org/pdf/1508.07909.pdf). BPE allows for the
representation of an open vocabulary through
a fixed-size vocabulary of variable-length
character sequences, making it a very suitable word segmentation strategy for neural
network models.

The code below shows a toy example of learned BPE
operations. At test time, we first split words into
sequences of characters, then apply the learned operations to merge the characters into larger, known
symbols. This is applicable to any word, and
allows for open-vocabulary networks with fixed
symbol vocabularies.
In our example, the
‘lower’ would be segmented into ‘low er·’

![alt text](https://alexanderdyakonov.files.wordpress.com/2019/11/bpe.jpg)

Source: [Subword Tokenization](https://dyakonov.org/2019/11/29/%D1%82%D0%BE%D0%BA%D0%B5%D0%BD%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F-%D0%BD%D0%B0-%D0%BF%D0%BE%D0%B4%D1%81%D0%BB%D0%BE%D0%B2%D0%B0-subword-tokenization/)

### 1.1.BPE simple version

In [14]:
import re, collections

def get_stats(vocab: dict[str, int])-> dict[tuple, int]:
  """collect charcters pairs frequency"""
  pairs = collections.defaultdict(int)
  for word, freq in vocab.items(): #iterate over word and their frequencies
    symbols = word.split()
    for i in range(len(symbols)-1): #increment symbol-pairs frequency
      pairs[symbols[i],symbols[i+1]] += freq
  return pairs

#?<! - 'Negative Lookbehind Before the Match' What's before this is not... http://www.rexegg.com/regex-disambiguation.html#lookbehind
#?! - 'Negative Lookahead After the Match' What's after this is not... http://www.rexegg.com/regex-disambiguation.html#negative-lookahead
def merge_vocab(pair, v_in):
  v_out = {}
  # Экранируем специальные символы в паре и объединяем через пробел
  # Пример: если pair = ('a', '+'), то bigram = 'a \\+' (экранированный плюс)
  bigram = re.escape(' '.join(pair)) #join character pairs with escape character and space
  # Компилируем регулярное выражение для поиска точных вхождений пары:
  # (?<!\S) - негативный просмотр назад: убеждаемся, что перед парой нет НЕ-пробельного символа
  # bigram - наш шаблон (например, "a\\ b")
  # (?!\S) - негативный просмотр вперед: убеждаемся, что после пары нет НЕ-пробельного символа
  # Это гарантирует, что мы находим пару только как отдельное слово
  p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)') #generate regex bigram, for matching namely "not no_whitespace >> whitespace"+"character1" + "space" + "character2" + "not no_whitespace >> whitespace"
  for word in v_in:
    # print("orig_word", word)
    # Заменяем все вхождения найденного шаблона на объединенную пару (без пробела)
    # Пример: p.sub('ab', 'a b') превратится в 'ab'
    w_out = p.sub(''.join(pair), word)
    # print("w_out", w_out)
    
    # Добавляем преобразованное слово в выходной словарь 
    # с сохранением исходной частоты из v_in
    v_out[w_out] = v_in[word]
  return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,'n e w e s t </w>':6, 'w i d e s t </w>':3}

num_merges = 10
for i in range(num_merges):
  pairs = get_stats(vocab)
  print("pairs_loop", pairs)
  best = max(pairs, key=pairs.get) #get the characters pair with the highest frequency
  print("best", best)
  vocab = merge_vocab(best, vocab)
  print("vocab", vocab)
  print("="*100)

pairs_loop defaultdict(<class 'int'>, {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 8, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('e', 's'): 9, ('s', 't'): 9, ('t', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3})
best ('e', 's')
vocab {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
pairs_loop defaultdict(<class 'int'>, {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'es'): 6, ('es', 't'): 9, ('t', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'es'): 3})
best ('es', 't')
vocab {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
pairs_loop defaultdict(<class 'int'>, {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'est'): 6, ('est', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est'): 3})
best ('est', '</w>')

### 1.2.Transformers tokenizers

In [16]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# subwords:'gp', '##u'
print(tokenizer.tokenize("I have a new GPU!"))

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

['i', 'have', 'a', 'new', 'gp', '##u', '!']


In [30]:
tokenizer.tokenize("I have a new GPU!") # Converts a string into a sequence of tokens, using the tokenizer (WordPiece)

['i', 'have', 'a', 'new', 'gp', '##u', '!']

In [32]:
tokenizer.encode("I have a new GPU!") # Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.

[101, 1045, 2031, 1037, 2047, 14246, 2226, 999, 102]

In [None]:
tokenizer.convert_ids_to_tokens(tokenizer.encode("I have a new GPU!")) # Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

['[CLS]', 'i', 'have', 'a', 'new', 'gp', '##u', '!', '[SEP]']

In [9]:
tokenizer.decode(tokenizer.encode("I have a new GPU!"))

'[CLS] i have a new gpu! [SEP]'

In [10]:
tokenizer.decode(tokenizer.encode("I have a new GPU!"), skip_special_tokens=True)

'i have a new gpu!'

In [11]:
tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [12]:
tokenizer.eos_token

In [13]:
tokenizer("I have a new GPU!")

{'input_ids': [101, 1045, 2031, 1037, 2047, 14246, 2226, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [14]:
encoding = tokenizer("I have a new GPU!", add_special_tokens = True,
                                 truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")

In [15]:
encoding.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [16]:
encoding

{'input_ids': tensor([[  101,  1045,  2031,  1037,  2047, 14246,  2226,   999,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

## 2.Attention Visualization

In [17]:
!pip install bertviz



In [18]:
# Load model and retrieve attention weights

from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel, BertForQuestionAnswering

model_version = 'bert-base-uncased'
do_lower_case = True
model = BertModel.from_pretrained(model_version, output_attentions=True, attn_implementation="eager")
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [19]:
sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

### 2.1.Model View

The model view gives a birds-eye view of attention across all of the layers (rows) and heads (columns) in the model. In this case we are showing bert-base, which has 12 layers and 12 heads (zero-indexed).

In [20]:
model_view(attention, tokens, sentence_b_start)

<IPython.core.display.Javascript object>

### 2.2.Head view

The attention-head view visualizes attention in one or more heads in a particular layer in the model

In [21]:
head_view(attention, tokens, sentence_b_start)

<IPython.core.display.Javascript object>

## 3.Text Classification using BERT

### 3.1.Work with Transformers

In [22]:
import transformers

GITHUB https://github.com/huggingface/transformers


See examples of how to do comon tasks:
https://github.com/huggingface/transformers/tree/master/examples


All available Hugging Face models you can find here:
https://huggingface.co/models

The library is build around three types of classes for each model:

* ***model classes*** e.g., BertModel which are ~100 PyTorch models (torch.nn.Modules) that work with the pretrained weights provided in the library. In TF2, these are tf.keras.Model.

* ***configuration classes*** which store all the parameters required to build a model, e.g., BertConfig. You don’t always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)

* ***tokenizer classes*** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., BertTokenizer

All these classes can be instantiated from pretrained instances and saved locally using two methods:

* *from_pretrained()* let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed here) or stored locally (or on a server) by the user,

* *save_pretrained()* let you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained().

`AutoModel` (`AutoModelFor*`) or `AutoTokenizer` are special classes that automatically convert themselves to specific model-based classes (such as `BertModel`, `BertTokenizer`) based on the data loaded into them.


#### 3.1.1.Masked Language Modeling

In [23]:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

In [24]:
model.eval();

In [25]:
text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)

with torch.no_grad():
    output = model(**input)
    logits = output.logits
    softmax = F.softmax(logits, dim = -1)
    mask_word = softmax[0, mask_index, :]
    top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

The capital of France, paris, contains the Eiffel Tower.
The capital of France, lyon, contains the Eiffel Tower.
The capital of France, lille, contains the Eiffel Tower.
The capital of France, toulouse, contains the Eiffel Tower.
The capital of France, marseille, contains the Eiffel Tower.
The capital of France, orleans, contains the Eiffel Tower.
The capital of France, strasbourg, contains the Eiffel Tower.
The capital of France, nice, contains the Eiffel Tower.
The capital of France, cannes, contains the Eiffel Tower.
The capital of France, versailles, contains the Eiffel Tower.


In [26]:
text

'The capital of France, [MASK], contains the Eiffel Tower.'

In [27]:
tokenizer

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

#### 3.1.2.Language Modeling

BERT can be fine-tuned as a decoder (with causal attention mask to predict the next token).

Because it already has a layer for MLM, the last decoder layer can be initialized with it. However, without fine-tuning such a model performs poorly.

In [28]:
!pip install transformers



In [29]:
from transformers import BertTokenizer, BertLMHeadModel
import torch
from torch.nn import functional as F
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertLMHeadModel.from_pretrained('bert-base-uncased', return_dict=True, is_decoder = True)

text = "A knife is very "
input = tokenizer.encode_plus(text, return_tensors = "pt")
output = model(**input).logits[:, -1, :]
softmax = F.softmax(output, -1)
index = torch.argmax(softmax, dim = -1)
x = tokenizer.decode(index)
print(x)

.


#### 3.1.3.Next Sentence Prediction

In [30]:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
from torch.nn import functional as F
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

first_sentence = "London is the capital of Great Britain"
next_sentence = "I like playing football."
encoding = tokenizer.encode_plus(first_sentence, next_sentence, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encoding)[0]
    softmax = F.softmax(outputs, dim = 1)
print(softmax)

tensor([[7.5438e-04, 9.9925e-01]])


In [31]:
encoding

{'input_ids': tensor([[ 101, 2414, 2003, 1996, 3007, 1997, 2307, 3725,  102, 1045, 2066, 2652,
         2374, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

#### 3.1.4.Pipelines

[Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) in the Hugging Face Transformers library are abstractions that contain models and tokenizers.



In [32]:
from transformers import pipeline

# Allocate a pipeline for question-answering
question_answerer = pipeline('question-answering', model="distilbert/distilbert-base-cased-distilled-squad")
question_answerer({
     'question': 'What is the name of the repository ?',
     'context': 'Pipeline have been included in the huggingface/transformers repository'
})

Device set to use cuda:0


{'score': 0.5135963559150696,
 'start': 35,
 'end': 59,
 'answer': 'huggingface/transformers'}

In [33]:
from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
classifier = pipeline(
    task="sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment"  # multilingual model
)
classifier('We are very happy to use transformers repository.')

Device set to use cuda:0


[{'label': '5 stars', 'score': 0.7796960473060608}]

### 3.2.Application Example

We will fine tune a BERT-based model to classify [restaurant reviews](https://huggingface.co/datasets/blinoff/restaurants_reviews).

In [34]:
!pip install datasets



In [35]:
import pandas as pd
import numpy as np
import torch
from tqdm.auto import tqdm, trange

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
from datasets import Dataset

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

In [36]:
df = pd.read_json('https://huggingface.co/datasets/blinoff/restaurants_reviews/resolve/main/restaurants_reviews.jsonl', lines=True)

In [37]:
pd.options.display.max_colwidth = 300

In [38]:
print(df.shape)
df.sample(3)

(47139, 6)


Unnamed: 0,review_id,general,food,interior,service,text
46400,46400,0,0,0,0,"Нам этот ресторан посоветовали друзья , которые каждый день приходят в это заведение ! Мы долго искали место где отметить свою свадьбу , ведь хотелось провести этот день незабываемым !!! С первого обращения к администрации ресторана мы поняли , что мероприятие пройдет на высшем уровне !! )) Зал..."
33117,33117,0,10,10,10,"Узнала про этот ресторанчик в CASA del МЯСО , где ужинала с другом . Так как я не особый фанат мясной кухни , в отличие от моей бой-френда , решили , что в следующий раз , ресторан выберу я . В один из вечеров отправились в Супер Марио , забили в навигаторе адрес , указынные на визитке , но , ..."
42062,42062,0,0,0,0,"Ситуация повторяется . Позвонила забронировать столик , со мной общался какой-то молодой человек , не совсем вежливо , и также в конце разговора кинул трубку . Администрация , научите своих сотрудников общаться с клиентами . А то только лишь при звонке уже портится впечатление о заведении . ..."


In [39]:
df.groupby('general').sample(1)

Unnamed: 0,review_id,general,food,interior,service,text
751,751,0,9,5,3,"Дива говорите ? А по-моему уж слишком больно намазанная для дивы-то ... Ну , да ладно . Баба дело 25 . Очень вкусно , очень приятно , можно покурить на улице ( для меня это очень важно ! ) . Обслуживание , конечно , хромает на все ноги , но , думаю , поправима ситуация . Буду ходить - мне ..."
34699,34699,1,0,0,0,"В декабре играли свадьбу в ресторане "" Фаина "" . При бронировании ресторана нам предложили услуги ведущего Олега и Натальи . Нам дали время подумать , сказав что свои услуги не навязывают и что у нас есть право выбора ! Выбор был сделан и сделан не в пользу Олега и Натальи ... На второй встр..."
35928,35928,2,0,0,0,"Доброго всем времени суток ! Вообще этот ресторан любим - хорошая кухня , приемлемые цены , демократичная обстановка . Ходили туда часто с семьей поужинать , даже "" наели "" на скидочную карту постоянного гостя . В последнее время заметили , что качество блюд ухудшилось .. хотя на мелочи стара..."
36999,36999,3,0,0,0,"приехал с друзьями из Новегии на пару дней в Москву , зашли в этот , так называемый ресторан , максимум закусочная . Как в Москве , столице России , возле Олимпийской деревни можно открывать "" ресторан "" , где официанты не только не говорят по английски , но даже не говорят по русски ? Сервис ..."
35208,35208,4,0,0,0,"Хорошее заведение . Замечательный обслуживающий персонал . Мы посещали этот бар в ноябре , музыка была классная , эмоций было море , танцевали до упаду , DJ Валинтин ставил любую песню на твой выбор , а сейчас один галимый клубняк , не вся музыка есть в наличии у DJ или ставить не хочет , отго..."
35421,35421,5,0,0,0,"Отдыхала с друзьями в прошлую пятницу . Остались самые приятные воспоминания о проведенном времени . Порадовала атмосфера , музыка , отзывчивость и уровень обсуживания персонала . Не в каждом такое встретишь ! Свой ДР буду справлять только там )"


In [40]:
df.general.value_counts()

Unnamed: 0_level_0,count
general,Unnamed: 1_level_1
0,43940
5,2164
1,462
4,257
2,166
3,150


In [41]:
g = df[df.general>0]

data = Dataset.from_dict({'text': g.text, 'label': g.general-1}).train_test_split(test_size=0.2, seed=1)
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2559
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 640
    })
})

# 🤗
https://huggingface.co/ai-forever/ruBert-base

In [42]:
base_model = 'ai-forever/ruBert-base'

In [43]:
tokenizer = AutoTokenizer.from_pretrained(base_model)

config.json:   0%|          | 0.00/590 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.78M [00:00<?, ?B/s]

In [44]:
data_tokenized = data.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Map:   0%|          | 0/2559 [00:00<?, ? examples/s]

Map:   0%|          | 0/640 [00:00<?, ? examples/s]

In [45]:
data_tokenized

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2559
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 640
    })
})

In [46]:
print(data_tokenized['train'][0])

{'label': 1, 'input_ids': [101, 945, 86782, 1055, 736, 1613, 965, 3844, 110, 11239, 126, 57893, 133, 2065, 734, 24350, 110755, 46789, 1151, 1712, 702, 378, 160, 57031, 17398, 27204, 49342, 650, 378, 158, 41832, 121, 4024, 9198, 57741, 680, 107, 5850, 56602, 52417, 126, 6167, 20220, 24326, 63915, 8928, 378, 121, 750, 22008, 1179, 53362, 177, 107, 36466, 110870, 14394, 133, 18777, 64866, 126, 43752, 5608, 20473, 4305, 78726, 133, 40816, 945, 1003, 672, 58207, 656, 126, 3966, 64440, 1721, 107, 3313, 126, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [47]:
collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [48]:
train_dataloader = DataLoader(data_tokenized['train'], shuffle=True, batch_size=4, collate_fn=collator)
val_dataloader = DataLoader(data_tokenized['test'], shuffle=False, batch_size=4, collate_fn=collator)

In [49]:
from torch.optim import Adam

In [50]:
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=5)

pytorch_model.bin:   0%|          | 0.00/716M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ai-forever/ruBert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [51]:
type(model)

The model is [BertForSequenceClassification](https://github.com/huggingface/transformers/blob/v4.19.4/src/transformers/models/bert/modeling_bert.py#L1508).

![alt text](https://jalammar.github.io/images/distilBERT/bert-model-calssification-output-vector-cls.png)

Source: [A Visual Guide to Using BERT for the First Time](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)

Approximately, `BertForSequenceClassification` looks like this, but with extra features inherited from Transformers, and with built-in loss computation

In [52]:
import torch

class BertClassifierSimple(torch.nn.Module):
    def __init__(self, num_labels):
        super(BertClassifierSimple, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = torch.nn.Dropout(self.bert.config.dropout)
        self.out = torch.nn.Linear(self.bert.config.hidden_size, num_labels)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        bert_output = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        output = self.out(self.dropout(bert_output[1]))  # output raw scores to be put into a softmax transformation
        return output

In [53]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(120138, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [54]:
if torch.cuda.is_available():
    model.cuda()

In [55]:
# model.classifier.parameters()
optimizer = Adam(model.parameters(), lr=1e-6)  # with tiny batches, LR should be very small as well
# Adagrad

In [56]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [57]:
losses = []
for epoch in trange(3):
    pbar = tqdm(train_dataloader)
    model.train()
    for i, batch in enumerate(pbar):
        out = model(**batch.to(model.device))
        out.loss.backward()
        if i % 1 == 0:
            optimizer.step()
            optimizer.zero_grad()
        losses.append(out.loss.item())
        pbar.set_description(f'loss: {np.mean(losses[-100:]):2.2f}')
    model.eval()
    eval_losses = []
    eval_preds = []
    eval_targets = []
    for batch in tqdm(val_dataloader):
        with torch.no_grad():
                out = model(**batch.to(model.device))
        eval_losses.append(out.loss.item())
        eval_preds.extend(out.logits.argmax(1).tolist())
        eval_targets.extend(batch['labels'].tolist())
    print('recent train loss', np.mean(losses[-100:]), 'eval loss', np.mean(eval_losses), 'accuracy', np.mean(np.array(eval_targets) == eval_preds))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/640 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/716M [00:00<?, ?B/s]

  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.8934309265017509 eval loss 0.9075848679989577 accuracy 0.6703125


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.7566500814259052 eval loss 0.7403444102033973 accuracy 0.765625


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.6460567098110914 eval loss 0.7121943225618452 accuracy 0.765625


In [58]:
model.eval()
eval_losses = []
eval_preds = []
eval_targets = []
for batch in tqdm(val_dataloader):
    with torch.no_grad():
            out = model(**batch.to(model.device))
    eval_losses.append(out.loss.item())
    eval_preds.extend(out.logits.argmax(1).tolist())
    eval_targets.extend(batch['labels'].tolist())
print('recent train loss', np.mean(losses[-100:]), 'eval loss', np.mean(eval_losses), 'accuracy', np.mean(np.array(eval_targets) == eval_preds))

  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.6460567098110914 eval loss 0.7121943225618452 accuracy 0.765625


In [59]:
from sklearn.metrics import confusion_matrix

confusion_matrix(eval_targets, eval_preds)

array([[ 82,   0,   0,   0,  13],
       [ 36,   0,   0,   0,   6],
       [ 18,   0,   0,   0,   7],
       [  5,   0,   0,   0,  56],
       [  9,   0,   0,   0, 408]])

Save the model for future use

In [60]:
import torch

# Приводим все тензоры модели в непрерывный формат (contiguous)
for param in model.parameters():
    param.data = param.data.contiguous()

model.save_pretrained('sentiment_classifier')
tokenizer.save_pretrained('sentiment_classifier')

('sentiment_classifier/tokenizer_config.json',
 'sentiment_classifier/special_tokens_map.json',
 'sentiment_classifier/vocab.txt',
 'sentiment_classifier/added_tokens.json',
 'sentiment_classifier/tokenizer.json')

In [61]:
!ls sentiment_classifier -alsh

total 686M
4.0K drwxr-xr-x 2 root root 4.0K Mar 18 20:06 .
4.0K drwxr-xr-x 1 root root 4.0K Mar 18 20:05 ..
4.0K -rw-r--r-- 1 root root 1.1K Mar 18 20:05 config.json
681M -rw-r--r-- 1 root root 681M Mar 18 20:06 model.safetensors
4.0K -rw-r--r-- 1 root root  125 Mar 18 20:06 special_tokens_map.json
4.0K -rw-r--r-- 1 root root 1.3K Mar 18 20:06 tokenizer_config.json
3.6M -rw-r--r-- 1 root root 3.6M Mar 18 20:06 tokenizer.json
1.7M -rw-r--r-- 1 root root 1.7M Mar 18 20:06 vocab.txt


Load the model from disk and use for inference

In [62]:
model = AutoModelForSequenceClassification.from_pretrained('sentiment_classifier')
tokenizer = AutoTokenizer.from_pretrained('sentiment_classifier')

In [63]:
def classify(text):
    with torch.no_grad():
        proba = torch.softmax(model(**tokenizer(text, return_tensors='pt', truncation=True, max_length=512).to(model.device)).logits, -1)
    return proba.cpu().numpy()[0]

In [68]:
classify('Отвратительный ресторан. Невкусно! Ставлю минимальный балл (1 из 5).')

array([0.37337908, 0.1378008 , 0.17974   , 0.10681526, 0.20226488],
      dtype=float32)

In [65]:
classify('Мне было весело')

array([0.14893204, 0.09701529, 0.08779054, 0.13236095, 0.53390115],
      dtype=float32)