# BERT (Encoder-Only) для NLP приложений

## Введение в BERT

**BERT** (Bidirectional Encoder Representations from Transformers, 2018) — это предобученная модель на основе трансформеров, использующая только encoder-часть архитектуры.

### Ключевые особенности:
- НЕ генеративная модель 
- Предобучена на задачах Masked Language Modeling (MLM) и Next Sentence Prediction (NSP)
- Эффективна для различных задач NLP
- Генерирует контекстно-зависимые эмбеддинги
- Варианты: ModernBERT (2024/25), RoBERTa, DistilBERT, etc.

## Классификация предложений

### Задача: Sentiment Analysis

Определение тональности отзывов (положительный/отрицательный).

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import torch.nn.functional as F

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

text = "This product is amazing! I love it."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# ToDo: !!!добавить обучение!!!
with torch.no_grad():
    model.eval()
    outputs = model(**inputs)
    predictions = F.softmax(outputs.logits, dim=-1)

print(f"Positive: {predictions[0][1]:.2%}")
print(f"Negative: {predictions[0][0]:.2%}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Positive: 46.00%
Negative: 54.00%


## Классификация токенов

### Задача: Извлечение именованных сущностей

In [3]:
from transformers import BertTokenizer, BertForTokenClassification
from transformers import pipeline

# Создание пайплайна для NER
ner_pipeline = pipeline(
    "ner", model="dslim/bert-base-NER", tokenizer="dslim/bert-base-NER"
)

# Анализ текста
text = "Elon Musk founded Tesla in California."
ner_results = ner_pipeline(text)

# Вывод результатов
for entity in ner_results:
    print(f"{entity['word']}: {entity['entity']} (score: {entity['score']:.2f})")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


El: B-ORG (score: 0.91)
##on: I-ORG (score: 0.84)
Mu: I-ORG (score: 0.69)
##sk: I-ORG (score: 0.73)
Te: B-ORG (score: 1.00)
##sla: I-ORG (score: 0.99)
California: B-LOC (score: 1.00)


## Question Answering

### Задача: Извлечение ответов из текста


```python
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

# Загрузка модели
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Контекст и вопрос
context = "BERT was published in 2018 by researchers at Google AI Language."
question = "When was BERT published?"

# Токенизация
inputs = tokenizer(question, context, return_tensors="pt")

# Получение ответа
with torch.no_grad():
    outputs = model(**inputs)
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits) + 1
    
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end])
)

print(f"Ответ: {answer}")
```

## Emdeddings

### Вычисление sentence embeddings

In [4]:
from transformers import BertTokenizer, BertModel
import torch.nn.functional as F
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")


def mean_pooling(
    model_output: torch.Tensor, attention_mask: torch.Tensor
) -> torch.Tensor:
    token_embeddings = model_output.last_hidden_state  # [B, L, H]

    # Приводим mask к размерности [B, L, 1]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )

    # Суммируем только токены, где mask == 1
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)

    # Делим на количество реальных токенов
    sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
    return sum_embeddings / sum_mask


def get_sentence_embedding(text: str) -> torch.Tensor:
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)

    embedding = mean_pooling(outputs, inputs["attention_mask"])

    return embedding.squeeze(0)


# Два предложения для сравнения
sent1 = "The cat sits on the mat."
sent2 = "A cat is sitting on a rug."
sent3 = "The weather is sunny today."

# Получение эмбеддингов
emb1 = get_sentence_embedding(sent1)
emb2 = get_sentence_embedding(sent2)
emb3 = get_sentence_embedding(sent3)

# Вычисление косинусной схожести
similarity_1_2 = F.cosine_similarity(emb1, emb2, dim=0).item()
similarity_1_3 = F.cosine_similarity(emb1, emb3, dim=0).item()

print(f"Схожесть (sent1, sent2): {similarity_1_2:.4f}")
print(f"Схожесть (sent1, sent3): {similarity_1_3:.4f}")

Схожесть (sent1, sent2): 0.8712
Схожесть (sent1, sent3): 0.6416


## Заполнение пропусков (Masked Language Modeling)

### Задача: Предсказание замаскированных слов

In [5]:
from transformers import pipeline

# Создание пайплайна
fill_mask = pipeline("fill-mask", model="bert-base-uncased")

# Предсказание замаскированного слова
text = "The capital of [MASK] is Paris."
predictions = fill_mask(text)

# Вывод топ-5 предсказаний
for pred in predictions[:5]:
    print(f"{pred['token_str']}: {pred['score']:.4f}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


france: 0.9280
brittany: 0.0084
algeria: 0.0074
department: 0.0050
reunion: 0.0044


## Fine-tuning на кастомных данных

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
import torch

class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.encodings = tokenizer(texts, truncation=True, padding=True, max_length=max_length)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Подготовка данных
texts = ["Great!", "Terrible", "Not bad", "Excellent quality"]
labels = [1, 0, 1, 1]  # 1 - positive, 0 - negative

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_dataset = CustomDataset(texts, labels, tokenizer)

# Загрузка модели
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Создание Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Обучение
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
del model
del fill_mask

# HuggingFace

Использование pipeline's 

In [3]:
# GPT2: My name is John [?]
# BERT: My name is John.<sep> Playstation. 
#       My [?] is John [?]. 

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

обработка одного предложения

In [4]:
tokenizer.encode("hello world meaningless")

[101, 7592, 2088, 25120, 102]

идентификаторы в токены

In [5]:
tokenizer.convert_ids_to_tokens([102])

['[SEP]']

In [6]:
tokenizer.decode(
    token_ids=[101, 7592, 2088, 25120, 102],
    skip_special_tokens=False
)

'[CLS] hello world meaningless [SEP]'

сразу несколько предложений

In [7]:
tokenizer(
    text=["hello world meaningless", "lemon tree"],
    padding=True,
    return_tensors="pt"
)

{'input_ids': tensor([[  101,  7592,  2088, 25120,   102],
        [  101, 14380,  3392,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0]])}

Вывод "вручную"

In [None]:
from transformers import BertForMaskedLM

model = BertForMaskedLM.from_pretrained("bert-base-uncased")

inp = tokenizer("hello world", return_tensors="pt")

result = model(**inp)

result

MaskedLMOutput(loss=None, logits=tensor([[[ -7.1909,  -7.0791,  -7.0640,  ...,  -6.1842,  -6.2977,  -4.2038],
         [ -9.6314,  -9.4228,  -9.5366,  ...,  -9.2171,  -8.3938,  -5.9639],
         [-12.5930, -12.9342, -12.8723,  ..., -11.7490, -11.0491,  -4.2497],
         [-12.2999, -12.2242, -12.0449,  ..., -10.0910,  -9.9382, -11.6595]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [22]:
import torch 

sent = f"The capital of France is {tokenizer.mask_token}."
inp = tokenizer(sent, return_tensors="pt")

logits = model(**inp).logits


torch.topk(logits[0, 6, :], 5)

torch.return_types.topk(
values=tensor([12.3461, 10.5820, 10.4628, 10.1078,  9.7246], grad_fn=<TopkBackward0>),
indices=tensor([ 3000, 22479, 10241, 16766,  7562]))

In [19]:
tokenizer.convert_ids_to_tokens([3000])

['paris']

С помощью pipeline

In [31]:
from transformers import pipeline

pipe = pipeline("fill-mask", "bert-base-uncased")

pipe("my name is Sasha [MASK]!")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.01388245727866888,
  'token': 4202,
  'token_str': 'taylor',
  'sequence': 'my name is sasha taylor!'},
 {'score': 0.011390585452318192,
  'token': 5954,
  'token_str': 'stewart',
  'sequence': 'my name is sasha stewart!'},
 {'score': 0.010675176978111267,
  'token': 3656,
  'token_str': 'alexander',
  'sequence': 'my name is sasha alexander!'},
 {'score': 0.010375653393566608,
  'token': 2100,
  'token_str': '##y',
  'sequence': 'my name is sashay!'},
 {'score': 0.007814432494342327,
  'token': 12214,
  'token_str': 'winters',
  'sequence': 'my name is sasha winters!'}]

"Casual" LM. Генерация текста

In [29]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import pipeline


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

In [30]:
#tokens = tokenizer("hello world", return_tensors="pt")
#model(**tokens)

pipe = pipeline("text-generation", "gpt2")
pipe("What is the largest ocean in the world?")

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "What is the largest ocean in the world? It's not clear. We know that the ocean is a finite resource, but that doesn't mean that it can't be used to fuel life.\n\nThis article was reprinted with permission from the Center for Science in the Public Interest."}]

## Sentence embedding

In [27]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

In [28]:
q = model.encode("Paris is the capital of France")
ans = model.encode(["hello world", "play game", "Париж является столицей Франции"])

util.dot_score(q, ans)

tensor([[-0.0999, -2.4728, 35.9304]])