# 06. BERT как эмбеддер

На прошлой лекции, мы рассмотрели BERT как предобученную модель языка, которую относительно дешево можно до-тренировать на конкретной задаче. Однако модель BERT-а используется сама по себе, без дообучения, для анализа текста. Из-за особенностей тренировки, при которой BERT видит все предложение целиком и должен восстанавливать некоторые пропущенные слова, он отлично учится составлять эмбеддинги **слов в предложениях**.

Попробуем решить задачу

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F


model_id = "DeepPavlov/rubert-base-cased-sentence"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.65M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/711M [00:00<?, ?B/s]

In [2]:
# Sentences we want sentence embeddings for
sentences = ['Этот город самый лучший город на земле',
             'Москва - столица России',
             'Английский язык - наиболее популярный второй язык с мире',
             'Машинное обучение является основой обработки естественного языка',
             'Лондон - столица Великобритании']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [4]:
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    print(token_embeddings.shape, attention_mask)
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)


torch.Size([5, 11, 768]) tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])


In [5]:
print(sentence_embeddings.shape)

torch.Size([5, 768])


In [6]:
import numpy as np

sen_len = len(sentences)
for i in range(sen_len):
  for j in range(i, sen_len):
    print(f"Схожесть между '{sentences[i]}' и '{sentences[j]}' : {sentence_embeddings[i] @ sentence_embeddings[j]}")


Схожесть между 'Этот город самый лучший город на земле' и 'Этот город самый лучший город на земле' : 1.0
Схожесть между 'Этот город самый лучший город на земле' и 'Москва - столица России' : 0.6997369527816772
Схожесть между 'Этот город самый лучший город на земле' и 'Английский язык - наиболее популярный второй язык с мире' : 0.778307318687439
Схожесть между 'Этот город самый лучший город на земле' и 'Машинное обучение является основой обработки естественного языка' : 0.37199896574020386
Схожесть между 'Этот город самый лучший город на земле' и 'Лондон - столица Великобритании' : 0.6042578816413879
Схожесть между 'Москва - столица России' и 'Москва - столица России' : 1.0
Схожесть между 'Москва - столица России' и 'Английский язык - наиболее популярный второй язык с мире' : 0.6052638292312622
Схожесть между 'Москва - столица России' и 'Машинное обучение является основой обработки естественного языка' : 0.4542554020881653
Схожесть между 'Москва - столица России' и 'Лондон - столица Вел

Можно использовать специальную библиотеку `sentence_transformers`, например, для ранжирования схожих предложений

In [7]:
!pip3 install sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [8]:
from sentence_transformers import CrossEncoder

# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

# 2. Predict scores for a pair of sentences
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# => array([ 8.607138 , -4.3200774], dtype=float32)

# 3. Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin is well known for its museums.",
    "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
    "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
    "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
    "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
    "Berlin is subdivided into 12 boroughs or districts (Bezirke).",
    "In 2015, the total labour force in Berlin was 1.85 million.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
    "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)

# Print the scores
print("Query:", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
"""
Query: How many people live in Berlin?
8.92    The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61    Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24    An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60    In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35    In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42    Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45    In 2015, the total labour force in Berlin was 1.85 million.
0.33    Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24   The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32   Berlin is well known for its museums.
"""

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Query: How many people live in Berlin?
8.92	The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61	Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24	An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60	In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35	In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42	Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45	In 2015, the total labour force in Berlin was 1.85 million.
0.33	Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24	The city of Paris had a population of 2,165,423 people within its administrative city limits as of Jan

'\nQuery: How many people live in Berlin?\n8.92    The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.\n8.61    Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.\n8.24    An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.\n7.60    In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.\n6.35    In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.\n5.42    Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.\n3.45    In 2015, the total labour force in Berlin was 1.85 million.\n0.33    Berlin is subdivided into 12 boroughs or districts (Bezirke).\n-4.24   The city of Paris had a population of 2,165,423 people within it