# BertScorer

В этом ноутбуке будут рассмотрены примеры использования BertScorer, который должен выдавать очки различным кандидатам.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2
import sys
sys.path.append('..')

import numpy as np
from IPython.display import display

from transformers import BertForMaskedLM, BertTokenizer

from src.models.BertScorer.bert_scorer_correction import BertScorerCorrection
from src.models.BertScorer.bert_scorer_sentence import BertScorerSentence

Загрузим модель и токенайзер.

In [3]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## BertScorerCorrection

Сначала посмотрим на класс, который ранжирует кандидата для подстановки место слова.

In [4]:
scorer = BertScorerCorrection(model, tokenizer)

Зададим пару предложений и кандидатов для скоринга.

In [5]:
sentence_wrong = (
    f'It is wrong sentence, there are '
    f'two {tokenizer.mask_token} mask tokens: {tokenizer.mask_token}'
)

sentences = [
    f'London is the {tokenizer.mask_token} of Great Britain',
    f'The square in Moscow is {tokenizer.mask_token}'
]
candidates = [
    ['wobblybebbly', 'city', 'capital', 'human', 'think'],
    ['red', 'blue', 'wobblybebbly', 'black', 'round', 'boiled']
]

Протестируем некорректное предложение.

In [6]:
try:
    scorer([sentence_wrong], [])
except ValueError as e:
    print(e)

There should be exactly one [MASK] token in a sentence.


Протестируем корректное предложение.

In [7]:
results = scorer(sentences, candidates)

for i in range(len(results)):
    print('Результаты для предложения: ')
    print(sentences[i])
    candidates_sentences = candidates[i]
    results_sentences = results[i]
    for candidate, result in zip(candidates_sentences, results_sentences):
        print(f'\t{candidate}: {result:.3f}')

Результаты для предложения: 
London is the [MASK] of Great Britain
	wobblybebbly: -21.435
	city: -5.993
	capital: -0.014
	human: -18.483
	think: -16.042
Результаты для предложения: 
The square in Moscow is [MASK]
	red: -15.927
	blue: -15.878
	wobblybebbly: -18.513
	black: -16.626
	round: -15.404
	boiled: -19.281


Результат выглядит правдоподобно.

## BertScorerSentence

Теперь посмотрим на класс, который умеет давать скоры сразу всему предложению.

In [8]:
scorer = BertScorerSentence(model, tokenizer)

Зададим несколько предложений и протестируем их.

In [9]:
sentences = [
    'Moscow is the capital of Russia',
    'Moscwo is the capital of Russia',
    'Moscow isthe capital of Russia',
    'Moscow is the capital or Russia'
]

In [10]:
results = scorer(sentences)

In [11]:
for i, result in enumerate(results):
    print(f'Sentence {i}: {result:.3f}')

Sentence 0: -5.296
Sentence 1: -6.985
Sentence 2: -7.880
Sentence 3: -7.956


Результат выглядит вполне разумно.