# MVD 9. cvičení

Dnešní cvičení nebude až tak obtížné. Cílem je seznámit se s HuggingFace a vyzkoušet si základní práci s BERT modelem.

## 1. část - Seznámení s HuggingFace a modelem BERT

Nainstalujte si Python knihovnu `transformers` a podívejte se na předtrénovaný [BERT model](https://huggingface.co/bert-base-uncased). Vyzkoušejte si unmasker s různými vstupy.

<br>
Pozn.: Použití BERT modelu vyžaduje zároveň PyTorch - postačí i cpu verze.

In [1]:
from transformers import pipeline

In [6]:
# read unmasking pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")

mask = '[MASK]'
inputs = [
    f'The first president of the USD was George {mask}.',
    f'The capital of France is {mask}.',
    f'Two plus two equals {mask}.',
    f"{mask} Kenobi was a Jedi Master and Anakin Skywalker's teacher.",
]

for i, input in enumerate(inputs):
    print(f"\n---- INPUT {i} ----")
    outputs = unmasker(input)
    for output in outputs:
        print(f"Option: {output['sequence']} (Score: {output['score']:.4f})")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0



---- INPUT 0 ----
Option: the first president of the usd was george smith. (Score: 0.0137)
Option: the first president of the usd was george johnson. (Score: 0.0103)
Option: the first president of the usd was george white. (Score: 0.0103)
Option: the first president of the usd was george brown. (Score: 0.0102)
Option: the first president of the usd was george williams. (Score: 0.0094)

---- INPUT 1 ----
Option: the capital of france is paris. (Score: 0.4168)
Option: the capital of france is lille. (Score: 0.0714)
Option: the capital of france is lyon. (Score: 0.0634)
Option: the capital of france is marseille. (Score: 0.0444)
Option: the capital of france is tours. (Score: 0.0303)

---- INPUT 2 ----
Option: two plus two equals one. (Score: 0.3128)
Option: two plus two equals three. (Score: 0.2023)
Option: two plus two equals two. (Score: 0.1390)
Option: two plus two equals four. (Score: 0.0606)
Option: two plus two equals zero. (Score: 0.0582)

---- INPUT 3 ----
Option: the kenobi was

## 2. část - BERT contextualized word embeddings

BERT dokumentace obsahuje také návod jak použít tento model pro získání word embeddingů. Vyzkoušejte použití stejného slova v různém kontextu a podívejte se, jak se mění kosinova podobnost embeddingů v závislosti na kontextu daného slova.

Podívejte se na výstup tokenizeru před vstupem do BERT modelu - kolik tokenů bylo vytvořeno pro větu "Hello, this is Bert."? Zdůvodněte jejich počet.

<br>
Pozn.: Vyřešení předchozí otázky Vám pomůže zjistit, který vektor z výstupu pro cílové slovo použít.

In [7]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

In [8]:
# read tokenizeru
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# tokenize the sentence
sentence = "Hello, this is Bert."
tokens = tokenizer.tokenize(sentence)
input_ids = tokenizer(sentence, return_tensors="pt")

print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")


Tokens: ['hello', ',', 'this', 'is', 'bert', '.']
Number of tokens: 6


In [10]:
model = BertModel.from_pretrained("bert-base-uncased")

sentence1 = "Knowledge - is the key to the Universe."
sentence2 = "I lost the key to my car."

word = 'key'

# tokenize
inputs1 = tokenizer(sentence1, return_tensors="pt")
inputs2 = tokenizer(sentence2, return_tensors="pt")

# calculate embeddings
with torch.no_grad():
    outputs1 = model(**inputs1)
    outputs2 = model(**inputs2)

# select embeddings for our `word` 
# find index of the token `word` in every sentence
index1 = tokenizer.convert_ids_to_tokens(inputs1["input_ids"][0]).index(word)
index2 = tokenizer.convert_ids_to_tokens(inputs2["input_ids"][0]).index(word)

# get embeddings for token
embedding1 = outputs1.last_hidden_state[0, index1, :].numpy()
embedding2 = outputs2.last_hidden_state[0, index2, :].numpy()

similarity = cosine_similarity([embedding1], [embedding2])[0][0]

print(f"Cosine similarity between `{word}` in two contexts: {similarity:.4f}")


Cosine similarity between `key` in two contexts: 0.4941


## Bonus - Vizualizace slovních  embeddingů

Vizualizujte slovní embeddingy - mění se jejich pozice v závislosti na kontextu tak, jak byste očekávali? Pokuste se vizualizovat i některá slova, ke kterým by se podle vás cílové slovo mělo po změně kontextu přiblížit.