# MVD 9. cvičení

Dnešní cvičení nebude až tak obtížné. Cílem je seznámit se s HuggingFace a vyzkoušet si základní práci s BERT modelem.

## 1. část - Seznámení s HuggingFace a modelem BERT

Nainstalujte si Python knihovnu `transformers` a podívejte se na předtrénovaný [BERT model](https://huggingface.co/bert-base-uncased). Vyzkoušejte si unmasker s různými vstupy.

<br>
Pozn.: Použití BERT modelu vyžaduje zároveň PyTorch - postačí i cpu verze.

In [11]:
import torch
from transformers import pipeline
from transformers import BertTokenizer, AutoTokenizer,AutoModel
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [17]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.10731096565723419,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello i'm a fashion model."},
 {'score': 0.08774467557668686,
  'token': 2535,
  'token_str': 'role',
  'sequence': "hello i'm a role model."},
 {'score': 0.053383972495794296,
  'token': 2047,
  'token_str': 'new',
  'sequence': "hello i'm a new model."},
 {'score': 0.04667218402028084,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello i'm a super model."},
 {'score': 0.027095871046185493,
  'token': 2986,
  'token_str': 'fine',
  'sequence': "hello i'm a fine model."}]

In [16]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Today was [MASK] sky.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.6679329872131348,
  'token': 1996,
  'token_str': 'the',
  'sequence': 'today was the sky.'},
 {'score': 0.06752169132232666,
  'token': 1037,
  'token_str': 'a',
  'sequence': 'today was a sky.'},
 {'score': 0.03634251281619072,
  'token': 2630,
  'token_str': 'blue',
  'sequence': 'today was blue sky.'},
 {'score': 0.029211433604359627,
  'token': 2178,
  'token_str': 'another',
  'sequence': 'today was another sky.'},
 {'score': 0.028842559084296227,
  'token': 2026,
  'token_str': 'my',
  'sequence': 'today was my sky.'}]

In [33]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("She brings a new [MASK].")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.05639948695898056,
  'token': 2028,
  'token_str': 'one',
  'sequence': 'she brings a new one.'},
 {'score': 0.04917013645172119,
  'token': 2767,
  'token_str': 'friend',
  'sequence': 'she brings a new friend.'},
 {'score': 0.032878898084163666,
  'token': 5195,
  'token_str': 'weapon',
  'sequence': 'she brings a new weapon.'},
 {'score': 0.024797676131129265,
  'token': 2611,
  'token_str': 'girl',
  'sequence': 'she brings a new girl.'},
 {'score': 0.0243862085044384,
  'token': 4377,
  'token_str': 'dress',
  'sequence': 'she brings a new dress.'}]

## 2. část - BERT contextualized word embeddings

BERT dokumentace obsahuje také návod jak použít tento model pro získání word embeddingů. Vyzkoušejte použití stejného slova v různém kontextu a podívejte se, jak se mění kosinova podobnost embeddingů v závislosti na kontextu daného slova.

Podívejte se na výstup tokenizeru před vstupem do BERT modelu - kolik tokenů bylo vytvořeno pro větu "Hello, this is Bert."? Zdůvodněte jejich počet.

<br>
Pozn.: Vyřešení předchozí otázky Vám pomůže zjistit, který vektor z výstupu pro cílové slovo použít.

In [107]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()

def bert_cosine_sim(text1, text2, word):
    tok1 = tokenizer(text1, return_tensors='pt')
    tok2 = tokenizer(text2, return_tensors='pt')

    with torch.no_grad():
        out1 = model(**tok1)
        out2 = model(**tok2)

    states1 = out1.hidden_states[-1].squeeze()
    states2 = out2.hidden_states[-1].squeeze()

    tok1_ids = np.where(np.array(tok1.word_ids()) == text1.split(" ").index(word))
    tok2_ids = np.where(np.array(tok2.word_ids()) == text2.split(" ").index(word))

    embs1 = states1[tok1_ids[0]].squeeze()
    embs2 = states2[tok2_ids[0]].squeeze()

    return torch.cosine_similarity(embs1.reshape(1,-1), embs2.reshape(1,-1))


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [109]:
text1 = "Handwritten letter"
text2 = "Alphabet letter"
word = "letter"

cos_sim = bert_cosine_sim(text1, text2, word)
print(cos_sim)

text1 = "Print a letter"
text2 = "Write a letter"
word = "letter"

cos_sim = bert_cosine_sim(text1, text2, word)
print(cos_sim)

# text1 = "hammer the nails"
# text2 = "paint your nails"
# word = "nails"

# text1 = "I'm sure I'm right"
# text2 = "Turn right"
# word = "right"



tensor([0.6980])
tensor([0.9604])


In [102]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Hello, this is Bert."
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input)
print(tokenizer.decode(encoded_input['input_ids'][0]))

# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# text = "Here is the sentence I want embeddings for."
# marked_text = "[CLS] " + text + " [SEP]"
# tokenized_text = tokenizer.tokenize(marked_text)
# print (tokenized_text)

{'input_ids': tensor([[  101,  7592,  1010,  2023,  2003, 14324,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
[CLS] hello, this is bert. [SEP]


## Bonus - Vizualizace slovních  embeddingů

Vizualizujte slovní embeddingy - mění se jejich pozice v závislosti na kontextu tak, jak byste očekávali? Pokuste se vizualizovat i některá slova, ke kterým by se podle vás cílové slovo mělo po změně kontextu přiblížit.

In [2]:
import numpy as np
import torch
import plotly.express as px
from sklearn.decomposition import PCA

In [17]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()

texts = [
    "i got a letter today from my sister",
    "i send you the letter",
    "write a letter to Santa",
    "A is the first letter of the alphabet",
    "the letter Z is the least commonly used letter in the English alphabet"
]
word = "letter"

emmbs = []
for text in texts:
    tok = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        out = model(**tok)
    states = out.hidden_states[-1].squeeze()
    tok_ids = np.where(np.array(tok.word_ids()) == text.split(" ").index(word))
    emmbs.append(states[tok_ids[0]].squeeze())

emmbs = torch.stack(emmbs)
pca = PCA(n_components=2)
pca.fit(emmbs)
pca_emmbs = pca.transform(emmbs)


fig = px.scatter(x=pca_emmbs[:,0], y=pca_emmbs[:,1],text=texts)
fig.show()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
