# MVD 9. cvičení

Dnešní cvičení nebude až tak obtížné. Cílem je seznámit se s HuggingFace a vyzkoušet si základní práci s BERT modelem.

## 1. část - Seznámení s HuggingFace a modelem BERT

Nainstalujte si Python knihovnu `transformers` a podívejte se na předtrénovaný [BERT model](https://huggingface.co/bert-base-uncased). Vyzkoušejte si unmasker s různými vstupy.

<br>
Pozn.: Použití BERT modelu vyžaduje zároveň PyTorch - postačí i cpu verze.

In [1]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Hello I'm student of [MASK] science ")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.44494354724884033,
  'token': 3274,
  'token_str': 'computer',
  'sequence': "hello i'm student of computer science"},
 {'score': 0.19867515563964844,
  'token': 2576,
  'token_str': 'political',
  'sequence': "hello i'm student of political science"},
 {'score': 0.02690133824944496,
  'token': 2591,
  'token_str': 'social',
  'sequence': "hello i'm student of social science"},
 {'score': 0.009483722038567066,
  'token': 3019,
  'token_str': 'natural',
  'sequence': "hello i'm student of natural science"},
 {'score': 0.00781850703060627,
  'token': 6228,
  'token_str': 'mechanical',
  'sequence': "hello i'm student of mechanical science"}]

## 2. část - BERT contextualized word embeddings

BERT dokumentace obsahuje také návod jak použít tento model pro získání word embeddingů. Vyzkoušejte použití stejného slova v různém kontextu a podívejte se, jak se mění kosinova podobnost embeddingů v závislosti na kontextu daného slova.

Podívejte se na výstup tokenizeru před vstupem do BERT modelu - kolik tokenů bylo vytvořeno pro větu "Hello, this is Bert."? Zdůvodněte jejich počet.

<br>
Pozn.: Vyřešení předchozí otázky Vám pomůže zjistit, který vektor z výstupu pro cílové slovo použít.

In [2]:
import numpy as np
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = ["its mean of rows",
        "mean by column",
        "You are mean to me",
        "its mean of rows"
]
### tokens ###
tokens = tokenizer.tokenize(text[0])
print(tokens) # + [CLS]101 a [SEP]102
### 1
# tok1 = tokenizer(text[0], return_tensors='pt')
# tok2 = tokenizer(text[1], return_tensors='pt')
# 2
tok1 = tokenizer(text[2], return_tensors='pt')
tok2 = tokenizer(text[3], return_tensors='pt')
print(tok1['input_ids'],tok1['input_ids']) #1 and 5 pos
## 1
# idx1 = 2
# idx2 = 1
# 2
idx1 = 3
idx2 = 2
###
## fit model
with torch.no_grad():
    out1 = model(**tok1)
    out2 = model(**tok2)

## get last hidden layer
states1 = out1.last_hidden_state.squeeze()
states2 = out2.last_hidden_state.squeeze()
print(states1.shape,states2.shape)

## get embedings
emb1 = states1[idx1]
emb2 = states1[idx2]
print(emb1.shape,emb2.shape)


cs = torch.cosine_similarity(emb1.reshape(1,-1), emb2.reshape(1,-1))
print(cs) ## cosine == 1

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['its', 'mean', 'of', 'rows']
tensor([[ 101, 2017, 2024, 2812, 2000, 2033,  102]]) tensor([[ 101, 2017, 2024, 2812, 2000, 2033,  102]])
torch.Size([7, 768]) torch.Size([6, 768])
torch.Size([768]) torch.Size([768])
tensor([0.4805])


## Bonus - Vizualizace slovních  embeddingů

Vizualizujte slovní embeddingy - mění se jejich pozice v závislosti na kontextu tak, jak byste očekávali? Pokuste se vizualizovat i některá slova, ke kterým by se podle vás cílové slovo mělo po změně kontextu přiblížit.

In [32]:
from sklearn.manifold import TSNE
import plotly.express as px
from sklearn.decomposition import PCA

In [42]:

text = ["its mean of rows",
        "its mean of cols",
        "You are mean to me",
]
idxs = [2,2,3]
embs = []
for one_text,idx in zip(text,idxs):
    tokens = tokenizer.tokenize(text[0])
    tok = tokenizer(text[2], return_tensors='pt')
    with torch.no_grad():
        out1 = model(**tok)
    states2 = out2.last_hidden_state.squeeze()
    embs.append(states1[idx].detach().numpy().reshape(1,-1))



In [46]:
embedings = np.concatenate([embs[0],embs[1],embs[2]],axis=0)
pca = PCA(n_components=2)
out = pca.fit(embedings).transform(embedings)
plot = px.scatter(out[:, 0], out[:, 1],color = text)
plot.update_coloraxes(showscale=False)
plot.layout.template = 'plotly'
plot