# Comparing contextual vectors

In [None]:
!pip install torch transformers datasets bertviz seaborn scikit-learn

In [55]:
import numpy as np
import datasets
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from sklearn.decomposition import PCA

import seaborn as sns
import matplotlib.pyplot as plt
from bertviz import model_view, head_view

So far, we have been working with *static* word (or sequence) embeddings.   
Word2vec is an example of static word embeddings – it has a set vocabulary, where each word has a single embedding that is the same no matter the case.

But as we know, words can take on different meanings depending on their context. 
Contextualized embeddings aim to learn something about the semantics of the whole sequence. Practically, contextualized embeddings, like those produced by Transformer models, generate different embeddings for the same word based on its surrounding context. 

It should be possible to capture the difference between "bank" in "river bank" vs. "bank loan". 

One of the mechanisms that allows this is attention.
Let's start seeing if we can recreate the example from lecture: `"The cat sat on the mat"` with a real model.

In [6]:
sentence = "The cat sat on the mat"

In [10]:
# Load a small pretrained transformer model to extract embeddings and attention
model_name = "prajjwal1/bert-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(
    model_name, output_attentions=True, attn_implementation="eager")

In [None]:
# Tokenize
inputs = tokenizer(sentence, return_tensors='pt')
inputs

In [11]:
# Run inference
with torch.no_grad():
    outputs = model(**inputs)

In [23]:
# Extract embeddings (last hidden state) and attention from the model
embeddings = outputs.last_hidden_state  
attentions = outputs.attentions  

# Decode tokens for visualization
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

In [None]:
# embeddings are a tensor shape (batch_size, sequence_length, embedding_size)
embeddings.shape

In [None]:
# attentions is a tuple of tenros...
[
    type(attentions),
    len(attentions)
]


In [None]:
# ...where each layer has (batch_size, num_heads, sequence_length, sequence_length)
attentions[0].shape

Let's see how that looks visualized.

This specific model has just 4 layers of attention (this is actually quite low). 
 
And 8 attention heads. This number has nothing to do with the specific sentence (which is 8 tokens long), but instead, this is decided by the model's architecture.

In [None]:
head_view(attentions, tokens)

Or you can plot each attention head separately:

In [None]:
# Attention from the first layer, first head
L1H1 = attentions[0][0, 0]

# Plot attention weights for the CLS token (attention_scores[0] gives weights from CLS to all tokens)
plt.figure(figsize=(10, 6))
sns.heatmap(L1H1, annot=True, xticklabels=tokens, yticklabels=tokens, cmap='Blues')
plt.show()

In [None]:
# plot all attention heads in the model
model_view(attentions, tokens)

Relevant at this point (but will be on the readings later):

- [CLS] is a special token added at the beginning of each sequence. It means “classification” and should contain some information representing the whole sequence.
- [SEP] is a special token used to separate two sequences in tasks like question answering or sentence pair classification. It also marks the end of a single sequence when only one is provided.

These special tokens are specific to the BERT architecture (Devlin et al., 2019)

Now, what do there connection mean?

1. which attention heads are covering the words that attend to themselves?
2. which heads are capturing the order of words in the sentence?
3. which heads are telling the [CLS] token what is important in the sentence?

### Comparing embeddings – Projections
Let's try to project the embeddings using dimensionality reduction to get more insight into how meaning is represented in our model.  


In [None]:
embeddings = embeddings.squeeze(0)  # Remove the batch dimension

# Apply PCA to reduce dimensionality to 2 components
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings.cpu().numpy())

# Plot the 2D projection of BERT embeddings
plt.figure(figsize=(10, 8))
for i, token in enumerate(tokens):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], marker='x', color='blue')
    plt.text(embeddings_2d[i, 0], embeddings_2d[i, 1], token, fontsize=12)


Notice where the embeddings landed.
What is happening here?

What characteristic do the outlier words share?  
What kind of words are close to the [CLS], central token?

### Paraphrasing – distance between sentences

The goal of the model is to embed similar sentences close to each other in the embedding space. 
A nice case for this is paraphrasing – saying (almost) the same thing using different words. 
If a model is able to detect paraphrases, this is a good indication of its performance.

So, our goal is to test whether the model understands that 
`"In 1995, the last survey, those numbers were equal."` is equivalent to `"The last time the survey was conducted, in 1995, those numbers matched."`

To do that, we will measure distance between embeddings of the two sentences.
We have to implement a distance metric first.

In [57]:
# Implement a cosine similarity in any way
def cosine_similarity(A, B) -> float:
    """
    Dot product of two vectors divided by the product of their norms (magnitudes)
    """
    pass

# test if it works
testing_results = cosine_similarity(
    A = torch.Tensor([0.5, 0.2, 0.7]),
    B = torch.Tensor([0.2, 0.1, 0.5])
    )

assert np.isclose(testing_results, 0.9716)

In [None]:
# Implement a function for extracting embeddings
def infer_embedding(sentence: str) -> torch.Tensor:
    """
    """
    # TODO tokenize (with a tokenizer)

    # TODO inference (torch.no_grad)

    # TODO squeeze last hidden state and return it
    pass

assert infer_embedding("test test test").shape == (5, 512)

In [None]:
original_sent = "In 1995, the last survey, those numbers were equal."
paraphrased_sent = "The last time the survey was conducted, in 1995, those numbers matched."

In [None]:
# run the little test

emb_orig_sent = infer_embedding(original_sent)
emb_para_sent = infer_embedding(paraphrased_sent)

cls_orig_sent = emb_orig_sent[0, :]
cls_para_sent = emb_para_sent[0, :]

mean_orig_sent = torch.mean(emb_orig_sent, dim=1)
mean_para_sent = torch.mean(emb_para_sent, dim=1)

max_orig_sent = torch.max(emb_orig_sent, dim=1)
max_para_sent = torch.max(emb_para_sent, dim=1)

print(f"Cosine similarity (CLS token): {cosine_similarity(cls_orig_sent, cls_para_sent)}")
print(f"Cosine similarity (Mean pooling): {cosine_similarity(mean_orig_sent, mean_para_sent)}")
print(f"Cosine similarity (Max): {cosine_similarity(mean_orig_sent, mean_para_sent)}")

Thoughts?  
Do the similarities indicate that the sentences are paraphrased?  

Let's see how representative the different measures are in a projection:

In [None]:
pca = PCA(n_components=2)
emb_orig_2d = pca.fit_transform(emb_orig_sent)
emb_para_2d = pca.transform(emb_para_sent)
cls_orig_2d = pca.transform(cls_orig_sent)
cls_para_2d = pca.transform(cls_para_sent)


plt.figure(figsize=(10, 8))
for i, token in enumerate(tokens):
    plt.scatter(emb_orig_2d[i, 0], emb_orig_2d[i, 1], marker='x', color='blue')
    plt.scatter(emb_para_2d[i, 0], emb_para_2d[i, 1], marker='x', color='blue')
    plt.scatter(cls_orig_2d[i, 0], cls_orig_2d[i, 1], marker='x', color='red')
    plt.scatter(cls_para_2d[i, 0], cls_para_2d[i, 1], marker='x', color='red')

Bonus: 
- Dependency distance: does it relate to attention, check using `spacy`

### First taste of masked learning

In [None]:
# masked attention
# 1. Define the sentence
sentence = "The [MASK] sat on the mat"

# 2. Load a pretrained BERT tokenizer and model for embedding generation
model_name = "prajjwal1/bert-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)

# 3. Tokenize the sentence
inputs = tokenizer(sentence, return_tensors='pt')

# 4. Generate embeddings and attention weights using BERT
with torch.no_grad():
    outputs = model(**inputs)

# Extract hidden states (embeddings) and attention from the model
hidden_states = outputs.hidden_states  # (layer, batch, seq_len, hidden_size)
attentions = outputs.attentions  # (layer, batch, num_heads, seq_len, seq_len)

# Convert token IDs back to tokens for visualization
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

head_view(attentions, tokens)

In [None]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model=model_name)
unmasker("The [MASK] sat on the mat")

# Building a sentence similarity benchmark

- create a dataset
    - do we need all the datapoints? 
    - the data is scored on an ordinal scale, how do we turn it into a binary problem?
- pick models to compare
- embed the texts
- evaluate accuracy

In [None]:
# load the datasets
raw_pairs = datasets.load_dataset("mteb/stsb_multi_mt", name="en", split="test")


# create a model registry
model_names = [
    ""
]

In [None]:
raw_pairs

In [None]:
# downsample
sample = raw_pairs.train_test_split(train_size=600, stratify_by_column="label", seed=42)

# infer
embeddings = infer_embedding()

Bonus 1:
- find edge cases (instances where the model fails)
- can edge cases be explained with dependency distance?

Bonus 2:
- compare static and contextualized embeddings on the paraphrasing task
- e.g. get representations using GloVe (Class 2, `embeddings = gensim.downloader.load("glove-wiki-gigaword-300")`)