#Section 1: Understand Bert Endcoding Tokenization and Output<br>


###Part1: Understand Bert Endcoding Tokenization<br>
comparing to approaches:
1. using tokenizer, with output including all info, input_ids, attention_mask, token_type_ids <br>
2. using tokenizer function to proceed each steps to reach the same tokenization results`

In [19]:
import torch
from transformers import BertTokenizer, BertModel

**1. using tokenizer, with output including all info, input_ids, attention_mask, token_type_ids**

In [16]:
# Load pre-trained model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize input
input_text = "Hello, BERT!"
inputs = tokenizer(input_text, return_tensors='pt')   # return_tensors="pt" is to specify return pytorch type of tensors
print(inputs)
'''
===============================================
input_ids is the tokens_id
attention_mask is the mask to filter out '[PAD]' token
token_type_ids sepcifies to which segement the token belongs to, in multiple text segments tasks, such as question-answering
===============================================
'''

# If you specifically want just the initial embeddings:
initial_embeddings = model.embeddings(inputs['input_ids'])     # use the input_id to map for the pretrained embedding representation
print(initial_embeddings.shape)


{'input_ids': tensor([[  101,  7592,  1010, 14324,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
torch.Size([1, 6, 768])


***2. using tokenizer function to proceed each steps to reach the same tokenization results***

In [13]:
tokens = tokenizer.tokenize(input_text)
tokens = ['[CLS]'] + tokens + ['[SEP]']
tokens_id = tokenizer.convert_tokens_to_ids(tokens)
mask = [1 if ele != '[PAD]' else 0 for ele in tokens]
print(tokens, tokens_id, mask)

['[CLS]', 'hello', ',', 'bert', '!', '[SEP]'] [101, 7592, 1010, 14324, 999, 102] [1, 1, 1, 1, 1, 1]


###Part2: Understand Bert Endcoding Output<br>

In [None]:
# This gives you the embeddings after all transformations.
with torch.no_grad():
    outputs = model(**inputs)      # torch.no_grad(): run the below codes without gradient descent.
    last_hidden_state = outputs.last_hidden_state
    pooler_output = outputs.pooler_output
'''
===============================================
'ast_hidden_state'
    It is the output of the last layer of the BERT model. It contains the contextualized embeddings for each token attending to other tokens in the input sequence.
    It is is typically used in tasks where token-level representations are required, such as sequence labeling (e.g., named entity recognition) or token-level classification tasks.
    dimention = [batch_size, token_size, embedding_size]
'pooler_output'
    It is a single vector representation of the entire input sequence, by applying a pooling operation over the last_hidden_state results.
    In BERT-family of models, pooling operation is processing through a linear layer and a tanh activation function (similar to the pool layer in CNN) on classification token.,
    The purpose of the pooler_output is to provide a fixed-size representation of the entire input sequence, capturing its overall semantic content.
    This representation is often used as input to downstream tasks such as sentence classification, where a single representation of the entire sentence is required.
    dimention = [batch_size, embedding_size]
===============================================
'''
print(last_hidden_state.shape, pooler_output.shape)
print(last_hidden_state, pooler_output)

#Section 2: How the embedding similarity of word "Bank" from difference sentences proceed through layers <br>
compare 3 processes:<br>
1. inital embedding similarity
2. last_hidden_state embedding similarity
3. the embedding similarity of each layers

**1. inital embedding similarity** <br>
As the initial embedding is just mapped from token ID, thus the same word embedding should be similar, even from different context

In [31]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Function to get the initial embedding of a word from a sentence
def get_initial_word_embedding(sentence, word):
    inputs = tokenizer(sentence, return_tensors="pt")
    word_id = tokenizer.convert_tokens_to_ids(word)
    word_position = inputs["input_ids"][0].tolist().index(word_id
    # Extracting the initial embeddings
    initial_embeddings = model.embeddings(inputs["input_ids"])
    return initial_embeddings[0][word_position].detach().numpy()

# Compare initial embeddings for the word 'bank' in two different contexts
sentence1 = "I sat by the river bank."
sentence2 = "I deposited money in the bank."

embedding1 = get_initial_word_embedding(sentence1, "bank")
embedding2 = get_initial_word_embedding(sentence2, "bank")

# Calculate cosine similarity or any other metric to see the difference
# For simplicity, let's use dot product
similarity = torch.nn.functional.cosine_similarity(
    torch.tensor(embedding1).unsqueeze(0), torch.tensor(embedding2).unsqueeze(0)
)

print(f"Cosine similarity between the initial embeddings: {similarity.item()}")


Cosine similarity between the initial embeddings: 0.9999998807907104


**2. last_hidden_state embedding similarity** <br>
In the final layer, the same initial embeddings are enriched with contextual semetics through attention and forward layers, thus they should be different based on their context.

In [29]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Function to get the embedding of a word from a sentence
def get_word_embedding(sentence, word):
    inputs = tokenizer(sentence, return_tensors="pt")
    outputs = model(**inputs)
    word_id = tokenizer.convert_tokens_to_ids(word)
    word_position = inputs["input_ids"][0].tolist().index(word_id)
    return outputs["last_hidden_state"][0][word_position].detach().numpy()

# Compare embeddings for the word 'bank' in two different contexts
sentence1 = "I sat by the river bank."
sentence2 = "I deposited money in the bank."

embedding1 = get_word_embedding(sentence1, "bank")
embedding2 = get_word_embedding(sentence2, "bank")

# Calculate cosine similarity or any other metric to see the difference
# For simplicity, let's use dot product
similarity = torch.nn.functional.cosine_similarity(
    torch.tensor(embedding1).unsqueeze(0), torch.tensor(embedding2).unsqueeze(0)
)

print(f"Cosine similarity between the embeddings: {similarity.item()}")

Cosine similarity between the embeddings: 0.5257285833358765


**3. the embedding similarity of each layers**

In [32]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Tokenize the sentences
sentence1 = "I sat by the river bank."
sentence2 = "I deposited money in the bank."

inputs1 = tokenizer(sentence1, return_tensors="pt")
inputs2 = tokenizer(sentence2, return_tensors="pt")

word_id = tokenizer.convert_tokens_to_ids("bank")
word_position1 = inputs1["input_ids"][0].tolist().index(word_id)
word_position2 = inputs2["input_ids"][0].tolist().index(word_id)

# Get initial embeddings
initial_embeddings1 = model.embeddings(inputs1["input_ids"])
initial_embeddings2 = model.embeddings(inputs2["input_ids"])

cosine_sim = torch.nn.functional.cosine_similarity(initial_embeddings1[0][word_position1].unsqueeze(0),
                                                   initial_embeddings2[0][word_position2].unsqueeze(0))
print(f"Layer 0 (initial embeddings) similarity: {cosine_sim.item()}")

# Process both sentences through each BERT layer
hidden_states1 = [initial_embeddings1]
hidden_states2 = [initial_embeddings2]

for i, layer in enumerate(model.encoder.layer):
    layer_output1 = layer(hidden_states1[-1], attention_mask=inputs1["attention_mask"])       # layer() uses previous hidden_state as input
    hidden_states1.append(layer_output1[0])

    layer_output2 = layer(hidden_states2[-1], attention_mask=inputs2["attention_mask"])
    hidden_states2.append(layer_output2[0])

    cosine_sim = torch.nn.functional.cosine_similarity(hidden_states1[-1][0][word_position1].unsqueeze(0),
                                                       hidden_states2[-1][0][word_position2].unsqueeze(0))
    print(f"Layer {i + 1} similarity: {cosine_sim.item()}")


Layer 0 (initial embeddings) similarity: 0.9999998807907104
Layer 1 similarity: 0.7583112120628357
Layer 2 similarity: 0.6887193322181702
Layer 3 similarity: 0.6551119685173035
Layer 4 similarity: 0.5860087275505066
Layer 5 similarity: 0.5718531012535095
Layer 6 similarity: 0.5652671456336975
Layer 7 similarity: 0.5206233263015747
Layer 8 similarity: 0.4970075488090515
Layer 9 similarity: 0.49416643381118774
Layer 10 similarity: 0.5007321834564209
Layer 11 similarity: 0.5522933006286621
Layer 12 similarity: 0.5257285833358765
