#### 01. Context-Aware Embeddings - BERT

- Traditional word embeddings like Word2Vec or GloVe assign a single vector to each word, regardless of context.

- This is problematic for words with multiple meanings (polysemy), such as "bank" (river bank vs. financial bank).

- Context-aware models like BERT generate different embeddings for the same word depending on its context in a sentence.


In [None]:
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
# The tokenizer splits sentences into tokens
# that BERT understands, including handling subwords.

# the embedding (hidden) dimension is 768
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# model.config.hidden_size == 768, Can't change, pre-determined as pre-trained

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Two sentences with "bank" in different contexts
sentences = [
    "He sat by the river bank.",
    "She deposited money in the bank."
]

def get_word_embedding(sentence, target_word):
    '''
    Tokenizes the sentence.
    Pass through BERT, get the last hidden state
    Find token corresponding to the target word which is "bank"
    '''

    # Tokenize and get input IDs
    inputs = tokenizer(sentence, return_tensors='pt') # pt stands for PyTorch, ;tf for tensorfow
    # For BERT, the maximum input length is 512 tokens (including special tokens like [CLS] and [SEP]).
    # inputs = tokenizer(sentence, return_tensors='pt', max_length=512, truncation=True)
    
    with torch.no_grad(): # for inference, no gradient calc, hence faster & less memory
        outputs = model(**inputs) # input is dict - {'input_ids': ..., 'attention_mask': ...}
    
    # Get the last hidden state (batch_size, seq_len, hidden_size)
    # i.e the output of final layer for each token in input
    last_hidden_state = outputs.last_hidden_state.squeeze(0) # removes firs dim, if size =1
    
    # Decode tokens to align with input words
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    # Find the index of the target word (may need to handle subwords)
    # We'll take the first occurrence for simplicity
    for i, token in enumerate(tokens):
        if target_word in token:
            return last_hidden_state[i].numpy(), tokens
    return None, tokens

# Get embeddings for "bank" in both sentences
vec1, tokens1 = get_word_embedding(sentences[0], "bank")
vec2, tokens2 = get_word_embedding(sentences[1], "bank")

print("Tokens in sentence 1:", tokens1)
print("Tokens in sentence 2:", tokens2)
print("Embedding for 'bank' in sentence 1 (river context):", vec1[:5])  # Show first 5 dims
print("Embedding for 'bank' in sentence 2 (money context):", vec2[:5])

Tokens in sentence 1: ['[CLS]', 'he', 'sat', 'by', 'the', 'river', 'bank', '.', '[SEP]']
Tokens in sentence 2: ['[CLS]', 'she', 'deposited', 'money', 'in', 'the', 'bank', '.', '[SEP]']
Embedding for 'bank' in sentence 1 (river context): [ 0.15994921 -0.33814338 -0.03246783 -0.08658472 -0.39891648]
Embedding for 'bank' in sentence 2 (money context): [ 0.3031039  -0.36687252 -0.35636595  0.1448596   1.0418966 ]


In [None]:
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

similarity = cosine_similarity(vec1, vec2)
print("Cosine similarity between 'bank' in different contexts:", similarity)
# the vectors are different, and their similarity will be less than 1

Cosine similarity between 'bank' in different contexts: 0.5278751


1. MSP (Masked Language Modeling, MLM)
- Masked Language Modeling (MLM) is the main pre-training task for BERT.
- During training, some words in the input are randomly replaced with a [MASK] token.
- The model learns to predict the original word for each [MASK] using the context from both sides (left and right). Helps BERT understand context and relationships between words.

2. NSP (Next Sentence Prediction)
- Next Sentence Prediction (NSP) is another pre-training task for BERT.
- The model is given pairs of sentences and must predict if the second sentence logically follows the first.
- 50% of the time, the second sentence is the actual next sentence; 50% of the time, it’s a random sentence. Helps BERT understand relationships between sentences, useful for tasks like Question Answering and Natural Language Inference.

3. Bidirectional
- BERT is bidirectional, meaning it looks at the entire sentence (both left and right context) when encoding each word.
- Traditional models like GPT or LSTM read text left-to-right or right-to-left, but not both at once. BERT can use full context, making its embeddings more powerful and context-aware.