| | |
|:---:|:---|
| <img src="https://i.ytimg.com/vi/1W-sWmFQPZY/sddefault.jpg" width="200"/> |  <strong><font size=5>Future x Summer School 2025 </font></strong><br><br><strong><font color="#1A54A6" size=5>LLMs<br>Lab 2 Part A: Embeddings in Context (DEMO)</font></strong>|

---



**Instructor:**  
Pavlos Protopapas  

**Teaching Team:**  
Nawang Thinley Bhutia




In this notebook, we will explore how context can impact embeddings. We first use word2vec and oserve **static embeddings** and then assess how using a model like BERT (Bidirectional Encoder Representations from Transformers) can give us enhanced **contextualised embeddings**.

**📝 Make a Copy to Edit**

This notebook is **view-only**. To edit it, follow these steps:

1. Click **File** > **Save a copy in Drive**.
2. Your own editable copy will open in a new tab.

Now you can modify and run the code freely!



## Table of Contents
 **Part A**


- <font color ='#CE6DFF'>**Using Word2Vec**</font>
    -   Extracting relevant embeddings
    -   Applying dimensionality reduction
    -   Exploring word embeddings
-<font color ='#34B086'>**Using BERT**</font>
    -   Extracting relevant embeddings
    -   Applying dimensionality reduction
    -   Exploring word embeddings

- <font color ='1A54A6'>**Take Home Exercise**🏡


---



## **Importing the require libraries**

In [1]:
pip install -q sentence-transformers

In [2]:
!pip install gensim



In [3]:
import numpy as np

#for dimensionality reduction
from sklearn.decomposition import PCA

#for plotting
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
#from plotly.subplots import make_subplots


#for text preprocessing
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## **Sample Sentences**

In [4]:
sentences = [
    "I am going to the bank to withdraw money",
    "He is sitting on the river bank",
    "She went to the bank to deposit a check",
    "They had a picnic by the river bank",
    "I need to withdraw cash from the bank",
    "The kids are playing on the river bank"
]

In [5]:
#remember your preprocessing steps from lab 1
stop_words = set(stopwords.words('english')) #these are typical stop words in english like a, an , the

#lets see what words remain once we remove stopwords
words_of_interest = [word for sentence in sentences for word in sentence.lower().split() if word not in stop_words]
print(words_of_interest)

['going', 'bank', 'withdraw', 'money', 'sitting', 'river', 'bank', 'went', 'bank', 'deposit', 'check', 'picnic', 'river', 'bank', 'need', 'withdraw', 'cash', 'bank', 'kids', 'playing', 'river', 'bank']



# <font color ='#CE6DFF'>**Using Word2Vec**</font>

## **Loading pretrained word2vec model**

Just like we did for Lab 1 (PartB)!


In [None]:
# Load pre-trained Word2Vec model (Google News vectors) using gensim downloader
# Note: This cell can take upto 10+ minutes to load the entire model
import gensim.downloader as api
if 'model_google_w2v' in locals():
  print("Model already exists, using existing model")
else:
  print("Model does not exist, loading model")
  model_google_w2v = api.load('word2vec-google-news-300')

Model does not exist, loading model


## **Extract Embeddings for all words**

In [None]:
## Collect only the words present in the model and their vectors
word_vector_pairs = [(word, model_google_w2v[word]) for word in words_of_interest if word in model_google_w2v]
word_vectors = np.array([pair[1] for pair in word_vector_pairs])

In [None]:
word_vectors.shape

### **Dimensionality reduction**

In [None]:
# We use PCA for dimensionality reduction
pca = PCA(n_components=2)
embeddings_2d_pca = pca.fit_transform(word_vectors)

## **Plot the Embeddings**

In [None]:
## Plot the Embeddings
plt.figure(figsize=(10, 10))
for i, (word, _) in enumerate(word_vector_pairs):
    plt.scatter(embeddings_2d_pca[i, 0], embeddings_2d_pca[i, 1], label=word)  # Plot each point individually
    plt.annotate(word, (embeddings_2d_pca[i, 0], embeddings_2d_pca[i, 1]), textcoords="offset points", xytext=(0,10), ha='center')
plt.title('2D Visualization of Static Word Embeddings -PCA')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.grid(True)
plt.show()

### We can see that there is only one embedding per word.

### However, overlapping points are not clear enough here so let us try plotly (another alternative to matplotlib)

In [None]:
## Create Plotly graph
fig = go.Figure()

# Adding each point separately to include custom hover text
for i, (word, _) in enumerate(word_vector_pairs):
    fig.add_trace(go.Scatter(x=[embeddings_2d_pca[i, 0]], y=[embeddings_2d_pca[i, 1]],
                             mode='markers+text', text=[word], textposition='top center',
                             marker=dict(size=10, opacity=0.8),
                             name=word))  # Name sets the hover text

fig.update_layout(title='2D Visualization of Static Word Embeddings using PCA',
                  xaxis_title='Component 1',
                  yaxis_title='Component 2',
                  width=1600,
                  height=900,
                  showlegend=False)  # Set to True if you want a legend

fig.show()

# <font color ='#34B086'>**Using BERT**</font>

Now let us try to use BERT and compare our results

In [None]:
#using the transformers library
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

### Generate Word Embeddings with BERT


In [None]:
# Load BERT Model and Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = TFBertModel.from_pretrained('bert-base-uncased')

##### (Note: You can safely ignore these warning we do not need a HF_TOKEN for our use case right now and we do not need extra weights as we are not training.)

### Tokenize sentences and get embeddings

In [None]:
#let's create a function to get our embeddings
def get_bert_embeddings(sentences):
    inputs = tokenizer(sentences, return_tensors='tf', padding=True, truncation=True, max_length=128)
    outputs = model_bert(inputs)
    embeddings = outputs.last_hidden_state #these are the ebeddings!
    return embeddings

In [None]:
# Get embeddings for each token in the sentences
embeddings = get_bert_embeddings(sentences)

In [None]:
#@title ###**Plotting 2D BERT embeings using matplotlib**
# Extract embeddings for each word
word_vectors = []
labels = []
colors = []

# Define a color for each sentence
color_map = {
    0: 'r', 1: 'g', 2: 'b', 3: 'c', 4: 'm', 5: 'y'
}

for i, sentence in enumerate(sentences):
    tokens = tokenizer.tokenize(sentence)
    for j, token in enumerate(tokens):
        word_vectors.append(embeddings[i][j].numpy())
        labels.append(token)
        colors.append(color_map[i])

# Dimensionality Reduction
# Apply PCA
pca = PCA(n_components=2)
word_vectors_2d = pca.fit_transform(word_vectors)

# Interactive Visualization
# Initialize Plot
plt.figure(figsize=(10, 7))
scatter_plots = []
unique_colors = set(colors)
for color in unique_colors:
    indices = [i for i, c in enumerate(colors) if c == color]
    scatter = plt.scatter(word_vectors_2d[indices, 0], word_vectors_2d[indices, 1], c=color, label=f"Sentence {list(color_map.keys())[list(color_map.values()).index(color)] + 1}")
    scatter_plots.append(scatter)

# Add text labels for each word
for i, label in enumerate(labels):
    x, y = word_vectors_2d[i, :]
    plt.text(x + 0.03, y + 0.03, label, fontsize=9)

# Create a legend
plt.legend(loc="best")

plt.title("2D BERT Contextualised Embeddings Visualization")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")

plt.show()

### let's try plotly version to see overlapping points better

In [None]:
#@title ###**Plotting 2D BERT embeings using plotly**
# Extract embeddings for each word
word_vectors = []
labels = []
colors = []

# Define a color for each sentence
color_map = {
    0: 'red', 1: 'green', 2: 'blue', 3: 'cyan', 4: 'magenta', 5: 'yellow'
}

for i, sentence in enumerate(sentences):
    tokens = tokenizer.tokenize(sentence)
    for j, token in enumerate(tokens):
        word_vectors.append(embeddings[i][j].numpy())
        labels.append(token)
        colors.append(color_map[i])  # Use the full color name here

# Dimensionality Reduction
# Apply PCA
pca = PCA(n_components=2)
word_vectors_2d = pca.fit_transform(word_vectors)

# Interactive Visualization
# Initialize Plotly figure
fig = go.Figure()

# Add points for each token
for i, label in enumerate(labels):
    fig.add_trace(go.Scatter(
        x=[word_vectors_2d[i, 0]],
        y=[word_vectors_2d[i, 1]],
        mode='markers+text',
        name=label,
        text=[label],
        textposition='top center',
        marker=dict(color=colors[i])
    ))

# Add buttons to toggle visibility of each sentence
buttons = []
for color in set(color_map.values()):  # Iterate over full color names
    indices = [i for i, c in enumerate(colors) if c == color]
    sentence_index = list(color_map.keys())[list(color_map.values()).index(color)]
    buttons.append(
        dict(
            label=f"Sentence {sentence_index + 1}",
            method="update",
            args=[{"visible": [i in indices for i in range(len(labels))]}]
        )
    )

# Add a button to show all
buttons.append(
    dict(
        label="Show All",
        method="update",
        args=[{"visible": [True] * len(labels)}]
    )
)

# Add a button to hide all
buttons.append(
    dict(
        label="Hide All",
        method="update",
        args=[{"visible": [False] * len(labels)}]
    )
)

# Update layout with buttons
fig.update_layout(
    title="2D BERT Contextualised Embeddings Visualization",
    xaxis_title="PCA Component 1",
    yaxis_title="PCA Component 2",
    updatemenus=[dict(type="buttons", showactive=True, buttons=buttons)]
)

# Show plot
fig.show()

### <font color ='#34B086'>**What do you notice from the plot above? How does *context* improve the embeddings?**</font>

Pay close attention to the clusters that emerge. The word bank now has 3 embeddings near words like river and picnic and the other three near bank, check, cash etc.

### Here is a brief comparision of the two approaches we looked at:

- **Embedding Type**:
  - **word2vec**: Static embeddings (single vector per word).
  - **BERT**: Contextualized embeddings (different vectors for the same word based on context).

- **Model Architecture**:
  - **word2vec**: Shallow neural network.
  - **BERT**: Deep transformer architecture.

- **Contextual Understanding**:
  - **word2vec**: Ignores context; words with multiple meanings have a single representation.
  - **BERT**: Incorporates context; words with multiple meanings have different representations based on their context.

- **Training Objective**:
  - **word2vec**: Trained using Skip-gram or CBOW models to predict nearby words.
  - **BERT**: Trained using masked language modeling (MLM) and next sentence prediction (NSP).

- **Directionality**:
  - **word2vec**: Processes words in one direction (context is limited to nearby words).
  - **BERT**: Bidirectional (considers the full sentence context, both before and after the target word).

- **Performance on NLP Tasks**:
  - **word2vec**: Good for basic semantic similarity tasks.
  - **BERT**: Superior performance on a wide range of NLP tasks, including question answering and text classification.


## <font color ='1A54A6'>**Take Home Exercise**🏡


Test this notebook with your own sentences where context impacts word meanings directly.

### **BONUS**:
Refer to [this](https://github.com/tensorflow/text/blob/master/docs/tutorials/word2vec.ipynb) official word2vec notebook from tensorflow or the BERT information page on HuggingFace [here](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM.forward) for even more details. (not needed for this course)