<a href="https://colab.research.google.com/github/RDGopal/IB9LQ0-GenAI/blob/main/Embeddings_Words_to_Sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Embeddings
Embeddings are dense vector representations of text that capture semantic meaning. They transform words, sentences, or even documents into numerical forms that machines can process, preserving relationships like similarity or context. Embeddings are key to modern Natural Language Processing (NLP) and Generative AI because they:

Enable models to understand linguistic patterns and semantics.
Power tasks like text classification, sentiment analysis, machine translation, and text generation.
Allow efficient computation of relationships (e.g., similarity) between pieces of text.
In this exercise, we'll explore three types of embeddings using a sample text:

Word Embeddings: Static vectors for individual words (e.g., GloVe).
Sentence Embeddings: Vectors for entire sentences (e.g., via sentence-transformers).
Contextual Embeddings: Dynamic vectors that vary by context (e.g., BERT).

In [None]:
# Install required libraries
!pip install gensim matplotlib scikit-learn sentence-transformers transformers

**Select Runtime-Restart session after installing the above packages**

In [None]:
from transformers import BertTokenizer, BertModel

#Sample Text
We will start with this text for word and sentence embeddings:

In [None]:
text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."

In [None]:
sentences = [
    "Natural language processing is a subfield of linguistics.",
    "It involves computer science and artificial intelligence.",
    "The goal is to enable computers to understand human language.",
    "Machine learning is a key component of NLP.",
    "Deep learning has revolutionized the field.",
    "Transformers are a type of deep learning model.",
    "BERT is a popular transformer model for NLP tasks."
]

In [None]:
sentences_context = [
    "I went to the bank to deposit money.",
    "The river bank was flooded."
]

#Word Embeddings
Word embeddings map individual words to fixed-size vectors based on their meanings, learned from large corpora. Popular models include Word2Vec and GloVe. Here, we'll use GloVe (Global Vectors), which captures global statistical information about word co-occurrences.

In [None]:
# Import libraries
import gensim.downloader
import gensim.downloader as api
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Load pre-trained GloVe model (50-dimensional vectors)
word_vectors = api.load("glove-wiki-gigaword-50")

# Select words from our text to analyze
words = ["language", "computer", "human", "machine", "intelligence", "artificial", "data", "process", "analyze"]

# Get their vectors
vectors = [word_vectors[word] for word in words]

# Compute cosine similarity between two words
similarity = word_vectors.similarity("language", "computer")
print(f"Similarity between 'language' and 'computer': {similarity:.4f}")

# Visualize

The similarity score (e.g., between "language" and "computer") will be a value between -1 and 1, with higher values indicating greater semantic similarity.

The t-SNE plot projects 50-dimensional vectors into 2D space. Words with similar meanings (e.g., "computer" and "machine") should appear closer together, while unrelated words (e.g., "human" and "data") may be farther apart.

In [None]:
# Reduce dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42,perplexity=5)
vectors_2d = tsne.fit_transform(np.array(vectors))

# Plot the embeddings
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
    plt.scatter(vectors_2d[i, 0], vectors_2d[i, 1])
    plt.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]), fontsize=12)
plt.title("Word Embeddings Visualization (GloVe)", fontsize=14)
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.show()

#Missing Context

Consider the following sentences:

    "I went to the bank to deposit money."
    "The river bank was flooded."
Let's create a vocabulary from the two sentences, get then embeddings and plot them.


In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

In [None]:
sentences = [
    "I went to the bank to deposit money.",
    "The river bank was flooded."
]
all_text = " ".join(sentences)
all_text = all_text.lower()
words = word_tokenize(all_text)
words

Get embeddings and plot them

In [None]:
# Get their vectors
vectors = [word_vectors[word] for word in words]
# Reduce dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42,perplexity=5)
vectors_2d = tsne.fit_transform(np.array(vectors))

# Plot the embeddings
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
    plt.scatter(vectors_2d[i, 0], vectors_2d[i, 1])
    plt.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]), fontsize=12)
plt.title("Word Embeddings Visualization (GloVe)", fontsize=14)
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.show()

#Contextual Embeddings
Contextual embeddings, like those from BERT, provide vectors that change based on a word’s context. Unlike static word embeddings, BERT considers the surrounding words, making it ideal for disambiguating meanings (e.g., "bank" as a financial institution vs. a riverbank).

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


In [None]:
# Import libraries
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
# --- Contextual Embeddings with BERT ---
sentences_context = [
    "I went to the bank to deposit money.",
    "The river bank was flooded."
]
bank_embeddings = []
for sentence in sentences_context:
    inputs = tokenizer(sentence, return_tensors="pt")
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    tokens = tokenizer.tokenize(sentence)
    bank_index = tokens.index('bank')
    bank_embedding = embeddings[0, bank_index, :].detach().numpy()
    bank_embeddings.append(bank_embedding)
    print(f"Embedding for 'bank' in '{sentence}': {bank_embedding[:5]}...")

In [None]:
similarity = cosine_similarity([bank_embeddings[0]], [bank_embeddings[1]])[0][0]
print(f"Similarity between 'bank' in different contexts: {similarity:.4f}")

In [None]:
# Tokenize the contextual sentences
inputs = [tokenizer(sentence, return_tensors='pt') for sentence in sentences_context]

# Get BERT outputs (embeddings)
with torch.no_grad():
    outputs = [model(**input) for input in inputs]

# Extract embeddings for specific words
bank1 = outputs[0].last_hidden_state[0, 4, :].numpy()  # "bank" in "I went to the bank to deposit money"
money = outputs[0].last_hidden_state[0, 7, :].numpy()  # "money" in the same sentence
bank2 = outputs[1].last_hidden_state[0, 2, :].numpy()  # "bank" in "The river bank was flooded"
river = outputs[1].last_hidden_state[0, 1, :].numpy()  # "river" in the same sentence

# Compute similarities
similarity_bank1_money = cosine_similarity([bank1], [money])[0][0]
similarity_bank2_river = cosine_similarity([bank2], [river])[0][0]
similarity_bank1_bank2 = cosine_similarity([bank1], [bank2])[0][0]

print(f"Similarity between 'bank' (finance) and 'money': {similarity_bank1_money:.4f}")
print(f"Similarity between 'bank' (river) and 'river': {similarity_bank2_river:.4f}")
print(f"Similarity between 'bank' (finance) and 'bank' (river): {similarity_bank1_bank2:.4f}")

The similarity scores should show that "bank" in the financial context is closer to "money," while "bank" in the river context is closer to "river." The two "bank" embeddings will be less similar, highlighting BERT’s context sensitivity.

The plot should position "bank (finance)" near "money" and "bank (river)" near "river," visually demonstrating how BERT adapts embeddings to context.

Contextual embeddings are the backbone of state-of-the-art NLP models (e.g., chatbots, question answering), as they handle polysemy and nuanced meanings effectively.

In [None]:
# Prepare embeddings and labels for visualization
embeddings_to_plot = np.array([bank1, money, bank2, river])
labels = ["bank (finance)", "money", "bank (river)", "river"]

# Reduce dimensions to 2D using t-SNE
tsne_context = TSNE(n_components=2, random_state=42,perplexity=3)
embeddings_2d = tsne_context.fit_transform(embeddings_to_plot)

# Plot the embeddings
plt.figure(figsize=(10, 8))
for i, label in enumerate(labels):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1])
    plt.annotate(label, (embeddings_2d[i, 0], embeddings_2d[i, 1]), fontsize=12)
plt.title("Contextual Embeddings Visualization (BERT)", fontsize=14)
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.show()

#Sentence Embeddings
Sentence embeddings represent entire sentences as single vectors, capturing their overall meaning. While averaging word embeddings is a simple approach, advanced models like those from sentence-transformers (built on transformer architectures) perform better by considering word interactions.



In [None]:
# Import libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained sentence-transformer model
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
sentences = [
    "Natural language processing is a subfield of linguistics.",
    "It involves computer science and artificial intelligence.",
    "The goal is to enable computers to understand human language.",
    "Machine learning is a key component of NLP.",
    "Deep learning has revolutionized the field.",
    "Transformers are a type of deep learning model.",
    "BERT is a popular transformer model for NLP tasks."
]

In [None]:
# Generate embeddings for the sentences
sentence_embeddings = sentence_model.encode(sentences)

# Compute similarity between the first two sentences
similarity = cosine_similarity([sentence_embeddings[0]], [sentence_embeddings[1]])[0][0]
print(f"Similarity between sentence 1 and sentence 2: {similarity:.4f}")

In [None]:
# Reduce dimensions to 2D using t-SNE
tsne_sentences = TSNE(n_components=2, random_state=42,perplexity=3)
sentence_vectors_2d = tsne_sentences.fit_transform(sentence_embeddings)

# Plot the embeddings
plt.figure(figsize=(12, 8))
for i, sentence in enumerate(sentences):
    plt.scatter(sentence_vectors_2d[i, 0], sentence_vectors_2d[i, 1])
    plt.annotate(f"S{i+1}", (sentence_vectors_2d[i, 0], sentence_vectors_2d[i, 1]), fontsize=10)
plt.title("Sentence Embeddings Visualization (Sentence-Transformers)", fontsize=14)
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")
plt.show()

# Optional: Print sentences with labels for reference
for i, sentence in enumerate(sentences):
    print(f"S{i+1}: {sentence}")

#Sentence Transformers
Sentence Transformers are models that produce embeddings (numerical representations) for entire sentences or paragraphs, rather than just individual words. They’re built on transformer models, a type of neural network architecture widely used in natural language processing (NLP). Unlike word embeddings, which give each word a static vector regardless of its context, Sentence Transformers create dynamic, context-aware representations by considering all the words in a sentence together.

##How Sentence Transformers Work

###Input Encoding: Turning Words into Numbers
Tokenization: The sentence is first split into smaller pieces called tokens. These could be whole words (e.g., "cat") or subwords (e.g., "play" and "##ing" from "playing"), depending on the model. For example, "I love NLP" might become ["I", "love", "NLP"].

Embedding Layer: Each token is mapped to an initial numerical vector using an embedding layer. This is similar to word embeddings, but these vectors are just a starting point—they’ll be refined later.

###Transformer Layers: Adding Context

The magic of Sentence Transformers lies in the transformer architecture, which processes the entire sentence at once. Here’s how it works:

* Self-Attention Mechanism: Each token’s representation is updated based on its relationship with every other token in the sentence. For instance, in "The bank by the river," the word "bank" gets influenced by "river," helping the model understand it’s a riverbank, not a financial institution.
* Multiple Layers: The tokens pass through several transformer layers (e.g., 6 or 12, depending on the model). Each layer refines the embeddings, capturing deeper patterns—early layers might focus on syntax (word order), while later layers grasp semantics (meaning).
* Result: After this step, you get a set of contextualized token embeddings, one for each token, where each embedding reflects the token’s meaning in the context of the full sentence.

###Pooling: Creating a Single Sentence Embedding
The transformer layers give us embeddings for each token, but we want one vector for the whole sentence. This is done through pooling:

* Mean Pooling: Take the average of all token embeddings. This is common because it’s simple and captures the overall meaning well.
* Max Pooling: Use the maximum value across each dimension of the token embeddings to highlight key features.
* CLS Token: Some models (like BERT) add a special [CLS] token at the start of the sentence, and its final embedding represents the sentence. Sentence Transformers often tweak this approach, but mean pooling is a popular default.
For example, if "I love NLP" produces three token embeddings, mean pooling averages them into one vector that represents the entire sentence.

###Fine-Tuning: Sharpening Semantic Understanding
To make the embeddings even better, Sentence Transformers are often fine-tuned on specific tasks:

* Natural Language Inference (NLI): The model learns to predict if one sentence entails, contradicts, or is neutral relative to another (e.g., "I love NLP" vs. "NLP is great"). This teaches it to align similar meanings in vector space.
Semantic Textual Similarity (STS): The model is trained to score how similar two sentences are, refining its ability to capture nuanced meanings.
This fine-tuning ensures the embeddings are optimized for practical tasks like finding similar sentences or classifying text.

###Output: A Fixed-Size Sentence Embedding
The end result is a single vector (e.g., 384 or 768 dimensions) that encapsulates the sentence’s meaning. This vector can be used for:

* Semantic Search: Finding sentences with similar meanings.
* Clustering: Grouping related sentences.
* Classification: Feeding the vector into a classifier (e.g., for sentiment analysis).