# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [14]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities
def print_similarities(pairs):
    for sent1, sent2 in pairs:
        emb1 = model.encode(sent1, convert_to_tensor=True)
        emb2 = model.encode(sent2, convert_to_tensor=True)
        sim = cosine_similarity(emb1.cpu().numpy().reshape(1, -1), emb2.cpu().numpy().reshape(1, -1))[0][0]
        print(f"Similarity between:\n  '{sent1}'\n  and\n  '{sent2}'\n  => {sim:.4f}\n")

print_similarities(sentence_pairs)

Similarity between:
  'A dog is playing in the park.'
  and
  'A dog is running in a field.'
  => 0.5220

Similarity between:
  'I love pizza.'
  and
  'I enjoy ice cream.'
  => 0.5281

Similarity between:
  'What is AI?'
  and
  'How does a computer learn?'
  => 0.3194



### Questions:
- Which sentence pairs are the most semantically similar? Why?
    - The most similar pair is: "I love pizza." + "I enjoy ice cream."
    - Both talk about liking food and use similar words, so their meaning is very close.

- Can you think of cases where cosine similarity might fail to capture true semantic meaning?

    - Sentences that seem alike but mean the opposite
    - Sentences where the meaning depends on the situation and context


In [18]:

sentence_pairs_2 = [
    ("I like cats.", "I don’t like cats."), # Sentences that seem alike but mean the opposite
    ("After a good presentation, she said, 'Great job!'", "After a terrible presentation, she just said with sarcasm, 'Great job!'") # sentences where the meaning depends on the situation and context
]


print_similarities(sentence_pairs_2)

Similarity between:
  'I like cats.'
  and
  'I don’t like cats.'
  => 0.8046

Similarity between:
  'After a good presentation, she said, 'Great job!''
  and
  'After a terrible presentation, she just said with sarcasm, 'Great job!''
  => 0.8206



## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [19]:
from sklearn.cluster import KMeans

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
embeddings = model.encode(documents)



In [25]:
# Perform KMeans clustering
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, random_state=0)
kmeans.fit(embeddings)
clusters = kmeans.labels_

In [26]:
# Print cluster assignments
for i, doc in enumerate(documents):
    print(f"Cluster {clusters[i]}: {doc}")

Cluster 1: What is the capital of France?
Cluster 3: How do I bake a chocolate cake?
Cluster 1: What is the distance between Earth and Mars?
Cluster 0: How do I change a flat tire on a car?
Cluster 2: What is the best way to learn Python?
Cluster 0: How do I fix a leaky faucet?


### Questions:
- How many clusters make the most sense? Why?
    - 4 clusters provides good results

- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
    - Yes, the seem to be semantically meaningful: DIY topics / Geography / Coding / Cooking

- Try this exercise with a larger dataset of your choice
    - Let's go!

In [35]:
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the health benefits of meditation?",
    "How to train a dog to sit?",
    "Explain the theory of relativity",
    "How can I improve my credit score?",
    "What are symptoms of COVID-19?",
    "How do airplanes fly?",
    "Best exercises to lose belly fat?",
    "What is the stock market?",
    "How to make a website with HTML?",
    "What causes climate change?",
    "How to prepare for a job interview?",
    "How does photosynthesis work?",
    "What is quantum computing?",
    "How to start investing in stocks?",
    "How do I make homemade pasta?",
    "What's a good recipe for apple pie?",
    "How do I use a for loop in Python?",
    "What is a closure in JavaScript?",
    "What is the Higgs boson?"
]

# Encode documents
embeddings = model.encode(documents)



In [36]:
#
# 4 clusters
#

# Perform KMeans clustering
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, random_state=0)
kmeans.fit(embeddings)
clusters = kmeans.labels_

# Print cluster assignments
for i, doc in enumerate(documents):
    print(f"Cluster {clusters[i]}: {doc}")

Cluster 0: What is the capital of France?
Cluster 1: How do I bake a chocolate cake?
Cluster 0: What is the distance between Earth and Mars?
Cluster 1: How do I change a flat tire?
Cluster 3: What is the best way to learn Python?
Cluster 2: How do I fix a leaky faucet?
Cluster 1: What are the health benefits of meditation?
Cluster 2: How to train a dog to sit?
Cluster 0: Explain the theory of relativity
Cluster 2: How can I improve my credit score?
Cluster 2: What are symptoms of COVID-19?
Cluster 0: How do airplanes fly?
Cluster 1: Best exercises to lose belly fat?
Cluster 0: What is the stock market?
Cluster 2: How to make a website with HTML?
Cluster 0: What causes climate change?
Cluster 2: How to prepare for a job interview?
Cluster 2: How does photosynthesis work?
Cluster 0: What is quantum computing?
Cluster 0: How to start investing in stocks?
Cluster 2: How do I make homemade pasta?
Cluster 1: What's a good recipe for apple pie?
Cluster 3: How do I use a for loop in Python?
Cl


- Cluster 0: Science & General Knowledge
- Cluster 1: Health, Fitness, and Food
- Cluster 2: Self-Improvement & Practical How-Tos
- Cluster 3: Python


There's some sentences that don't get classified very well (eg "How do I change a flat tire?", or the sentences related to coding). Worth to explore with a higher number of clusters.

In [37]:
#
# 7 clusters
#

# Perform KMeans clustering
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, random_state=0)
kmeans.fit(embeddings)
clusters = kmeans.labels_

# Print cluster assignments
for i, doc in enumerate(documents):
    print(f"Cluster {clusters[i]}: {doc}")

Cluster 0: What is the capital of France?
Cluster 1: How do I bake a chocolate cake?
Cluster 0: What is the distance between Earth and Mars?
Cluster 1: How do I change a flat tire?
Cluster 3: What is the best way to learn Python?
Cluster 2: How do I fix a leaky faucet?
Cluster 1: What are the health benefits of meditation?
Cluster 2: How to train a dog to sit?
Cluster 0: Explain the theory of relativity
Cluster 2: How can I improve my credit score?
Cluster 2: What are symptoms of COVID-19?
Cluster 0: How do airplanes fly?
Cluster 1: Best exercises to lose belly fat?
Cluster 0: What is the stock market?
Cluster 2: How to make a website with HTML?
Cluster 0: What causes climate change?
Cluster 2: How to prepare for a job interview?
Cluster 2: How does photosynthesis work?
Cluster 0: What is quantum computing?
Cluster 0: How to start investing in stocks?
Cluster 2: How do I make homemade pasta?
Cluster 1: What's a good recipe for apple pie?
Cluster 3: How do I use a for loop in Python?
Cl

In general, with a higher number of clusters it performs well, but it still fails with the sentences related to coding.

For those cases, a domain-specific model can be helpful (e.g. code-search-net or microsoft/codebert-base). Another alternative is to try a different clustering algorithm (e.g. HDBSCAN)

<br>


## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [None]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings
doc_embeddings = model.encode(documents)


In [None]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_n]

    print("\nHere are the top documents matching your query:")
    for rank, i in enumerate(top_indices, start=1):
        print(f"{rank}. {documents[i]} ({similarities[i] * 100:.0f}% similarity)")


In [60]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)


Here are the top documents matching your query:
1. What are the best programming languages for web development? (56% similarity)
2. What is quantum computing? (44% similarity)
3. What is artificial intelligence? (42% similarity)
4. What is machine learning? (35% similarity)
5. What is the best way to learn Python? (32% similarity)


### Questions:
- What are the top-ranked results for the given queries?
    - see output
- How can you improve the ranking explanation for users?
    - displaying the % of similarity
- Try this approach with a larger dataset
    - let's go for it!

In [63]:
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?",
    "What are the symptoms of the flu?",
    "How can I improve my public speaking skills?",
    "What causes climate change?",
    "How do I train for a marathon?",
    "What is the fastest animal on Earth'?",
    "How do I start a vegetable garden?",
    "What are the benefits of meditation?",
    "How do I create a budget plan?",
    "What is machine learning?",
    "How do I learn to play the guitar?",
    "What are the best programming languages for web development?",
    "What are the benefits of learning a second language?",
    "What is the history of the Internet?",
    "How do I prepare for a job interview?",
    "What are some healthy dinner recipes?",
    "How do I install Python on Windows?",
    "What is artificial intelligence?",
    "How do I improve my photography skills?",
    "What is the process of photosynthesis?",
    "How do I set up a Wi-Fi network at home?"
]

# Compute document embeddings
doc_embeddings = model.encode(documents)



queries = [
    "How can I start coding?",
    "Tips for healthy eating",
    "Explain how plants make food",
    "Ways to relax and reduce stress",
    "What is the future of artificial intelligence?"
]

for q in queries:
    print(f"\nQuery: {q}")
    semantic_search(q, documents, doc_embeddings)
    print(f"\n\n")


Query: How can I start coding?

Here are the top documents matching your query:
1. What are the best programming languages for web development? (39% similarity)
2. How do I start a vegetable garden? (38% similarity)
3. What is the best way to learn Python? (34% similarity)
4. How do I build a mobile app? (33% similarity)
5. How do I learn to play the guitar? (29% similarity)




Query: Tips for healthy eating

Here are the top documents matching your query:
1. What are some healthy dinner recipes? (62% similarity)
2. What are the benefits of meditation? (28% similarity)
3. How do I improve my photography skills? (23% similarity)
4. How do I prepare for a job interview? (22% similarity)
5. How do I train for a marathon? (21% similarity)




Query: Explain how plants make food

Here are the top documents matching your query:
1. What is the process of photosynthesis? (49% similarity)
2. How do I start a vegetable garden? (48% similarity)
3. What are some healthy dinner recipes? (37% simi

Analysis:

- We get some decent results, specially for the top-1 result; however, in some cases, we also get some off-topic results (eg, "start a vegetable garden" ranks high for "start coding")

Some ways to improve that:
- Use a larger and more specific dataset (ie. using more documents and/or documents that are closely related to the expected queries)
- Use larger models (e.g. `all-mpnet-base-v2`, `text-embedding-3-small`, `text-embedding-3-large`)
- Filter by minimum threshold (eg. discard results under, say, 30% similarity to reduce noise)

 