# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities
for sent1, sent2 in sentence_pairs:
    # Encode sentences
    embedding1 = model.encode(sent1)
    embedding2 = model.encode(sent2)

    # Compute cosine similarity
    similarity = cosine_similarity(
        [embedding1],
        [embedding2]
    )[0][0]

    print(f"Sentence 1: {sent1}")
    print(f"Sentence 2: {sent2}")
    print(f"Cosine Similarity: {similarity:.4f}")
    print("-" * 50)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence 1: A dog is playing in the park.
Sentence 2: A dog is running in a field.
Cosine Similarity: 0.5220
--------------------------------------------------
Sentence 1: I love pizza.
Sentence 2: I enjoy ice cream.
Cosine Similarity: 0.5281
--------------------------------------------------
Sentence 1: What is AI?
Sentence 2: How does a computer learn?
Cosine Similarity: 0.3194
--------------------------------------------------


In [3]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence_transformers)
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub>=0.20.0 (from sentence_transformers)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers<5.0.0,>=4.41.0->sentence_transformers)
  Downloading tokenizers-0.22.2-cp39-abi3-win_amd64.whl.metadata (7.4 kB)
Collecting safetensors>=0.4.3 (from transformers<5.0.0,>=4.41.0->sentence_transformers)
  Downloading safetensors-0.7.0-cp38-abi3-win_amd64.whl.metadata (4.2 kB)
Downloading sentence_transformers-5.1.2-py3-none-any.whl (488 kB)
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
   ---------------------------------------- 0.0/12.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/12.0 MB ? eta -:--:--
    ------------------

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio-client 1.3.0 requires websockets<13.0,>=10.0, but you have websockets 15.0.1 which is incompatible.

[notice] A new release of pip is available: 24.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


### Questions:
- Which sentence pairs are the most semantically similar? Why?
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?


Sentence 1: I love pizza.
Sentence 2: I enjoy ice cream.
Cosine Similarity: 0.5281

## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [7]:
from sklearn.cluster import KMeans

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
embeddings = model.encode(documents)



In [8]:
# Perform KMeans clustering

#YOUR CODE HERE

# Apply KMeans clustering
k = 3  # Number of clusters
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(embeddings)

# Analyze clusters
clusters = {}
for doc, label in zip(documents, labels):
    clusters.setdefault(label, []).append(doc)



In [9]:
# Print cluster assignments

#YOUR CODE HERE

for cluster_id, docs in clusters.items():
    print(f"\nCluster {cluster_id}:")
    for doc in docs:
        print(f" - {doc}")


Cluster 2:
 - What is the capital of France?
 - What is the best way to learn Python?

Cluster 0:
 - How do I bake a chocolate cake?
 - What is the distance between Earth and Mars?

Cluster 1:
 - How do I change a flat tire on a car?
 - How do I fix a leaky faucet?


### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [10]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings
doc_embeddings = model.encode(documents)
#YOUR CODE HERE

In [11]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_k=5):
    # Encode query
    query_embedding = model.encode(query)

    # Compute cosine similarity
    similarities = cosine_similarity(
        [query_embedding],
        doc_embeddings
    )[0]

    # Get top-k results
    top_indices = np.argsort(similarities)[-top_k:][::-1]

    # Return results
    results = [
        (documents[i], similarities[i])
        for i in top_indices
    ]
    return results

In [12]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)

[('What is quantum computing?', np.float32(0.43524778)),
 ('What is the best way to learn Python?', np.float32(0.31878257)),
 ('How do I build a mobile app?', np.float32(0.110440716)),
 ('How do I set up a local server?', np.float32(0.091126524)),
 ('What are the best travel destinations in Europe?', np.float32(0.0906478))]

### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset