# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]
# Encoding
sentences = [s for pair in sentence_pairs for s in pair]
embeddings = model.encode(sentences)

# Compute similarities
# Step 3: Compute cosine similarities for each pair
for i, (sent1, sent2) in enumerate(sentence_pairs):
    emb1 = embeddings[2*i].reshape(1, -1)
    emb2 = embeddings[2*i+1].reshape(1, -1)
    sim = cosine_similarity(emb1, emb2)[0][0]
    print(f"Similarity between:\n  '{sent1}'\n  '{sent2}'\n  => {sim:.3f}\n")
#YOUR CODE HERE

Similarity between:
  'A dog is playing in the park.'
  'A dog is running in a field.'
  => 0.522

Similarity between:
  'I love pizza.'
  'I enjoy ice cream.'
  => 0.528

Similarity between:
  'What is AI?'
  'How does a computer learn?'
  => 0.319



### Questions:
- Which sentence pairs are the most semantically similar? Why?
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?


## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [3]:
from sklearn.cluster import KMeans

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(documents)
print("Embedding shape:", embeddings.shape)


#YOUR CODE HERE

Embedding shape: (6, 384)


In [16]:
# Perform KMeans clustering
from sklearn.cluster import KMeans
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from sentence_transformers import SentenceTransformer
#YOUR CODE HERE
kmeans = KMeans(n_clusters=3,random_state=100)
kmeans.fit(embeddings)
pred = kmeans.predict(embeddings)

In [14]:
# Print cluster assignments

#YOUR CODE HERE
# Option 1: Direkt aus pred
print("Cluster assignments (pred):")
print(pred)

# Option 2: Aus KMeans-Objekt
print("Cluster assignments (kmeans.labels_):")
print(kmeans.labels_)

# Option 3: Wenn du sie zu den ursprünglichen Daten hinzufügen willst
import pandas as pd

df = pd.DataFrame({
    "sentence": sentences,
    "cluster": pred
})
print(df)

Cluster assignments (pred):
[1 0 1 0 2 0]
Cluster assignments (kmeans.labels_):
[1 0 1 0 2 0]
                        sentence  cluster
0  A dog is playing in the park.        1
1   A dog is running in a field.        0
2                  I love pizza.        1
3             I enjoy ice cream.        0
4                    What is AI?        2
5     How does a computer learn?        0


### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [21]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

doc_embeddings = model.encode(documents)
print("Embedding shape:", doc_embeddings.shape)


Embedding shape: (10, 384)


In [22]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    # YOUR CODE HERE
      # 1️⃣ Encode the query
    query_embedding = model.encode([query])  # shape: (1, embedding_dim)

    # 2️⃣ Compute cosine similarity between query and documents
    sims = cosine_similarity(query_embedding, doc_embeddings)[0]  # shape: (num_documents,)

    # 3️⃣ Get indices of top N most similar documents
    top_indices = np.argsort(sims)[::-1][:top_n]

    # 4️⃣ Return top documents with their similarity scores
    top_docs = [(documents[i], sims[i]) for i in top_indices]
    return top_docs

In [23]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)

[('What is quantum computing?', 0.4352477),
 ('What is the best way to learn Python?', 0.3187827),
 ('How do I build a mobile app?', 0.110440776),
 ('How do I set up a local server?', 0.0911265),
 ('What are the best travel destinations in Europe?', 0.090647765)]

### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset