# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities

for s1, s2 in sentence_pairs:
    emb1 = model.encode(s1)
    emb2 = model.encode(s2)
    sim = cosine_similarity([emb1], [emb2])[0][0]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Questions:
- Which sentence pairs are the most semantically similar? Why?
"A dog is playing in the park." vs "A dog is running in a field."
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?
when you Deny something and its badly interpreted, or irony


## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [2]:
from sklearn.cluster import KMeans

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
model = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = model.encode(documents)

In [3]:
# Perform KMeans clustering

k = 3
kmeans = KMeans(n_clusters=k, random_state=2)
kmeans.fit(doc_embeddings)
labels = kmeans.labels_

In [5]:
# Print cluster assignments

for cluster in range(k):
    print(cluster)
    for i, label in enumerate(labels):
        if label == cluster:
            print(documents[i])

0
How do I bake a chocolate cake?
What is the distance between Earth and Mars?
1
How do I change a flat tire on a car?
How do I fix a leaky faucet?
2
What is the capital of France?
What is the best way to learn Python?


### Questions:
- How many clusters make the most sense? Why?
**cluster 1 because it grouped technical questions**

- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
** cluster 0 and 2 are not making so much sense**
- Try this exercise with a larger dataset of your choice

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [6]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings


model = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = model.encode(documents)

In [30]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_k=5):
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]  # top k mais similares


    for idx in top_indices:
        print(f"- ({similarities[idx]:.4f}) {documents[idx]}")


In [31]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)

- (0.4352) What is quantum computing?
- (0.3188) What is the best way to learn Python?
- (0.1104) How do I build a mobile app?
- (0.0911) How do I set up a local server?
- (0.0906) What are the best travel destinations in Europe?


### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset

In [None]:
##"explainprogramming languages" is probably related to the learning of python, creation of mobile app and srver
##maybe split the categories of learning