# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [None]:
# !pip install sentence_transformers
# !pip install tf-keras

In [38]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

#Embeddings
embedding = model.encode(sentence_pairs)
print(embedding.shape)

x1 = "A dog is playing in the park."
x2 = "A dog is running in a field."
x3 = "I love pizza."
x4 = "I enjoy ice cream."
x5 = "What is AI?"
x6 = "How does a computer learn?"

x = [x1, x2, x3, x4, x5, x6]

# Compute similarities
print(cosine_similarity(model.encode([x1]), model.encode([x2]))[0][0])
print(cosine_similarity(model.encode([x3]), model.encode([x4]))[0][0])
print(cosine_similarity(model.encode([x5]), model.encode([x6]))[0][0])
# print(cosine_similarity(model.encode([x3]), embedding)-cosine_similarity(model.encode([x4]), embedding))
# print(cosine_similarity(model.encode([x5]), embedding)-cosine_similarity(model.encode([x6]), embedding))


(3, 384)
0.52197516
0.5280682
0.3194349


### Questions:
- Which sentence pairs are the most semantically similar? Why?
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?


## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [21]:
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer, util

# Documents to cluster
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?"
]

# Encode documents
docmodel = SentenceTransformer('all-MiniLM-L6-v2')
docembedding = docmodel.encode(documents)

In [39]:
# Perform KMeans clustering
dockmeans = KMeans(n_clusters=2,random_state=4)
dockmeans.fit(docembedding)

In [40]:
# Print cluster assignments
label = dockmeans.predict(docembedding)

labeldoc = pd.concat([pd.DataFrame(documents), pd.DataFrame(label, columns=['class'])], axis=1)
labeldoc.head()

Unnamed: 0,0,class
0,What is the capital of France?,1
1,How do I bake a chocolate cake?,1
2,What is the distance between Earth and Mars?,1
3,How do I change a flat tire on a car?,0
4,What is the best way to learn Python?,0


### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [88]:
import numpy as np

# Documents dataset
documents = [
    "What is the capital of France?",
    "How do I bake a chocolate cake?",
    "What is the distance between Earth and Mars?",
    "How do I change a flat tire on a car?",
    "What is the best way to learn Python?",
    "How do I fix a leaky faucet?",
    "What are the best travel destinations in Europe?",
    "How do I set up a local server?",
    "What is quantum computing?",
    "How do I build a mobile app?"
]

# Compute document embeddings
model1 = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = model1.encode(documents)

In [89]:
query = "Explain programming languages."

qemb = model1.encode(query)
result = util.semantic_search(query_embeddings=qemb, corpus_embeddings=docemb, top_k=5)
for i in result[0]:
    print(f"\"{documents[i['corpus_id']]}\" matches with score : {round(i['score'],2)}")
# print(f"\n {query} matches top 5 sentesces : {list((documents[i['corpus_id']],round(i['score'],2)) for i in result[0])}")

"What is the best way to learn Python?" matches with score : 0.32
"How do I fix a leaky faucet?" matches with score : 0.08
"What is the capital of France?" matches with score : 0.08
"What is the distance between Earth and Mars?" matches with score : 0.05
"How do I change a flat tire on a car?" matches with score : 0.04


In [90]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    qemb = model1.encode(query)
    result = util.semantic_search(query_embeddings=qemb, corpus_embeddings=doc_embeddings, top_k=top_n)
    for i in result[0]:
        print(f"\"{documents[i['corpus_id']]}\" matches with score : {round(i['score'],2)}")

In [91]:
# Test the search function
query = "Explain programming languages."
semantic_search(query, documents, doc_embeddings)

"What is quantum computing?" matches with score : 0.44
"What is the best way to learn Python?" matches with score : 0.32
"How do I build a mobile app?" matches with score : 0.11
"How do I set up a local server?" matches with score : 0.09
"What are the best travel destinations in Europe?" matches with score : 0.09


### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset