# Embeddings

This lab is designed to help you solidify your understanding of embeddings by applying them to tasks like semantic similarity, clustering, and building a semantic search system.

### Tasks:
- Task 1: Semantic Similarity Comparison
- Task 2: Document Clustering
- Task 3: Enhance the Semantic Search System


## Task 1: Semantic Similarity Comparison
### Objective:
Compare semantic similarity between pairs of sentences using cosine similarity and embeddings.

### Steps:
1. Load a pre-trained Sentence Transformer model.
2. Encode the sentence pairs.
3. Compute cosine similarity for each pair.

### Dataset:
- "A dog is playing in the park." vs. "A dog is running in a field."
- "I love pizza." vs. "I enjoy ice cream."
- "What is AI?" vs. "How does a computer learn?"


In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentence pairs
sentence_pairs = [
    ("A dog is playing in the park.", "A dog is running in a field."),
    ("I love pizza.", "I enjoy ice cream."),
    ("What is AI?", "How does a computer learn?")
]

# Compute similarities
for pair in sentence_pairs:
    sentence1, sentence2 = pair
    embedding1 = model.encode([sentence1])
    embedding2 = model.encode([sentence2])
    similarity = cosine_similarity(embedding1, embedding2)
    print(f"Similarity between '{sentence1}' and '{sentence2}': {similarity[0][0]}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Similarity between 'A dog is playing in the park.' and 'A dog is running in a field.': 0.5219751596450806
Similarity between 'I love pizza.' and 'I enjoy ice cream.': 0.5280680656433105
Similarity between 'What is AI?' and 'How does a computer learn?': 0.3194349408149719


### Questions:
- Which sentence pairs are the most semantically similar? Why?
- Can you think of cases where cosine similarity might fail to capture true semantic meaning?


## Task 2: Document Clustering
### Objective:
Cluster a set of text documents into similar groups based on their embeddings.

### Steps:
1. Encode the documents using Sentence Transformers.
2. Use KMeans clustering to group the documents.
3. Analyze the clusters for semantic meaning.

In [2]:
from sklearn.cluster import KMeans

# Documents to cluster
questions = [
    "What is the most popular sport in the world?",
    "What is the stock market?",
    "What are the three primary types of musical instruments?",
    "How many players are on a soccer team?",
    "What does 'ROI' stand for in finance?",
    "Who is known as the 'King of Pop'?",
    "Who holds the record for the most Olympic gold medals?",
    "What is a savings account?",
    "What is the difference between a symphony and a concerto?",
    "What is the Super Bowl?",
    "What is the purpose of a credit score?",
    "What are the seven musical notes?",
    "Which country has won the most FIFA World Cups?",
    "What does 'diversification' mean in investing?",
    "Who composed 'Fur Elise'?",
    "What does 'MVP' stand for in sports?",
    "What is a mutual fund?",
    "What is a chord in music?",
    "In tennis, what does 'love' mean?",
    "What is the difference between gross income and net income?",
    "What is the purpose of a metronome?",
    "How long is an NBA basketball game?",
    "How does a credit card work?",
    "What does 'tempo' mean in music?",
    "What is the main rule of golf?",
    "What is the role of a central bank?",
    "What is the most played instrument in the world?",
    "Who is considered the greatest basketball player of all time?",
    "What is cryptocurrency?",
    "What is sheet music?",
    "What are the main sections of an orchestra?",
    "What does 'a cappella' mean?",
    "What is the term for a score in baseball?",
    "What does 'inflation' mean?",
    "What is the difference between a loan and a mortgage?",
    "What is the difference between a major and minor scale?",
    "What is the fastest recorded speed for a tennis serve?",
    "How does a pension plan work?",
    "Who wrote the opera 'The Magic Flute'?",
    "What does 'offside' mean in soccer?",
    "What does 'compound interest' mean?",
    "What is a flat note in music?",
    "Who has the most home runs in MLB history?",
    "What is the purpose of a budget?",
    "What is the role of a conductor in an orchestra?",
    "What is the standard tuning for a guitar?",
    "How many rounds are there in a standard boxing match?",
    "How does a debit card differ from a credit card?",
    "Who is the best-selling music artist of all time?",
    "What is the national sport of Japan?",
    "What is the role of a financial advisor?",
    "What is the difference between a cover and an original song?",
    "What equipment is required to play cricket?",
    "What is foreign exchange (Forex)?",
    "What is a duet in music?",
    "What is a strike in bowling?",
    "What is an initial public offering (IPO)?",
    "What is the purpose of a capo on a guitar?",
    "How long is a marathon?",
    "How does a checking account differ from a savings account?",
    "What is the difference between rhythm and melody?",
    "What is the highest score possible in bowling?",
    "What is a hedge fund?",
    "Who is considered the greatest composer in classical music?",
    "What sport uses a pommel horse?",
    "What does 'liquidity' mean in finance?",
    "What is the structure of a typical pop song?",
    "What is a 'grand slam' in tennis?",
    "What is an economic recession?",
    "What is the purpose of a music producer?"
]


# Encode documents
X = model.encode(questions)


In [3]:
# Perform KMeans clustering
kmeans = KMeans(n_clusters=3,random_state=100)
kmeans.fit(X)
pred = kmeans.predict(X)

In [4]:
# Print cluster assignments
for i, question in enumerate(questions):
    print(f"Question: {question}\nCluster: {pred[i]}\n")

Question: What is the most popular sport in the world?
Cluster: 0

Question: What is the stock market?
Cluster: 2

Question: What are the three primary types of musical instruments?
Cluster: 1

Question: How many players are on a soccer team?
Cluster: 1

Question: What does 'ROI' stand for in finance?
Cluster: 2

Question: Who is known as the 'King of Pop'?
Cluster: 0

Question: Who holds the record for the most Olympic gold medals?
Cluster: 0

Question: What is a savings account?
Cluster: 2

Question: What is the difference between a symphony and a concerto?
Cluster: 1

Question: What is the Super Bowl?
Cluster: 1

Question: What is the purpose of a credit score?
Cluster: 2

Question: What are the seven musical notes?
Cluster: 1

Question: Which country has won the most FIFA World Cups?
Cluster: 0

Question: What does 'diversification' mean in investing?
Cluster: 2

Question: Who composed 'Fur Elise'?
Cluster: 0

Question: What does 'MVP' stand for in sports?
Cluster: 1

Question: Wha

### Questions:
- How many clusters make the most sense? Why?
- Examine the documents in each cluster. Are they semantically meaningful? Can you assign a semantic "theme" to each cluster?
- Try this exercise with a larger dataset of your choice

## Task 3: Semantic Search System
### Objective:
Create a semantic search engine:
A user provides a query and you search the dataset for semantically relevant documents to return. Return the top 5 results.

### Dataset:
- Use the following set of documents:
    - "What is the capital of France?"
    - "How do I bake a chocolate cake?"
    - "What is the distance between Earth and Mars?"
    - "How do I change a flat tire on a car?"
    - "What is the best way to learn Python?"
    - "How do I fix a leaky faucet?"
    - "What are the best travel destinations in Europe?"
    - "How do I set up a local server?"
    - "What is quantum computing?"
    - "How do I build a mobile app?"


In [5]:
import numpy as np

# Documents dataset
documents = [
    "What is the most popular sport in the world?",
    "What is the stock market?",
    "What are the three primary types of musical instruments?",
    "How many players are on a soccer team?",
    "What does 'ROI' stand for in finance?",
    "Who is known as the 'King of Pop'?",
    "Who holds the record for the most Olympic gold medals?",
    "What is a savings account?",
    "What is the difference between a symphony and a concerto?",
    "What is the Super Bowl?",
    "What is the purpose of a credit score?",
    "What are the seven musical notes?",
    "Which country has won the most FIFA World Cups?",
    "What does 'diversification' mean in investing?",
    "Who composed 'Fur Elise'?",
    "What does 'MVP' stand for in sports?",
    "What is a mutual fund?",
    "What is a chord in music?",
    "In tennis, what does 'love' mean?",
    "What is the difference between gross income and net income?",
    "What is the purpose of a metronome?",
    "How long is an NBA basketball game?",
    "How does a credit card work?",
    "What does 'tempo' mean in music?",
    "What is the main rule of golf?",
    "What is the role of a central bank?",
    "What is the most played instrument in the world?",
    "Who is considered the greatest basketball player of all time?",
    "What is cryptocurrency?",
    "What is sheet music?",
    "What are the main sections of an orchestra?",
    "What does 'a cappella' mean?",
    "What is the term for a score in baseball?",
    "What does 'inflation' mean?",
    "What is the difference between a loan and a mortgage?",
    "What is the difference between a major and minor scale?",
    "What is the fastest recorded speed for a tennis serve?",
    "How does a pension plan work?",
    "Who wrote the opera 'The Magic Flute'?",
    "What does 'offside' mean in soccer?",
    "What does 'compound interest' mean?",
    "What is a flat note in music?",
    "Who has the most home runs in MLB history?",
    "What is the purpose of a budget?",
    "What is the role of a conductor in an orchestra?",
    "What is the standard tuning for a guitar?",
    "How many rounds are there in a standard boxing match?",
    "How does a debit card differ from a credit card?",
    "Who is the best-selling music artist of all time?",
    "What is the national sport of Japan?",
    "What is the role of a financial advisor?",
    "What is the difference between a cover and an original song?",
    "What equipment is required to play cricket?",
    "What is foreign exchange (Forex)?",
    "What is a duet in music?",
    "What is a strike in bowling?",
    "What is an initial public offering (IPO)?",
    "What is the purpose of a capo on a guitar?",
    "How long is a marathon?",
    "How does a checking account differ from a savings account?",
    "What is the difference between rhythm and melody?",
    "What is the highest score possible in bowling?",
    "What is a hedge fund?",
    "Who is considered the greatest composer in classical music?",
    "What sport uses a pommel horse?",
    "What does 'liquidity' mean in finance?",
    "What is the structure of a typical pop song?",
    "What is a 'grand slam' in tennis?",
    "What is an economic recession?",
    "What is the purpose of a music producer?"
]

# Compute document embeddings
doc_embeddings = model.encode(documents)

#YOUR CODE HERE

In [10]:
# Create the search function
#This function should encode the user query and return the top N documents that most resemble it
def semantic_search(query, documents, doc_embeddings, top_n=5):
    query_embedding = model.encode([query])
    similarity_scores = cosine_similarity(query_embedding, doc_embeddings)[0]
    top_indices = np.argsort(similarity_scores)[::-1][:top_n]
    top_documents = [documents[i] for i in top_indices]
    return top_documents

In [13]:
# Test the search function
query = "What is a flat note in music?"
semantic_search(query, documents, doc_embeddings)

['What is a flat note in music?',
 'What are the seven musical notes?',
 'What is sheet music?',
 'What is a chord in music?',
 'What is a duet in music?']

### Questions:
- What are the top-ranked results for the given queries?
- How can you improve the ranking explanation for users?
- Try this approach with a larger dataset