# Document similarity
1. Load the embedding model (`all-MiniLM-L6-v2`).
2. Create 10 example documents and one query.
3. Get embeddings for both.
4. Use `cosine_similarity` to compare query vs. each document.
5. Sort results and print the **most similar sentence** along with a full ranking.

 ## 1. Load the embedding model (`all-MiniLM-L6-v2`).

In [1]:
from langchain_huggingface import HuggingFaceEmbeddings
from dotenv import load_dotenv


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
load_dotenv()

True

In [3]:
embeddings = HuggingFaceEmbeddings(model="sentence-transformers/all-MiniLM-L6-v2")

In [4]:
text = "New Delhi is the capital of the India"

vector = embeddings.embed_query(text)

In [5]:
print(str(vector))

[0.04605361074209213, -0.05806925520300865, -0.017604384571313858, 0.026150131598114967, -0.007730443030595779, -0.10276605933904648, 0.024327564984560013, 0.005463439039885998, -0.0037072852719575167, -0.00020358702749945223, 0.03943249583244324, -0.12142534554004669, -0.0005752576398663223, -0.05194535851478577, 0.07414358109235764, -0.04955495521426201, 0.07524493336677551, 0.04618397355079651, 0.05548340827226639, -0.05678314343094826, -0.014605126343667507, -0.006626819726079702, -0.02740038000047207, -0.06705169379711151, 0.009240210056304932, 0.019550032913684845, 0.041943494230508804, -0.01277570053935051, 0.04445423558354378, 0.0038903020322322845, 0.06544924527406693, 0.0006415504612959921, -0.007206382229924202, -6.107848457759246e-05, -0.034634027630090714, 0.0212282482534647, -0.04494797810912132, 0.08696726709604263, 0.13940368592739105, -0.10837884247303009, 0.05453198030591011, 0.03133875131607056, 0.05558394268155098, 0.03333265334367752, 0.05737239494919777, -0.010184

## 2. Create 10 example documents and one query.


In [6]:
document = [
    "Harish is good  boy he lives in andhrapardesh",
    "Python is a popular programming language",
    "Cricket is the most popular sport in India",
    "The Taj Mahal is located in Agra",
    "Water boils at 100 degrees Celsius",
    "The Himalayas are the tallest mountains in the world",
    "Artificial Intelligence is transforming many industries"
]

In [7]:
document_vector = embeddings.embed_documents(document)

In [8]:
print(str(document_vector))

[[0.005943941883742809, -0.00807455275207758, 0.011437086388468742, -0.030495349317789078, -0.06900510936975479, 0.06814603507518768, 0.06432916969060898, -0.02426259219646454, -0.05999157950282097, 0.001751104835420847, 0.0624106302857399, -0.12053573131561279, -0.012980078347027302, -0.03405764326453209, 0.013191316276788712, -0.010150314308702946, 0.07816095650196075, 0.034702498465776443, -0.11295846104621887, -0.11622744798660278, -0.07487251609563828, 0.01927623338997364, 0.09270021319389343, -0.06974779814481735, -0.04484916105866432, -0.07911323755979538, 0.06762999296188354, -0.020428460091352463, 0.04566741734743118, 0.026580965146422386, 0.0746195912361145, 0.015547211281955242, -0.014171327464282513, 0.013506805524230003, -0.05653249844908714, 0.03864619508385658, 0.04112870618700981, 0.05879148095846176, 0.04536650702357292, 0.0053036753088235855, 0.024914439767599106, 0.02323978953063488, -0.04321246221661568, -0.10509991645812988, -0.004477926064282656, -0.05940705165266


1. Load the embedding model (`all-MiniLM-L6-v2`).
2. Create 10 example documents and one query.
3. Get embeddings for both.
4. Use `cosine_similarity` to compare query vs. each document.
5. Sort results and print the **most similar sentence** along with a full ranking.


In [10]:
documents = [
    "New Delhi is the capital of the India",
    "Amaravathi is the capital of Andhra Pradesh",
    "Sun rises in the East",
    "The Earth revolves around the Sun",
    "Python is a popular programming language",
    "Cricket is the most popular sport in India",
    "The Taj Mahal is located in Agra",
    "Water boils at 100 degrees Celsius",
    "The Himalayas are the tallest mountains in the world",
    "Artificial Intelligence is transforming many industries"
]

## 3. Get embeddings for both.

In [11]:
# Example query
query = "Which city is the capital of India?"

In [12]:
doc_embeddings = embeddings.embed_documents(documents)
query_embeddings = embeddings.embed_query(query)

In [15]:
len(doc_embeddings)

10

In [16]:
type(doc_embeddings)

list

Here every document in the document is 384 dimensions

In [20]:
len(doc_embeddings[2])

384

In [22]:
len(query_embeddings)

384


## 4. Use `cosine_similarity` to compare query vs. each document.


In [26]:
# Compute cosine similarity
similarities = cosine_similarity([query_embeddings],doc_embeddings)

In [27]:
similarities

array([[0.66796625, 0.52606786, 0.10237831, 0.06642682, 0.09172476,
        0.3984782 , 0.38767492, 0.07669379, 0.20355731, 0.11413664]])

Pair documents with similarity scores

In [28]:
similarities[0]

array([0.66796625, 0.52606786, 0.10237831, 0.06642682, 0.09172476,
       0.3984782 , 0.38767492, 0.07669379, 0.20355731, 0.11413664])

In [32]:
enum_to_list = list(enumerate(similarities[0]))

Sort by similarity (descending)

In [34]:
sorted_list = sorted(enum_to_list,key=lambda x: x[1])
sorted_list

[(3, 0.06642682093768534),
 (7, 0.07669379038176113),
 (4, 0.09172475986952255),
 (2, 0.10237831103663435),
 (9, 0.11413664419797931),
 (8, 0.20355731023738027),
 (6, 0.3876749243533652),
 (5, 0.3984781986931988),
 (1, 0.5260678563722281),
 (0, 0.6679662495566434)]

## 5. Sort results and print the **most similar sentence** along with a full ranking.

In [None]:
# Print results
print("Query:", query)
print("\nMost similar sentence:")
print(documents[sorted_list[-1][0]]) 

Query: Which city is the capital of India?

Most similar sentence:
New Delhi is the capital of the India


# All AT once

In [38]:
from langchain_huggingface import HuggingFaceEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load embedding model
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Example documents (10 lines)
documents = [
    "New Delhi is the capital of the India",
    "Amaravathi is the capital of Andhra Pradesh",
    "Sun rises in the East",
    "The Earth revolves around the Sun",
    "Python is a popular programming language",
    "Cricket is the most popular sport in India",
    "The Taj Mahal is located in Agra",
    "Water boils at 100 degrees Celsius",
    "The Himalayas are the tallest mountains in the world",
    "Artificial Intelligence is transforming many industries"
]

# Example query
query = "Which city is the capital of India?"

# Get embeddings
doc_embeddings = embedding.embed_documents(documents)
query_embedding = embedding.embed_query(query)

# Convert to numpy arrays
doc_embeddings = np.array(doc_embeddings)
query_embedding = np.array(query_embedding).reshape(1, -1)

# Compute cosine similarity
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

# Pair documents with similarity scores
scored_docs = list(zip(documents, similarities))

# Sort by similarity (descending)
sorted_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)

# Print results
print("Query:", query)
print("\nMost similar sentence:")
print(sorted_docs[0][0], "-> similarity:", sorted_docs[0][1])

print("\nAll sentences ranked by similarity:")
for doc, score in sorted_docs:
    print(f"{doc} -> {score:.4f}")


Query: Which city is the capital of India?

Most similar sentence:
New Delhi is the capital of the India -> similarity: 0.6679662495566434

All sentences ranked by similarity:
New Delhi is the capital of the India -> 0.6680
Amaravathi is the capital of Andhra Pradesh -> 0.5261
Cricket is the most popular sport in India -> 0.3985
The Taj Mahal is located in Agra -> 0.3877
The Himalayas are the tallest mountains in the world -> 0.2036
Artificial Intelligence is transforming many industries -> 0.1141
Sun rises in the East -> 0.1024
Python is a popular programming language -> 0.0917
Water boils at 100 degrees Celsius -> 0.0767
The Earth revolves around the Sun -> 0.0664
