BGE model is created by the Beijing Academy of Artificial Intelligence (BAAI).
BGE (BAAI General Embedding) focuses on retrieval-augmented LLMs. Its
FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs.

Refrences:
1. https://github.com/FlagOpen/FlagEmbedding
2. https://huggingface.co/BAAI/bge-large-en

`pip install -U FlagEmbedding`

In [1]:
from langchain_huggingface import HuggingFaceEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist

In [7]:
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en",encode_kwargs={'normalize_embeddings':True})

In [8]:
sentences_1 = "Gujarat, located on the western coast of India, is rich in ancient history. Its settlements date back to the Indus Valley Civilization, with cities like Dholavira showing advanced urban planning. Over 4,000 years ago, it became a center of maritime trade. Gujarat played a crucial role in the spread of culture across the Arabian Sea, with links to Mesopotamia, the Persian Gulf, and even Africa."
sentences_2 = "The history of human settlement in Rajasthan goes back more than 100,000 years, with the Indus Valley Civilization marking one of the earliest urban settlements. Sites like Kalibangan show evidence of fire altars, and Rajasthan's strategic location made it a hub for trade. The region's culture and history have been shaped by its rich archaeological past, which continues to be explored."
sentences_3 = "Uttar Pradesh, a state in northern India, has a diverse economy dominated by agriculture, services, and manufacturing. The state is a significant contributor to India’s agriculture sector, producing crops like wheat, rice, and sugarcane. In recent years, the economy has shifted toward industrial growth, including sectors like textiles, electronics, and infrastructure development. The state is also a key player in India’s political landscape."

In [15]:
# Generate embeddings for the three paragraphs
import numpy as np
embeddings_1 = np.array(embeddings.embed_query(sentences_1))
embeddings_2 = np.array(embeddings.embed_query(sentences_2))
embeddings_3 = np.array(embeddings.embed_query(sentences_3))

In [16]:
print(embeddings_1)
print(embeddings_2)

[ 0.03032507  0.02998742 -0.01682428 ... -0.0135049  -0.01568999
  0.00547421]
[ 0.01226222  0.02909873 -0.0154123  ... -0.01017884 -0.00809293
  0.00015997]


In [18]:
#calclate cosine silimarity
similarity = cosine_similarity(embeddings_1.reshape(-1, 1), embeddings_2.reshape(-1, 1))
print(similarity)

[[ 1.  1. -1. ... -1. -1.  1.]
 [ 1.  1. -1. ... -1. -1.  1.]
 [-1. -1.  1. ...  1.  1. -1.]
 ...
 [-1. -1.  1. ...  1.  1. -1.]
 [-1. -1.  1. ...  1.  1. -1.]
 [ 1.  1. -1. ... -1. -1.  1.]]


In [19]:
# Compute cosine similarity between sentences
# The cosine_similarity function expects its inputs to be in a specific format: 2D arrays with shape (n_samples, n_features) we need to reshape it
cosine_sim_1_2 = cosine_similarity(embeddings_1.reshape(1, -1), embeddings_2.reshape(1, -1))
cosine_sim_1_3 = cosine_similarity(embeddings_1.reshape(1, -1), embeddings_3.reshape(1, -1))
cosine_sim_2_3 = cosine_similarity(embeddings_2.reshape(1, -1), embeddings_3.reshape(1, -1))

In [20]:
print("Cosine Similarity between Paragraph 1 and 2:", cosine_sim_1_2)
print("Cosine Similarity between Paragraph 1 and 3:", cosine_sim_1_3)
print("Cosine Similarity between Paragraph 2 and 3:", cosine_sim_2_3)

Cosine Similarity between Paragraph 1 and 2: [[0.88005115]]
Cosine Similarity between Paragraph 1 and 3: [[0.77623076]]
Cosine Similarity between Paragraph 2 and 3: [[0.7639011]]


In [21]:
# Compute Euclidean distance between sentences
#  the cdist function from scipy.spatial.distance expects 2D arrays with shape (n_samples, n_features) as inputs.
euclidean_dist_1_2 = cdist(embeddings_1.reshape(1, -1), embeddings_2.reshape(1, -1), metric='euclidean')
euclidean_dist_1_3 = cdist(embeddings_1.reshape(1, -1), embeddings_3.reshape(1, -1), metric='euclidean')
euclidean_dist_2_3 = cdist(embeddings_2.reshape(1, -1), embeddings_3.reshape(1, -1), metric='euclidean')

In [22]:
print("Euclidean Distance between Paragraph 1 and 2:", euclidean_dist_1_2)
print("Euclidean Distance between Paragraph 1 and 3:", euclidean_dist_1_3)
print("Euclidean Distance between Paragraph 2 and 3:", euclidean_dist_2_3)

Euclidean Distance between Paragraph 1 and 2: [[0.48979355]]
Euclidean Distance between Paragraph 1 and 3: [[0.66898318]]
Euclidean Distance between Paragraph 2 and 3: [[0.68716654]]
