# Exploring Embedding Models in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [1]:
!pip install sentence-transformers

# Install scipy and scikit-learn
!pip install scipy scikit-learn

# Now install the langchain components
!pip install langchain==0.3.11
!pip install langchain-openai==0.2.12
!pip install langchain-community==0.3.11
!pip install langchain-huggingface==0.1.2



## Enter Open AI and HuggingFace API Tokens

## Setup Environment Variables

In [2]:
from google.colab import userdata
import os

os.environ['HUGGINGFACEHUB_API_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

## Embedding models

The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

In [3]:
docs = [
    "cats eat and sleep",
    "dogs eat and bark",
    "cars drive fast",
    "vehicles include trucks and cars"
]

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former, `.embed_documents`, takes as input multiple texts, while the latter, `.embed_query`, takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

- `.embed_query`  will return a list of floats,
- `.embed_documents` returns a list of lists of floats.

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [4]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

In [5]:
embeddings = openai_embed_model.embed_documents(docs)

In [6]:
len(embeddings)

4

In [7]:
len(embeddings[0])

1536

In [8]:
print(embeddings[0])

[-0.014177093282341957, -0.004976172931492329, -0.009036445990204811, 0.01158419530838728, 0.020240090787410736, 0.026702988892793655, 0.023839188739657402, 0.035087984055280685, -0.0050664725713431835, 0.0032314485870301723, 0.05015517771244049, 0.0012254994362592697, 0.03108898550271988, 0.046955980360507965, 0.0034668734297156334, 0.01676999218761921, 0.008081846870481968, 0.012242094613611698, -0.006998247001320124, 0.03550078347325325, 0.0372035838663578, -0.02899918705224991, 0.03212098404765129, -0.03583618253469467, -0.029721586033701897, 0.054747574031353, -0.00852044578641653, 0.03351418673992157, 0.012125995010137558, -0.029308786615729332, 0.02483248896896839, -0.040686581283807755, -0.01138424500823021, -0.04086718335747719, 0.011397144757211208, 0.023078089579939842, 0.020214291289448738, -0.031475987285375595, -0.0067724972032010555, 0.013286993838846684, -0.009333145804703236, 0.01478339359164238, 0.021813889965415, -0.009494395926594734, -0.009068695828318596, -0.02249

In [9]:
docs

['cats eat and sleep',
 'dogs eat and bark',
 'cars drive fast',
 'vehicles include trucks and cars']

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

sim_matrix = cosine_similarity(embeddings)
sim_matrix


array([[1.        , 0.6126697 , 0.20352739, 0.19823333],
       [0.6126697 , 1.        , 0.26190051, 0.2003429 ],
       [0.20352739, 0.26190051, 1.        , 0.39909568],
       [0.19823333, 0.2003429 , 0.39909568, 1.        ]])

## Open Source Embedding Models on HuggingFace

`langchain-huggingface` integrates seamlessly with LangChain, providing an efficient and effective way to utilize Hugging Face models within the LangChain ecosystem.

`HuggingFaceEmbeddings`uses `sentence-transformers` embeddings. It computes the embedding locally, using your computer resources and allows you to access open or open source embedding LLMs hosted on HuggingFace.

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings

# check out model details here: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
model_name = "mixedbread-ai/mxbai-embed-large-v1"

hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/114k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [8]:
embeddings = hf_embeddings.embed_documents(docs)

In [9]:
len(embeddings)

4

In [10]:
len(embeddings[0])

1024

In [11]:
docs

['cats eat and sleep',
 'dogs eat and bark',
 'cars drive fast',
 'vehicles include trucks and cars']

In [12]:
sim_matrix = cosine_similarity(embeddings)
sim_matrix

array([[1.        , 0.52516038, 0.3422186 , 0.33691793],
       [0.52516038, 1.        , 0.31728604, 0.33082582],
       [0.3422186 , 0.31728604, 1.        , 0.7225333 ],
       [0.33691793, 0.33082582, 0.7225333 , 1.        ]])

## Build a small search engine!

### Load Knowledgebase documents

In [13]:
documents = [
    'Quantum mechanics describes the behavior of very small particles.',
    'Photosynthesis is the process by which green plants make food using sunlight.',
    "Shakespeare's plays are a testament to English literature.",
    'Artificial Intelligence aims to create machines that can think and learn.',
    'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
    'Biology is the study of living organisms and their interactions with the environment.',
    'Music therapy can aid in the mental well-being of individuals.',
    'The Milky Way is just one of billions of galaxies in the universe.',
    'Economic theories help understand the distribution of resources in society.',
    'Yoga is an ancient practice that involves physical postures and meditation.'
]

In [14]:
len(documents)

10

### Get document embeddings

In [15]:
document_embeddings = openai_embed_model.embed_documents(documents)

### Let's try to find the most similar document for one query

In [16]:
new_text = 'What is AI?'
new_text

'What is AI?'

In [17]:
query_embedding = openai_embed_model.embed_query(new_text)

In [18]:
cosine_similarities = cosine_similarity([query_embedding], document_embeddings)
cosine_similarities

array([[ 0.10212454,  0.09252234, -0.00534622,  0.6313588 ,  0.02336383,
         0.09317177,  0.10769226,  0.07003597,  0.05817442,  0.06610282]])

In [19]:
import numpy as np

documents[np.argmax(cosine_similarities[0])]

'Artificial Intelligence aims to create machines that can think and learn.'

### Create Search Engine function

In [20]:
def semantic_search_engine(query, embedder_model):
  query_embedding = embedder_model.embed_query(query)
  cos_scores = cosine_similarity([query_embedding], document_embeddings)[0]
  top_result_id = np.argmax(cos_scores)
  return documents[top_result_id]

In [21]:
new_sentence = 'Tell me about AI'
semantic_search_engine(new_sentence, openai_embed_model)

'Artificial Intelligence aims to create machines that can think and learn.'

In [22]:
new_sentence = 'Do you know about the pyramids?'
semantic_search_engine(new_sentence, openai_embed_model)

'The pyramids of Egypt are historical monuments that have stood for thousands of years.'

In [23]:
new_sentence = 'How do plants survive?'
semantic_search_engine(new_sentence, openai_embed_model)

'Photosynthesis is the process by which green plants make food using sunlight.'