## Setup Environments

In [1]:
%run Setup.ipynb

## Embedding models

The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

In [2]:
docs = [
    "cats eat and sleep",
    "dogs eat and bark",
    "cars drive fast",
    "vehicles include trucks and cars"
]

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former, `.embed_documents`, takes as input multiple texts, while the latter, `.embed_query`, takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

- `.embed_query`  will return a list of floats,
- `.embed_documents` returns a list of lists of floats.

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [12]:
from pprint import pprint

## OpenAI's `text-embedding-3-small` Model Specifications

The **1536 dimensions** is the default output size for OpenAI's `text-embedding-3-small` model. This is a design choice made by OpenAI when they created this embedding model.

---

### Key Points About the 1536 Dimension Size

#### 1. Model Architecture Design
- OpenAI designed `text-embedding-3-small` to output **1536-dimensional vectors** by default.
- This size balances **performance** and **computational efficiency**.
- It is smaller than the larger `text-embedding-3-large` model, which outputs **3072 dimensions**.

#### 2. Why 1536 Specifically?
- **Powers of 2:** 1536 = 3 × 512, where 512 is 2⁹ (a common size in neural networks).
- **Computational Efficiency:** This size works well with modern GPU architectures.
- **Information Capacity:** 1536 dimensions provide enough capacity to capture semantic meaning effectively.


In [5]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model="text-embedding-3-small")      

In [6]:
embeddings = openai_embed_model.embed_documents(docs)

In [9]:
print(f"The length of the embeddings is {len(embeddings)}")

The length of the embeddings is 4


In [22]:
embeddings

[[-0.014177093282341957, -0.004976172931492329, -0.009036445990204811, 0.01158419530838728, 0.020240090787410736, 0.026702988892793655, 0.023839188739657402, 0.035087984055280685, -0.0050664725713431835, 0.0032314485870301723, 0.05015517771244049, 0.0012254994362592697, 0.03108898550271988, 0.046955980360507965, 0.0034668734297156334, 0.01676999218761921, 0.008081846870481968, 0.012242094613611698, -0.006998247001320124, 0.03550078347325325, 0.0372035838663578, -0.02899918705224991, 0.03212098404765129, -0.03583618253469467, -0.029721586033701897, 0.054747574031353, -0.00852044578641653, 0.03351418673992157, 0.012125995010137558, -0.029308786615729332, 0.02483248896896839, -0.040686581283807755, -0.01138424500823021, -0.04086718335747719, 0.011397144757211208, 0.023078089579939842, 0.020214291289448738, -0.031475987285375595, -0.0067724972032010555, 0.013286993838846684, -0.009333145804703236, 0.01478339359164238, 0.021813889965415, -0.009494395926594734, -0.009068695828318596, -0.0224

In [18]:
# Loop through each embedding vector in the embeddings list
for embedding in embeddings:
    # Print the length of the current embedding vector
    # This shows how many dimensions (features) each embedding has
    print(len(embedding))

1536
1536
1536
1536


#### Customizable Dimensions
You can actually change the output dimensions using OpenAI's embedding models:

In [23]:
from langchain_openai import OpenAIEmbeddings

# Default 1536 dimensions
openai_embed_default = OpenAIEmbeddings(model="text-embedding-3-small")

# Custom dimensions (can be reduced for efficiency)
openai_embed_custom = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=512  # Reduce to 512 dimensions
)

In [24]:
embeddings_custom = openai_embed_custom.embed_documents(docs)
embeddings_custom

[[-0.0207798033952713, -0.007293730042874813, -0.013244997709989548, 0.016979312524199486, 0.02966652624309063, 0.03913939371705055, 0.034941837191581726, 0.051429543644189835, -0.007426085416227579, 0.004736433736979961, 0.07351399213075638, 0.0017962524434551597, 0.045568086206912994, 0.06882482767105103, 0.00508150365203619, 0.024580296128988266, 0.011845812201499939, 0.017943615093827248, -0.010257546789944172, 0.05203459411859512, 0.054530441761016846, -0.04250500351190567, 0.04708072170615196, -0.05252620205283165, -0.0435638464987278, 0.08024521172046661, -0.012488680891692638, 0.049122776836156845, 0.017773443832993507, -0.04295879229903221, 0.036397743970155716, -0.05963557958602905, -0.01668624021112919, -0.05990029126405716, 0.016705147922039032, 0.03382626920938492, 0.029628710821270943, -0.046135324984788895, -0.009926658123731613, 0.01947515830397606, -0.01367987971752882, 0.02166847698390484, 0.03197329118847847, -0.013916228897869587, -0.013292267918586731, -0.032975412

In [25]:
# Loop through each embedding vector in the embeddings list
for embedding in embeddings_custom:
    # Print the length of the current embedding vector
    # This shows how many dimensions (features) each embedding has
    print(len(embedding))

512
512
512
512


Embedding Model Dimension Comparison:
- text-embedding-3-small: 1536 dimensions (default)
- text-embedding-3-large: 3072 dimensions
- text-embedding-ada-002: 1536 dimensions
- Sentence Transformers: Varies (e.g., 384, 768, 1024)

# Higher dimensions (1536+)
- ✅ Better semantic representation
- ✅ More nuanced understanding
- ❌ More storage space
- ❌ Slower similarity computations

# Lower dimensions (256-512)
- ✅ Faster computations
- ✅ Less storage needed
- ❌ Potential loss of semantic detail

In [26]:
len(embeddings[0])

1536

In [27]:
docs

['cats eat and sleep', 'dogs eat and bark', 'cars drive fast', 'vehicles include trucks and cars']

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix between all pairs of document embeddings.
# The resulting sim_matrix is a square matrix where each entry [i, j] represents
# the similarity between document i and document j. Values range from -1 (opposite)
# to 1 (identical), with 1s on the diagonal (each document is identical to itself).
sim_matrix = cosine_similarity(embeddings)
sim_matrix  # Shows how similar each document is to every other document.

array([[1.        , 0.6126697 , 0.20352739, 0.19823333],
       [0.6126697 , 1.        , 0.26190051, 0.2003429 ],
       [0.20352739, 0.26190051, 1.        , 0.39909568],
       [0.19823333, 0.2003429 , 0.39909568, 1.        ]])

## Open Source Embedding Models on HuggingFace

`langchain-huggingface` integrates seamlessly with LangChain, providing an efficient and effective way to utilize Hugging Face models within the LangChain ecosystem.

`HuggingFaceEmbeddings`uses `sentence-transformers` embeddings. It computes the embedding locally, using your computer resources and allows you to access open or open source embedding LLMs hosted on HuggingFace.

In [34]:
!pip install -qq langchain_huggingface

In [35]:
from langchain_huggingface import HuggingFaceEmbeddings

# check out model details here: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
model_name = "mixedbread-ai/mxbai-embed-large-v1"

hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/266 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [36]:
embeddings = hf_embeddings.embed_documents(docs)

In [37]:
len(embeddings)

4

In [38]:
len(embeddings[0])

1024

In [39]:
docs

['cats eat and sleep', 'dogs eat and bark', 'cars drive fast', 'vehicles include trucks and cars']

In [40]:
sim_matrix = cosine_similarity(embeddings)
sim_matrix

array([[1.        , 0.52515984, 0.34221797, 0.3369172 ],
       [0.52515984, 1.        , 0.31728526, 0.33082526],
       [0.34221797, 0.31728526, 1.        , 0.72253285],
       [0.3369172 , 0.33082526, 0.72253285, 1.        ]])

## Build a small search engine!

### Load Knowledgebase documents

We have 10 Document in our Knowledge Base

In [41]:
documents = [
    'Quantum mechanics describes the behavior of very small particles.',
    'Photosynthesis is the process by which green plants make food using sunlight.',
    "Shakespeare's plays are a testament to English literature.",
    'Artificial Intelligence aims to create machines that can think and learn.',
    'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
    'Biology is the study of living organisms and their interactions with the environment.',
    'Music therapy can aid in the mental well-being of individuals.',
    'The Milky Way is just one of billions of galaxies in the universe.',
    'Economic theories help understand the distribution of resources in society.',
    'Yoga is an ancient practice that involves physical postures and meditation.'
]

In [42]:
len(documents)

10

In [43]:
### Get document embeddings
document_embeddings = openai_embed_model.embed_documents(documents)

In [52]:
document_embeddings

[[-0.06414780765771866, 0.015736594796180725, -0.04284387081861496, 0.018375450745224953, -0.0320095419883728, -0.016036951914429665, -0.021582841873168945, 0.0713563933968544, -0.03402622789144516, 0.03788796812295914, -0.012314663268625736, 0.0005883121048100293, -0.04827176034450531, 0.020145416259765625, 0.019394520670175552, 0.02379261516034603, 0.0025879028253257275, -0.01640167273581028, 0.05260549485683441, 0.03949702903628349, -0.01271156407892704, -0.046598341315984726, 0.031237194314599037, 0.010469608940184116, 0.012958286330103874, -0.030786657705903053, 0.027825988829135895, 0.01542550977319479, 0.037072714418172836, 0.026924915611743927, 0.07234328240156174, -0.029070327058434486, -0.009670442901551723, -0.05269131064414978, -0.06307510286569595, 0.011102505028247833, -0.0005903234123252332, -0.04509655386209488, 0.012893923558294773, 0.0005923347198404372, -0.014470801688730717, -0.047284871339797974, -0.053549475967884064, 0.021271755918860435, 0.005497617181390524, 0.

### Let's try to find the most similar document for one query

In [44]:
new_text = 'What is AI?'
new_text

'What is AI?'

In [45]:
query_embedding = openai_embed_model.embed_query(new_text)

In [54]:
len(query_embedding)

1536

Query vs. Documents Similarity Analysis

- Comparing one query embedding: *"What is AI?"*
- Against 10 document embeddings from the knowledge base

- Objective: Find the document most similar to the query using cosine similarity

In [48]:
# Compute the cosine similarity between the query embedding and each document embedding
cosine_similarities = cosine_similarity([query_embedding], document_embeddings)
cosine_similarities

array([[ 0.10209964,  0.09248553, -0.00529486,  0.63144386,  0.02336383,
         0.09323297,  0.10764793,  0.0699893 ,  0.05817442,  0.06608819]])

In [56]:
print(cosine_similarities)

[[ 0.10209964  0.09248553 -0.00529486  0.63144386  0.02336383  0.09323297
   0.10764793  0.0699893   0.05817442  0.06608819]]


In [64]:
# From above we can see the document in 4th position is the most similar to the query "What is AI?"

most_similar_document = documents[3]
most_similar_document

'Artificial Intelligence aims to create machines that can think and learn.'

We can also get is using `np.argmax`

In [66]:
import numpy as np

documents[np.argmax(cosine_similarities[0])]    

'Artificial Intelligence aims to create machines that can think and learn.'

### Create Search Engine function

In [67]:
def semantic_search_engine(query, embedder_model):
  query_embedding = embedder_model.embed_query(query)
  cos_scores = cosine_similarity([query_embedding], document_embeddings)[0]
  top_result_id = np.argmax(cos_scores)
  return documents[top_result_id]

### Try out the function

In [68]:
new_sentence = 'Tell me about AI'
semantic_search_engine(new_sentence, openai_embed_model)

'Artificial Intelligence aims to create machines that can think and learn.'

In [69]:
new_sentence = 'Do you know about the pyramids?'
semantic_search_engine(new_sentence, openai_embed_model)

'The pyramids of Egypt are historical monuments that have stood for thousands of years.'

In [70]:
new_sentence = 'How do plants survive?'
semantic_search_engine(new_sentence, openai_embed_model)

'Photosynthesis is the process by which green plants make food using sunlight.'