# Embeddings
- Text chunck to Vectors conversion
- OpenAI 
- OLLama - Open Source
- Hugging Face - Open Source
- Google Gemini
- Claude -Anthropic - Paid

# OpeApi Key is placed as an environment variable in .env file

In [1]:
import os
from dotenv import load_dotenv
load_dotenv() # Load all environment Variables

True

# Keep in mind to use '' instead of "" for string literals in Python
- If you are using "" for OPENAI_API_KEY then you will get error
- UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 156: ordinal not in range(128)

In [2]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")  # Correct key name


# Embeddings	
- A set of models that can convert text into a numerical form
- Embeddings are a numerical representation of text that can be used to measure the relatedness between two pieces of text. Embeddings are useful for search, clustering, recommendations, anomaly detection, and classification tasks. You can read more about our latest embedding models in the announcement blog post.

## Model: Output Dimension
- text-embedding-3-large:
Most capable embedding model for both english and non-english tasks
: 3,072
- text-embedding-3-small:
Increased performance over 2nd generation ada embedding model
:1,536
- text-embedding-ada-002:
Most capable 2nd generation embedding model, replacing 16 first generation models
: 1,536

In [3]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7fe3c2f75db0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7fe3c58bae90>, model='text-embedding-3-large', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [4]:
text="My name is Suraj Bhardwaj"
query_results = embeddings.embed_query(text)
query_results

[0.003885936224833131,
 -0.0036428512539714575,
 -0.010647807270288467,
 -0.004022885579615831,
 0.01696801744401455,
 0.022048836573958397,
 -0.013736014254391193,
 0.021172361448407173,
 -0.03618200495839119,
 0.03431949391961098,
 -0.002764663891866803,
 -0.006837193388491869,
 0.005258853081613779,
 -0.01588611863553524,
 -0.026266874745488167,
 0.025513652712106705,
 0.011709164828062057,
 -0.02591080591082573,
 0.0020268496591597795,
 0.016625644639134407,
 0.04549455642700195,
 -0.02162429317831993,
 -0.021323004737496376,
 0.024869991466403008,
 -0.03785278648138046,
 0.029635827988386154,
 -0.0028451215475797653,
 0.006361294537782669,
 -0.004824039060622454,
 -0.005193802062422037,
 0.010332824662327766,
 0.024965856224298477,
 0.04497414827346802,
 0.018666189163923264,
 0.020651953294873238,
 -0.006744752638041973,
 -0.004204343538731337,
 -0.036236781626939774,
 -0.022760972380638123,
 0.008203262463212013,
 -0.008497703820466995,
 -0.017858188599348068,
 -0.01670781336724

In [6]:
len(query_results) # Size of Embedding generated from Embedding Model

3072

In [7]:
embeddings_1024 = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)
text="My name is Suraj Bhardwaj"
query_results = embeddings_1024.embed_query(text)
len(query_results)

1024

In [8]:
query_results

[0.005391980987042189,
 -0.005050536245107651,
 -0.014795936644077301,
 -0.005520022474229336,
 0.0235027763992548,
 0.030559299513697624,
 -0.01899760402739048,
 0.02932630479335785,
 -0.050078555941581726,
 0.047536689788103104,
 -0.003827025881037116,
 -0.00942766759544611,
 0.007317349314689636,
 -0.02190936729311943,
 -0.036382827907800674,
 0.03533952683210373,
 0.01622810773551464,
 -0.03585169464349747,
 0.002805063035339117,
 0.023066485300660133,
 0.06301551312208176,
 -0.0300091952085495,
 -0.029553934931755066,
 0.034504882991313934,
 -0.05243073031306267,
 0.041011299937963486,
 -0.003848366206511855,
 0.008754262700676918,
 -0.006672398187220097,
 -0.007265184540301561,
 0.014293254353106022,
 0.0345238521695137,
 0.062332626432180405,
 0.02585495077073574,
 0.028472693637013435,
 -0.00935653317719698,
 -0.0058093019761145115,
 -0.05026824772357941,
 -0.031526725739240646,
 0.01134355179965496,
 -0.011732419021427631,
 -0.024735771119594574,
 -0.02310442365705967,
 0.0036

# Generating embedding for speech.txt file
- Load the speech.txt file using TextLoader()
- Split the loaded doc into chunks using RecursiveCharacterTextSplitter
- Generate the embeddings from the chunks
- Store the embedding vectors in a VectorDB

## Step 1: Data Ingestion

In [10]:
from langchain_community.document_loaders import TextLoader
# Load the text file
loader = TextLoader("speech.txt")
docs = loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content='A text file (sometimes spelled textfile; \nan old alternative name is flat file) is a kind of computer file \nthat is structured as a sequence of lines of electronic text. \nA text file exists stored as data within a computer file system.\n\n"Text file" refers to a type of container, \nwhile plain text refers to a type of content.')]

## Step 2: Data Transformation

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=50
)
final_documents = text_splitter.split_documents(docs)
final_documents

[Document(metadata={'source': 'speech.txt'}, page_content='A text file (sometimes spelled textfile;'),
 Document(metadata={'source': 'speech.txt'}, page_content='an old alternative name is flat file) is a kind of computer file'),
 Document(metadata={'source': 'speech.txt'}, page_content='that is structured as a sequence of lines of electronic text.'),
 Document(metadata={'source': 'speech.txt'}, page_content='A text file exists stored as data within a computer file system.'),
 Document(metadata={'source': 'speech.txt'}, page_content='"Text file" refers to a type of container, \nwhile plain text refers to a type of content.')]

## Step 3 & 4: Embedding Generation and Saving them in a Vector Store

## Vector Embedding and Vector Store
- We have list of embeddings
- VectorStoreDB - Chroma

In [12]:
from langchain_community.vectorstores import Chroma

# Give all final documents and embedding technique or model
db = Chroma.from_documents(final_documents, embeddings_1024)
db

<langchain_community.vectorstores.chroma.Chroma at 0x7fe3c5fb6b90>

# Step 5: Performing a Retrieval Task : Similarity Search

# Perform Similarity search on a specific query

In [None]:
# Retrieved the results from a query using vectorstore db
query = "A text file exists stored as data within a computer file system."
retrieved_results = db.similarity_search(query)
print(retrieved_results)

[Document(metadata={'source': 'speech.txt'}, page_content='A text file exists stored as data within a computer file system.'), Document(metadata={'source': 'speech.txt'}, page_content='A text file (sometimes spelled textfile;'), Document(metadata={'source': 'speech.txt'}, page_content='"Text file" refers to a type of container, \nwhile plain text refers to a type of content.'), Document(metadata={'source': 'speech.txt'}, page_content='an old alternative name is flat file) is a kind of computer file')]
