---
**Embedding Techniques**

Converting text into vectors


![alt text](embeddings_concept-975a9aaba52de05b457a1aeff9a7393a.png)

1. **Embed text as a vector:** Embeddings transform text into a numerical vector representation.

2. **Measure similarity:** Embedding vectors can be compared using simple mathematical operations.

In [3]:
import os
from dotenv import load_dotenv
load_dotenv() 

True

In [4]:
os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")

**Embedding Models**

Embedding models are models that are trained specifically to generate vector embeddings

long arrays of numbers that represent semantic meaning for a given sequence of text
https://ollama.com/public/blog/what-are-embeddings.svg


The resulting vector embedding arrays can then be stored in a database, which will compare them as a way to search for data that is similar in meaning.

Embedding models create a vector representation of a piece of text.

* AzureOpenAI -> langchain-openai
* Ollama -> langchain-ollama
* Fake -> langchain-core
* OpenAI -> langchain-openai
* IBM -> langchain-ibm
* NVIDIA -> langchain-nvidia*

In [5]:
from langchain_openai import OpenAIEmbeddings

In [14]:
embedding=OpenAIEmbeddings(model='text-embedding-3-small')
embedding


OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x000002307C7F1720>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x000002307C898970>, model='text-embedding-3-small', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [7]:
text="Hello, how are you?"
query_res=embedding.embed_query(text)
query_res


[0.023070387542247772,
 -0.05002918466925621,
 -0.011751905083656311,
 0.04363931715488434,
 -0.008544588461518288,
 -0.02290940284729004,
 0.011070813983678818,
 0.02125001884996891,
 -0.019194364547729492,
 -0.07529144734144211,
 -0.0121481753885746,
 -0.015665078535676003,
 0.0016624797135591507,
 -0.04864223673939705,
 0.004554017912596464,
 0.05087126046419144,
 -0.025485163554549217,
 0.057459261268377304,
 -0.009349512867629528,
 0.033683013170957565,
 0.06508747488260269,
 -0.0028481960762292147,
 0.009368088096380234,
 0.02277318574488163,
 0.04106355831027031,
 0.012878799811005592,
 0.009188528172671795,
 -0.0045942640863358974,
 0.009968685917556286,
 -0.012965483590960503,
 0.0703875944018364,
 -0.03256850317120552,
 -0.003337342757731676,
 0.02895253151655197,
 -0.015107822604477406,
 0.01378279272466898,
 -0.012265818193554878,
 0.014983988367021084,
 -0.0027506763581186533,
 -0.07122966647148132,
 0.0025850476231426,
 -0.02365241013467312,
 0.006365098990499973,
 0.0225

In [13]:
query_res[0], len(query_res)

(0.023070387542247772, 1536)

In [15]:
embedding_1024=OpenAIEmbeddings(model='text-embedding-3-large',dimensions=1024)


In [16]:
embedding_1024

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x000002307C89AC20>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x000002307D26CC10>, model='text-embedding-3-large', dimensions=1024, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [17]:
text='This is an OPENAI embedding model example'
query_res_1024=embedding_1024.embed_query(text)
query_res_1024[0], len(query_res_1024)


(-0.008114777505397797, 1024)

---
Splitting text using RecursiveCharacterTextSplitter


In [19]:
# Text Loader
from langchain_community.document_loaders import TextLoader

loader=TextLoader('speech.txt')
docs=loader.load()

In [21]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)

final_docs=text_splitter.split_documents(docs)

final_docs

[Document(metadata={'source': 'speech.txt'}, page_content='My fellow dreamers and doers,\n\nToday marks not just another day, but a new beginning. Each of us carries within ourselves the power to create change, to inspire others, and to make a difference in this world.\n\nRemember that success is not measured by the heights we reach, but by the obstacles we overcome. Every setback is a setup for a comeback.'),
 Document(metadata={'source': 'speech.txt'}, page_content="In this rapidly changing world, it's not the strongest or the smartest who survive, but those most adaptable to change. Embrace uncertainty as an opportunity for growth.\n\nYour potential is limited only by your imagination and your willingness to work towards your goals. The future belongs to those who believe in the beauty of their dreams.\n\nDon't be afraid to fail. In fact, fail forward. Learn from every mistake and let it fuel your determination to succeed."),
 Document(metadata={'source': 'speech.txt'}, page_content

**Vector Embedding and Vector StoreDB**

In [26]:
from langchain_community.vectorstores import Chroma

# it will create a vector database ->Vector StoreDB
db=Chroma.from_documents(final_docs,embedding_1024)
db

<langchain_community.vectorstores.chroma.Chroma at 0x2307f40e830>

Retrieve the results from the vector database vectorestoreDB

In [28]:
query="The path ahead may not be easy but nothing worth"
retrieved_res=db.similarity_search(query)

retrieved_res

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


[Document(metadata={'source': 'speech.txt'}, page_content="The path ahead may not be easy, but nothing worth having ever is. Your journey is unique, and that's what makes it valuable.\n\nTake action today. Small steps forward are better than grand plans that never begin.\n\nBe kind to others along the way, for success without compassion is an empty victory.\n\nTogether, we can turn our dreams into reality. The time is now. Thank you."),
 Document(metadata={'source': 'speech.txt'}, page_content="In this rapidly changing world, it's not the strongest or the smartest who survive, but those most adaptable to change. Embrace uncertainty as an opportunity for growth.\n\nYour potential is limited only by your imagination and your willingness to work towards your goals. The future belongs to those who believe in the beauty of their dreams.\n\nDon't be afraid to fail. In fact, fail forward. Learn from every mistake and let it fuel your determination to succeed."),
 Document(metadata={'source': 