## **Start with Embeddings**

### *HuggingFace Embedding*

In [1]:
from dotenv import load_dotenv
import os
load_dotenv()

os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embeddings

  from .autonotebook import tqdm as notebook_tqdm
No sentence-transformers model found with name sentence-transformers/all-mpnet-base-v2. Creating a new one with mean pooling.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [5]:
my_text = "I'm Al Amin"

result = embeddings.embed_query(my_text)


In [7]:
len(result)

768

In [10]:
my_docs = [
    "i am al amin",
    "i'm student of the nsu",
    "i am in cse495 class"
]

In [12]:
result_doc = embeddings.embed_documents(my_docs)
len(result_doc)

3

In [13]:
len(result_doc[1])

768

### Load Some of Document and do Embedding.

In [15]:
from langchain_community.document_loaders import TextLoader

speech = TextLoader("speech.txt").load()
speech

[Document(metadata={'source': 'speech.txt'}, page_content='**The Importance of Data Ingestion in Modern AI Systems**\n\nData ingestion is a foundational step in any data pipeline, especially in systems powered by artificial intelligence and machine learning. It involves collecting, importing, and processing data for immediate use or storage in a database. In the context of LangChain and large language models (LLMs), data ingestion becomes even more crucial. The ability to efficiently load documents, parse them into chunks, and embed them for retrieval-augmented generation (RAG) directly influences the performance of AI applications.\n\nThere are various sources and formats from which data may be ingested—PDFs, HTML, Markdown files, CSVs, JSON, SQL databases, and even real-time APIs. Each type requires a slightly different handling process. For instance, PDFs must be parsed for text extraction, while CSVs require row-based parsing for structured data ingestion.\n\nIn LangChain, document

In [17]:
speech[0].page_content

'**The Importance of Data Ingestion in Modern AI Systems**\n\nData ingestion is a foundational step in any data pipeline, especially in systems powered by artificial intelligence and machine learning. It involves collecting, importing, and processing data for immediate use or storage in a database. In the context of LangChain and large language models (LLMs), data ingestion becomes even more crucial. The ability to efficiently load documents, parse them into chunks, and embed them for retrieval-augmented generation (RAG) directly influences the performance of AI applications.\n\nThere are various sources and formats from which data may be ingested—PDFs, HTML, Markdown files, CSVs, JSON, SQL databases, and even real-time APIs. Each type requires a slightly different handling process. For instance, PDFs must be parsed for text extraction, while CSVs require row-based parsing for structured data ingestion.\n\nIn LangChain, document loaders like `PyPDFLoader`, `TextLoader`, `CSVLoader`, an

In [22]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50
)
docs = splitter.split_documents(documents=speech)
docs

[Document(metadata={'source': 'speech.txt'}, page_content='**The Importance of Data Ingestion in Modern AI Systems**'),
 Document(metadata={'source': 'speech.txt'}, page_content='Data ingestion is a foundational step in any data pipeline, especially in systems powered by artificial intelligence and machine learning. It involves collecting, importing, and processing data for immediate use or storage in a database. In the context of LangChain and large language models (LLMs),'),
 Document(metadata={'source': 'speech.txt'}, page_content='of LangChain and large language models (LLMs), data ingestion becomes even more crucial. The ability to efficiently load documents, parse them into chunks, and embed them for retrieval-augmented generation (RAG) directly influences the performance of AI applications.'),
 Document(metadata={'source': 'speech.txt'}, page_content='There are various sources and formats from which data may be ingested—PDFs, HTML, Markdown files, CSVs, JSON, SQL databases, and 

In [24]:
doc_embeddings = embeddings.embed_documents(docs[0].page_content)
doc_embeddings

[[-0.07100773602724075,
  0.06848253309726715,
  -0.10874670743942261,
  -0.05590743198990822,
  0.06916828453540802,
  -0.05591384693980217,
  -0.1364176869392395,
  0.06006304547190666,
  -0.05065111815929413,
  -0.02653052844107151,
  0.06550167500972748,
  -0.09969644993543625,
  0.07896089553833008,
  0.09687616676092148,
  -0.05232555791735649,
  0.07615617662668228,
  0.1567022204399109,
  0.06738951057195663,
  -0.059801217168569565,
  -0.08171814680099487,
  -0.18244163691997528,
  0.07284394651651382,
  0.09483786672353745,
  -0.01887756958603859,
  0.023262225091457367,
  0.04363490641117096,
  0.09595432132482529,
  -0.09406822174787521,
  -0.04985657334327698,
  0.0026045218110084534,
  -0.1387895792722702,
  -0.18362873792648315,
  -0.02386275678873062,
  0.14663663506507874,
  4.992854883312248e-06,
  0.05253860354423523,
  0.07173207402229309,
  -0.11360090225934982,
  -0.00675865588709712,
  0.01422024890780449,
  -0.008153020404279232,
  0.05876363441348076,
  -0.0571

## **Check Sentence Similarity**

In [25]:
embeddings.embed_query("Hello!, how are you?")

[0.05944778397679329,
 -0.010382159613072872,
 -0.04376945644617081,
 -0.03851791098713875,
 0.1202593520283699,
 -0.05798788741230965,
 -0.027522718533873558,
 0.14704278111457825,
 0.057806797325611115,
 -0.02005010098218918,
 -0.0756768137216568,
 0.0059331730008125305,
 -0.017380986362695694,
 0.10194414108991623,
 0.024727020412683487,
 -0.10742659121751785,
 0.05587827414274216,
 -0.1673566699028015,
 -0.06651634722948074,
 0.011753544211387634,
 0.006819194182753563,
 -0.03233831003308296,
 -0.019089102745056152,
 0.09405741840600967,
 0.04695425182580948,
 0.05639710649847984,
 0.05926735699176788,
 0.019214795902371407,
 -0.020675547420978546,
 0.1675911843776703,
 0.03565971553325653,
 0.05213737115263939,
 0.2081088274717331,
 0.09723994135856628,
 4.601586624630727e-06,
 0.023988567292690277,
 -0.08584705740213394,
 -0.06684308499097824,
 0.08050195127725601,
 -0.1282549649477005,
 0.0832357183098793,
 -0.15470601618289948,
 0.015243579633533955,
 0.17564232647418976,
 0.01

In [26]:
from sklearn.metrics.pairwise import cosine_similarity

In [29]:
documents = [
    "What is the capital of USA?",
    "who is the the President of USA.",
    "WHo is the prime minister of Bangladesh."
]
doc_embed = embeddings.embed_documents(documents)

In [30]:
my_query = "Dr. Younus is the prime minister of Bangladesh."
my_query_emb = embeddings.embed_query(my_query)


In [33]:
cosine_similarity([my_query_emb], doc_embed)

array([[0.15061189, 0.31117925, 0.72048594]])

In [34]:
from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances([my_query_emb], doc_embed)

array([[3.76929385, 3.39429513, 2.20812812]])

## in euclidean distance high similarity is lowest value.
- Cosine Similarity is [-1, 1] and its based on angle.
- L2 distance [0, infinity.]