### Data Ingestion using Document Loaders in LangChain_Community
Documentation for all Data Loaders:    https://python.langchain.com/v0.2/docs/integrations/document_loaders

In [46]:
# Text Loaders to read contents in txt file
from langchain_community.document_loaders import TextLoader
text_load=TextLoader('speech.txt',encoding = 'UTF-8')
text_load

<langchain_community.document_loaders.text.TextLoader at 0x23d07bfde50>

In [47]:
text_docs=text_load.load()
print(text_docs)

[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.')]


In [48]:
# PyPDF Loader for reading the text in pdf
from langchain_community.document_loaders import PyPDFLoader
pdf_loader=PyPDFLoader("Minor Project Report.pdf")
pdf_docs=pdf_loader.load()
pdf_docs

[Document(metadata={'source': 'Minor Project Report.pdf', 'page': 0}, page_content='Multi-Class Ship Classification of Commercial and\nNaval Vessels using Convolutional Neural Network\nA PROJECT REPORT\nSubmitted by\nAKASH VARMA DATLA [RA2111047010131]\nAMAN PARASHER [RA2111047010157]\nUnder the Guidance of\nDr. U.Sakthi\nAssistant Professor,\nDepartment of Computational Intelligence\nin partial fulfilment of the requirements for the degree of\nBACHELOR OF TECHNOLOGY\nin\nARTIFICIAL INTELLIGENCE\nDEPARTMENT OF COMPUTATIONAL INTELLIGENCE\nCOLLEGE OF ENGINEERING AND TECHNOLOGY\nSRM INSTITUTE OF SCIENCE AND TECHNOLOGY\nKATTANKULATHUR- 603 203\nOCTOBER 2024\n'),
 Document(metadata={'source': 'Minor Project Report.pdf', 'page': 1}, page_content='1\nDepartment of Computational Intelligence\nSRM Institute of Science & Technology\nOwn Work* Declaration Form\nThis sheet must be filled in (each box ticked to show that the condition has been met). It must\nbe signed and dated along with your stud

In [49]:
type(pdf_docs[0])

langchain_core.documents.base.Document

In [50]:
# Web based Loader
from langchain_community.document_loaders import WebBaseLoader
import bs4
web_loader=WebBaseLoader("https://en.wikipedia.org/wiki/Ferrari",
                        bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                            class_=("mw-page-title-main","mw-default-size")))
                        )

In [51]:
web_loader.load()

[Document(metadata={'source': 'https://en.wikipedia.org/wiki/Ferrari'}, page_content="FerrariThree Scuderia Ferrari cars in 1934, all Alfa Romeo P3s. Drivers, left to right: Achille Varzi, Louis Chiron, and Carlo Felice Trossi.Ferrari's factory in the early 1960s: everything in its production line was handmade by machinists, who followed technical drawings with extreme precision.[13] Much of this work is now done by industrial robots.[14]A Ferrari F2004 Formula One car, driven by Michael Schumacher. Schumacher is one of the most decorated drivers in F1 history.A 312 P, driven by Jacky Ickx, during Ferrari's final year in the World Sportscar ChampionshipFerrari 499P No. 51 at the 2023 6 Hours of Spa-Francorchamps166 Inter Touring BerlinettaEnzo FerrariFerrari Pinin1963 Ferrari 250 GTOTifosi flying Prancing Horse flags at the 2003 Italian Grand PrixA Ferrari 550 painted in rosso corsa. Both varieties of the Prancing Horse logo are present: the shield is located in front of the door, the 

In [52]:
# Arxiv paper loader
from langchain_community.document_loaders import ArxivLoader
docs=ArxivLoader(query="2411.03403",load_max_docs=2).load()
len(docs)

1

In [53]:
docs

[Document(metadata={'Published': '2024-11-05', 'Title': 'Enhancing Maritime Situational Awareness through End-to-End Onboard Raw Data Analysis', 'Authors': 'Roberto Del Prete, Manuel Salvoldi, Domenico Barretta, Nicolas Longépé, Gabriele Meoni, Arnon Karnieli, Maria Daniela Graziano, Alfredo Renga', 'Summary': 'Satellite-based onboard data processing is crucial for time-sensitive\napplications requiring timely and efficient rapid response. Advances in edge\nartificial intelligence are shifting computational power from ground-based\ncenters to on-orbit platforms, transforming the\n"sensing-communication-decision-feedback" cycle and reducing latency from\nacquisition to delivery. The current research presents a framework addressing\nthe strict bandwidth, energy, and latency constraints of small satellites,\nfocusing on maritime monitoring. The study contributes three main innovations.\nFirstly, it investigates the application of deep learning techniques for direct\nship detection and cla

In [54]:
# Wikipedia Data Loader
from langchain_community.document_loaders import WikipediaLoader
docs=WikipediaLoader(query="Gen AI",load_max_docs=2).load()
print(len(docs))
print(docs)

2
[Document(metadata={'title': 'Generative artificial intelligence', 'summary': 'Generative artificial intelligence (generative AI, GenAI, or GAI) is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models often generate output in response to specific prompts. Generative AI systems learn the underlying patterns and structures of their training data, enabling them to create new data. \nImprovements in transformer-based deep neural networks, particularly large language models (LLMs), enabled an AI boom of generative AI systems in the early 2020s. These include chatbots such as ChatGPT, Copilot, Gemini and LLaMA, text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney and DALL-E, and text-to-video AI generators such as Sora. Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.\nGenerati

### Data Transformation(Data to Text Chunks) using LangChain_Community

In [55]:
# PyPDF Loader for reading the text in pdf
from langchain_community.document_loaders import PyPDFLoader
pdf_loader=PyPDFLoader("Minor Project Report.pdf")
pdf_docs=pdf_loader.load()
pdf_docs

[Document(metadata={'source': 'Minor Project Report.pdf', 'page': 0}, page_content='Multi-Class Ship Classification of Commercial and\nNaval Vessels using Convolutional Neural Network\nA PROJECT REPORT\nSubmitted by\nAKASH VARMA DATLA [RA2111047010131]\nAMAN PARASHER [RA2111047010157]\nUnder the Guidance of\nDr. U.Sakthi\nAssistant Professor,\nDepartment of Computational Intelligence\nin partial fulfilment of the requirements for the degree of\nBACHELOR OF TECHNOLOGY\nin\nARTIFICIAL INTELLIGENCE\nDEPARTMENT OF COMPUTATIONAL INTELLIGENCE\nCOLLEGE OF ENGINEERING AND TECHNOLOGY\nSRM INSTITUTE OF SCIENCE AND TECHNOLOGY\nKATTANKULATHUR- 603 203\nOCTOBER 2024\n'),
 Document(metadata={'source': 'Minor Project Report.pdf', 'page': 1}, page_content='1\nDepartment of Computational Intelligence\nSRM Institute of Science & Technology\nOwn Work* Declaration Form\nThis sheet must be filled in (each box ticked to show that the condition has been met). It must\nbe signed and dated along with your stud

In [56]:
# Split text by characters using RecursiveCharacterTextSplitter (More Generic)
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=700,chunk_overlap=50)
final_pdf_docs=text_splitter.split_documents(pdf_docs)

In [57]:
print(final_pdf_docs[5])

page_content='2
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203
BONAFIDE CERTIFICATE
Certified that 18CSP107L - Minor Project report titled “Multi-Class Ship
Classification of Commercial and Naval Vessels using Convolutional Neural
Network” is the bonafide work of AKASH VARMA DATLA
[RA2111047010131] and AMAN PARASHER [RA2111047010157] who
carried out the project work under my supervision. Certified further, that to the best of
my knowledge the work reported herein does not form any other project report or
dissertation on the basis of which a degree or award was conferred on an earlier
occasion on this or any other candidate.
SIGNATURE SIGNATURE
Dr. U.Sakthi Dr. R. ANNIE UTHRA
SUPERVISOR' metadata={'source': 'Minor Project Report.pdf', 'page': 2}


In [58]:
# Split text by characters using CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
text_splitter=CharacterTextSplitter(separator="\n\n",chunk_size=700,chunk_overlap=50)
final_pdf_docs=text_splitter.split_documents(pdf_docs)
print(final_pdf_docs[3])

page_content='3
ACKNOWLEDGEMENTS
We express our humble gratitude to Dr. C. Muthamizhchelvan, Vice-Chancellor, SRM Institute of
Science and Technology, for the facilities extended for the project work and his continued support.
We extend our sincere thanks to Dr. T. V. Gopal, Dean-CET, SRM Institute of Science and
Technology, for his invaluable support.
We wish to thank Dr. Revathi Venkataraman, Professor and Chairperson, School of Computing,
SRM Institute of Science and Technology, for her support throughout the project work.
We encompass our sincere thanks to Dr. M. Pushpalatha, Professor and Associate Chairperson,
School of Computing and Dr. C.Lakshmi, Professor and Associate Chairperson, School of
Computing, SRM Institute of Science and Technology, for their invaluable support. We are
incredibly grateful to our Head of the Department, Dr. R. Annie Uthra, Professor, Department of
Computational Intelligence, SRM Institute of Science and Technology, for her suggestions and
encouragemen

In [59]:
# Split text at HTML element level using HTMLHeaderTextSplitter
from langchain_text_splitters import HTMLHeaderTextSplitter
html_string="""
<html>
<body>
<p>The HTML <code>button</code> tag defines a clickable button.</p>
<p>The CSS <code>background-color</code> property defines the background color of an element.</p>
</body>
</html>
"""
headers_to_split_on=[("p","Header 3")]
html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={}, page_content='The HTML button tag defines a clickable button.  \nThe CSS background-color property defines the background color of an element.')]

In [60]:
url="https://en.wikipedia.org/wiki/24_Hours_of_Le_Mans"
header_to_split_on=[
    ("h1","Header 1"),("h2","Header 2"),("h3","Header 3"),("h4","Header 4")
]
html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text_from_url(url)
html_header_splits


[Document(metadata={}, page_content='Main menu  \nmove to sidebar hide  \nMain menu  \nNavigation  \nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact us  \nContribute  \nHelpLearn to editCommunity portalRecent changesUpload file  \nSearch  \nSearch  \nAppearance  \nDonate Create account Log in  \nPersonal tools  \nDonate Create account Log in  \nPages for logged out editors learn more  \nContributionsTalk  \nContents move to sidebar hide  \nToggle Race subsection Toggle History subsection Toggle Innovations subsection  \n(Top)  \n1 Purpose  \n2 Race  \n2.1 Cars  \n2.1.1 Garage 56  \n2.2 Drivers  \n2.3 Traditions and unique rules  \n2.3.1 Schedule  \n2.3.2 Classification  \n2.3.3 Le Mans start  \n3 Circuit  \n4 History  \n4.1 1923–1939  \n4.2 1949–1969  \n4.3 1970–1980  \n4.4 1981–1993  \n4.5 1994–1999  \n4.6 2000–2005  \n4.7 2006–2013  \n4.8 2014–2020  \n4.9 2021–present  \n5 Innovations  \n5.1 Aerodynamics  \n5.2 Engines  \n5.3 Brakes  \n6 Winners  \n7 Accidents  \n

In [61]:
# Split json text data using HTMLHeaderTextSplitter
import json
import requests
json_data=requests.get("https://api.smith.langchain.com/openapi.json").json()

from langchain_text_splitters import RecursiveJsonSplitter
json_splitter=RecursiveJsonSplitter(max_chunk_size=300)
json_text=json_splitter.split_json(json_data)

for chunk in json_text[:3]:
    print(chunk)
print("-"*100)
# For documents data or text data, etc from json splitters
docs=json_splitter.create_documents([json_data])
for doc in docs[:3]:
    print(doc)

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}}
---------------------------------------------------------------------------------------------------

### Embedding Techniques to convert chunks to vectors using OpenAI, OLLama, Hugging Face

#### OPENAI

In [62]:
# Importing required libraries to load env variables
import os
from dotenv import load_dotenv
load_dotenv()

True

In [63]:
# Getting the api key from .env file
os.environ["OPENAI_EMD_KEY"]=os.getenv("OPENAI_EMD_KEY")

In [64]:
# Loading best embedding model of OpenAI
from langchain_openai import OpenAIEmbeddings
embeddings=OpenAIEmbeddings(model="text-embedding-3-large",dimensions=1024)
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x0000023D5DB75070>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x0000023D5DB76C00>, model='text-embedding-3-large', dimensions=1024, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [65]:
# testing the embedding
text="I am testing my OpenAIEmbeddings from my API"
query_result=embeddings.embed_query(text)
query_result

[-0.017282424494624138,
 0.06568903475999832,
 -0.009165221825242043,
 -0.015364352613687515,
 0.010509850457310677,
 -0.020861508324742317,
 -0.014612942934036255,
 0.03895466774702072,
 -0.038282353430986404,
 -0.0013322693994268775,
 -0.016867171972990036,
 0.012546566314995289,
 -0.023511217907071114,
 0.002398826414719224,
 0.03045187145471573,
 -0.00654023140668869,
 0.025587480515241623,
 0.049197569489479065,
 -0.03337841480970383,
 -0.05077948421239853,
 -0.035276710987091064,
 -0.004664178472012281,
 -0.03124282881617546,
 -0.013733002357184887,
 0.012981592677533627,
 -0.011389791034162045,
 -0.01171606034040451,
 0.03846031799912453,
 0.0007668581674806774,
 0.028256963938474655,
 0.008245733566582203,
 0.00039980438305065036,
 0.03549422696232796,
 0.01853807084262371,
 0.03177672624588013,
 -0.004535648040473461,
 0.05793765187263489,
 0.005049770697951317,
 -0.030847350135445595,
 -0.0020676127169281244,
 1.8798762539518066e-05,
 0.012457583099603653,
 -0.019398236647248

In [66]:
print(len(query_result))
print(query_result[0])

1024
-0.017282424494624138


In [67]:
# Text Loaders to read contents in txt file
from langchain_community.document_loaders import TextLoader
text_doc=TextLoader('speech.txt',encoding = 'UTF-8').load()
text_doc

[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.')]

In [68]:
# Split text by characters using RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=700,chunk_overlap=50)
final_text_docs=text_splitter.split_documents(text_doc)
final_text_docs

[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.')]

Storing the Embeddings Vectors in Vector StoreDBs

In [69]:
# Vector embedding and storing it into Vector storeDB(Chroma)
from langchain_community.vectorstores import Chroma
db=Chroma.from_documents(final_text_docs,embeddings)
db

<langchain_community.vectorstores.chroma.Chroma at 0x23d5eb09a30>

In [70]:
# To retrieve from chromadb
query="Italian auto manufacturer Ferrari"
retrieved_res=db.similarity_search(query)
print(retrieved_res)

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.'), Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and mo

#### OLLAMA

In [71]:
from langchain_community.embeddings import OllamaEmbeddings
embeddings=(OllamaEmbeddings(model="mxbai-embed-large"))

In [72]:
r1=embeddings.embed_documents(["Alpha is first letter of Greek alphabets","Beta is second one"])
r1[1]

[0.6928829550743103,
 -0.06529523432254791,
 0.6625701189041138,
 0.009048640727996826,
 -1.0707876682281494,
 -0.30808326601982117,
 0.3649643659591675,
 0.29894670844078064,
 0.1590316891670227,
 1.0068645477294922,
 0.046925753355026245,
 0.18990951776504517,
 0.0908103883266449,
 0.24835142493247986,
 -0.2849648594856262,
 -0.2917434573173523,
 -0.5408878326416016,
 -0.5953996181488037,
 -0.3173350393772125,
 0.17620515823364258,
 0.27307000756263733,
 0.2589612901210785,
 -1.2641451358795166,
 0.04168955236673355,
 -0.5112680196762085,
 0.8599085211753845,
 -0.053098492324352264,
 0.24467432498931885,
 1.6036608219146729,
 1.0155749320983887,
 0.2424660325050354,
 0.07851175963878632,
 0.46443408727645874,
 -0.4050128757953644,
 -0.17650337517261505,
 -0.1626066118478775,
 0.9058332443237305,
 -0.20042863488197327,
 -0.4880538582801819,
 -0.7561136484146118,
 0.6609728932380676,
 -0.30094245076179504,
 1.0008721351623535,
 -0.45461252331733704,
 -1.2260957956314087,
 -0.4703350365

In [73]:
len(r1[0])

1024

In [74]:
embeddings.embed_query("What is second letter of greek alphabet?")

[0.2971070408821106,
 -0.09603878855705261,
 0.32907047867774963,
 -0.16204790771007538,
 -1.4167752265930176,
 -0.2853044271469116,
 1.1404540538787842,
 0.4384717345237732,
 -0.1398947536945343,
 0.5754501223564148,
 0.22336529195308685,
 0.3516661822795868,
 -0.6853501796722412,
 0.18279340863227844,
 -1.2385733127593994,
 -0.06594858318567276,
 -0.15118825435638428,
 -0.010719601064920425,
 -0.5971776247024536,
 -0.18852975964546204,
 1.051552653312683,
 1.0377808809280396,
 -1.384584903717041,
 0.4320550262928009,
 -0.07874493300914764,
 0.1339009404182434,
 -0.5505450963973999,
 -0.6654351949691772,
 -0.06054915487766266,
 0.9092860221862793,
 -0.5731201767921448,
 0.39792633056640625,
 0.4485471546649933,
 -0.7019790410995483,
 0.027622103691101074,
 -0.6260785460472107,
 0.7473986744880676,
 -0.6798529028892517,
 -0.08232481777667999,
 -1.033233404159546,
 0.4757238030433655,
 0.060763418674468994,
 -0.19816432893276215,
 -0.594415545463562,
 -0.7081314921379089,
 -0.0035319924

#### Hugging Face

In [75]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [76]:
os.environ['HF_TOKEN']=os.getenv("HF_TOKEN")

In [77]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [78]:
text="This is a test sentence for embeddings."
query=embeddings.embed_query(text)
print(len(query))
query

384


[0.02219323255121708,
 -0.011655949987471104,
 0.07762274146080017,
 0.048201631754636765,
 0.028246894478797913,
 0.05335577204823494,
 0.014203856699168682,
 -0.050392355769872665,
 0.029828622937202454,
 -0.03013942576944828,
 0.0846155658364296,
 -0.04450998827815056,
 0.0440790131688118,
 0.0433078408241272,
 -0.009363092482089996,
 0.014246895909309387,
 0.08916229009628296,
 0.02025790326297283,
 -0.037817858159542084,
 0.013670781627297401,
 0.007137923967093229,
 0.04653988406062126,
 0.05418972671031952,
 -0.05408455803990364,
 0.01716436631977558,
 -0.0006513539119623601,
 -0.09210391342639923,
 0.016734562814235687,
 0.08676841109991074,
 -0.01860661432147026,
 0.063850536942482,
 -0.014821152202785015,
 -0.0239939633756876,
 0.09018656611442566,
 0.04945998638868332,
 0.04422907158732414,
 0.05397539585828781,
 0.0205898005515337,
 -0.016251536086201668,
 0.011888865381479263,
 0.025042062625288963,
 -0.025132490321993828,
 0.007802823558449745,
 0.0709867924451828,
 0.008

In [79]:
doc_res_query=embeddings.embed_documents(["This is another text for the embeddings"])
doc_res_query

[[-0.043399371206760406,
  0.03153632581233978,
  0.0426400750875473,
  -0.0070740156807005405,
  0.0590025968849659,
  0.07748805731534958,
  0.05717156454920769,
  -0.015846719965338707,
  0.03532953932881355,
  -0.04691490903496742,
  0.07170525938272476,
  0.01592620648443699,
  0.011252649128437042,
  -0.024096734821796417,
  -0.08868508785963058,
  0.04808421432971954,
  0.06417981535196304,
  0.017124459147453308,
  -0.02819094993174076,
  -0.009318923577666283,
  -0.011185038834810257,
  0.060136061161756516,
  0.08445288985967636,
  -0.050475675612688065,
  0.0016540219075977802,
  0.005656891502439976,
  -0.07919064909219742,
  0.11868414282798767,
  0.07436972111463547,
  -0.0014468568842858076,
  0.003951891791075468,
  0.01218367274850607,
  0.0077368891797959805,
  0.06085770204663277,
  0.01718178763985634,
  0.044974539428949356,
  0.017898235470056534,
  0.0623304508626461,
  -0.028899140655994415,
  0.03728455677628517,
  0.027611952275037766,
  -0.04769134894013405,


### Vector StoreDB

In [80]:
# FAISS
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

docs=TextLoader("speech.txt",encoding="UTF-8").load()
splitter=CharacterTextSplitter(chunk_size=200,chunk_overlap=0)
doc=splitter.split_documents(docs)

embeddings=OllamaEmbeddings(model="mxbai-embed-large")
db=FAISS.from_documents(doc,embeddings)
db


<langchain_community.vectorstores.faiss.FAISS at 0x23d0a683b00>

In [81]:
# Query the db
query="What is Scuderia Ferrari succesful in?"
docs=db.similarity_search(query)
docs[0]

Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.')

In [82]:
# Smililarity search with score
docs_score=db.similarity_search_with_score(query)
docs_score

[(Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.'),
  148.14124)]

In [83]:
# As a retriever
ret=db.as_retriever()
ret.invoke(query)

[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.')]

In [84]:
# Save db in Local system
db.save_local("faiss_index")

In [85]:
# Load the db
new_db=FAISS.load_local("faiss_index",embeddings,allow_dangerous_deserialization=True)
New_docs=new_db.similarity_search("Sucerdia ferrari championship in which year?")
New_docs

[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.')]

In [86]:
# ChromaDB from langchain
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

docs=TextLoader("speech.txt",encoding="UTF-8").load()
splitter=CharacterTextSplitter(chunk_size=200,chunk_overlap=0)
doc=splitter.split_documents(docs)

embeddings=OllamaEmbeddings(model="mxbai-embed-large")
db=Chroma.from_documents(doc,embeddings)
db

<langchain_chroma.vectorstores.Chroma at 0x23d61f38140>

In [87]:
query="What is Scuderia Ferrari known for?"
docs=db.similarity_search(query)
docs

[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and m

In [88]:
# Saving Chromadb
db=Chroma.from_documents(doc,embeddings,persist_directory="./chroma_db")

In [89]:
# Loading Chroma
db2=Chroma(persist_directory="./chroma_db",embedding_function=embeddings)
docs=db2.similarity_search(query)
docs

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and m

In [90]:
# retriever 
ret=db2.as_retriever()
ret.invoke(query)

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


[Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and most successful Formula One team, having competed in every world championship since 1950.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Scuderia Ferrari (Italian: [skudeˈriːa ferˈraːri]) currently racing under Scuderia Ferrari HP is the racing division of luxury Italian auto manufacturer Ferrari and the racing team that competes in Formula One racing. The team is also known by the nickname "The Prancing Horse" (Italian: il Cavallino Rampante or simply il Cavallino), in reference to their logo. It is the oldest surviving and m