## Data Ingestion

### Reading Text

In [222]:
from langchain_community.document_loaders import TextLoader

In [223]:
txt_loader = TextLoader("./Data/Text/speech.txt")
# txt_loader

In [224]:
loaded_text = txt_loader.load()
loaded_text

[Document(metadata={'source': './Data/Text/speech.txt'}, page_content='Ladies and gentlemen, friends, and fellow builders,\n\nWe live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.\n\nFor generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.\n\nTechnology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just re

### Reading PDFs

In [225]:
# !pip install pypdf

In [7]:
from langchain_community.document_loaders import PyPDFLoader

In [226]:
pdf_loader = PyPDFLoader("./Data/PDFs/Report.pdf")
# pdf_loader

In [227]:
loaded_pdf = pdf_loader.load()
loaded_pdf

[Document(metadata={'producer': 'pdfTeX-1.40.27', 'creator': 'LaTeX with hyperref', 'creationdate': '2026-02-01T15:53:53+00:00', 'author': '', 'keywords': '', 'moddate': '2026-02-01T15:53:53+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.27 (TeX Live 2025) kpathsea version 6.4.1', 'subject': 'VGTC Special Issue Paper for TVCG', 'title': '', 'trapped': '/False', 'source': './Data/PDFs/Report.pdf', 'total_pages': 10, 'page': 0, 'page_label': '1'}, page_content='Engineering Data Intensive Systems - 2IMD10\nEDS - PROJECTREPORT\nTeam Number - 16\nFull Name Discord Username Email\nDivyansh Purohit wah shampy d.purohit@student.tue.nl\nLikhit Vesalapu likhit7. l.vesalapu@student.tue.nl\nPrathamesh Samal viper 101 p.samal@student.tue.nl\nElena Terzieva ellie218388 e.e.terzieva@student.tue.nl\nEindhoven, February 1, 2026'),
 Document(metadata={'producer': 'pdfTeX-1.40.27', 'creator': 'LaTeX with hyperref', 'creationdate': '2026-02-01T15:53:53+00:00', 'author': '', 'keyw

In [228]:
# list of documents
type(loaded_pdf)

list

In [230]:
for item in loaded_pdf:
    print(item.page_content, end="\n\n")

Engineering Data Intensive Systems - 2IMD10
EDS - PROJECTREPORT
Team Number - 16
Full Name Discord Username Email
Divyansh Purohit wah shampy d.purohit@student.tue.nl
Likhit Vesalapu likhit7. l.vesalapu@student.tue.nl
Prathamesh Samal viper 101 p.samal@student.tue.nl
Elena Terzieva ellie218388 e.e.terzieva@student.tue.nl
Eindhoven, February 1, 2026

1 ABSTRACT
Accurate cardinality estimation is fundamental to query opti-
mization in graph databases, enabling the selection of efficient
execution plans for regular path queries. In this report, we
present a hybrid cardinality estimator that combines multi-
ple statistical synopses including per-label statistics, pairwise
label correlations, and characteristic sets with a weighted
and stratified sampling strategy for complex queries. Our ap-
proach balances estimation accuracy against preparation time
and memory overhead, achieving competitive performance
on both synthetic and real-world workloads.
2 INTRODUCTION
The efficiency of query pr

### Reading from Web

In [231]:
# !pip install bs4

In [232]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

In [233]:
web_loader = WebBaseLoader(web_path=("https://lilianweng.github.io/posts/2023-06-23-agent/"), 
                          bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                              class_=("post-title", "post-content", "post-header")
                              # Inspect Element for these classes
                          )))
# without bs_kwargs, entire content will be loaded

In [234]:
loaded_web_documents = web_loader.load()
loaded_web_documents

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

In [235]:
for item in loaded_web_documents:
    print(item.page_content, end="\n\n")



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory

Short-term memory: I 

### Reading from Arxiv

In [None]:
!pip install arxiv pymupdf

In [27]:
from langchain_community.document_loaders import ArxivLoader

In [29]:
# each paper has its own unique arxiv query number
# example: attention is all you need -> 1706.03762v7 
arxiv_loader = ArxivLoader(query=("1706.03762v7"), laod_max_docs=2)

In [236]:
loaded_arxiv = arxiv_loader.load()
loaded_arxiv

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation 

### Reading from Wikipedia

In [237]:
# !pip install wikipedia

In [33]:
from langchain_community.document_loaders import WikipediaLoader

In [34]:
wiki_loader = WikipediaLoader(query="India", load_max_docs=2)

In [238]:
loaded_wiki_documents = wiki_loader.load()

In [239]:
for page in loaded_wiki_documents:
    print(page, end="\nPAGE_END\n")

page_content='India, officially the Republic of India, is a country in South Asia.  It is the seventh-largest country by area; the most populous country since 2023; and, since its independence in 1947, the world's most populous democracy. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is near Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Myanmar, Thailand, and Indonesia.
Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago. Their long occupation, predominantly in isolation as hunter-gatherers, has made the region highly diverse. Settled life emerged on the subcontinent in the western margins of the Indus river basin 9,000 years ago, evolving gradually into the Indus Valley Civilisatio

## Splitting Documents

### Recursively splitting text into characters

In [194]:
# !pip install langchain-text-splitters

In [40]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [41]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
# each chunk of text has a maximum chunk size of 500
# there can be an overlap of 50 characters

In [240]:
# split_documents
pdf_documents = text_splitter.split_documents(loaded_pdf)
# loaded_pdf is already a list of documents, therefore we are using split_documents

In [241]:
for document in pdf_documents:
    print(document.page_content, end="\n\n\n")

Engineering Data Intensive Systems - 2IMD10
EDS - PROJECTREPORT
Team Number - 16
Full Name Discord Username Email
Divyansh Purohit wah shampy d.purohit@student.tue.nl
Likhit Vesalapu likhit7. l.vesalapu@student.tue.nl
Prathamesh Samal viper 101 p.samal@student.tue.nl
Elena Terzieva ellie218388 e.e.terzieva@student.tue.nl
Eindhoven, February 1, 2026


1 ABSTRACT
Accurate cardinality estimation is fundamental to query opti-
mization in graph databases, enabling the selection of efficient
execution plans for regular path queries. In this report, we
present a hybrid cardinality estimator that combines multi-
ple statistical synopses including per-label statistics, pairwise
label correlations, and characteristic sets with a weighted
and stratified sampling strategy for complex queries. Our ap-


and stratified sampling strategy for complex queries. Our ap-
proach balances estimation accuracy against preparation time
and memory overhead, achieving competitive performance
on both synthetic an

In [242]:
# create_documents
speech = ""
with open("./Data/Text/speech.txt") as f:
    speech = f.read()
speech

'Ladies and gentlemen, friends, and fellow builders,\n\nWe live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.\n\nFor generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.\n\nTechnology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just researchers,” we ignore the fact that our work shapes how people live, l

In [244]:
text_splitter_for_text = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
text_documents = text_splitter_for_text.create_documents([speech])
# speech is not a list of documents by default, therefore we use create_documents

In [245]:
for document in text_documents:
    print(document.page_content, end="\n\n")

Ladies and gentlemen, friends, and fellow builders,

We live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.

For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.

Technology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just researchers,” we ignore the fact that our work shapes how people live, learn, c

### Difference between RecursiveCharacterTextSplitter and CharacterTextSplitter

In [75]:
# CharacterTextSplitter: The dumb-but-predictable one.
# How it works
# 1.Splits text strictly by character count
# 2.Uses one separator (default: \n\n)
# 3. If a chunk is too big → it just hard-cuts it

# RecursiveCharacterTextSplitter: The smart one. Default for a reason.
# How it works
# 1. Tries multiple separators in order
# 2. Falls back recursively if chunk is still too big
# 3. Default separator priority: ["\n\n", "\n", " ", ""]

In [74]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter
)

basic = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

smart = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

In [246]:
character_docs = basic.split_documents(loaded_web_documents)
recursive_character_docs = smart.split_documents(loaded_web_documents)

Created a chunk of size 2671, which is longer than the specified 1000
Created a chunk of size 1373, which is longer than the specified 1000
Created a chunk of size 1281, which is longer than the specified 1000
Created a chunk of size 2352, which is longer than the specified 1000
Created a chunk of size 1731, which is longer than the specified 1000
Created a chunk of size 1067, which is longer than the specified 1000
Created a chunk of size 1475, which is longer than the specified 1000
Created a chunk of size 2881, which is longer than the specified 1000
Created a chunk of size 1980, which is longer than the specified 1000
Created a chunk of size 4145, which is longer than the specified 1000


In [247]:
for item in character_docs:
    print(item.page_content, end="\nPAGE_END\n")

LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory
PAGE_END
Memory

Short-term me

In [248]:
for item in recursive_character_docs:
    print(item.page_content, end="\nPAGE_END\n")

LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory
PAGE_END
Memory

Short-term me

### Splitting HTML Header Text

In [77]:
from langchain_text_splitters import HTMLHeaderTextSplitter

In [82]:
html_string = """
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Product Overview – NovaX Pro</title>
  <meta name="description" content="NovaX Pro is a high-performance smart device designed for professionals.">
</head>
<body>

  <header>
    <h1>NovaX Pro</h1>
    <p class="tagline">Power. Precision. Performance.</p>
  </header>

  <section id="overview">
    <h2>Overview</h2>
    <p>
      NovaX Pro is a next-generation smart device built for engineers, designers,
      and data professionals who demand speed, reliability, and flexibility.
      Designed with a minimalist aesthetic and engineered for maximum performance,
      NovaX Pro adapts seamlessly to modern workflows.
    </p>
  </section>

  <section id="features">
    <h2>Key Features</h2>
  </section>
</body>
</html>
"""

In [249]:
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_documents = html_splitter.split_text(html_string)

In [250]:
for item in html_header_documents:
    print(item.page_content, end="\n\n")

NovaX Pro

Power. Precision. Performance.

Overview

NovaX Pro is a next-generation smart device built for engineers, designers,
      and data professionals who demand speed, reliability, and flexibility.
      Designed with a minimalist aesthetic and engineered for maximum performance,
      NovaX Pro adapts seamlessly to modern workflows.

Key Features



In [251]:
html_header_url_documents = html_splitter.split_text_from_url("https://lilianweng.github.io/posts/2023-06-23-agent/")

In [252]:
for item in html_header_url_documents:
    print(item.page_content, end="\n\n")

if (localStorage.getItem("pref-theme") === "dark") {
        document.body.classList.add('dark');
    } else if (localStorage.getItem("pref-theme") === "light") {
        document.body.classList.remove('dark')
    } else if (window.matchMedia('(prefers-color-scheme: dark)').matches) {
        document.body.classList.add('dark');
    }  
MathJax = {
    tex: {
      inlineMath: [['$', '$'], ['\\(', '\\)']],
      displayMath: [['$$','$$'], ['\\[', '\\]']],
      processEscapes: true,
      processEnvironments: true
    },
    options: {
      skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre']
    }
  };

  window.addEventListener('load', (event) => {
      document.querySelectorAll("mjx-container").forEach(function(x){
        x.parentElement.classList += 'has-jax'})
    });  
Lil'Log  
|  
Posts  
Archive  
Search  
Tags  
FAQ

LLM Powered Autonomous Agents

Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng  
Table of Contents  
Agent System O

### Splitting JSON Data

In [90]:
import json, requests

In [186]:
response = requests.get(url="https://api.smith.langchain.com/openapi.json").json()
# print(response)
# huge response -> splitting required

In [95]:
from langchain_text_splitters import RecursiveJsonSplitter

In [96]:
json_splitter = RecursiveJsonSplitter(max_chunk_size=500)

In [256]:
json_documents = json_splitter.split_json(response)
print(len(json_documents))

1865


In [257]:
for chunk in json_documents[0:5]:
    print(chunk, end="\n\n")

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'description': 'The LangSmith API is used to programmatically create and manage LangSmith resources.\n\n## Host\nhttps://api.smith.langchain.com\n\n## Authentication\nTo authenticate with the LangSmith API, set the `X-Api-Key` header\nto a valid [LangSmith API key](https://docs.langchain.com/langsmith/create-account-api-key#create-an-api-key).\n\n', 'version': '0.1.0'}}

{'paths': {'/api/v1/audit-logs': {'get': {'tags': ['audit-logs'], 'summary': 'Get Audit Logs', 'description': "Retrieve audit log records for the authenticated user's organization in OCSF format.\n\nRequires both start_time and end_time parameters to filter logs within a date range.\nSupports cursor-based pagination.\n\nReturns results in OCSF API Activity (Class UID: 6003) format,\nwhich is compatible with security monitoring and SIEM tools.\nReference: https://schema.ocsf.io/1.7.0/classes/api_activity"}}}}

{'paths': {'/api/v1/audit-logs': {'get': {'operationId': 'g

In [258]:
json_documents = json_splitter.create_documents(texts=[response])

In [259]:
for item in json_documents[:5]:
    print(item.page_content, end="\n\n")

{"openapi": "3.1.0", "info": {"title": "LangSmith", "description": "The LangSmith API is used to programmatically create and manage LangSmith resources.\n\n## Host\nhttps://api.smith.langchain.com\n\n## Authentication\nTo authenticate with the LangSmith API, set the `X-Api-Key` header\nto a valid [LangSmith API key](https://docs.langchain.com/langsmith/create-account-api-key#create-an-api-key).\n\n", "version": "0.1.0"}}

{"paths": {"/api/v1/audit-logs": {"get": {"tags": ["audit-logs"], "summary": "Get Audit Logs", "description": "Retrieve audit log records for the authenticated user's organization in OCSF format.\n\nRequires both start_time and end_time parameters to filter logs within a date range.\nSupports cursor-based pagination.\n\nReturns results in OCSF API Activity (Class UID: 6003) format,\nwhich is compatible with security monitoring and SIEM tools.\nReference: https://schema.ocsf.io/1.7.0/classes/api_activity"}}}}

{"paths": {"/api/v1/audit-logs": {"get": {"operationId": "g

## Documents to Vectors / Embeddings

In [142]:
# !pip install python-dotenv
# !pip install langchain-openai

In [118]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [121]:
from langchain_openai import OpenAIEmbeddings

In [123]:
em_model = OpenAIEmbeddings(model="text-embedding-3-large")
# has 3072 output dimensions
em_model

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x11ce5c310>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x122b55700>, model='text-embedding-3-large', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [125]:
# text = "This is my first actual interaction with openAI embeddings"
# query_result = em_model.embed_query(text)

In [126]:
query_result

[0.01431331131607294,
 0.022795815020799637,
 -0.012167046777904034,
 -0.05253585800528526,
 0.009808354079723358,
 -0.023484379053115845,
 -1.6553085515624844e-05,
 0.04368709772825241,
 0.032904502004384995,
 0.011009675450623035,
 0.003094868967309594,
 -0.012489352375268936,
 -0.03070696070790291,
 0.014547714963555336,
 -0.00912711676210165,
 0.038999009877443314,
 0.02506660670042038,
 0.05801505967974663,
 -0.03214268758893013,
 -0.02697114273905754,
 0.024290142580866814,
 -0.01194729283452034,
 -0.045620933175086975,
 -0.017682872712612152,
 0.02824571542441845,
 -0.019426254555583,
 0.012731081806123257,
 0.030912064015865326,
 -0.021330788731575012,
 0.01860583946108818,
 -0.00824077520519495,
 -0.019733909517526627,
 0.015529283322393894,
 -0.006317927967756987,
 0.022502809762954712,
 -0.009258968755602837,
 0.03598105534911156,
 0.010702020488679409,
 -0.0016856964211910963,
 0.0194116048514843,
 0.009757078252732754,
 0.030179550871253014,
 -0.01977786049246788,
 0.00955

In [127]:
len(query_result)

3072

### Custom Output Dimension

In [128]:
em_custom_model = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)
em_custom_model

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x122c867c0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x122c86ac0>, model='text-embedding-3-large', dimensions=1024, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [129]:
# text = "This is my first actual interaction with openAI embeddings"
# query_result_custom = em_custom_model.embed_query(text)

In [130]:
len(query_result_custom)

1024

### Vectorizing Documents

In [135]:
text_documents

[Document(metadata={'source': './Data/Text/speech.txt'}, page_content='Ladies and gentlemen, friends, and fellow builders,\n\nWe live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.\n\nFor generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.\n\nTechnology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just re

In [137]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_text_documents = text_splitter.split_documents(text_documents)
split_text_documents

[Document(metadata={'source': './Data/Text/speech.txt'}, page_content='Ladies and gentlemen, friends, and fellow builders,\n\nWe live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.'),
 Document(metadata={'source': './Data/Text/speech.txt'}, page_content='For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.'),
 Document(metadata={'source': './Data/Text/speech.txt'}, page_content='Technology is not neutral. Every line of code carries assumption

### Storing the vectors into Vector Store DB, example ChromaDB

In [141]:
# !pip install chromadb

In [139]:
from langchain_community.vectorstores import Chroma

In [143]:
db = Chroma.from_documents(split_text_documents, em_custom_model)
db

<langchain_community.vectorstores.chroma.Chroma at 0x1205c8970>

In [144]:
query="But today, the real challenge is not whether"
# retrieved_results = db.similarity_search(query)
print(retrieved_results)

[Document(metadata={'source': './Data/Text/speech.txt'}, page_content='For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.'), Document(metadata={'source': './Data/Text/speech.txt'}, page_content='Ladies and gentlemen, friends, and fellow builders,\n\nWe live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.'), Document(metadata={'source': './Data/Text/speech.txt'}, page_content='The future we are building will not be defined by our tools alone.

By default, similarity search ≠ “give me the best one”
It means:

“Give me the top-k most similar documents”

And k defaults to > 1 (usually k=4, depending on the vector store).

So you’re asking:

“What are the top few chunks that kind of match this sentence?”

Not:

“What is the single best match?”

## Using Open Source Models through OLLAMA

In [181]:
from langchain_community.embeddings import OllamaEmbeddings

In [160]:
embeddings = (
    OllamaEmbeddings(model="mxbai-embed-large:latest") # default -> llama2 Embeddings
)
embeddings

OllamaEmbeddings(base_url='http://localhost:11434', model='mxbai-embed-large:latest', embed_instruction='passage: ', query_instruction='query: ', mirostat=None, mirostat_eta=None, mirostat_tau=None, num_ctx=None, num_gpu=None, num_thread=None, repeat_last_n=None, repeat_penalty=None, temperature=None, stop=None, tfs_z=None, top_k=None, top_p=None, show_progress=False, headers=None, model_kwargs=None)

In [161]:
r1 = embeddings.embed_documents([
    "Alpha is the first letter of greek alphabets",
    "Beta is the second letter of greek aplhabets"
])

In [167]:
len(r1[1])

1024

In [168]:
r1[1]

[0.02232862263917923,
 -0.011172731406986713,
 0.05565343424677849,
 0.004261345602571964,
 -0.059006091207265854,
 -0.002101231599226594,
 0.05839128419756889,
 0.014102347195148468,
 -0.0006132923299446702,
 0.04060807451605797,
 0.003592876950278878,
 0.0008210025262087584,
 -0.01609811745584011,
 0.00807739794254303,
 -0.04263812676072121,
 -0.01741427183151245,
 -0.032701026648283005,
 -0.03561025857925415,
 -0.04910216107964516,
 -0.013773754239082336,
 0.006644190289080143,
 0.030934231355786324,
 -0.08119471371173859,
 0.014379146508872509,
 -0.012773838825523853,
 0.047786515206098557,
 0.0017566366586834192,
 0.0016951588913798332,
 0.06454253941774368,
 0.04427395015954971,
 0.01942199468612671,
 0.036683741956949234,
 0.03538158908486366,
 -0.019382568076252937,
 -5.7232424296671525e-05,
 -0.01037557702511549,
 0.05384887382388115,
 -0.02253211848437786,
 -0.00800285767763853,
 -0.04165192320942879,
 0.03943972289562225,
 -0.01664954610168934,
 0.030306879431009293,
 -0.015

In [165]:
r2_ans = embeddings.embed_query("What is the second letter of the greek alphabets")

In [166]:
r2_ans

[0.018760541453957558,
 0.0027914312668144703,
 0.01640276424586773,
 -0.008009024895727634,
 -0.08523359894752502,
 -0.016433728858828545,
 0.05806936323642731,
 0.023995280265808105,
 0.0005751606659032404,
 0.03280412033200264,
 0.005269171670079231,
 0.01563948579132557,
 -0.03185107558965683,
 0.008660389110445976,
 -0.06639841198921204,
 -0.0028611638117581606,
 -0.018696248531341553,
 0.0002917601086664945,
 -0.04092410206794739,
 -0.012840654700994492,
 0.05591011047363281,
 0.053638946264982224,
 -0.07699041813611984,
 0.019978661090135574,
 -0.003759740386158228,
 0.015828536823391914,
 -0.027374276891350746,
 -0.03970276564359665,
 0.007332733366638422,
 0.05431250482797623,
 -0.030517537146806717,
 0.019720114767551422,
 0.026561785489320755,
 -0.03649875149130821,
 -0.0012056123232468963,
 -0.03193274885416031,
 0.038359418511390686,
 -0.0350051112473011,
 -0.004734362009912729,
 -0.05996692553162575,
 0.02492077462375164,
 6.737854710081592e-05,
 -0.014212377369403839,
 -

In [170]:
load_dotenv()
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")

In [178]:
# !pip install sentence-transformers
# !pip install langchain_huggingface
# !pip install tf-keras

In [175]:
from langchain_huggingface import HuggingFaceEmbeddings

In [None]:
# HuggingFace sentence-transformers:
#It’s a library for generating vector embeddings from text — basically turning sentences, paragraphs, or even documents into numbers 
# that a computer can understand for semantic similarity, search, clustering, or RAG.

In [180]:
hf_embeddings = HuggingFaceEmbeddings(model="all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [182]:
query = "This is a example test sentence to be converted into a vector embedding"

In [183]:
query_result = hf_embeddings.embed_query(query)
len(query_result)

384

## Vector Stores

In [277]:
# !pip install faiss-cpu

#### Loading Text

In [188]:
from langchain_community.document_loaders import TextLoader

In [196]:
text_loader = TextLoader("./Data/Text/speech.txt")
loaded_text = text_loader.load()

In [197]:
loaded_text

[Document(metadata={'source': './Data/Text/speech.txt'}, page_content='Ladies and gentlemen, friends, and fellow builders,\n\nWe live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.\n\nFor generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.\n\nTechnology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just re

#### Splitting Text into Documents

In [193]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [195]:
rec_txt_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

In [198]:
text_documents = rec_txt_splitter.split_documents(loaded_text)

In [201]:
print(len(text_documents), type(text_documents))

5 <class 'list'>


### Creating a FAISS Vector Store

In [202]:
from langchain_community.vectorstores import FAISS

In [203]:
# get the OLLAMA embeddings
from langchain_community.embeddings import OllamaEmbeddings

In [207]:
embeddings = OllamaEmbeddings(
    model="mxbai-embed-large"
    # default llama2
)

In [212]:
# create a FAISS vector store,
# ollama has to be running
db = FAISS.from_documents(text_documents, embedding=embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x169cd2460>

In [213]:
# since we had 5 documents, faiss db actually is a 2D matrix of 5 rows and number of embeddings per document columns (here, 1024)
db.index.ntotal

5

In [214]:
db.index.d

1024

#### Interacting with the Vector Store DB

In [219]:
query="What the speaker has to tell about bigger machines"

In [220]:
similar_docs = db.similarity_search(query)

In [264]:
for item in similar_docs:
    print(item.page_content, end="\n\n")

For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.

Ladies and gentlemen, friends, and fellow builders,

We live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.

Technology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just researchers,” we ignore the fact that our work shapes how people live, learn, c

#### Converting the Vector Store to a Retriever

In [260]:
retriever = db.as_retriever()

In [262]:
similar_docs_using_retriever = retriever.invoke(query)

In [265]:
for item in similar_docs_using_retriever:
    print(item.page_content, end="\n\n")

For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.

Ladies and gentlemen, friends, and fellow builders,

We live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.

Technology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just researchers,” we ignore the fact that our work shapes how people live, learn, c

In [266]:
# both the vector store db and the retriever give the same results

#### Similarity Search with Score using Retriever

In [267]:
similar_docs_with_score = db.similarity_search_with_score(query)

In [269]:
for item in similar_docs_with_score:
    print(item, end="\n\n")

(Document(id='e2a0acc7-277a-455e-9e5b-f8be36741a97', metadata={'source': './Data/Text/speech.txt'}, page_content='For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.'), np.float32(0.75798464))

(Document(id='fab83da3-2c97-49d6-9fe7-b7d42603944d', metadata={'source': './Data/Text/speech.txt'}, page_content='Ladies and gentlemen, friends, and fellow builders,\n\nWe live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.'), np.float32(0.78366566))


#### Interacting using vectors instead of text query

In [270]:
text_embedding = embeddings.embed_query(query)

In [271]:
similar_docs_using_vectors = db.similarity_search_by_vector(text_embedding)

In [273]:
for item in similar_docs_using_vectors:
    print(item, end="\n\n")

page_content='For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.' metadata={'source': './Data/Text/speech.txt'}

page_content='Ladies and gentlemen, friends, and fellow builders,

We live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.' metadata={'source': './Data/Text/speech.txt'}

page_content='Technology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while mut

#### Saving the vector db

In [274]:
db.save_local("faiss_index")

#### Loading the vector db

In [276]:
dbb = FAISS.load_local("faiss_index", embeddings=embeddings, allow_dangerous_deserialization=True)
# embeddings = OLLAMA

### Using Chroma

In [283]:
# !pip install chromadb
# !pip install langchain-chroma

In [284]:
from langchain_chroma import Chroma

In [291]:
chromadb = Chroma.from_documents(text_documents, embedding=embeddings, persist_directory="./chromadb")

In [296]:
query="What the speaker has to tell about bigger machines"

In [297]:
similar_docs = chromadb.similarity_search(query)
for item in similar_docs:
    print(item.page_content, end="\n\n")

Ladies and gentlemen, friends, and fellow builders,

We live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.

For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.

Technology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just researchers,” we ignore the fact that our work shapes how people live, learn, c

In [298]:
chromadbb = Chroma(persist_directory="./chromadb", embedding_function=embeddings)

In [299]:
similar_docs = chromadbb.similarity_search(query)
for item in similar_docs:
    print(item.page_content, end="\n\n")

Ladies and gentlemen, friends, and fellow builders,

We live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.

For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.

Technology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just researchers,” we ignore the fact that our work shapes how people live, learn, c

In [300]:
retriever_chroma = chromadb.as_retriever()

In [302]:
retrieved_documents_using_retriever_chroma = retriever_chroma.invoke(query)

In [303]:
for item in retrieved_documents_using_retriever_chroma:
    print(item.page_content, end="\n\n")

Ladies and gentlemen, friends, and fellow builders,

We live in an era where change no longer knocks on the door. It breaks it down. Every year, technology moves faster, systems grow more complex, and decisions that once took decades now take months, sometimes weeks. And in the middle of all this acceleration, there is one uncomfortable truth we must face: progress without direction is not innovation—it is chaos.

For generations, we measured success by what we could build. Bigger machines. Faster networks. Smarter systems. But today, the real challenge is not whether we can build something. It is whether we should, and if we do, whether we understand the consequences of what we are putting into the world.

Technology is not neutral. Every line of code carries assumptions. Every model reflects priorities. Every system amplifies certain voices while muting others. When we say we are “just engineers” or “just researchers,” we ignore the fact that our work shapes how people live, learn, c