# Semantic Search on PDF Documents with qFlat Index

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

This example demonstrates how to use KDB.AI to run semantic search on unstructured text documents.

<div class="alert alert-block alert-info">
<b>Tip:</b> This sample uses ‘qFlat’ , a new vector index choice in KDB.AI. It will support the same API options as the existing ‘Flat’ index but with the significant difference that the index is stored on-disk and memory-mapped as required. This means data inserts will have negligible memory and cpu footprints. The vector index can grow and be searched as long as there is disk space available and works great for datasets with up to 1,000,000 vectors. Among other cases, this stands out as a great index for memory contrained situations such as edge devices.
</div>

Semantic search allows users to perform searches based on the meaning or similarity of the data rather than exact matches. It works by converting the query into a vector representation and then finding similar vectors in the database. This way, even if the query and the data in the database are not identical, the system can identify and retrieve the most relevant results based on their semantic meaning.

### Aim
In this tutorial, we'll walk you through the process of performing semantic search on documents, taking PDFs as example, using KDB.AI as the vector store. We will cover the following topics:

0. Setup
1. Load PDF Data
2. KDB.AI Table Creation
3. LlamaIndex index & query_engine setup
4. Retrieve Similar Sentences & RAG
5. Delete the KDB.AI Table

---

## 0. Setup

### Install dependencies

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

### Set Environment Variables

In [None]:
!pip install llama-index llama-index-llms-openai llama-index-embeddings-openai llama-index-readers-file llama-index-vector-stores-kdbai
!pip install kdbai_client pandas


In [2]:
### !!! Only run this cell if you need to download the data into your environment, for example in Colab
### This downloads research paper pdf into your environment
!mkdir ./data
!wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/document_search/data/research_paper.pdf

--2024-09-05 18:14:01--  https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/document_search/data/research_paper.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1206037 (1.1M) [application/octet-stream]
Saving to: ‘./data/research_paper.pdf’


2024-09-05 18:14:01 (17.7 MB/s) - ‘./data/research_paper.pdf’ saved [1206037/1206037]



### Import Packages

In [3]:

import os
from getpass import getpass
import re
import os
import shutil
import time
import urllib

import pandas as pd

from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.kdbai import KDBAIVectorStore

import kdbai_client as kdbai

OUTDIR = "pdf"
RESET = True

In [4]:
# OpenAI API Key: https://platform.openai.com/api
os.environ["OPENAI_API_KEY"] = (
    os.environ["OPENAI_API_KEY"]
    if "OPENAI_API_KEY" in os.environ
    else getpass("OpenAI API Key: ")
)

OpenAI API Key: ··········


In [5]:
# Set up LlamaIndex Parameters

EMBEDDING_MODEL  = "text-embedding-3-small"
GENERATION_MODEL = 'gpt-4o-mini'

llm = OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)

Settings.llm = llm
Settings.embed_model = embed_model

### Configure Console

In [6]:
pd.set_option("display.max_colwidth", 300)

### Define Helper Functions

In [7]:
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

## 1. Load PDF Data

### Read Text From PDF Document

We LlamaIndex SimpleDirectorReader to read in our PDF file.

The PDF we are using is [this research paper](https://arxiv.org/pdf/2308.05801.pdf) presenting information on the formation of Interstellar Objects in the Milky Way.

In [8]:
reader = SimpleDirectoryReader(
    input_dir="data",
)
documents = reader.load_data()

In [9]:
print(documents)

[Document(id_='1b12ee1b-9b8d-4fa5-bc3f-2d10de41c3ce', embedding=None, metadata={'page_label': '1', 'file_name': 'research_paper.pdf', 'file_path': '/content/data/research_paper.pdf', 'file_type': 'application/pdf', 'file_size': 1206037, 'creation_date': '2024-09-05', 'last_modified_date': '2024-09-05'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Draft version August 14, 2023\nTypeset using L ATEX default style in AASTeX631\nThe Galactic Interstellar Object Population: A Framework for Prediction and Inference\nMatthew J. Hopkins\n ,1Chris Lintott\n ,1Michele T. Bannister\n ,2J. Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes\n2\n1Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford, OX1 3RH, UK\n2School of Physical

### Define KDB.AI Session

KDB.AI comes in two offerings:

1. [KDB.AI Cloud](https://trykdb.kx.com/kdbai/signup/) - For experimenting with smaller generative AI projects with a vector database in our cloud.
2. [KDB.AI Server](https://trykdb.kx.com/kdbaiserver/signup/) - For evaluating large scale generative AI applications on-premises or on your own cloud provider.

Depending on which you use there will be different setup steps and connection details required.

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [None]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [11]:
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [None]:
# session = kdbai.Session(endpoint="http://localhost:8082")

### Define Vector DB Table Schema

The next step is to define a schema for our KDB.AI table where we will store our embeddings. Our table will have two columns.

At this point you will select the index and metric you want to use for searching.

In this case, we will use the qFlat index, Euclidean Distance (L2) for the search metric, and we specify the number of dimensions of our embeddings (384).

In [12]:
pdf_schema = dict(
    columns=[
        dict(name="document_id", pytype="bytes"),
        dict(name="text", pytype="bytes"),
        dict(
            name="embedding",
            vectorIndex=dict(type="qFlat", metric="L2", dims=1536),
        ),
    ]
)

### Create Vector DB Table

Use the KDB.AI `create_table` function to create a table that matches the defined schema in the vector database.

In [13]:
# First ensure the table does not already exist
try:
    session.table("pdf").drop()
except kdbai.KDBAIException:
    pass

In [14]:
table = session.create_table("pdf", pdf_schema)

We can use `query` to see our table exists but is empty.

In [15]:
table.query()

Unnamed: 0,document_id,text,embedding


## 3. LlamaIndex index & query_engine setup
Define the index: using KDB.AI as the vector store, chunk, embed, and load the document into KDB.AI

In [16]:
vector_store = KDBAIVectorStore(table)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[SentenceSplitter(chunk_size=100, chunk_overlap=0)],
)

### Verify Data Has Been Inserted

Running `table.query()` should show us that data has been added.

In [17]:
show_df(table.query())

(370, 3)


Unnamed: 0,document_id,text,embedding
0,b'5a969e74-9ccf-4448-90e3-6391d6552b70',"b'Draft version August 14, 2023\nTypeset using L ATEX default style in AASTeX631\nThe Galactic Interstellar Object Population: A Framework for Prediction and Inference\nMatthew J. Hopkins\n ,1Chris Lintott\n ,1Michele T. Bannister\n ,2J. Ted Mackereth\n ,3, 4, 5,'","[0.011503529, 0.018526183, 0.035453927, -0.0246186, -0.0075270813, -0.00292829, 0.018146226, -0.023373913, -0.020976253, -0.034222342, -0.0044841487, 0.015800975, 0.009164827, -0.009485826, 0.012342054, 0.043865394, 0.031916395, -0.009905089, 0.020688009, 0.02982008, -0.03254529, -0.020046012, 0..."
1,b'ba18a044-01bd-4666-bd5a-7190b199ef2c',"b'\xe2\x88\x97and\nJohn C. Forbes\n2\n1Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford, OX1 3RH, UK\n2School of Physical and Chemical Sciences\xe2\x80\x94Te Kura Mat\xc2\xaf u, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand\n3...","[0.03447592, 0.0034924955, 0.029711159, 0.013346323, 0.009011886, 0.03317871, -0.009685439, 0.006903915, 0.0079953205, -0.03205612, 0.0209425, 0.011837065, -0.00838199, 0.019932171, 0.012604167, 0.06630752, 0.0377439, 0.039190788, 0.032929245, 0.015566552, -0.033577852, -0.06376299, 0.047872137,..."
2,b'4b5d7fec-e635-4705-8100-af8af517a875',"b'Bancroft road, Reigate, Surrey RH2 7RP, UK\n4Canadian Institute for Theoretical Astrophysics, University of Toronto, 60 St. George Street, Toronto, ON, M5S 3H8, Canada\n5Dunlap Institute for Astronomy and Astrophysics, University of Toronto, 50 St.'","[0.0070580486, -0.014808062, 0.046827197, 0.03716484, -0.0113482345, 0.0041863914, -0.014644507, -0.026370177, -0.039228156, -0.032987885, -0.026244367, 0.016745565, 0.015009361, -0.046097487, -0.0002655811, 0.09491251, 0.048965998, 0.043782547, -0.0015246832, 0.061899465, -0.005321844, -0.04652..."
3,b'7fb0ac0b-5234-452c-a2bc-37c00f658733',"b'George Street, Toronto, ON M5S 3H4, Canada\nABSTRACT\nThe Milky Way is thought to host a huge population of interstellar objects (ISOs), numbering\napproximately 1015pc\xe2\x88\x923around the Sun, which are formed and shaped by a diverse set of processes\nranging from planet formation to galac...","[0.003038883, -0.00866065, 0.05570509, 0.025615819, -0.012807909, -0.025429426, 0.024271121, -0.024044788, -0.017720714, -0.02395159, -0.008560796, 0.012508349, -0.0057049785, -0.028491609, 0.0154973045, 0.046811447, -0.0069231945, 0.016322762, 0.03956872, 0.014924809, -0.00085624604, 0.01504463..."
4,b'6169368e-8ac2-437c-a98b-3f4584b33baf',"b'We define a novel framework: firstly to predict\nthe properties of this Galactic ISO population by combining models of processes across planetary\nand galactic scales, and secondly to make inferences about the processes modelled, by comparing the\npredicted population to what is observed.'","[0.023436025, 0.011106744, 0.034744486, 0.002776686, 0.033301894, -0.018154668, 0.019743964, -0.019316077, 0.0020676148, 0.0028362847, -0.033081837, -0.008301022, -0.003939624, -0.0159052, 0.021895628, 0.0002355293, -0.006503894, -0.0015258783, 0.045038246, 0.026651295, -0.016174158, -0.02338712..."


#### Set up the LlamaIndex Query Engine

In [18]:
query_engine = index.as_query_engine(
    similarity_top_k=5,
)

## 4. Retrieve Similar Sentences & RAG



Now that the embeddings are stored in KDB.AI, we can perform semantic search using through the LlamaIndex query engine.

### Search 1


In [19]:
search_term1 = "number of interstellar objects in the milky way"

In [20]:
retrieved_chunks = query_engine.retrieve(search_term1)
for i in retrieved_chunks:
    print(i.node.get_text())
    print("____________________")

George Street, Toronto, ON M5S 3H4, Canada
ABSTRACT
The Milky Way is thought to host a huge population of interstellar objects (ISOs), numbering
approximately 1015pc−3around the Sun, which are formed and shaped by a diverse set of processes
ranging from planet formation to galactic dynamics.
____________________
Therefore the results of this work, based on the observed stellar population of the
Milky Way, should give a much more accurate prediction for the Milky Way’s population of ISOs.
6.2.
____________________
In this work, we develop
this method and apply it to the stellar population of the Milky Way, estimated with data from the APOGEE survey, to
predict a broader set of properties of our own Galaxy’s population of interstellar objects. We predict the distribution
of ISOs in both their spatial position in the Galaxy and their water mass fraction.
____________________
3.PREDICTING THE INTERSTELLAR OBJECT DISTRIBUTION
In the previous section, we calculated the sine morte stellar pop

We can see these sentences do reference our search term 'number of interstellar objects in the milky way' in some way.

### Now we can perform RAG, passing the retrieved chunks from above to the LLM for a generate response:

In [21]:
result = query_engine.query(search_term1)
print(result.response)

Approximately 10^15 interstellar objects (ISOs) are estimated to be present per cubic parsec around the Sun in the Milky Way.


### Search 2

Let's try another search term.

In [22]:
search_term2 = "how does planet formation occur"

In [23]:
retrieved_chunks = query_engine.retrieve(search_term2)
for i in retrieved_chunks:
    print(i.node.get_text())
    print("____________________")

A significant number of planetesimals can
also be ejected by close stellar flybys early in a planetary system’s life (e.g. Pfalzner et al. 2021). The protoplanetary
disks of other stars are therefore expected to be a source of ISOs (Stern 1990; Moro-Mart´ ın 2022).
____________________
(2020) argue that the mass of planet-forming material in a protoplanetary disk is proportional
to both the mass of the host star mass M∗and its metal mass fraction Z— well approximated by Z⊙10[Fe/H]for
small values of Z ( Z⊙= 0.0153, Caffau et al. (2011)).
____________________
Protoplanetary Disk Model
We make the foundational assertion that all ISOs we consider form as planetesimals in a protoplanetary disk (’Oumua-
mua ISSI Team et al. 2019). A protoplanetary disk has to first order the same composition as the star it forms around,
since they both form from the same molecular cloud core.
____________________
Additionally, scattering by giant planets may form ISOs with extra
fragmentation, reweighting t

Again, we can see these sentences do reference our search term 'how does planet formation occur' in some way.

In [24]:
result = query_engine.query(search_term2)
print(result.response)

Planet formation occurs within protoplanetary disks, where planetesimals form from the material present in the disk. These disks have a composition similar to that of the host star, as they originate from the same molecular cloud core. The mass of the planet-forming material is influenced by the mass of the host star and its metal mass fraction. Additionally, interactions with giant planets can lead to the fragmentation of planetesimals, resulting in a greater number of interstellar objects being produced from the same amount of material. The formation processes are particularly efficient for planetesimals located outside the water ice line, which contributes to a larger reservoir of material available for planet formation.


## 5. Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [None]:
table.drop()

True

## Take Our Survey

We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

[**Take the Survey**](https://delighted.com/t/ejgOzTpo)