# Semantic Search on PDF Documents with qFlat Index

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

This example demonstrates how to use KDB.AI to run semantic search on unstructured text documents.

<div class="alert alert-block alert-info">
<b>Tip:</b> This sample uses ‘qFlat’ , a new vector index choice in KDB.AI. It will support the same API options as the existing ‘Flat’ index but with the significant difference that the index is stored on-disk and memory-mapped as required. This means data inserts will have negligible memory and cpu footprints. The vector index can grow and be searched as long as there is disk space available and works great for datasets with up to 1,000,000 vectors. Among other cases, this stands out as a great index for memory contrained situations such as edge devices.
</div>

Semantic search allows users to perform searches based on the meaning or similarity of the data rather than exact matches. It works by converting the query into a vector representation and then finding similar vectors in the database. This way, even if the query and the data in the database are not identical, the system can identify and retrieve the most relevant results based on their semantic meaning.

### Aim
In this tutorial, we'll walk you through the process of performing semantic search on documents, taking PDFs as example, using KDB.AI as the vector store. We will cover the following topics:

0. Load PDF Data
1. Create Sentence Vector Embeddings
2. Store Embeddings in KDB.AI
3. Search For Similar Sentences To A Target Sentence
4. Delete the KDB.AI Table

---

## 0. Setup

### Install dependencies

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

### Set Environment Variables

In [None]:
%pip install kdbai_client pypdf sentence_transformers langchain langchain-community


In [None]:
### !!! Only run this cell if you need to download the data into your environment, for example in Colab
### This downloads research paper pdf into your environment
!mkdir ./data
!wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/document_search/data/research_paper.pdf

In [3]:
import os

In [4]:
### ignore tensorflow warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

### Import Packages

In [5]:
# load data
import pypdf
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

In [None]:
# embeddings
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

In [7]:
# vector DB
import os
import kdbai_client as kdbai
from getpass import getpass
import time

### Configure Console

In [8]:
pd.set_option("display.max_colwidth", 300)

### Define Helper Functions

In [9]:
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

## 1. Load PDF Data

### Read Text From PDF Document

We leverage the power of PyPDF using LangChain's PyPDFLoader. The code below extracts content from each page of the PDF and processes it to identify sentences.

The PDF we are using is [this research paper](https://arxiv.org/pdf/2308.05801.pdf) presenting information on the formation of Interstellar Objects in the Milky Way.

In [10]:
loader = PyPDFLoader("data/research_paper.pdf")
doc = loader.load()

### Split The Text Into Chunks

In [11]:
# Chunk the documents into 500 character chunks using langchain's text splitter "RucursiveCharacterTextSplitter"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)

In [12]:
# split_documents produces a list of all the chunks created, printing out first chunk for example
pdf_chunks = [p.page_content for p in text_splitter.split_documents(doc)]

In [14]:
pdf_chunks[0]

'Draft version August 14, 2023\nTypeset using L ATEX default style in AASTeX631\nThe Galactic Interstellar Object Population: A Framework for Prediction and Inference\nMatthew J. Hopkins\n ,1Chris Lintott\n ,1Michele T. Bannister\n ,2J. Ted Mackereth\n ,3, 4, 5, ∗and\nJohn C. Forbes\n2\n1Department of Physics, University of Oxford, Denys Wilkinson Building, Keble Road, Oxford, OX1 3RH, UK'

In [19]:
cleaned_chunks = []
for chunk in pdf_chunks:
    cleaned_chunk = chunk.replace("\n", " ")
    cleaned_chunks.append(cleaned_chunk)
pdf_chunks = cleaned_chunks


## 2. Create Sentence Vector Embeddings

Next, we use the Sentence Transformers library to create embeddings for our collection of sentences.

### Selecting a Sentence Transformer Model

There are 100+ of different types of Sentence Transformers models available - see [HuggingFace](https://huggingface.co/sentence-transformers) for the full list. The diversity among these primarily stems from variations in their training data. Selecting the ideal model for your needs involves matching the domain and task closely, while also considering the benefits of incorporating larger datasets to enhance scale.

This tutorial will use the `all-MiniLM-L6-v2` pre-trained model. This embedding model can create sentence and document embeddings that can be used for a wide variety of tasks including semantic search which makes it a good choice for our needs.

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

### Generate Sentence Embeddings Using This Model

We prepare embeddings by applying the sentence transformer model to our sentences to encode them. The we do some transformation to get this into DataFrame which is the format accepted by KDB.AI.

In [22]:
# Create embeddings
embeddings_list = model.encode(np.array(pdf_chunks)).tolist()
embeddings_df = pd.DataFrame({"vectors": embeddings_list, "sentences": pdf_chunks})

In [23]:
show_df(embeddings_df)

(177, 2)


Unnamed: 0,vectors,sentences
0,"[-0.05836932361125946, -0.09570957720279694, 0.09096840023994446, 0.08964202553033829, -0.0016967009287327528, -0.007497997023165226, -0.04162970557808876, 0.052305180579423904, 0.0187312513589859, 0.04670504108071327, -0.010419736616313457, -0.08292997628450394, -0.024917449802160263, -0.046881...","Draft version August 14, 2023 Typeset using L ATEX default style in AASTeX631 The Galactic Interstellar Object Population: A Framework for Prediction and Inference Matthew J. Hopkins ,1Chris Lintott ,1Michele T. Bannister ,2J. Ted Mackereth ,3, 4, 5, ∗and John C. Forbes 2 1Department of Phys..."
1,"[-0.0750979408621788, -0.0819830521941185, 0.07326588034629822, 0.09153451770544052, 0.03397738188505173, -0.055389612913131714, -0.05017055571079254, -0.04132988676428795, 0.05165640637278557, -0.021904826164245605, -0.030745334923267365, -0.09236758947372437, -0.04359031468629837, -0.031650699...","2School of Physical and Chemical Sciences—Te Kura Mat¯ u, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand 3Just Group plc, Enterprise House, Bancroft road, Reigate, Surrey RH2 7RP, UK 4Canadian Institute for Theoretical Astrophysics, University of Toronto, 60 St. Georg..."
2,"[-0.0873275175690651, -0.11289384961128235, 0.06424403935670853, 0.07887156307697296, 0.04632219672203064, -0.07177848368883133, -0.04732842743396759, 0.038410428911447525, 0.029449230059981346, 0.03104921244084835, 0.0027609155513346195, -0.10972756892442703, -0.0008616013801656663, -0.02070042...","The Milky Way is thought to host a huge population of interstellar objects (ISOs), numbering approximately 1015pc−3around the Sun, which are formed and shaped by a diverse set of processes ranging from planet formation to galactic dynamics. We define a novel framework: firstly to predict the pro..."
3,"[-0.0460510216653347, -0.0047828322276473045, 0.06719707697629929, 0.07736083120107651, -0.0483601875603199, -0.02022170089185238, 0.012577960267663002, 0.027407987043261528, 0.03376094996929169, 0.04556810110807419, 0.03773481771349907, -0.13278351724147797, 0.02866959013044834, -0.004317905288...",predicted population to what is observed. We predict the spatial and compositional distribution of the Galaxy’s population of ISOs by modelling the Galactic stellar population with data from the APOGEE survey and combining this with a protoplanetary disk chemistry model. Selecting ISO water mass...
4,"[-0.09910354763269424, 0.014781934209167957, 0.06315221637487411, 0.07686097919940948, 0.035197123885154724, -0.018099332228302956, 0.02634289488196373, 0.009751057252287865, -0.012997588142752647, -0.03471815958619118, -0.06396297365427017, -0.1404487043619156, 0.04472656175494194, 0.0024918657...",and averaged over the Galactic disk; our prediction for the Solar neighbourhood is compatible with the inferred water mass fraction of 2I/Borisov. We show that the well-studied Galactic stellar metallicity gradient has a corresponding ISO compositional gradient. We also demonstrate the inference...


It is important to note the dimension of our embeddings is 384. This will need to match the dimensions we set in the KDB.AI index in the next step. We can easily check this using `len` to count elements in our vector.

In [24]:
len(embeddings_df["vectors"][0])

384

## 3. Store Embeddings in KDB.AI

With the embeddings created, we need to store them in a vector database to enable efficient searching.

### Define KDB.AI Session

KDB.AI comes in two offerings:

1. [KDB.AI Cloud](https://trykdb.kx.com/kdbai/signup/) - For experimenting with smaller generative AI projects with a vector database in our cloud.
2. [KDB.AI Server](https://trykdb.kx.com/kdbaiserver/signup/) - For evaluating large scale generative AI applications on-premises or on your own cloud provider.

Depending on which you use there will be different setup steps and connection details required.

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [None]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [26]:
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [None]:
# session = kdbai.Session(endpoint="http://localhost:8082")

### Define Vector DB Table Schema

The next step is to define a schema for our KDB.AI table where we will store our embeddings. Our table will have two columns.

At this point you will select the index and metric you want to use for searching.

In this case, we will use the qFlat index, Euclidean Distance (L2) for the search metric, and we specify the number of dimensions of our embeddings (384).

In [27]:
pdf_schema = {
    "columns": [
        {"name": "sentences", "pytype": "str"},
        {
            "name": "vectors",
            "pytype": "float64",
            "vectorIndex": {"dims": 384, "metric": "L2", "type": "qFlat"},
        },
    ]
}

### Create Vector DB Table

Use the KDB.AI `create_table` function to create a table that matches the defined schema in the vector database.

In [28]:
# First ensure the table does not already exist
try:
    session.table("pdf").drop()
except kdbai.KDBAIException:
    pass

In [29]:
table = session.create_table("pdf", pdf_schema)

We can use `query` to see our table exists but is empty.

In [30]:
table.query()

Unnamed: 0,sentences,vectors


### Add Embedded Data to KDB.AI Table

In [31]:
table.insert(embeddings_df)

'Insert successful'

### Verify Data Has Been Inserted

Running `table.query()` should show us that data has been added.

In [32]:
show_df(table.query())

(177, 2)


Unnamed: 0,vectors,sentences
0,"[-0.058369324, -0.09570958, 0.0909684, 0.089642026, -0.0016967009, -0.007497997, -0.041629706, 0.05230518, 0.018731251, 0.04670504, -0.010419737, -0.08292998, -0.02491745, -0.04688102, -0.03619034, -0.010246917, 0.010703141, -0.08851221, 0.026034398, 0.046041407, 0.05991257, 0.06968561, 0.007134...","Draft version August 14, 2023 Typeset using L ATEX default style in AASTeX631 The Galactic Interstellar Object Population: A Framework for Prediction and Inference Matthew J. Hopkins ,1Chris Lintott ,1Michele T. Bannister ,2J. Ted Mackereth ,3, 4, 5, ∗and John C. Forbes 2 1Department of Phys..."
1,"[-0.07509794, -0.08198305, 0.07326588, 0.09153452, 0.033977382, -0.055389613, -0.050170556, -0.041329887, 0.051656406, -0.021904826, -0.030745335, -0.09236759, -0.043590315, -0.0316507, 0.06475058, 0.012436966, -0.037336867, -0.08578171, 0.0125643825, 0.008734101, -0.031784117, 0.0033918184, -0....","2School of Physical and Chemical Sciences—Te Kura Mat¯ u, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand 3Just Group plc, Enterprise House, Bancroft road, Reigate, Surrey RH2 7RP, UK 4Canadian Institute for Theoretical Astrophysics, University of Toronto, 60 St. Georg..."
2,"[-0.08732752, -0.11289385, 0.06424404, 0.07887156, 0.046322197, -0.07177848, -0.047328427, 0.03841043, 0.02944923, 0.031049212, 0.0027609156, -0.10972757, -0.0008616014, -0.020700429, 0.030852202, -0.10038987, 0.03392497, -0.112470984, -0.009184576, 0.003726836, 0.0033985798, 0.073350705, -0.055...","The Milky Way is thought to host a huge population of interstellar objects (ISOs), numbering approximately 1015pc−3around the Sun, which are formed and shaped by a diverse set of processes ranging from planet formation to galactic dynamics. We define a novel framework: firstly to predict the pro..."
3,"[-0.04605102, -0.004782832, 0.06719708, 0.07736083, -0.048360188, -0.0202217, 0.01257796, 0.027407987, 0.03376095, 0.0455681, 0.037734818, -0.13278352, 0.02866959, -0.0043179053, 0.02555343, -0.09984318, -0.018692864, -0.025918748, 0.06574646, 0.028952502, -0.015121564, 0.07768005, -0.036640517,...",predicted population to what is observed. We predict the spatial and compositional distribution of the Galaxy’s population of ISOs by modelling the Galactic stellar population with data from the APOGEE survey and combining this with a protoplanetary disk chemistry model. Selecting ISO water mass...
4,"[-0.09910355, 0.014781934, 0.06315222, 0.07686098, 0.035197124, -0.018099332, 0.026342895, 0.009751057, -0.012997588, -0.03471816, -0.06396297, -0.1404487, 0.04472656, 0.0024918658, 0.011178071, -0.09807016, 0.012129602, -0.024170477, 0.020960514, 0.02431601, -0.014627528, 0.07360536, -0.0052580...",and averaged over the Galactic disk; our prediction for the Solar neighbourhood is compatible with the inferred water mass fraction of 2I/Borisov. We show that the well-studied Galactic stellar metallicity gradient has a corresponding ISO compositional gradient. We also demonstrate the inference...


## 4. Search For Similar Sentences To A Target Sentence

Now that the embeddings are stored in KDB.AI, we can perform semantic search using `search`.

### Search 1

First, we embed our search term using the Sentence Transformer model as before. Then we search our index to return the three most similar vectors.

In [33]:
search_term1 = "number of interstellar objects in the milky way"

In [34]:
encoded_search_term1 = model.encode(search_term1).tolist()

In [35]:
results1 = table.search([encoded_search_term1], n=3)
results1[0]

Unnamed: 0,__nn_distance,vectors,sentences
0,0.640539,"[-0.058295775, -0.09771143, 0.06014487, 0.074923225, 0.0048555867, -0.07818227, -0.016815603, 0.0032223102, 0.015408782, -0.010816202, 0.05693562, -0.1279278, 0.004590574, -0.016816303, -0.011035758, -0.03767118, 0.012765638, -0.09253237, 0.021211633, 0.00905855, -0.017835798, 0.04149379, -0.015...","1I/‘Oumuamua (Meech et al. 2017) and 2I/Borisov1are the first two observed samples from a highly numerous population: interstellar objects (ISOs). Estimated to number ∼1015pc−3around the Sun (Engelhardt et al. 2017; Do et al. 2018), they are implied to have a spatial distribution spanning the en..."
1,0.65308,"[-0.08732752, -0.11289385, 0.06424404, 0.07887156, 0.046322197, -0.07177848, -0.047328427, 0.03841043, 0.02944923, 0.031049212, 0.0027609156, -0.10972757, -0.0008616014, -0.020700429, 0.030852202, -0.10038987, 0.03392497, -0.112470984, -0.009184576, 0.003726836, 0.0033985798, 0.073350705, -0.055...","The Milky Way is thought to host a huge population of interstellar objects (ISOs), numbering approximately 1015pc−3around the Sun, which are formed and shaped by a diverse set of processes ranging from planet formation to galactic dynamics. We define a novel framework: firstly to predict the pro..."
2,0.792991,"[-0.05955432, -0.04058243, 0.05124921, 0.024224859, -0.000807483, -0.09936961, -0.025115658, 0.04463438, -0.0040212194, 0.0103443, 0.008752798, -0.105754204, -0.041435424, -0.032096222, 0.053592417, -0.0024946833, 0.08261293, -0.034735326, 0.022979582, 0.009241539, 0.026665753, 0.04567836, 0.006...","a sample of tens of 1I/‘Oumuamua-like ISOs (e.g. Levine et al. 2021), as well as any more 2I/Borisov-like cometary ISOs that enter our Solar System. This is in addition to the continuing contributions of the NEO surveys and other observatories that found 1I and 2I in the first place (Meech et al..."


The results returned from `table.search` show the closest matches along with value of nearest neighbor distances `nn_distance`.
We can see these sentences do reference our search term 'number of interstellar objects in the milky way' in some way.

### Search 2

Let's try another search term.

In [36]:
search_term2 = "how does planet formation occur"

In [37]:
encoded_search_term2 = model.encode(search_term2).tolist()

In [38]:
results2 = table.search([encoded_search_term2], n=3)
results2[0]

Unnamed: 0,__nn_distance,vectors,sentences
0,0.928834,"[0.047059268, -0.04043316, 0.059950948, 0.04573296, -0.0039455127, -0.05942609, 0.015076651, -0.008425757, 0.06865491, 0.03147973, 0.0057636853, -0.030097032, 0.023744823, -0.04634087, -0.019034341, -0.060542025, -0.07941156, 0.023247978, 0.00544396, 0.02270429, 0.060322043, 0.012773193, -0.0442...","of ISO formation mechanisms, we use this as a reasonable proxy for the number of ISOs produced by each star. However, the number of ISOs produced by a star may not be simply proportional to the mass of planet-forming material, because ISO production also requires the ejection of planetesimals — ..."
1,0.970353,"[-0.09935083, 0.010767108, 0.0982887, 0.061452035, 0.013212075, -0.053271275, 0.022545123, 0.015341881, 0.10350827, 0.03053778, -0.11399655, -0.06025269, 0.073029354, -0.086470425, 0.030716218, -0.10944588, -0.08945113, -0.032149576, -0.010662642, -0.0027274736, 0.056740154, -0.03797836, -0.0789...","for that metallicity, exterior to the water ice line. While in reality, stars will each produce a distribution of ISOs that formed at different positions in their protoplanetary disk and thus have a range of compositions, this simplification of only modelling planetesimals which form exterior to..."
2,1.012795,"[-0.029549774, -0.012195995, 0.0057087704, 0.0638904, 0.033144396, -0.04190236, 0.059939828, -0.003044334, 0.06776189, 0.017902834, -0.08158446, -0.06551176, 0.1142783, -0.07725323, 0.037620135, -0.08376906, -0.023476476, -0.009318634, 0.0010405355, 0.06408926, 0.04694798, 0.0012622206, -0.04820...",the composition of planetesimals formed around stars of different metallicities. They do this for stars with metallicities


Again, we can see these sentences do reference our search term 'how does planet formation occur' in some way.

## 5. Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [39]:
table.drop()

True

## Take Our Survey

We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

[**Take the Survey**](https://delighted.com/t/ejgOzTpo)