# Advanced RAG with temporal filters using LlamaIndex and KDB.AI vector store

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

> [KDB.AI](https://kdb.ai/) is a powerful knowledge-based vector database and search engine that allows you to build scalable, reliable AI applications, using real-time data, by providing advanced search, recommendation and personalization.

This example demonstrates how to use KDB.AI to run semantic search, summarization and analysis of financial regulations around some specific moment in time.

To access your end point and API keys, sign up to KDB.AI here.

To set up your development environment, follow the instructions on the KDB.AI pre-requisites page.

The following examples demonstrate some of the ways you can interact with KDB.AI through LlamaIndex.

## Install dependencies with Pip

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

In [None]:
!pip install llama-index llama-index-llms-openai llama-index-embeddings-openai llama-index-readers-file llama-index-vector-stores-kdbai
!pip install kdbai_client pandas

## Import dependencies

In [2]:
from getpass import getpass
import re
import os
import shutil
import time
import urllib

import pandas as pd

from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.kdbai import KDBAIVectorStore

import kdbai_client as kdbai

OUTDIR = "pdf"
RESET = True


#### Set OpenAI API key and choose the LLM and Embedding model to use:

In [3]:
#os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = (
    os.environ["OPENAI_API_KEY"]
    if "OPENAI_API_KEY" in os.environ
    else getpass("OpenAI API Key: ")
)

OpenAI API Key: ··········


In [4]:
EMBEDDING_MODEL  = "text-embedding-3-small"
GENERATION_MODEL = 'gpt-3.5-turbo'
#GENERATION_MODEL = "gpt-4o" # Expensive !!!

llm = OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)

Settings.llm = llm
Settings.embed_model = embed_model


## Create KDB.AI session and table

In [5]:
# vector DB imports
import os
from getpass import getpass
import kdbai_client as kdbai
import time

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [None]:
#Set up KDB.AI endpoint and API key
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [7]:
#connect to KDB.AI
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [None]:
#session = kdbai.Session()

### Create the schema for your KDB.AI table

***!!! Note:*** The 'dims' parameter in the embedding column must reflect the output dimensions of the embedding model you choose.


- OpenAI 'text-embedding-3-small' outputs 1536 dimensions.

In [8]:

schema = dict(
    columns=[
        dict(name="document_id", pytype="bytes"),
        dict(name="text", pytype="bytes"),
        dict(
            name="embedding",
            vectorIndex=dict(type="flat", metric="L2", dims=1536),
        ),
        dict(name="title", pytype="bytes"),
        dict(name="publication_date", pytype="datetime64[ns]"),
    ]
)

In [9]:
KDBAI_TABLE_NAME = "reports"

# First ensure the table does not already exist
if KDBAI_TABLE_NAME in session.list():
    session.table(KDBAI_TABLE_NAME).drop()

#Create the table
table = session.create_table(KDBAI_TABLE_NAME, schema)

## Financial reports urls and metadata

In [10]:
INPUT_URLS = [
    "https://www.govinfo.gov/content/pkg/PLAW-106publ102/pdf/PLAW-106publ102.pdf",
    "https://www.govinfo.gov/content/pkg/PLAW-111publ203/pdf/PLAW-111publ203.pdf",
]

METADATA = {
    "pdf/PLAW-106publ102.pdf": {
        "title": "GRAMM–LEACH–BLILEY ACT, 1999",
        "publication_date": pd.to_datetime("1999-11-12"),
    },
    "pdf/PLAW-111publ203.pdf": {
        "title": "DODD-FRANK WALL STREET REFORM AND CONSUMER PROTECTION ACT, 2010",
        "publication_date": pd.to_datetime("2010-07-21"),
    },
}

## Download PDF files locally

In [11]:
%%time

CHUNK_SIZE = 512 * 1024


def download_file(url):
    print("Downloading %s..." % url)
    out = os.path.join(OUTDIR, os.path.basename(url))
    try:
        response = urllib.request.urlopen(url)
    except urllib.error.URLError as e:
        logging.exception("Failed to download %s !" % url)
    else:
        with open(out, "wb") as f:
            while True:
                chunk = response.read(CHUNK_SIZE)
                if chunk:
                    f.write(chunk)
                else:
                    break
    return out


if RESET:
    if os.path.exists(OUTDIR):
        shutil.rmtree(OUTDIR)
    os.mkdir(OUTDIR)

    local_files = [download_file(x) for x in INPUT_URLS]
    local_files[:10]

Downloading https://www.govinfo.gov/content/pkg/PLAW-106publ102/pdf/PLAW-106publ102.pdf...
Downloading https://www.govinfo.gov/content/pkg/PLAW-111publ203/pdf/PLAW-111publ203.pdf...
CPU times: user 39.7 ms, sys: 19.1 ms, total: 58.8 ms
Wall time: 3.33 s


## Load local PDF files with LlamaIndex

In [12]:
%%time

def get_metadata(filepath):
    return METADATA[filepath]


documents = SimpleDirectoryReader(
    input_files=local_files,
    file_metadata=get_metadata,
)

docs = documents.load_data()
len(docs)

CPU times: user 13.8 s, sys: 89.5 ms, total: 13.9 s
Wall time: 14 s


994

## Setup LlamaIndex RAG pipeline using KDB.AI vector store

In [13]:
%%time

#llm = OpenAI(temperature=0, model=LLM)
vector_store = KDBAIVectorStore(table)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    docs,
    storage_context=storage_context,
    transformations=[SentenceSplitter(chunk_size=2048, chunk_overlap=0)],
)

CPU times: user 12.7 s, sys: 213 ms, total: 12.9 s
Wall time: 41.5 s


In [None]:
table.query()

## Setup the LlamaIndex Query Engine

In [15]:
%%time

# Using gpt-3.5-turbo, the 16k tokens context size can only fit around 15 pages of document.
# Using gpt-4o, the 128k tokens context size can take 100 pages.
K = 15

query_engine = index.as_query_engine(
    similarity_top_k=K,
    filter=[("<", "publication_date", "2008-09-15")],
    sort_by="publication_date",
)

CPU times: user 88.7 ms, sys: 3.89 ms, total: 92.6 ms
Wall time: 98 ms


## Before the 2008 crisis

In [16]:
%%time

result = query_engine.query(
    """
What was the main financial regulation in the US before the 2008 financial crisis ?
"""
)
print(result.response)

The main financial regulation in the US before the 2008 financial crisis was the Glass-Steagall Act, which was enacted in 1933. This act separated commercial banking from investment banking and aimed to reduce the risks associated with financial speculation. However, many of its provisions were repealed by the Gramm-Leach-Bliley Act in 1999, which allowed for the consolidation of financial services companies and is often cited as a contributing factor to the financial instability that led to the 2008 crisis.
CPU times: user 179 ms, sys: 5.84 ms, total: 185 ms
Wall time: 4.1 s


In [17]:
%%time

result = query_engine.query(
    """
Is the Gramm-Leach-Bliley Act of 1999 enough to prevent the 2008 crisis. Search the document and explain its strenghts and weaknesses to regulate the US stock market.
"""
)
print(result.response)

The Gramm-Leach-Bliley Act of 1999 aimed to enhance competition in the financial services industry by allowing affiliations among banks, securities firms, and insurance companies. While it provided a framework for such affiliations and included provisions for streamlining supervision and protecting financial systems, it had both strengths and weaknesses in regulating the US stock market.

**Strengths:**
1. **Facilitating Affiliations:** The Act repealed sections of the Glass-Steagall Act, allowing banks to affiliate with securities firms and insurance companies, which could lead to more integrated financial services.
2. **Streamlining Supervision:** It aimed to streamline the supervision of bank holding companies and provided authority to state insurance regulators and the SEC, potentially leading to more efficient oversight.
3. **Prudential Safeguards:** The Act included provisions for prudential safeguards to protect the financial system and deposit funds from "too big to fail" insti

## After the 2008 crisis

In [26]:
%%time

# Using gpt-3.5-turbo, the 16k tokens context size can only fit around 15 pages of document.
# Using gpt-4o, the 128k tokens context size can take 100 pages.
K = 15

query_engine = index.as_query_engine(
    similarity_top_k=K,
    filter=[(">=", "publication_date", "2008-09-15")],
    sort_by="publication_date",
)

CPU times: user 613 µs, sys: 0 ns, total: 613 µs
Wall time: 622 µs


In [27]:
%%time

result = query_engine.query(
    """
What happened on the 15th of September 2008 ?
"""
)
print(result.response)

On September 15, 2008, Lehman Brothers, a major global financial services firm, filed for bankruptcy, marking one of the largest bankruptcies in U.S. history and a significant event in the global financial crisis.
CPU times: user 191 ms, sys: 10.2 ms, total: 201 ms
Wall time: 4.76 s


In [25]:
%%time

result = query_engine.query(
    """
What was the new US financial regulation enacted after the 2008 crisis to increase the market regulation and to improve consumer sentiment ?
"""
)
print(result.response)

The new US financial regulation enacted after the 2008 crisis to increase market regulation and improve consumer sentiment is the Dodd-Frank Wall Street Reform and Consumer Protection Act of 2010.
CPU times: user 293 ms, sys: 20.5 ms, total: 313 ms
Wall time: 31.4 s


## In depth analysis

In [None]:
%%time

# Using gpt-3.5-turbo, the 16k tokens context size can only fit around 15 pages of document.
# Using gpt-4o, the 128k tokens context size can take 100 pages.
K = 20

query_engine = index.as_query_engine(
    similarity_top_k=K, sort_by="publication_date"
)

CPU times: user 0 ns, sys: 396 µs, total: 396 µs
Wall time: 403 µs


In [None]:
%%time

result = query_engine.query(
    """
Analyse the US financial regulations before and after the 2008 crisis and produce a report of all related arguments to explain what happened, and to ensure that does not happen again.
Use both the provided context and your own knowledge but do mention explicitely which one you use.
"""
)
print(result.response)

The Dodd-Frank Wall Street Reform and Consumer Protection Act of 2010 was a response to the 2008 financial crisis, aiming to strengthen financial regulations. Before the crisis, regulatory oversight was criticized for being insufficient in monitoring large financial institutions, leading to unchecked risky behavior. The Act introduced measures to mitigate risks, such as restrictions on mergers and acquisitions and limits on certain financial activities. It also established the Office of Financial Research and the Financial Stability Oversight Council to enhance monitoring and address systemic risks.

To prevent future crises, the Act focused on increasing transparency, strengthening oversight, and enhancing consumer protection. By implementing stricter regulations and oversight mechanisms, it aimed to prevent excessive risk-taking and promote financial stability.

In summary, the Dodd-Frank Act significantly revamped financial regulations in the US post-2008 crisis to address regulator

## Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [None]:
table.drop()

#### Take Our Survey
We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

Take the [Survey](https://delighted.com/t/kWYXv316)
