In [1]:
from getpass import getpass
import os
os.environ['OPENAI_API_KEY'] = getpass('OpenAI API key: ')

OpenAI API key: ········


In [2]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import KDBAI

PyKX version: 1.6.0


In [3]:
PDF = "/opt/kx/data/HNSW.pdf"

loader = PyPDFLoader(PDF)
pages = loader.load_and_split()
len(pages)

25

In [4]:
# pages = pages[:1]  # Uncomment this when testing the pipeline to run on the first page only

In [5]:
%%time

import importlib
from langchain.vectorstores import Chroma, Pinecone, kdbai
importlib.reload(kdbai)

TEMP = 0.0
K = 10

embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')
options = dict(efConstruction=8, efSearch=8) # M is missing, and thus DEFAULT_M will be used
vectordb = kdbai.KDBAI.from_documents(pages, 
                                      index_name="myIndex", 
                                      embedding=embeddings, 
                                      persist_directory='tmp',
                                      options=options)
print('fitParams of the model:-')
print(vectordb._model['fitParams'])

# vectordb.persist()
qabot = RetrievalQA.from_chain_type(chain_type='stuff',
                                    llm=ChatOpenAI(model='gpt-3.5-turbo-16k', temperature=TEMP), 
                                    retriever=vectordb.as_retriever(search_kwargs=dict(k=K)),
                                    return_source_documents=True)

PyKX version: 1.6.0
The model myIndex for algorithm 'hnsw' is now created.
Setting the table: '.kdbai.table.data' as a global table within this process

fitParams of the model:-
dims          | 1536
efConstruction| 8
efSearch      | 8
M             | 32
errors        | 1b
CPU times: user 422 ms, sys: 32.4 ms, total: 455 ms
Wall time: 3.99 s
Updating index with new vector data
Updating in-memory table: '.kdbai.table.data' with new data


In [6]:
%%time

Q = 'Summarize the content of this research paper'

print(qabot(dict(query=Q))['result'])

This research paper introduces a new approach for approximate nearest neighbor search called Hierarchical Navigable Small World (HNSW). The HNSW algorithm is a fully graph-based structure that aims to improve the performance and scalability of similarity search in large datasets. The algorithm utilizes proximity graphs and a heuristic for selecting neighbors to achieve logarithmic complexity scaling. The paper presents experimental results comparing HNSW with other state-of-the-art algorithms and demonstrates its superior performance in terms of query time and recall. The HNSW algorithm is also shown to be suitable for distributed implementation.
CPU times: user 113 ms, sys: 3.63 ms, total: 117 ms
Wall time: 11.9 s


In [7]:
%%time 

Q = """
What are the parameters used when building the index or when searching ?
Give me their names and how to tune those to improve the output.
"""

print(qabot(dict(query=Q))['result'])

The parameters used when building the index in Hierarchical NSW are:

1. M: This parameter determines the maximum number of connections that an element can have in each layer of the index. It can be tuned to balance the trade-off between search performance and memory usage. Smaller values of M generally produce better results for lower recalls and/or lower dimensional data, while larger values of M are better for high recall and/or high dimensional data.

2. efConstruction: This parameter controls the trade-off between index construction time and index quality. A higher value of efConstruction leads to longer construction time but potentially better index quality. It can be set to achieve a desired recall during the construction process.

The parameters used when searching in Hierarchical NSW are:

1. efSearch: This parameter controls the trade-off between search time and search quality. A higher value of efSearch leads to longer search time but potentially better search quality. It ca

In [8]:
%%time

Q = 'What about the K parameter ?'

print(qabot(dict(query=Q))['result'])

The K parameter in the K-Nearest Neighbor Search (K-NNS) algorithm refers to the number of nearest neighbors that the algorithm should return. It determines how many elements from the dataset will be considered as potential matches for the query. 

The choice of the K parameter depends on the specific application and the desired trade-off between accuracy and computational efficiency. A smaller value of K will result in faster search times but may lead to a higher chance of missing some relevant neighbors. On the other hand, a larger value of K will provide more accurate results but may require more computational resources and time.

It is important to note that the optimal value of K can vary depending on the dataset and the specific problem at hand. It is often determined through experimentation and performance evaluation.
CPU times: user 102 ms, sys: 1.08 ms, total: 103 ms
Wall time: 15.1 s
