## ThirdAI's NeuralDB

NeuralDB, provides a high-level API for users to insert different types of files into it and search through the file contents with natural language queries.

First, let's install the dependencies.

In [None]:
!pip3 install thirdai --upgrade

In [None]:
from thirdai import licensing, neural_db as ndb

import nltk

nltk.download("punkt")

import os

if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("")  # Enter your ThirdAI key here

Now, let's define a neural db class.

In [None]:
db = ndb.NeuralDB()

### Insert your files

Let's insert things into it!

Currently, we natively support adding CSV, PDF and DOCX files. We also have a support to automatically scrape and parse URLs. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document. 

#### Example 1: CSV files
The first example below shows how to insert a CSV file. Please note that a CSV file is required to have a column named "DOC_ID" with rows numbered from 0 to n_rows-1.

In [None]:
insertable_docs = []
csv_files = ["data/sample_nda.csv"]

for file in csv_files:
    csv_doc = ndb.CSV(
        path=file,
        id_column="DOC_ID",
        strong_columns=["passage"],
        weak_columns=["para"],
        reference_columns=["passage"],
    )
    #
    insertable_docs.append(csv_doc)

#### Example 2: PDF files

In [None]:
insertable_docs = []
pdf_files = ["data/sample_nda.pdf"]

for file in pdf_files:
    pdf_doc = ndb.PDF(file)
    insertable_docs.append(pdf_doc)

#### Example 3: DOCX files

In [None]:
insertable_docs = []
doc_files = ["data/sample_nda.docx"]

for file in doc_files:
    doc = ndb.DOCX(file)
    insertable_docs.append(doc)

#### Example 4: Parse from URLs directly

First, you can use our utility to generate a set of candidate URLs containing data of interest. You can also use your own list of URLs to extract data from.

In [None]:
valid_url_data = ndb.parsing_utils.recursive_url_scrape(
    base_url="https://www.thirdai.com/pocketllm/", max_crawl_depth=0
)

Then you can create a list of insertable documents from those URLs:

In [None]:
url_docs = []
for url, response in valid_url_data:
    try:
        url_docs.append(ndb.URL(url, response))
    except:
        continue

# The data in the URL is not relevant to the demo, this is just to illustrate how
# url data can be incorporated into NeuralDB
# insertable_docs += url_docs

These insertable docs can be inserted into Neural DB just like with any other document type that we support (as shown below)

### Insert into NeuralDB

If you wish to insert without unsupervised training, you can set 'train=False' in the insert() method.

In [None]:
# Passing a checkpoint config (optional) while inserting will checkpoint the DB state in the specified location.
db.insert(insertable_docs, train=False)

The above command is intended to be used with a base DB which already has reasonable knowledge of the domain. In general, we always recommend using 'train=True' as shown below.

#### Insert and Train

In [None]:
source_ids = db.insert(insertable_docs, train=True)

#### Checkpointing Configuration for NeuralDB (Optional)

When we insert with Train=True, to facilitate recovery in case of machine failures, a straightforward checkpoint configuration can be provided as shown below. This configuration allows users to define where and how often checkpoints are saved, and whether to resume from a checkpoint if needed.

In [None]:
checkpoint_config = ndb.CheckpointConfig(
    checkpoint_dir="./data/sample_checkpoint",  # Specify the location for storing checkpoint data
    resume_from_checkpoint=False,  # Set to True if you want to resume from a checkpoint
    checkpoint_interval=3,  # Granularity of checkpoints (lower value implies more frequent checkpoints)
)

source_ids = db.insert(insertable_docs, train=True, checkpoint_config=checkpoint_config)

If you call the insert() method multiple times, the documents will automatically be de-duplicated. If insert=True, then the training will be done multiple times.

### Search

Now let's start searching.

In [None]:
search_results = db.search(query="what is the termination period", top_k=2)

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print("************")

We can see that the search pulled up the right passage that contains the termination period "(i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret" .

In [None]:
search_results = db.search(query="made by and between", top_k=2)

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print("************")

We can see that the search pulled up the right passage again that has "made by and between".

Now let's ask a tricky question.

In [None]:
search_results = db.search(query="who are the parties involved?", top_k=2)

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print("************")

Oops! looks like when we search for "parties involved", we do not get the correct paragraph in the 1st position (we should be expecting the first paragraph as the correct results instead fo the last). 

No worries, we'll show shot to teach the model to correct it's retrieval.

### RLHF

Let's go over some of NeuralDB's advanced features. The first one is text-to-text association. This allows you to teach the model that two keywords, phrases, or concepts are related.

Based on the above example, let's teach the model that "parties involved" and the phrase "made by between" are the same.

In [None]:
db.associate(source="who are the parties involved", target="made by and between")

Let's search again with the same query.

In [None]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
)

for result in search_results:
    print(result.text)
    # print(result.source)
    # print(result.metadata)
    print("************")

There you go! In just a line, you taught the model to correct itself and retrieve the correct result.

Now, let's see the 2nd option which is text-to-result association. Let's say that you know that "parties involved" should go the paragraph with DOC_ID=0, you can simply teach the model to associate the query to the corresponding label using the following API.

In [None]:
db.text_to_result("made by and between", 0)

If you want to use the above RLHF methods in a batch instead of a single sample, you can simply use the batched versions of the APIs as shown next.

In [None]:
db.associate_batch(
    [("parties involved", "made by and between"), ("date of signing", "duly executed")]
)

In [None]:
db.text_to_result_batch([("parties involved", 0), ("date of signing", 16)])

### Supervised Training (Optional)

If you have supervised data for a specific CSV file in your list, you can simply train the DB on that file by specifying a source_id = source_ids[*file_number_in_your_list*].

Note: The supervised file should have the query_column and id_column that you specify in the following call. The id_column should match the id_column that you specified in the "Prep CSV Data" step or default to "DOC_ID".

In [None]:
sup_files = ["data/sample_nda_sup.csv"]

db.supervised_train(
    [
        ndb.Sup(path, query_column="QUERY", id_column="DOC_ID", source_id=source_ids[0])
        for path in sup_files
    ]
)

### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [None]:
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = None  # Enter your OpenAI key here

In [None]:
from examples.utils import generate_answers


def get_references(query):
    search_results = db.search(query, top_k=3)
    references = []
    for result in search_results:
        references.append(result.text)
    return references


def get_answer(query, references):
    return generate_answers(
        query=query,
        references=references,
    )

In [None]:
query = "what is the effective date of this agreement?"

references = get_references(query)
print(references)

In [None]:
answer = get_answer(query, references)

print(answer)

### Saving the DB

In [None]:
db.save("sample_nda.ndb")

### Loading the saved DB

In [None]:
# Loading is just like we showed above, with an optional progress handler
new_db = ndb.NeuralDB.from_checkpoint(
    "sample_nda.ndb",
    on_progress=lambda fraction: print(f"{fraction}% done with loading."),
)