## ThirdAI's NeuralDB

First let's import the relevant module and initialize a neural db class.

In [None]:
from thirdai import licensing
licensing.activate("")

from thirdai import neural_db as ndb

db = ndb.NeuralDB(user_id="my_user") # you can use any username, in the future, this username will let you push models to the model hub

### Initialize

At this point, the db is uninitialized. 

##### Option 1: We can either initialize from scratch like this

In [None]:
db.from_scratch()

##### Option 2: Or even load from a base DB that we provide, as shown below

First download the base DB.

In [None]:
import os

checkpoint = "contract_review"
download_link = ""

if not os.path.exists(checkpoint):
    os.system(f"wget -O {checkpoint}.zip '{download_link}'")
    os.system(f"unzip {checkpoint}.zip")

Then load the checkpoint.

In [None]:
db.from_checkpoint(checkpoint)

### Let's insert things into it!

In [None]:
pdf_files = ['mutual_nda.pdf']  # You can have as many paths as you want here.
pdf_docs = [ndb.SentenceLevelPDF(file) for file in pdf_files]

### Insert documents into NeuralDB

In [None]:
source_ids = db.insert(pdf_docs, train=False, num_buckets_to_sample=8)

### Insert and Train

In [None]:
source_ids = db.insert(pdf_docs, train=True, num_buckets_to_sample=8)

### Just train on the docs

Do not worry abt files being inserted multiple times, the DB takes care of de-duplication!

In [None]:
source_ids = db.insert(pdf_docs, train=True, num_buckets_to_sample=8)

### Search

Now let's start searching.

In [None]:
search_results = db.search(
    query="what is the termination period",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=3))
    # print(result.source)
    # print(result.metadata)
    print('************')

We can see that the search pulled up the right passage that contains the termination period "(i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret" .

In [None]:
search_results = db.search(
    query="made by and between",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=3))
    # print(result.source)
    # print(result.metadata)
    print('************')

We can see that the search pulled up the right passage again that has "made by and between".

Now let's ask a tricky question.

In [None]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    # print(result.context(radius=3))
    # print(result.source)
    # print(result.metadata)
    print('************')

Oops! looks like when we search for "parties involved", we do not get the correct paragraph in the 1st position (we should be expecting the first paragraph as the correct results instead fo the last). 

No worries, we'll show shot to teach the model to correct it's retrieval.

### RLHF

Let's go over some of NeuralDB's advanced features. The first one is text-to-text association. This allows you to teach the model that two keywords, phrases, or concepts are related.

Based on the above example, let's teach the model that "parties involved" and the phrase "made by between" are the same.

In [None]:
db.associate(source="parties involved", target="made by and between")

Let's search again with the same query.

In [None]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
)

for result in search_results:
    print(result.text)
    # print(result.source)
    # print(result.metadata)
    print('************')

There you go! In just a line, you taught the model to correct itself and retrieve the correct result.

Now, let's see the 2nd option which is text-to-result association. Let's say that you know that "parties involved" should go the paragraph with DOC_ID=0, you can simply teach the model to associate the query to the corresponding label using the following API.

In [None]:
def upvote(query, result):
    # ids from the same document are guaranteed to have the same offset.
    offset = result.id - result.metadata["sentence_id"]
    result_ids = [offset + rid for rid in result.metadata["sentence_ids_in_para"]]
    db.text_to_result_batch([
        (query, rid)
        for rid in result_ids
    ])

In [None]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

print([result.text for result in search_results])

# Suppose we want to upvote the second search result
upvote("who are the parties involved?", search_results[1])

If you want to use the above RLHF methods in a batch instead of a single sample, you can simply use the batched versions of the APIs as shown next.

In [None]:
db.associate_batch([("parties involved","made by and between"),("date of signing","duly executed")])

### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [None]:
from langchain.chat_models import ChatOpenAI
from paperqa.qaprompts import qa_prompt, make_chain

your_openai_key = ""

llm = ChatOpenAI(
    model_name='gpt-3.5-turbo', 
    temperature=0.1, 
    openai_api_key=your_openai_key,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [None]:
def get_references(query):
    search_results = db.search(query,top_k=3)
    references = []
    for result in search_results:
        references.append(result.text())
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context_str='\n\n'.join(references[:3]), length="abt 50 words")

In [None]:
query = "what is the effective date of this agreement?"

references = get_references(query)
print(references)

In [None]:
answer = get_answer(query, references)

print(answer)

### Load and Save
As usual, saving and loading the DB are one-liners.

In [None]:
# save your db
db.save("sample_nda.db")

# Loading is just like we showed above, with an optional progress handler
db.from_checkpoint("sample_nda.db", on_progress=lambda fraction: print(f"{fraction}% done with loading."))

### Clearing files

In [None]:
db.clear_sources()