## ThirdAI's NeuralDB

In this notebook, we will show 

1. How to use ThirdAI's Neural Database for building grounded specialized AI-Agents with ChatGPT (Retreival Augmented Generation) all on any CPU.

2. (Optional) How to use your OpenAI key to get retrieval augmented answers from OpenAI.

3. How to teach your Neural DB with real-time RLHF (Reinforcement Learning with Human Feedback) to correct any retrieval failures.

To unlock additional features exporting the DB to ThirdAI's Playground for interactive QnA and teaching, please reach out to contact@thirdai.com

In [None]:
!pip3 install -r requirements.txt

In [None]:
# thirdai's license activation

import thirdai
try:
    thirdai.licensing.activate("D0F869-B61466-6A28F0-14B8C6-0AC6C6-V3")
except:
    print("You need a license key to use ThirdAI's library. Please request a trial license at https://www.thirdai.com/try-bolt/")

thirdai.set_seed(7)

In [2]:
from thirdai import bolt
import nltk
nltk.data.path.append("./data/")
from pathlib import Path
import pickle
from doc_utils import documents

### Display your CSV contents

In [None]:
csv_file = "sample_nda.csv"
query_column_name = "QUERY"
target_column_name = "DOC_ID"

# Visualize the dataframe and get the column names in the csv_file.
# Your target column (id_col) name has to match the target column in the model defined above (we are using target_column_name across the notebook)
# You will have to pick your choice of strong_columns and weak_columns for the insert and train step shown next.
# Strong columns are usually the most important ones like titles of documents, keywords, categories etc
# Weak columns are usually the long descriptions

import pandas as pd
pd.options.display.max_colwidth = 200

df = pd.read_csv(csv_file)
print(df.head(3))

### Data prep and schema

Based on the above file, this is how you can define your NeuralDB schema.

Note: Your *target_column_name* has to be "DOC_ID" to be able to use pre-trained base models. If you're target column is something else, please rename it to "DOC_ID". Also, the "DOC_ID" should contain integers from from 0 to n_rows - 1.

If you're creating a DN from scratch, this is not mandatory. 

We currently only support CSV files. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document. 

In [None]:
schema = {
    'search_string_name':"QUERY", ## keep this as it is if you want to use our base models.
    'target_column_name':"DOC_ID",
    'strong_search_columns':["passage"],
    'weak_search_columns':["para"],
}

### Initialize your NeuralDB
#### Option 1: Initialize from scratch

In [6]:
ndb = bolt.UniversalDeepTransformer(
    data_types = {
        schema['search_string_name']: bolt.types.text(tokenizer="char-4"),
        schema['target_column_name']: bolt.types.categorical(delimiter=":"),
    },
    target=schema['target_column_name'],
    n_target_classes=1000000, # this is the expected number of unique paragraphs in the DB, larger number will increase the training time
    integer_target=True,
    options={"neural_db": True}
)

#### Option 2: Load from a checkpoint

In [None]:
import os

## Pick your choice of base NeuralDB. We currently offer three choices
## 1. "qna_1" : pre-trained on a public QnA dataset
## 2. "qna_2" : pre-trained on a larger public QnA dataset
## 3. "contracts" : pre-trained on CUAD dataset, tailored towards tasks like contract reviewing

checkpoint = "qna_2.bolt"

if not os.path.exists(checkpoint):
    if checkpoint=="qna_1.bolt":
        os.system("wget -O qna_1.bolt 'https://www.dropbox.com/scl/fi/8i3qd9edhrm6zjviq7vvy/qna_1_0.7.7_frozen.bolt?dl=0&rlkey=raonu7dh3cy6mooucjrns49vf' ")
    elif checkpoint=="qna_2.bolt":
        os.system("wget -O qna_2.bolt 'https://www.dropbox.com/scl/fi/27psws3dcujgbma5xwsh1/qna_2_0.7.7_frozen.bolt?dl=0&rlkey=z1ivtoquspqole3i6mdmgwb9v' ")
    elif checkpoint=="contracts.bolt":
        os.system("wget -O contracts.bolt 'https://www.dropbox.com/scl/fi/dk9bw59bix245d9x49nhy/contracts_0.7.7_frozen.bolt?dl=0&rlkey=xs9uzyv65sug30oi201sy7u6v' ")
    else:
        print("please choose the checkpoint from the aforementioned list in the comment only")

ndb = bolt.UniversalDeepTransformer.load(checkpoint)
ndb.clear_index()

ndb.insert_into_neural_db()

### Train the model

In [None]:
for doc in [csv_doc, combined_pdfs, combined_docxs]:
    if doc:
        doc_config = combined_pdfs.get_config()
        ndb.pre_train(doc)

In [None]:
for doc in [csv_doc, combined_pdfs, combined_docxs]:
    if doc:
        doc_config = combined_pdfs.get_config()
        ndb.train(doc)

In [None]:
# how many search results do you want to retrieve from your files for every query
N_REFERENCES = 2

ndb.set_decode_params(min(doclist.get_n_new_ids(), N_REFERENCES), min(doclist.get_n_new_ids(), 100))

### Get Answers from OpenAI

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the NeuralDB you just built. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any open-source model of your choice (like MPT or Dolly) for answer generation with the same prompt that you use with OpenAI.

In [None]:
from langchain.chat_models import ChatOpenAI
from paperqa.qaprompts import qa_prompt, make_chain

your_openai_key = ""

llm = ChatOpenAI(
    model_name='gpt-3.5-turbo', 
    temperature=0.1, 
    openai_api_key=your_openai_key,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [None]:
def get_references(query):
    reference_ids = ndb.predict({"QUERY":query})
    reference_ids = [itm[0] for itm in reference_ids]
    references = [doclist.get_new_display_items().iloc[p] for p in reference_ids]
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context_str='\n\n'.join(references[:3]), length="abt 50 words")

In [3]:
query = "what is the effective date of this agreement?"

references = get_references(query)
print(references)

['CONFIDENTIALITY AGREEMENT This Confidentiality Agreement (the “Agreement”) is made by and between ACME. dba ToTheMoon Inc. with offices at 2025 Guadalupe St. Suite 260 Austin TX 78705 and StarWars dba ToTheMars with offices at the forest moon of Endor and entered as of May 3 2023 (“Effective Date”).', 'In consideration of the business discussions disclosure of Confidential Information and any future business relationship between the parties it is hereby agreed as follows: 1. CONFIDENTIAL INFORMATION. For purposes of this Agreement the term “Confidential Information” shall mean any information business plan concept idea know-how process technique program design formula algorithm or work-in-process Request for Proposal (RFP) or Request for Information (RFI) and any responses thereto engineering manufacturing marketing technical financial data or sales information or information regarding suppliers customers employees investors or business operations and other information or materials w

In [1]:
answer = get_answer(query, references)

print(answer)

The effective date of this Confidentiality Agreement is May 3, 2023 (ACME dba ToTheMoon Inc. and StarWars dba ToTheMars, 2023).


Now, let's ask a query that the model gets it wrong. Subsequently, let's teach the model to correct itself using our RLHF methods.

In [2]:
query = "who are the parties involved in this agreement?"

references = get_references(query)
answer = get_answer(query, references)
print(answer)

The context provides insufficient information to determine the parties involved in this agreement.


### How to teach your model (RLHF)

This is one of the marquee features that we provide. Thanks to our efficient training capabilties, we can offer you to teach the DB to correct itself in the event of it not being able to get the relevant paragraphs from the database. 

Also, the RLHF teachings done a NeuralDB will generalize beyond the current documents if we run *ndb.clear_index()* and insert new documents.

To do RLHF, we provide two functions:

1. Associate: Using this function, you can associate two phrases to give similar results. For examples, assume you're in the contract review domain. And you're interested in asking a question like "who are the parties involved in this contract?". However, most contracts have the phrase "made by and between" to suggest the parties involved in the contracts (like "this agreement is made by and between company A and company B"). In this scenario, you can simply call *ndb.teach_concept_association(["parties involved","made by and between"])* and the model would learn the relation. In the subsequent documents, you're more likely to retrieve the passage containing the correct information.

2. Upvote: Let's say you searched for a query "is there a limited liability clause?" and you got 5 search results (along with their passage IDs). If you know that the correct result is actually the 2nd one instead of the first one. Then you can simply call *ndb.upvote("is there a limited liability clause",passage_id_of_the_best_search_result)*.

### RLHF using function calls 

In the above example, the DB could not understand that the phrase "date of signing". But if you are an expert in contracts, you know that "date of signing" usually goes with phrases like "duly executed" (for example, "this Agreement has been duly executed by the parties hereto as of the latest date set forth below ..."). So, let's teach the DB that these two phrases should retrieve similar passages.

In [None]:
rlhf_samples = [({"QUERY":"parties involved"},{"QUERY":"made by and between"})]

ndb.associate(rlhf_samples, 7)

Now, let's query the model again

In [3]:
query = "who are the parties involved in this agreement?"

references = get_references(query)
answer = get_answer(query, references)
print(answer)

The parties involved in this agreement are ACME, dba ToTheMoon Inc. with offices at 2025 Guadalupe St. Suite 260 Austin TX 78705 and StarWars dba ToTheMars with offices at the forest moon of Endor (Confidentiality Agreement).


There you go!