## ThirdAI's NeuralDB

First let's import the relevant module and initialize a neural db class.

In [1]:
from thirdai import licensing
licensing.activate("D0F869-B61466-6A28F0-14B8C6-0AC6C6-V3")

from thirdai import neural_db as ndb

db = ndb.NeuralDB(user_id="my_user") # you can use any username, in the future, this username will let you push models to the model hub

### Initialize

At this point, the db is uninitialized. 

##### Option 1: We can either initialize from scratch like this

In [2]:
db.from_scratch()

##### Option 2: Or even load from a base DB that we provide, as shown below

In [2]:
import os

checkpoint = "qna_db_1"

if not os.path.exists(checkpoint):
    os.system("wget -O qna_db_1.zip 'https://www.dropbox.com/scl/fi/s1zhxmwjpayj5jphzct0p/qna_1_db.zip?dl=0&rlkey=ftcgrzt1rpc2d6hx0iuk1lz1r'")
    os.system("unzip qna_db_1.zip -d qna_db_1")

db.from_checkpoint("qna_db_1")

### Prep CSV data

Let's insert things into it!

Currently, we support adding as many CSV files as you wish. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document. 

The file is required to have a column named "DOC_ID" with rows numbered from 0 to n_rows-1.

In [3]:
from utils import CSV

csv_files = ['sample_nda.csv']
csv_docs = []

for file in csv_files:
    csv_doc = CSV(
        path=file,
        id_column="DOC_ID",
        strong_columns=["passage"],
        weak_columns=["para"],  
        reference_columns=["passage"])

    csv_docs.append(csv_doc)


### Insert CSV files into NeuralDB

In [None]:
source_ids = db.insert(csv_docs, train=False)

### Insert and Train

In [4]:
source_ids = db.insert(csv_docs, train=True)

loaded data | source 'Documents:
sample_nda.csv' | vectors 109 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 1 | train_hash_precision@5=0.00917431  | train_batches 1 | time 0s

loaded data | source 'Documents:
sample_nda.csv' | vectors 109 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 2 | train_hash_precision@5=0.506422  | train_batches 1 | time 0s

loaded data | source 'Documents:
sample_nda.csv' | vectors 109 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 3 | train_hash_precision@5=0.422018  | train_batches 1 | time 0s

loaded data | source 'Documents:
sample_nda.csv' | vectors 109 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 4 | train_hash_precision@5=0.388991  | train_batches 1 | time 0s

loaded data | source 'Documents:
sample_nda.csv' | vectors 109 | batches 1 | time 0s | complete

train | epoch 0 | train_steps 5 | train_hash_precision@5=0.847706  | train_batches 1 | time 0s

loaded data | source 'Documents:


### Just train on the docs

Do not worry abt files being inserted multiple times, the DB takes care of de-duplication!

In [None]:
source_ids = db.insert(csv_docs, train=True)

### Search

Now let's start searching.

In [5]:
search_results = db.search(
    query="what is the termination period",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    # print(result.context(radius=3))
    # print(result.source())
    # print(result.metadata())
    # result.show()
    print('************')

12. entire agreement. this agreement constitutes the entire agreement with respect to the subject matter hereof and supersedes all prior agreements and understandings between the parties (whether written or oral) relating to the subject matter and may not be amended or modified except in a writing signed by an authorized representative of both parties. the terms of this agreement relating to the confidentiality and non-use of confidential information shall continue after the termination of this agreement for a period of the longer of (i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret under applicable law.
************
4. return of confidential information. upon request of the other party termination of the discussions regarding a business relationship between the parties or termination of the current business relationship each party shall promptly destroy or deliver to the other party any and all documents notes and other physical embodim

We can see that the search pulled up the right passage that contains the termination period "(i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret" .

In [6]:
search_results = db.search(
    query="made by and between",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    # print(result.context(radius=3))
    # print(result.source())
    # print(result.metadata())
    # result.show()
    print('************')

confidentiality agreement this confidentiality agreement (the “agreement”) is made by and between acme. dba tothemoon inc. with offices at 2025 guadalupe st. suite 260 austin tx 78705 and starwars dba tothemars with offices at the forest moon of endor and entered as of may 3 2023 (“effective date”).
************
in consideration of the business discussions disclosure of confidential information and any future business relationship between the parties it is hereby agreed as follows: 1. confidential information. for purposes of this agreement the term “confidential information” shall mean any information business plan concept idea know-how process technique program design formula algorithm or work-in-process request for proposal (rfp) or request for information (rfi) and any responses thereto engineering manufacturing marketing technical financial data or sales information or information regarding suppliers customers employees investors or business operations and other information or mat

We can see that the search pulled up the right passage again that has "made by and between".

Now let's ask a tricky question.

In [7]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    # print(result.context(radius=3))
    # print(result.source())
    # print(result.metadata())
    # result.show()
    print('************')

3. joint undertaking. each party agrees that it will not at any time disclose give or transmit in any manner or for any purpose the confidential information received from the other party to any person firm or corporation or use such confidential information for its own benefit or the benefit of anyone else or for any purpose other than to engage in discussions regarding a possible business relationship or the current business relationship involving both parties.
************
6. excluded information. the parties agree that confidential information of the other party shall not include any information to the extent that the information: (i) is or at any time becomes a part of the public domain through no act or omission of the receiving party; (ii) is independently discovered or developed by the receiving party without use of the disclosing party’s confidential information; (iii) is rightfully obtained from a third party without any obligation of confidentiality; or (iv) is already known 

Oops! looks like when we search for "parties involved", we do not get the correct paragraph in the 1st position (we should be expecting the first paragraph as the correct results instead fo the last). 

No worries, we'll show shot to teach the model to correct it's retrieval.

### RLHF

Let's go over some of NeuralDB's advanced features. The first one is text-to-text association. This allows you to teach the model that two keywords, phrases, or concepts are related.

Based on the above example, let's teach the model that "parties involved" and the phrase "made by between" are the same.

In [8]:
db.associate(source="parties involved", target="made by and between")

Let's search again with the same query.

In [9]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
)

for result in search_results:
    print(result.text())
    # print(result.source())
    # print(result.metadata())
    # result.show()
    print('************')

confidentiality agreement this confidentiality agreement (the “agreement”) is made by and between acme. dba tothemoon inc. with offices at 2025 guadalupe st. suite 260 austin tx 78705 and starwars dba tothemars with offices at the forest moon of endor and entered as of may 3 2023 (“effective date”).
************
6. excluded information. the parties agree that confidential information of the other party shall not include any information to the extent that the information: (i) is or at any time becomes a part of the public domain through no act or omission of the receiving party; (ii) is independently discovered or developed by the receiving party without use of the disclosing party’s confidential information; (iii) is rightfully obtained from a third party without any obligation of confidentiality; or (iv) is already known by the receiving party without any obligation of confidentiality prior to obtaining the confidential information from the disclosing party.
************


There you go! In just a line, you taught the model to correct itself and retrieve the correct result.

Now, let's see the 2nd option which is text-to-result association. Let's say that you know that "parties involved" should go the paragraph with DOC_ID=0, you can simply teach the model to associate the query to the corresponding label using the following API.

In [125]:
db.text_to_result("made by and between",0)

If you want to use the above RLHF methods in a batch instead of a single sample, you can simply use the batched versions of the APIs as shown next.

In [18]:
db.associate_batch([("parties involved","made by and between"),("date of signing","duly executed")])

In [19]:
db.text_to_result_batch([("parties involved",0),("date of signing",16)])

### Supervised Training (Optional)

If you have supervised data for a specific CSV file in your list, you can simply train the DB on that file by specifying a source_id = source_ids[*file_number_in_your_list*].

Note: The supervised file should have the query_column and id_column that you specify in the following call. The id_column should match the id_column that you specified in the "Prep CSV Data" step or default to "DOC_ID".

In [None]:
sup_files = ['sample_nda_sup.csv']

db.supervised_train([ndb.Sup(path, query_column="QUERY", id_column="DOC_ID", source_id=source_ids[0]) for path in sup_files])

### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [11]:
from langchain.chat_models import ChatOpenAI
from paperqa.qaprompts import qa_prompt, make_chain

your_openai_key = ""

llm = ChatOpenAI(
    model_name='gpt-3.5-turbo', 
    temperature=0.1, 
    openai_api_key=your_openai_key,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [12]:
def get_references(query):
    search_results = db.search(query,top_k=3)
    references = []
    for result in search_results:
        references.append(result.text())
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context_str='\n\n'.join(references[:3]), length="abt 50 words")

In [13]:
query = "what is the effective date of this agreement?"

references = get_references(query)
print(references)

['confidentiality agreement this confidentiality agreement (the “agreement”) is made by and between acme. dba tothemoon inc. with offices at 2025 guadalupe st. suite 260 austin tx 78705 and starwars dba tothemars with offices at the forest moon of endor and entered as of may 3 2023 (“effective date”).', 'each party shall take all reasonable measures to preserve the confidentiality and avoid the disclosure of the other party’s confidential information including but not limited to those steps taken with respect to the party’s own confidential information of like importance. neither party shall disassemble decompile or otherwise reverse engineer any software product of the other party and to the extent any such activity may be permitted the results thereof shall be deemed confidential information subject to the requirements of this agreement.', 'in consideration of the business discussions disclosure of confidential information and any future business relationship between the parties it i

In [14]:
answer = get_answer(query, references)

print(answer)

The effective date of this confidentiality agreement is May 3, 2023 (line 3).


### Load and Save
As usual, saving and loading the DB are one-liners.

In [None]:
# save your db
db.save("sample_nda.db")

# Loading is just like we showed above, with an optional progress handler
db.from_checkpoint("sample_nda.db", on_progress=lambda fraction: print(f"{fraction}% done with loading."))

### Export to Playground

Note: Currently, we support exporting to Playground UI with only 1 CSV file, if you have multiple CSV files, please watch out for our next release that will add support to export a NeuralDB directly into Playground.

In [17]:
from export_utils import neural_db_to_playground

neural_db_to_playground(db, './sample_nda/', csv=csv_doc)