## ThirdAI's NeuralDB

First let's import the relevant module and initialize a neural db class.

In [None]:
from thirdai import neural_db as ndb

db = ndb.NeuralDB(user_id="my_user")

### Initialize
At this point, the db is uninitialized. We can either initialize from scratch like this

In [None]:
db.from_scratch()

Or even build one with a base DB.

In [None]:
import os
from thirdai import bolt

checkpoint = "qna_1.bolt"

if not os.path.exists(checkpoint):
    if checkpoint=="qna_1.bolt":
        os.system("wget -O qna_1.bolt 'https://www.dropbox.com/scl/fi/8i3qd9edhrm6zjviq7vvy/qna_1_0.7.7_frozen.bolt?dl=0&rlkey=raonu7dh3cy6mooucjrns49vf' ")
    elif checkpoint=="qna_2.bolt":
        os.system("wget -O qna_2.bolt 'https://www.dropbox.com/scl/fi/27psws3dcujgbma5xwsh1/qna_2_0.7.7_frozen.bolt?dl=0&rlkey=z1ivtoquspqole3i6mdmgwb9v' ")
    elif checkpoint=="contracts.bolt":
        os.system("wget -O contracts.bolt 'https://www.dropbox.com/scl/fi/dk9bw59bix245d9x49nhy/contracts_0.7.7_frozen.bolt?dl=0&rlkey=xs9uzyv65sug30oi201sy7u6v' ")
    else:
        print("please choose the checkpoint from the aforementioned list in the comment only")

db.from_udt(
    udt=bolt.UniversalDeepTransformer.load(checkpoint),
    id_col="DOC_ID", id_delimiter=":", query_col="QUERY", 
    input_dim=50_000, hidden_dim=2048, extreme_output_dim=50_000)

### Prep CSV data

Let's insert things into it!

Currently, we support adding as many CSV files as you wish. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document. 

The file is required to have a column named "DOC_ID" with rows numbered from 0 to n_rows-1.

In [None]:
from document_impls import CSV

csv_files = ['sample_nda.csv']
csv_docs = []

for file in csv_files:
    csv_doc = CSV(
        path="sample_nda.csv",
        id_column="DOC_ID",
        strong_columns=["passage"],
        weak_columns=["para"],  
        reference_columns=["passage"])

    csv_docs.append(csv_doc)


### Insert CSV files into NeuralDB

In [None]:
source_ids = db.insert(csv_docs, train=False)

### Insert and Train

In [None]:
source_ids = db.insert(csv_docs, train=True)

### Just train on the docs

Do not worry abt files being inserted multiple times, the DB takes care of de-duplication!

In [None]:
source_ids = db.insert(csv_docs, train=True)

### Search

Now let's start searching.

In [None]:
search_results = db.search(
    query="what is the termination period",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    # print(result.context(radius=3))
    # print(result.source())
    # print(result.metadata())
    # result.show()

We can see that the search pulled up the right passage that contains the termination period "(i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret" .

In [None]:
search_results = db.search(
    query="made by and between",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    # print(result.context(radius=3))
    # print(result.source())
    # print(result.metadata())
    # result.show()

We can see that the search pulled up the right passage again that has "made by and between".

Now let's ask a tricky question.

In [None]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    # print(result.context(radius=3))
    # print(result.source())
    # print(result.metadata())
    # result.show()

Oops! looks like when we search for "parties involved", we do not get the correct paragraph in the 1st position. 

No worries, we'll show shot to teach the model to correct it's retrieval.

### RLHF

Let's go over some of NeuralDB's advanced features. The first one is text-to-text association. This allows you to teach the model that two keywords, phrases, or concepts are related.

Based on the above example, let's teach the model that "parties involved" and the phrase "made by between" are the same.

In [None]:
db.associate(source="parties involved", target="made by and between")

Let's search again with the same query.

In [None]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
)

for result in search_results:
    print(result.text())
    # print(result.source())
    # print(result.metadata())
    # result.show()

There you go! In just a line, you taught the model to correct itself and retrieve the correct result. 

### Supervised Training (Optional)

If you have supervised data for a specific CSV file in your list, you can simply train the DB on that file by specifying a source_id = source_ids[*file_number_in_your_list*].

Note: The supervised file should have the query_column and id_column that you specify in the following call.  

In [None]:
sup_files = ['sample_nda_sup.csv']

db.supervised_train([ndb.Sup(path, query_column="QUERY", id_column="DOC_ID", source_id=source_ids[0]) for path in sup_files])

### Load and Save
As usual, saving and loading are one-liners.

In [None]:
# save your db
db.save("temp.db")

# Loading is just like we showed above, with an optional progress handler
db.from_checkpoint("temp.db", on_progress=lambda fraction: print(f"{fraction}% done with loading."))

### Export to Playground

In [None]:
from utils import bolt_and_csv_to_checkpoint

bolt_and_csv_to_checkpoint(db._savable_state.model.get_model(), csv_files[0], './playground_checkpoint/')