## ThirdAI's NeuralDB

NeuralDB, as the name suggests, is a combination of a neural network and a database. It provides a high-level API for users to insert different types of files into it and search through the file contents with natural language queries. The neural network part of it enables semantic search while the database part of it stores the paragraphs of the files that are inserted into it.

First, let's install the dependencies.

In [None]:
!pip3 install thirdai --upgrade
!pip3 install "thirdai[neural_db]"
!pip3 install langchain --upgrade
!pip3 install openai --upgrade
!pip3 install paper-qa --upgrade

In [None]:
from thirdai import licensing, neural_db as ndb

import nltk
nltk.download("punkt")

import os
if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("366E41-392A4E-A4D03D-0AAD37-7DC14A-V3")  # Enter your ThirdAI key here

Now, let's import the relevant module and define a neural db class.

In [2]:
db = ndb.NeuralDB(user_id="my_user") # you can use any username, in the future, this username will let you push models to the model hub

### You even load from a base DB from our Bazaar (optional but recommended)

We have a model bazaar that provides users with domain specific NeuralDBs that can jumpstart searching on their private documents. The Bazaar has two main types of DBs

1. Base DBs: These come with models that have either general QnA capabilities or domain specific capabilities like search on Medical Documents, Financial documents or Contracts. These come with an empty data index into which users can insert their files.

2. Pre-Indexed DBs: These are ready-to-search DBs that come with pre-trained models and their corresponding datasets. These are meant to  search through large public datasets like PubMed or Amazon 3MM Products or Stackoverflow issues etc.

In [3]:
# Set up a cache directory
import os
if not os.path.isdir("bazaar_cache"):
    os.mkdir("bazaar_cache")

from pathlib import Path
from thirdai.neural_db import Bazaar
bazaar = Bazaar(cache_dir=Path("bazaar_cache"))


Call fetch to refresh list of available DBs.

In [4]:
bazaar.fetch() # Optional arg filter="model name" to filter by model name.


Below is the list of all DBs in the Bazaar.

In [5]:
print(bazaar.list_model_names())


['Contract Review', 'Finance QnA', 'General QnA']


Finally load the DB

In [6]:
db = bazaar.get_model("General QnA")

### Insert your files

Let's insert things into it!

Currently, we natively support adding `CSV, PDF, DOCX, TXT, PPTX, EML, Outlook message (MSG)` files. We also have a support to automatically scrape and parse URLs. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document. 

In addition to the above documents, if you have a database with substantial number of rows in a table or multiple native documents, we offer support for SQLAlchemy and SharePoint. This enables you to train the data from remote locations. 

Furthermore, you have the option to specify the parameter `max_in_memory_batches`, which will divide the data into manageable segments and allow you to train the database model on them, mitigating the risk of memory errors.


#### Example 1: CSV files
The first example below shows how to insert a CSV file. Please note that a CSV file is required to have a column named "DOC_ID" with rows numbered from 0 to n_rows-1.

In [7]:
insertable_docs = []
csv_files = ['data/sample_nda.csv']

for file in csv_files:
    csv_doc = ndb.CSV(
        path=file,
        id_column="DOC_ID",
        strong_columns=["passage"],
        weak_columns=["para"],  
        reference_columns=["passage"])
    #
    insertable_docs.append(csv_doc)


#### Example 2: PDF files

In [8]:
insertable_docs = []
pdf_files = ['data/sample_nda.pdf']

for file in pdf_files:
    pdf_doc = ndb.PDF(file)
    insertable_docs.append(pdf_doc)

#### Example 3: DOCX files

In [7]:
insertable_docs = []
doc_files = ['data/sample_nda.docx']

for file in doc_files:
    doc = ndb.DOCX(file)
    insertable_docs.append(doc)

#### Example 4: Parse from URLs directly

First you can use our utility to generate a set of candidate URLs containing data of interest. You can also use your own list of URLs to extract data from.
```python
valid_url_data = ndb.parsing_utils.recursive_url_scrape(
  base_url="https://www.thirdai.com/pocketllm/", max_crawl_depth=0
)
```
Then you can create a list of insertable documents from those URLs:
```python
insertable_docs = []
for url, response in valid_url_data:
    try:
        insertable_docs.append(ndb.URL(url, response))
    except:
        continue
```
These can be inserted into Neural DB just like any other document.
```python
db.insert(insertable_docs, ...)
```

#### Example 5: EML, PPTX, and TXT files

In [None]:
insertable_docs = []
doc_files = ['data/sample_nda.eml' ,'data/sample_nda.txt', 'data/sample_nda.pptx']
for file in doc_files:
    doc = ndb.Unstructured(file)
    insertable_docs.append(doc)

#### Example 6: SQLDatabase

Insert the data directly from a database. Please note that the concerned table in the database should have a simple primary key as the id column ranging from 0 to n_rows - 1

In [7]:
from sqlalchemy import create_engine

insertable_docs = []
db_url = "sqlite:///data/sample.db"
engine = create_engine(url = db_url)
table_name = "nda_table"

db_doc = ndb.SQLDatabase(engine = engine,
                         table_name = table_name,
                         id_col = "id",
                         strong_columns = ["passage"],
                         weak_columns=["para"],
                         reference_columns=["passage"],
                    )

insertable_docs.append(db_doc)

#### Example 6: SharePoint files

First you need to create a `ClientContext` object for your site. There are multiple ways to create it. The simplest way is to use the `(client_id and client_secret)` OR `(username and password)`


To make it more easy, we provide a method to do that also
```python
ctx = ndb.SharePoint.setup_clientContext(
  base_url="https://<domain>.sharepoint.com/sites/<site-name>", credentials = {"username": username, "password": password}
)
```
OR
```python
ctx = ndb.SharePoint.setup_clientContext(
  base_url="https://<domain>.sharepoint.com/sites/<site-name>", credentials = {"client_id": client_id, "client_secret": client_secret}
)
```
currently, we only support creation of ctx object with those above pair of credentials. Other ways are more interactive way to create it. For more details: https://github.com/vgrem/Office365-REST-Python-Client

Then you can create a ndb-sharepoint document
```python
db_doc = ndb.SharePoint(
                  ctx = ctx,
                  library_path = <Library path where the documents are present>)   # Default is "Shared Documents"
  
insertable_docs = [db_doc]
```
Now it can be inserted into Neural DB just like any other document.
```python
db.insert(insertable_docs, ...)
```

NOTE: Since we are not storing any of the remote document data, we are only able to point you to the filename and page no. (for PPTX & PDF) for the sharepoint document resulted query

### Insert into NeuralDB

If you wish to insert without unsupervised training, you can set 'train=False' in the insert() method.

In [None]:
source_ids = db.insert(insertable_docs, train=False)

The above command is intended to be used with a base DB which already has reasonable knowledge of the domain. In general, we always recommend using 'train=True' as shown below.

#### Insert and Train

In [None]:
source_ids = db.insert(insertable_docs, train=True)

If you call the insert() method multiple times, the documents will automatically be de-duplicated. If insert=True, then the training will be done multiple times.

### Search

Now let's start searching.

In [10]:
search_results = db.search(
    query="what is the termination period",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print('************')

passage: 12. entire agreement. this agreement constitutes the entire agreement with respect to the subject matter hereof and supersedes all prior agreements and understandings between the parties (whether written or oral) relating to the subject matter and may not be amended or modified except in a writing signed by an authorized representative of both parties. the terms of this agreement relating to the confidentiality and non-use of confidential information shall continue after the termination of this agreement for a period of the longer of (i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret under applicable law.
************
passage: 13. severability. each party acknowledges that should any provision of this agreement be determined to be void invalid or otherwise unenforceable by any court of competent jurisdiction such determination shall not affect the remaining provisions hereof which shall remain in full force and effect.
**********

We can see that the search pulled up the right passage that contains the termination period "(i) five (5) years or (ii) when the confidential information no longer qualifies as a trade secret" .

In [12]:
search_results = db.search(
    query="made by and between",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print('************')

passage: confidentiality agreement this confidentiality agreement (the “agreement”) is made by and between acme. dba tothemoon inc. with offices at 2025 guadalupe st. suite 260 austin tx 78705 and starwars dba tothemars with offices at the forest moon of endor and entered as of may 3 2023 (“effective date”).
************
passage: in consideration of the business discussions disclosure of confidential information and any future business relationship between the parties it is hereby agreed as follows: 1. confidential information. for purposes of this agreement the term “confidential information” shall mean any information business plan concept idea know-how process technique program design formula algorithm or work-in-process request for proposal (rfp) or request for information (rfi) and any responses thereto engineering manufacturing marketing technical financial data or sales information or information regarding suppliers customers employees investors or business operations and other 

We can see that the search pulled up the right passage again that has "made by and between".

Now let's ask a tricky question.

In [13]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text)
    # print(result.context(radius=1))
    # print(result.source)
    # print(result.metadata)
    print('************')

passage: 3. joint undertaking. each party agrees that it will not at any time disclose give or transmit in any manner or for any purpose the confidential information received from the other party to any person firm or corporation or use such confidential information for its own benefit or the benefit of anyone else or for any purpose other than to engage in discussions regarding a possible business relationship or the current business relationship involving both parties.
************
passage: each party shall take all reasonable measures to preserve the confidentiality and avoid the disclosure of the other party’s confidential information including but not limited to those steps taken with respect to the party’s own confidential information of like importance. neither party shall disassemble decompile or otherwise reverse engineer any software product of the other party and to the extent any such activity may be permitted the results thereof shall be deemed confidential information sub

Oops! looks like when we search for "parties involved", we do not get the correct paragraph in the 1st position (we should be expecting the first paragraph as the correct results instead fo the last). 

No worries, we'll show shot to teach the model to correct it's retrieval.

### RLHF

Let's go over some of NeuralDB's advanced features. The first one is text-to-text association. This allows you to teach the model that two keywords, phrases, or concepts are related.

Based on the above example, let's teach the model that "parties involved" and the phrase "made by between" are the same.

In [14]:
db.associate(source="parties involved", target="made by and between")

Let's search again with the same query.

In [15]:
search_results = db.search(
    query="who are the parties involved?",
    top_k=2,
)

for result in search_results:
    print(result.text)
    # print(result.source)
    # print(result.metadata)
    print('************')

passage: confidentiality agreement this confidentiality agreement (the “agreement”) is made by and between acme. dba tothemoon inc. with offices at 2025 guadalupe st. suite 260 austin tx 78705 and starwars dba tothemars with offices at the forest moon of endor and entered as of may 3 2023 (“effective date”).
************
passage: in consideration of the business discussions disclosure of confidential information and any future business relationship between the parties it is hereby agreed as follows: 1. confidential information. for purposes of this agreement the term “confidential information” shall mean any information business plan concept idea know-how process technique program design formula algorithm or work-in-process request for proposal (rfp) or request for information (rfi) and any responses thereto engineering manufacturing marketing technical financial data or sales information or information regarding suppliers customers employees investors or business operations and other 

There you go! In just a line, you taught the model to correct itself and retrieve the correct result.

Now, let's see the 2nd option which is text-to-result association. Let's say that you know that "parties involved" should go the paragraph with DOC_ID=0, you can simply teach the model to associate the query to the corresponding label using the following API.

In [16]:
db.text_to_result("made by and between",0)

If you want to use the above RLHF methods in a batch instead of a single sample, you can simply use the batched versions of the APIs as shown next.

In [17]:
db.associate_batch([("parties involved","made by and between"),("date of signing","duly executed")])

In [18]:
db.text_to_result_batch([("parties involved",0),("date of signing",16)])

### Supervised Training (Optional)

If you have supervised data for a specific CSV file in your list, you can simply train the DB on that file by specifying a source_id = source_ids[*file_number_in_your_list*].

Note: The supervised file should have the query_column and id_column that you specify in the following call. The id_column should match the id_column that you specified in the "Prep CSV Data" step or default to "DOC_ID".

In [19]:
sup_files = ['data/sample_nda_sup.csv']

db.supervised_train([ndb.Sup(path, query_column="QUERY", id_column="DOC_ID", source_id=source_ids[0]) for path in sup_files])

loading data | source 'Supervised training samples'
loaded data | source 'Supervised training samples' | vectors 3 | batches 1 | time 0.007s | complete

train | epoch 0 | train_steps 2479 |  | train_batches 1 | time 0.021s   

train | epoch 1 | train_steps 2480 |  | train_batches 1 | time 0.019s   

train | epoch 2 | train_steps 2481 |  | train_batches 1 | time 0.019s   



### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [31]:
import os
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = ""  # Enter your OpenAI key here

In [None]:
from langchain.chat_models import ChatOpenAI
from paperqa.prompts import qa_prompt
from paperqa.chains import make_chain

llm = ChatOpenAI(
    model_name='gpt-3.5-turbo', 
    temperature=0.1,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [23]:
def get_references(query):
    search_results = db.search(query,top_k=3)
    references = []
    for result in search_results:
        references.append(result.text)
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context='\n\n'.join(references[:3]), answer_length="abt 50 words")

In [24]:
query = "what is the effective date of this agreement?"

references = get_references(query)
print(references)

['passage: confidentiality agreement this confidentiality agreement (the “agreement”) is made by and between acme. dba tothemoon inc. with offices at 2025 guadalupe st. suite 260 austin tx 78705 and starwars dba tothemars with offices at the forest moon of endor and entered as of may 3 2023 (“effective date”).', 'passage: in consideration of the business discussions disclosure of confidential information and any future business relationship between the parties it is hereby agreed as follows: 1. confidential information. for purposes of this agreement the term “confidential information” shall mean any information business plan concept idea know-how process technique program design formula algorithm or work-in-process request for proposal (rfp) or request for information (rfi) and any responses thereto engineering manufacturing marketing technical financial data or sales information or information regarding suppliers customers employees investors or business operations and other informat

In [None]:
answer = get_answer(query, references)

print(answer)

### Get Embeddings

In [33]:
def get_embeddings(db, queries):
    return db._savable_state.model.model.embedding_representation([{"query":query} for query in queries])

In [34]:
embeddings = get_embeddings(db, ['query 1', 'query 2'])
print(embeddings.shape)
print(embeddings)


(2, 2048)
[[-0.44150677  0.00377565 -0.11008609 ... -0.17087384 -0.10667692
  -0.21856037]
 [-0.43653303 -0.10459366 -0.07299916 ... -0.13396175 -0.03077642
  -0.18951523]]


### Load and Save
As usual, saving and loading the DB are one-liners.



In [None]:
# save your db
db.save("sample_nda.ndb")

# Loading is just like we showed above, with an optional progress handler
db.from_checkpoint("sample_nda.ndb", on_progress=lambda fraction: print(f"{fraction}% done with loading."))

#### NOTE: 
we never store data from the remote documents and thus DB model's functionality gets limited if you save and load the models containing these remote document objects. 

We do provide the connection re-establishment mechanism in our remote document objects. Also, these remote end-points and remote document content should not changes in any aspect.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#### Re-establishing connection to SQLDatabse
First, find the corressponding document object

```
docs = db.sources()
```
It returns an Ordered-dict of `{doc-source: DB doc object}` 

For ex: `{'16cd6625838b9d5f97e1a49a9f0a6189c90b1b2d': <thirdai.neural_db.documents.SQLDatabase at 0x7f6b04ca23e0>, ...}`

you can verify that this is the required one by iterating over this dictionary and printing their name
```python
for source, doc in docs.items():
    print(f"{doc.name = } and {source = }")
```


And later 

```python
sql_doc = docs['16cd6625838b9d5f97e1a49a9f0a6189c90b1b2d']
engine = create_engine(url = db_url)
db_doc.setup_connection(engine = engine)
```

It would be re-connected to the same table

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#### Re-establishing connection to SharePoint

find the corressponding document object as shown above

Then, provide the same `clientContext` object

```python
sp_doc.setup_connection(ctx = ctx)
```


