## ThirdAI's NeuralDB

First let's import the relevant module and initialize a neural db class.

In [1]:
!python --version

Python 3.9.16


In [1]:
from thirdai import licensing
licensing.deactivate()
licensing.activate("")

from thirdai import neural_db as ndb

db = ndb.NeuralDB(user_id="my_user") # you can use any username, in the future, this username will let you push models to the model hub

### Initialize

At this point, the db is uninitialized.

##### Option 1: We can either initialize from scratch like this

In [3]:
db.from_scratch()

##### Option 2: Or even load from a base DB that we provide, as shown below

In [3]:
import os

checkpoint = "qna_db_1"

if not os.path.exists(checkpoint):
    os.system("wget -O qna_db_1.zip 'https://www.dropbox.com/scl/fi/s1zhxmwjpayj5jphzct0p/qna_1_db.zip?dl=0&rlkey=ftcgrzt1rpc2d6hx0iuk1lz1r'")
    os.system("unzip qna_db_1.zip -d qna_db_1")

db.from_checkpoint("qna_db_1")

In [4]:
db.from_checkpoint("sample_faculties.db")

### Prep CSV data

Let's insert things into it!

Currently, we support adding as many CSV files as you wish. All other file formats have to be converted into CSV files where each row represents a paragraph/text-chunk of the document.

You can use url_to_csv.py to convert a webpage to a CSV file.

The file is required to have a column named "DOC_ID" with rows numbered from 0 to n_rows-1.

In [27]:
from utils import CSV

csv_files = ['faculties2.csv']
csv_docs = []

for file in csv_files:
    csv_doc = CSV(
        path=file,
        id_column="DOC_ID",
        strong_columns=["text"],
        weak_columns=["url"],
        reference_columns=["text"])

    csv_docs.append(csv_doc)


### Insert CSV files into NeuralDB and train

In [28]:
source_ids = db.insert(csv_docs, train=True)

loaded data | source 'Documents:
faculties2.csv' | vectors 8348 | batches 5 | time 0s | complete

train | epoch 0 | train_steps 2671 | train_hash_precision@5=0.853162  | train_batches 5 | time 3s

loaded data | source 'Documents:
faculties2.csv' | vectors 8348 | batches 5 | time 0s | complete

train | epoch 0 | train_steps 2676 | train_hash_precision@5=0.843723  | train_batches 5 | time 3s

loaded data | source 'Documents:
faculties2.csv' | vectors 8348 | batches 5 | time 0s | complete

train | epoch 0 | train_steps 2681 | train_hash_precision@5=0.851725  | train_batches 5 | time 3s

loaded data | source 'Documents:
faculties2.csv' | vectors 8348 | batches 5 | time 0s | complete

train | epoch 0 | train_steps 2686 | train_hash_precision@5=0.879133  | train_batches 5 | time 3s

loaded data | source 'Documents:
faculties2.csv' | vectors 8348 | batches 5 | time 0s | complete

train | epoch 0 | train_steps 2691 | train_hash_precision@5=0.870748  | train_batches 5 | time 3s

loaded data | s

### Search

Now let's start searching.

In [None]:
source_ids = db.insert(csv_docs, train=True)

loaded data | source 'Documents:
faculties2.csv' | vectors 8348 | batches 5 | time 0s | complete

train: [                                                  ] 0%          

In [5]:
search_results = db.search(
    query="who works on machine learning and biology",
    top_k=10,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    print(result.metadata()['url'])
    # print(result.context(radius=3))
    # print(result.source())
    # result.show()
    print('************')

towards these aims, tang uses statistical mechanics, information theory, and dynamical theory to characterize emergent function. she particularly enjoys using topology and geometry to predict robust dynamics in novel and accessible platforms, from quantum to biological systems. other interests include learning and optimal navigation, as well as information flow in fluids, networks and active matter.
https://profiles.rice.edu/faculty/evelyn-tang
************
her research focus is in computational biology, where she develops machine learning and statistical methods to improve our understanding of the biological circuitry that underlies living organisms and how its dysregulation may lead to disease. more specifically, she has worked on modeling tissue and cell type specificity as well as disease progression, both by developing general methods (such as semi-supervised network integration) and in applying them to decipher the molecular underpinnings of
https://profiles.rice.edu/faculty/vick

Oops! looks like when we search for "parties involved", we do not get the correct paragraph in the 1st position (we should be expecting the first paragraph as the correct results instead fo the last).

No worries, we'll show shot to teach the model to correct it's retrieval.

### RLHF

Let's go over some of NeuralDB's advanced features. The first one is text-to-text association. This allows you to teach the model that two keywords, phrases, or concepts are related.

Based on the above example, let's teach the model that "machine learning" and the phrase "deep learning" are the same.

In [6]:
db.associate(source="machine learning", target="deep learning")

In [7]:
search_results = db.search(
    query="who works on machine learning and biology",
    top_k=10,
    on_error=lambda error_msg: print(f"Error! {error_msg}"))

for result in search_results:
    print(result.text())
    print(result.metadata()['url'])
    # print(result.context(radius=3))
    # print(result.source())
    # result.show()
    print('************')

religion: secret religion. macmillan interdisciplinary handbooks. farmington hills: gale cengage learning, 2016. histories of the hidden god: concealment and revelation in western gnostic, esoteric, and mystical traditions. gnostica series. durham: acumen, 2013. with grant adamson
https://profiles.rice.edu/faculty/april-deconick
************
her research focus is in computational biology, where she develops machine learning and statistical methods to improve our understanding of the biological circuitry that underlies living organisms and how its dysregulation may lead to disease. more specifically, she has worked on modeling tissue and cell type specificity as well as disease progression, both by developing general methods (such as semi-supervised network integration) and in applying them to decipher the molecular underpinnings of
https://profiles.rice.edu/faculty/vicky-yao
************
such synthetic networks are highly valuable to many synthetic biology applications, such as metabol

### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [8]:
from langchain.chat_models import ChatOpenAI
from paperqa.qaprompts import qa_prompt, make_chain

your_openai_key = ""
llm = ChatOpenAI(
    model_name='gpt-3.5-turbo',
    temperature=0.1,
    openai_api_key=your_openai_key,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [24]:
def get_references(query):
    search_results = db.search(query,top_k=5)
    references = []
    for result in search_results:
        references.append(result.text())
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context_str='\n\n'.join(references[:3]), length="abt 50 words")

In [25]:
query = "who works on machine learning and biology"

references = get_references(query)
print(references)

['religion: secret religion. macmillan interdisciplinary handbooks. farmington hills: gale cengage learning, 2016. histories of the hidden god: concealment and revelation in western gnostic, esoteric, and mystical traditions. gnostica series. durham: acumen, 2013. with grant adamson', 'her research focus is in computational biology, where she develops machine learning and statistical methods to improve our understanding of the biological circuitry that underlies living organisms and how its dysregulation may lead to disease. more specifically, she has worked on modeling tissue and cell type specificity as well as disease progression, both by developing general methods (such as semi-supervised network integration) and in applying them to decipher the molecular underpinnings of', 'such synthetic networks are highly valuable to many synthetic biology applications, such as metabolic engineering and creation of biological diagnostics. finally, rna regulators are potentially highly transfera

In [26]:
answer = get_answer(query, references)

print(answer)

The researcher mentioned in the context works on computational biology and develops machine learning and statistical methods to improve our understanding of biological circuitry (no source needed).


### Load and Save
As usual, saving and loading the DB are one-liners.

In [15]:
# save your db
db.save("sample_faculties.db")

# Loading is just like we showed above, with an optional progress handler
db.from_checkpoint("sample_faculties.db", on_progress=lambda fraction: print(f"{fraction}% done with loading."))

0.16666666666666666% done with loading.
0.3333333333333333% done with loading.
0.5% done with loading.
0.6666666666666666% done with loading.
0.8333333333333334% done with loading.
1.0% done with loading.


### Export to Playground

Note: Currently, we support exporting to Playground UI with only 1 CSV file, if you have multiple CSV files, please watch out for our next release that will add support to export a NeuralDB directly into Playground.

In [None]:
from export_utils import neural_db_to_playground

neural_db_to_playground(db, './sample_faculties/', csv=csv_doc)