# Pre-training and fine-tuning using NeuralDB

In this notebook, we will pre-train a NeuralDB from scratch on the popular BEIR dataset (https://github.com/beir-cellar/beir) using ThirdAI's NeuralDB. We will use the 'Scifact' dataset to demonstrate how NeuralDB can just pre-train on a small dataset and outperform T5-large model trained on a huge corpus. 

This demo shows that one-model for all is sub-optimal and pre-training on specific downstream datasets is required to get the best results.

Please Note: You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/neural_db/examples/scifact.ipynb

This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

#### Import thirdai and activate license

In [None]:
!pip3 install beir
# !pip3 install thirdai --upgrade
!pip3 install "thirdai[neural_db]" --upgrade

import os
from thirdai import licensing
licensing.deactivate()
if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("")  # Enter your ThirdAI key here

#### Download and process the dataset into a csv file.

In [None]:
from thirdai.demos import download_beir_dataset

dataset = "scifact"
unsup_file, sup_train_file, sup_test_file, n_target_classes = download_beir_dataset(dataset)

In the above step, *unsup_file* refers to the corpus file with document id, title and text. We can have even more columns with other metadata for each document. Pre-training with NeuralDB supports two types of columns, strong and weak. For the purpose of this demo, we choose 'title' to be the strong column and 'text' to be the weak column.

A couple of sample rows of the *unsup_file* are shown below.

In [None]:
import pandas as pd

pd.options.display.max_colwidth = 700
pd.read_csv(unsup_file, nrows=2)

#### Define a NeuralDB from scratch

In [None]:
from thirdai import neural_db as ndb
db = ndb.NeuralDB(user_id="my_user")

#### Load the unsupervised documents and create an insertable object

In [None]:
insertable_docs = []
csv_files = [unsup_file]

for file in csv_files:
    csv_doc = ndb.CSV(
        path=file,
        id_column="DOC_ID",
        strong_columns=["TITLE"],
        weak_columns=["TEXT"],  
        reference_columns=["TITLE","TEXT"])
    #
    insertable_docs.append(csv_doc)

### Pre-train on the *unsup_file*

In the following step, we do the pre-training by specifying the strong and weak columns. For this demo, we use 'TITLE' as the strong column and 'TEXT' as the weak column. We can have more columns in either of the lists. The training time and the test accuracies are shown below. We can see that by just pre-traiing on the Scifact dataset, we get 40% precision@1 which beats T5-large's performance on the same dataset.

In [None]:
source_ids = db.insert(insertable_docs, train=True)

#### Evaluate after pre-training

In [None]:
def get_precision(test_file, db):
    test_df = pd.read_csv(sup_test_file)
    correct_count = 0
    for i in range(test_df.shape[0]):
        query = test_df['QUERY'][i]
        top_pred = db.search(query=query,top_k=1)[0].id
        if str(top_pred) in test_df['DOC_ID'][i].split(":"):
            correct_count += 1
    ##
    return correct_count/test_df.shape[0]


In [None]:
print(get_precision(sup_test_file, db))

### Fine-tune on supervised data (OPTIONAL)

If you have supervised data that maps queries to documents, you can further improve the model performance by fine-tuning your pre-trained model on the supervised data.

The training time to fine-tune and the final accuracy are shown below. 

In [None]:
train_df = pd.read_csv(sup_train_file)

if type(train_df['DOC_ID'][0])==str:
    train_df['DOC_ID'] = train_df['DOC_ID'].apply(lambda x: int(x.split(":")[0]))
    train_df.to_csv(sup_train_file, index=False)


In [None]:
db.supervised_train([ndb.Sup(sup_train_file, query_column="QUERY", id_column="DOC_ID", source_id=source_ids[0])],learning_rate=0.001, epochs=10)

In [None]:
print(get_precision(sup_test_file, db))

### Comparisons against T5

| Model | Precision@1 |
| --- | --- |
| NeuralDB (pre-training + fine-tuning) | 77% |
| OpenAI Ada-002 | 63%    |
|  NeuralDB (just pre-training) |     53%     |
| Instruct-L | 52%    |
| T5-large | 39.3%    |
| T5-base |  34.7%    |