# Pre-training and fine-tuning using NeuralDB

In this notebook, we will pre-train a NeuralDB from scratch on the popular BEIR dataset (https://github.com/beir-cellar/beir) using ThirdAI's NeuralDB. We will use the 'Scifact' dataset to demonstrate how UDT can just pre-train on a small dataset and outperform T5-large model trained on a huge corpus. 

This demo shows that one-model for all is sub-optimal and pre-training on specific downstream datasets is required to get the best results.

Please Note: You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/neural_db_examples/scifact.ipynb

This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

#### Import thirdai and activate license

In [10]:
!pip3 install beir
!pip3 install thirdai --upgrade
!pip3 install "thirdai[neural_db]" --upgrade

import os
from thirdai import licensing
licensing.deactivate()
if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    licensing.activate("D0F869-B61466-6A28F0-14B8C6-0AC6C6-V3")  # Enter your ThirdAI key here

#### Download and process the dataset into a csv file.

In [None]:
from thirdai.demos import download_beir_dataset

dataset = "scifact"
unsup_file, sup_train_file, sup_test_file, n_target_classes = download_beir_dataset(dataset)

In the above step, *unsup_file* refers to the corpus file with document id, title and text. We can have even more columns with other metadata for each document. Pre-training with NeuralDB supports two types of columns, strong and weak. For the purpose of this demo, we choose 'title' to be the strong column and 'text' to be the weak column.

A couple of sample rows of the *unsup_file* are shown below.

In [3]:
import pandas as pd

pd.options.display.max_colwidth = 700
pd.read_csv(unsup_file, nrows=2)

Unnamed: 0,DOC_ID,TITLE,TEXT
0,0,Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.,Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient to calculate relative anisotropy and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coeffic...
1,1,Induction of myelodysplasia by myeloid-derived suppressor cells.,Myelodysplastic syndromes (MDS) are age-dependent stem cell malignancies that share biological features of activated adaptive immune response and ineffective hematopoiesis. Here we report that myeloid-derived suppressor cells (MDSC) which are classically linked to immunosuppression inflammation and cancer were markedly expanded in the bone marrow of MDS patients and played a pathogenetic role in the development of ineffective hematopoiesis. These clonally distinct MDSC overproduce hematopoietic suppressive cytokines and function as potent apoptotic effectors targeting autologous hematopoietic progenitors. Using multiple transfected cell models we found that MDSC expansion is driven ...


#### Define a NeuralDB from scratch

In [51]:
from thirdai import neural_db as ndb
db = ndb.NeuralDB(user_id="my_user")

#### Load the unsupervised documents and create an insertable object

In [6]:
insertable_docs = []
csv_files = [unsup_file]

for file in csv_files:
    csv_doc = ndb.CSV(
        path=file,
        id_column="DOC_ID",
        strong_columns=["TITLE"],
        weak_columns=["TEXT"],  
        reference_columns=["TITLE","TEXT"])
    #
    insertable_docs.append(csv_doc)

### Pre-train on the *unsup_file*

In the following step, we do the pre-training by specifying the strong and weak columns. For this demo, we use 'TITLE' as the strong column and 'TEXT' as the weak column. We can have more columns in either of the lists. The training time and the test accuracies are shown below. We can see that by just pre-traiing on the Scifact dataset, we get 50.33% precision@1 which beats T5-large's performance on the same dataset.

In [48]:
source_ids = db.insert(insertable_docs, train=True)

loaded data | source 'Documents:
unsupervised.csv' | vectors 110213 | batches 54 | time 3.316s | complete

train | epoch 0 | train_steps 54 | train_hash_precision@5=0.287826  | train_batches 54 | time 127.104s

train | epoch 1 | train_steps 108 | train_hash_precision@5=0.894878  | train_batches 54 | time 103.665s

train | epoch 2 | train_steps 162 | train_hash_precision@5=0.991183  | train_batches 54 | time 82.819s

train | epoch 3 | train_steps 216 | train_hash_precision@5=0.998269  | train_batches 54 | time 96.998s

train | epoch 4 | train_steps 270 | train_hash_precision@5=0.999361  | train_batches 54 | time 102.805s

train | epoch 5 | train_steps 324 | train_hash_precision@5=0.999757  | train_batches 54 | time 78.578s

train | epoch 6 | train_steps 378 | train_hash_precision@5=0.999933  | train_batches 54 | time 100.060s

train | epoch 7 | train_steps 432 | train_hash_precision@5=0.99992  | train_batches 54 | time 97.880s

train | epoch 8 | train_steps 486 | train_hash_precision@5=

#### Evaluate after pre-training

In [39]:
def get_precision(test_file, db):
    test_df = pd.read_csv(sup_test_file)
    correct_count = 0
    for i in range(test_df.shape[0]):
        query = test_df['QUERY'][i]
        top_pred = db.search(query=query,top_k=1)[0].id
        if str(top_pred) in test_df['DOC_ID'][i].split(":"):
            correct_count += 1
    ##
    return correct_count/test_df.shape[0]


In [49]:
print(get_precision(sup_test_file, db))

0.35333333333333333


### Fine-tune on supervised data (OPTIONAL)

If you have supervised data that maps queries to documents, you can further improve the model performance by fine-tuning your pre-trained model on the supervised data.

The training time to fine-tune and the final accuracy are shown below. 

In [32]:
train_df = pd.read_csv(sup_train_file)

if type(train_df['DOC_ID'][0])==str:
    train_df['DOC_ID'] = train_df['DOC_ID'].apply(lambda x: int(x.split(":")[0]))
    train_df.to_csv(sup_train_file, index=False)



In [None]:
db.supervised_train([ndb.Sup(sup_train_file, query_column="QUERY", id_column="DOC_ID", source_id=source_ids[0])],learning_rate=0.001, epochs=10)

In [46]:
print(get_precision(sup_test_file, db))

0.6966666666666667


### Save and load the model

In [None]:
model.save('./scidocs.model')

model = bolt.UniversalDeepTransformer.load('./scidocs.model')

### Comparisons against T5

| Model | Precision@1 | Recall@100 |
| --- | --- | --- |
| UDT (pre-training + fine-tuning) | 58% | 90% |
|  UDT (just pre-training) |     40%     | 82.3%      |
| T5-large | 39.3%    | 82%        |
| T5-base |  34.7%    | 80%        |