# Pre-training and fine-tuning an LLM on CPU on BEIR datasets with ThirdAI's UDT

In this notebook, we will pre-train an LLM from scratch on the popular BEIR datasets (https://github.com/beir-cellar/beir) using ThirdAI's Universal Deep Transformer (UDT). We will use the 'Scifact' dataset to demonstrate how UDT can just pre-train on a small dataset and outperform T5-large model trained on a huge corpus. 

This demo shows that one-model for all is sub-optimal and pre-training on specific downstream datasets is required to get the best results.

While most LLMs cannot be fine-tuned even on a powerful GPU, ThirdAI's UDT can train a billion parameter model on just a moderate CPU in few minutes.

Please Note: You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/Scifact%20Pre-training%20Demo.ipynb

This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

### Import thirdai and activate license

In [None]:
!pip3 install beir
!pip3 install thirdai --upgrade

import thirdai
thirdai.licensing.activate('8E46E4-653FD6-AA2B02-65E265-A4FACA-V3')

### Download and process the dataset into a csv file.

In [None]:
from thirdai.demos import download_beir_dataset

dataset = "scifact"
unsup_file, sup_train_file, sup_test_file, n_target_classes = download_beir_dataset(dataset)

In the above step, *unsup_file* refers to the corpus file with document id, title and text. We can have even more columns with other metadata for each document. Pre-training with UDT supports two types of columns, strong and weak. For the purpose of this demo, we choose 'title' to be the strong column and 'text' to be the weak column. You can play with different settings shown in the pre-training (model.cold_start()) step.

A couple of sample rows of the *unsup_file* are shown below.

PLEASE NOTE: Currently, UDT's cold_start function requires the DOC_ID to be an integer. We will add support for other formats in a future release.

In [None]:
import pandas as pd

pd.options.display.max_colwidth = 700
pd.read_csv(unsup_file, nrows=2)

### Define a UDT model

The column name 'QUERY' has to match with the on in the header in *unsup_test_file*.
The column name 'DOC_ID' should match with the one in the header of the corpus file (*unsup_file*).

In [None]:
from thirdai import bolt
import os

config_dir = os.path.join(os.path.abspath(""), "../configs/")

model = bolt.UniversalDeepTransformer(
    data_types={
        "QUERY": bolt.types.text(),
        "DOC_ID": bolt.types.categorical(delimiter=':'),
    },
    target="DOC_ID",
    n_target_classes=n_target_classes,
    integer_target=True,
    model_config=os.path.join(config_dir, "embeddings_and_cold_start.config"),
)

### Pre-train (Cold Start) on the *unsup_file*

In the following step, we do the pre-training by specifying the strong and weak columns. For this demo, we use 'TITLE' as the strong column and 'TEXT' as the weak column. We can have more columns in either of the lists. The training time and the test accuracies are shown below. We can see that by just pre-traiing on the Scifact dataset, we get 40% precision@1 which beats T5-large's performance on the same dataset.

In [None]:
model.cold_start(
    filename=unsup_file,
    strong_column_names=["TITLE"],
    weak_column_names=["TEXT"],
    learning_rate=0.001,
    epochs=5,
)

activations = model.evaluate(sup_test_file, metrics=['categorical_accuracy','recall@100'])

### Fine-tune on supervised data (OPTIONAL)

If you have supervised data that maps queries to documents, you can further improve the model performance by fine-tuning your pre-trained model on the supervised data.

Please note that in your *sup_train_file* and *sup_test_file* should have the same column names 'QUERY' and 'DOC_ID'.

The training time to fine-tune and the final accuracy are shown below. 

In [None]:
model.train(
    filename=sup_train_file,
    learning_rate=0.001,
    epochs=5,
)

activations = model.evaluate(sup_test_file, metrics=['categorical_accuracy','recall@100'])

### Save and load the model

In [None]:
model.save('./scidocs.model')

model = bolt.UniversalDeepTransformer.load('./scidocs.model')

### Comparisons against T5

| Model | Precision@1 | Recall@100 |
| --- | --- | --- |
| UDT (pre-training + fine-tuning) | 58% | 90% |
|  UDT (just pre-training) |     40%     | 82.3%      |
| T5-large | 39.3%    | 82%        |
| T5-base |  34.7%    | 80%        |