# Pre-training and fine-tuning an LLM on CPU on AG News with ThirdAI's UDT

In this notebook, we will pre-train an LLM from scratch on the popular AG News Dataset (https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset) using ThirdAI's Universal Deep Transformer (UDT). We will demonstrate how UDT can just pre-train on a small dataset and outperform the Semantic Search offering of OpenAI. 

This demo shows that one-model for all is sub-optimal and pre-training/fine-tuning on specific downstream datasets is required to get the best results.

While most LLMs cannot be fine-tuned even on a powerful GPU, ThirdAI's UDT can train a billion parameter model on just a moderate CPU in few minutes.

You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/llm_search/AgNewsDemo.ipynb

This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

### Import thirdai and activate license

In [1]:
!pip3 install datasets
# !pip3 install thirdai --upgrade
!pip3 install ray

import thirdai



## Ray Cluster Initialization
For the purpose of this demo, we will be initializing a mock ray cluster here.

In [2]:
from ray.cluster_utils import Cluster

mini_cluster = Cluster(
    initialize_head=True,
    head_node_args={
        "num_cpus": 3,
    },
)
mini_cluster.add_node(num_cpus=3)

  from .autonotebook import tqdm as notebook_tqdm
Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.


<ray._private.node.Node at 0x13b326c10>

### Download and process the dataset into a csv file.

In [3]:
from datasets import load_dataset

file_1 = open('agnews_train_0.csv', 'w')
file_2 = open('agnews_train_1.csv', 'w')

corpus = load_dataset("ag_news")["train"]["text"]
num_datapoints = len(corpus)

file_1.write("id,text\n")
file_2.write("id,text\n")

idx = 0
for line in corpus:
    if idx < num_datapoints//2:
        nothing = file_1.write(str(idx) + "," + line.replace(",", " ").lower() + "\n")
    else:
        nothing = file_2.write(str(idx) + "," + line.replace(",", " ").lower() + "\n")

    idx += 1

file_1.close()
file_2.close()

train_filenames = ['agnews_train_0.csv', 'agnews_train_1.csv']

Found cached dataset ag_news (/Users/mjay/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 64.63it/s]


In the above step, *corpus_file* refers to the corpus file with document id and text. We can have even more columns with other metadata for each row. Pre-training with UDT supports two types of columns, strong and weak. For the purpose of this demo, we choose *text* to be the strong column and leave the weak column list to be empty.

A couple of sample rows of the *corpus_file* are shown below.

PLEASE NOTE: Currently, UDT's cold_start function requires the *id* to be an integer. We will add support for other formats in a future release.

In [4]:
import pandas as pd

pd.options.display.max_colwidth = 700
pd.read_csv(train_filenames[0], nrows=2)

Unnamed: 0,id,text
0,0,wall st. bears claw back into the black (reuters) reuters - short-sellers wall street's dwindling\band of ultra-cynics are seeing green again.
1,1,carlyle looks toward commercial aerospace (reuters) reuters - private investment firm carlyle group \which has a reputation for making well-timed and occasionally\controversial plays in the defense industry has quietly placed\its bets on another part of the market.


### Define a UDT model

The column name *query* can be anything of your choice.
The column name *id* should match with the one in the header of the *corpus_file*.

In [5]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "query": bolt.types.text(),
        "id": bolt.types.categorical(delimiter=':'),
    },
    target="id",
    n_target_classes=num_datapoints,
    integer_target=True,
    model_config='../configs/embeddings_and_cold_start_0.005.config',
)

## Distributed Training

We will now train a UDT model in distributed data parallel fashion. Feel free to customize the number of epochs and the learning rate; we have chosen values that give good convergence. 

In [6]:
import thirdai.distributed_bolt as dist_bolt
import os

cluster_config = dist_bolt.RayTrainingClusterConfig(
    num_workers=2,
    cluster_address=mini_cluster.address,
    requested_cpus_per_node=3,
    communication_type="linear",
    ignore_reinit_error=True,
)

model.cold_start_distributed(
    cluster_config=cluster_config,
    filenames=train_filenames,
    strong_column_names=["text"],
    weak_column_names=[],
    learning_rate=0.001,
    epochs=5,
    metrics=['categorical_accuracy'],
)

NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2023-06-05 18:12:54,042	INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 127.0.0.1:58887...
2023-06-05 18:12:54,055	INFO worker.py:1625 -- Connected to Ray cluster.


[2m[36m(ReplicaWorker pid=47352)[0m loading data | source 'agnews_train_1.csv'
[2m[36m(ReplicaWorker pid=47352)[0m loaded data | source 'agnews_train_1.csv' | vectors 60000 | batches 59 | time 0s | complete
[2m[36m(ReplicaWorker pid=47352)[0m 


### Save and load the model

In [None]:
model.save('./agnews.model')

model = bolt.UniversalDeepTransformer.load('./agnews.model')

## Make Predictions

### Example 1

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv(corpus_file)

activations = model.predict({'query':'BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the  quot;alarming quot; growth of greenhouse gases'})
top_preds = np.argsort(-activations)[:5]

df.iloc[top_preds]

Unnamed: 0,id,text
27868,27868,world briefings britain: blair warns of climate threat prime minister tony blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the quot;alarming quot; growth of greenhouse gases.
113457,113457,ecological forum gets greenhouse gas report a new report on ecological damage from greenhouse gases dominated the sidelines of a un conference on global warming saturday as delegates from nearly 200 nations
74462,74462,nations to discuss what to do after kyoto treaty the ice is melting and the heat is on for international delegates assembling in buenos aires this week to find new ways to confront global warming under the 194-nation treaty on climate change.
78602,78602,arctic endangered by greenhouse gases: report washington: greenhouse gases have contributed to a gradual warming of the ecologically-fragile arctic region causing massive climate changes including melting glaciers and sea ice according to a soon-to-be-released environmental study.
100819,100819,group passes on addressing global warming (ap) ap - although faced with fresh evidence of global warming the united states and other members the arctic council on wednesday failed to make any recommendations to combat a problem most scientists say is causing sea ice to melt and temperatures to rise.


For the same example, here are the top-5 results that OpenAI's Search and Recommendation notebook (https://github.com/openai/openai-cookbook/blob/main/examples/Recommendation_using_embeddings.ipynb) gets.

| text |
| --- |
| THE re-election of British Prime Minister Tony Blair would be seen as an endorsement of the military action in Iraq, Prime Minister John Howard said today |
| LONDON, England -- A US scientist is reported to have observed a surprising jump in the amount of carbon dioxide, the main greenhouse gas. |
| The anguish of hostage Kenneth Bigley in Iraq hangs over Prime Minister Tony Blair today as he faces the twin test of a local election and a debate by his Labour Party about the divisive war. |
| Israel is prepared to back a Middle East conference convened by Tony Blair early next year despite having expressed fears that the British plans were over-ambitious and designed |
| AFP - A battle group of British troops rolled out of southern Iraq on a US-requested mission to deadlier areas near Baghdad, in a major political gamble for British Prime Minister Tony Blair. |


### Example 2

In [None]:
activations = model.predict({'query':'PC World - Upcoming chip set will include built-in security features for your PC'})
top_preds = np.argsort(-activations)[:5]

df.iloc[top_preds]

Unnamed: 0,id,text
66178,66178,nvidia puts a firewall on a motherboard (pc world) pc world - upcoming chip set will include built-in security features for your pc.
110674,110674,nvidia will supply graphics chip for new playstation nvidia will supply the graphics chip for the successor to the playstation 2 games console being developed by sony computer entertainment inc.
71475,71475,intel prepares for the next 20 years chip maker plans for smaller faster less power-hungry processors.
30177,30177,ibm builds in pc security safekeeper module stores passwords encryption keys in thinkcentre desktops.
84918,84918,mcafee unveils 2005 security suite security software maker has released updated versions of its offerings for home computer users.\


For the same example, here are the top-5 results that OpenAI's Search and Recommendation notebook (https://github.com/openai/openai-cookbook/blob/main/examples/Recommendation_using_embeddings.ipynb) gets.

| text |
| --- |
| PC World - Updated antivirus software for businesses adds intrusion prevention features. |
| PC World - The one-time World Class Product of the Year PDA gets a much-needed upgrade. |
| PC World - Send your video throughout your house--wirelessly--with new gateways and media adapters. |
| PC World - Symantec, McAfee hope raising virus-definition fees will move users to\  suites. |
| Gateway computers will be more widely available at Office Depot, in the PC maker #39;s latest move to broaden distribution at retail stores since acquiring rival eMachines this year. |