# Pre-training and fine-tuning an LLM on CPU on AG News with ThirdAI's UDT

In this notebook, we will pre-train an LLM from scratch on the popular AG News Dataset (https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset) using ThirdAI's Universal Deep Transformer (UDT). We will demonstrate how UDT can just pre-train on a small dataset and outperform the Semantic Search offering of OpenAI. 

This demo shows that one-model for all is sub-optimal and pre-training/fine-tuning on specific downstream datasets is required to get the best results.

While most LLMs cannot be fine-tuned even on a powerful GPU, ThirdAI's UDT can train a billion parameter model on just a moderate CPU in few minutes.

You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/llm_search/AgNewsDemo.ipynb

This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

### Import thirdai and activate license

In [None]:
!pip3 install datasets
!pip3 install thirdai --upgrade

import thirdai
thirdai.licensing.activate('71FC4B-F20E8F-D7C39E-4E936C-404BC9-V3')

### Download and process the dataset into a csv file.

In [None]:
from thirdai.demos import download_agnews_dataset

corpus_file = './agnews.csv'
n_target_classes = download_agnews_dataset(corpus_file)

In the above step, *corpus_file* refers to the corpus file with document id and text. We can have even more columns with other metadata for each row. Pre-training with UDT supports two types of columns, strong and weak. For the purpose of this demo, we choose *text* to be the strong column and leave the weak column list to be empty.

A couple of sample rows of the *corpus_file* are shown below.

PLEASE NOTE: Currently, UDT's cold_start function requires the *id* to be an integer. We will add support for other formats in a future release.

In [8]:
import pandas as pd

pd.options.display.max_colwidth = 700
pd.read_csv(corpus_file, nrows=2)

Unnamed: 0,id,text
0,0,wall st. bears claw back into the black (reuters) reuters - short-sellers wall street's dwindling\band of ultra-cynics are seeing green again.
1,1,carlyle looks toward commercial aerospace (reuters) reuters - private investment firm carlyle group \which has a reputation for making well-timed and occasionally\controversial plays in the defense industry has quietly placed\its bets on another part of the market.


### Define a UDT model

The column name *query* can be anything of your choice.
The column name *id* should match with the one in the header of the *corpus_file*.

In [9]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "query": bolt.types.text(),
        "id": bolt.types.categorical(delimiter=':'),
    },
    target="id",
    n_target_classes=n_target_classes,
    integer_target=True,
    model_config='../configs/embeddings_and_cold_start_0.005.config',
)

### Pre-train (Cold Start) on the *corpus_file*

In the following step, we do the pre-training by specifying the strong and weak columns. For this demo, we use *text* as the strong column and leave the weak columns to be an emplty list. We can have more columns in either of the lists. The training time and the accuracies are shown below.

In [10]:
model.cold_start(
    filename=corpus_file,
    strong_column_names=["text"],
    weak_column_names=[],
    learning_rate=0.001,
    epochs=5,
    metrics=['categorical_accuracy'],
)

loaded data | source './agnews.csv' | vectors 120000 | batches 59 | time 0s | complete

train | epoch 0 | train_steps 59 | {categorical_accuracy: 0.00118333} | train_batches 59 | time 107s | complete

train | epoch 1 | train_steps 118 | {categorical_accuracy: 0.17175} | train_batches 59 | time 82s | complete

train | epoch 2 | train_steps 177 | {categorical_accuracy: 0.490392} | train_batches 59 | time 87s | complete

train | epoch 3 | train_steps 236 | {categorical_accuracy: 0.719392} | train_batches 59 | time 87s | complete

train | epoch 4 | train_steps 295 | {categorical_accuracy: 0.868383} | train_batches 59 | time 88s | complete



{'epoch_times': [82.0, 87.0, 87.0, 88.0],
 'categorical_accuracy': [0.17175,
  0.49039166666666667,
  0.7193916666666667,
  0.8683833333333333]}

### Save and load the model

In [6]:
model.save('./agnews.model')

model = bolt.UniversalDeepTransformer.load('./agnews.model')

## Make Predictions

### Example 1

In [11]:
import numpy as np
import pandas as pd

df = pd.read_csv(corpus_file)

activations = model.predict({'query':'BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the  quot;alarming quot; growth of greenhouse gases'})
top_preds = np.argsort(-activations)[:5]

df.iloc[top_preds]

Unnamed: 0,id,text
27868,27868,world briefings britain: blair warns of climate threat prime minister tony blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the quot;alarming quot; growth of greenhouse gases.
113457,113457,ecological forum gets greenhouse gas report a new report on ecological damage from greenhouse gases dominated the sidelines of a un conference on global warming saturday as delegates from nearly 200 nations
74462,74462,nations to discuss what to do after kyoto treaty the ice is melting and the heat is on for international delegates assembling in buenos aires this week to find new ways to confront global warming under the 194-nation treaty on climate change.
78602,78602,arctic endangered by greenhouse gases: report washington: greenhouse gases have contributed to a gradual warming of the ecologically-fragile arctic region causing massive climate changes including melting glaciers and sea ice according to a soon-to-be-released environmental study.
100819,100819,group passes on addressing global warming (ap) ap - although faced with fresh evidence of global warming the united states and other members the arctic council on wednesday failed to make any recommendations to combat a problem most scientists say is causing sea ice to melt and temperatures to rise.


For the same example, here are the top-5 results that OpenAI's Search and Recommendation notebook (https://github.com/openai/openai-cookbook/blob/main/examples/Recommendation_using_embeddings.ipynb) gets.

| text |
| --- |
| THE re-election of British Prime Minister Tony Blair would be seen as an endorsement of the military action in Iraq, Prime Minister John Howard said today |
| LONDON, England -- A US scientist is reported to have observed a surprising jump in the amount of carbon dioxide, the main greenhouse gas. |
| The anguish of hostage Kenneth Bigley in Iraq hangs over Prime Minister Tony Blair today as he faces the twin test of a local election and a debate by his Labour Party about the divisive war. |
| Israel is prepared to back a Middle East conference convened by Tony Blair early next year despite having expressed fears that the British plans were over-ambitious and designed |
| AFP - A battle group of British troops rolled out of southern Iraq on a US-requested mission to deadlier areas near Baghdad, in a major political gamble for British Prime Minister Tony Blair. |


### Example 2

In [11]:
activations = model.predict({'query':'PC World - Upcoming chip set will include built-in security features for your PC'})
top_preds = np.argsort(-activations)[:5]

df.iloc[top_preds]

Unnamed: 0,id,text
66178,66178,nvidia puts a firewall on a motherboard (pc world) pc world - upcoming chip set will include built-in security features for your pc.
110674,110674,nvidia will supply graphics chip for new playstation nvidia will supply the graphics chip for the successor to the playstation 2 games console being developed by sony computer entertainment inc.
71475,71475,intel prepares for the next 20 years chip maker plans for smaller faster less power-hungry processors.
30177,30177,ibm builds in pc security safekeeper module stores passwords encryption keys in thinkcentre desktops.
84918,84918,mcafee unveils 2005 security suite security software maker has released updated versions of its offerings for home computer users.\


For the same example, here are the top-5 results that OpenAI's Search and Recommendation notebook (https://github.com/openai/openai-cookbook/blob/main/examples/Recommendation_using_embeddings.ipynb) gets.

| text |
| --- |
| PC World - Updated antivirus software for businesses adds intrusion prevention features. |
| PC World - The one-time World Class Product of the Year PDA gets a much-needed upgrade. |
| PC World - Send your video throughout your house--wirelessly--with new gateways and media adapters. |
| PC World - Symantec, McAfee hope raising virus-definition fees will move users to\  suites. |
| Gateway computers will be more widely available at Office Depot, in the PC maker #39;s latest move to broaden distribution at retail stores since acquiring rival eMachines this year. |