## Data Parallel Training for ThirdAI's UDT

This notebook shows how to run Data Parallel Training for ThirdAI's UDT. We will be using CLINC 150 small dataset for training and evaluation for this demo. But, you can easily replace this with your workload. 

ThirdAI's Distributed Data Parallel Training assumes that you already have a ray cluster running. For this demo, we would be using the ray mock cluster to simulate the ray cluster. For seetting up a ray cluster, see here: https://docs.ray.io/en/latest/cluster/getting-started.html

In [1]:
#!pip3 install thirdai --upgrade --no-cache-dir --force-reinstall
!pip3 install ray
!pip3 install torch

import thirdai
from thirdai import bolt
import thirdai.distributed_bolt as dist     



  from .autonotebook import tqdm as notebook_tqdm
2023-08-08 08:22:17,557	INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2023-08-08 08:22:19,881	INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2023-08-08 08:22:20,026	INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


## Ray Cluster Initialization
For the purpose of this demo, we will be initializing a mock ray cluster of 2 nodes here.

In [2]:
import ray
from ray.air import ScalingConfig, session

cpus_per_node = (dist.get_num_cpus() - 1) // 2

ray.init(ignore_reinit_error=True)
scaling_config = ScalingConfig(
    num_workers=2,
    use_gpu=False,
    trainer_resources={"CPU": 1},
    resources_per_worker={"CPU": cpus_per_node},
    placement_strategy="PACK",
)

2023-08-08 08:22:37,625	INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8266 [39m[22m


# Dataset Download

We will use the demos module in the thirdai package to download the CLINC 150 small dataset. You can replace theis step and the next step with a download method and a UDT initialization that is specific to your dataset.

In [3]:
from thirdai.demos import download_clinc_dataset

train_filenames, test_filename , _ = download_clinc_dataset(num_training_files=2, clinc_small=True)

# UDT Initialization
We can now create a UDT model by passing in the types of each column in the dataset and the target column we want to be able to predict.

In [4]:
def get_udt_model():
    model = bolt.UniversalDeepTransformer(
        data_types={
            "text": bolt.types.text(),
            "category": bolt.types.categorical(),
        },
        target="category",
        n_target_classes=151,
        integer_target=True,
    )
    return model
    
def train_loop_per_worker(config):
    # thirdai.licensing.deactivate()
    # thirdai.licensing.activate("HN7J-W79C-KN9U-WTKE-9PNM-4PVR-CNPJ-WTWE") 

    model = get_udt_model()
    model = dist.prepare_model(model)

    metrics = model.train_distributed_v2(
        filename=os.path.join(config["curr_dir"], train_filenames[session.get_world_rank()]),
        learning_rate=0.02,
        epochs=1,
        batch_size=256,
        metrics=["categorical_accuracy"],
        verbose=True,
    )

    session.report(
        metrics=metrics,
        checkpoint=dist.UDTCheckPoint.from_model(model),
    )



## Distributed Training

We will now train a UDT model in distributed data parallel fashion. Feel free to customize the number of epochs and the learning rate; we have chosen values that give good convergence. 

In [5]:
import os
from ray.train.torch import TorchConfig

trainer = dist.BoltTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config={
        "curr_dir": os.path.abspath(os.getcwd()),
        #"licensing_lambda": licensing_lambda,
    },
    scaling_config=scaling_config,
    backend_config=TorchConfig(backend="gloo"),
)

result_checkpoint_and_history = trainer.fit()


NameError: name 'licensing_lambda' is not defined

# Evaluation
Evaluating the performance of the UDT model is just one line!

In [None]:
model = result_checkpoint_and_history.checkpoint.get_model()
model.evaluate(test_filename, metrics=["categorical_accuracy"])