## Data Parallel Training for ThirdAI's UDT

This notebook shows how to run Data Parallel Training for ThirdAI's UDT. We will be using CLINC 150 small dataset for training and evaluation for this demo. But, you can easily replace this with your workload. 

ThirdAI's Distributed Data Parallel Training assumes that you already have a ray cluster running. For this demo, we would be using the ray mock cluster to simulate the ray cluster. For seetting up a ray cluster, see here: https://docs.ray.io/en/latest/cluster/getting-started.html

In [1]:
#!pip3 install thirdai --upgrade
!pip3 install ray

import thirdai
#thirdai.licensing.activate("HN7J-W79C-KN9U-WTKE-9PNM-4PVR-CNPJ-WTWE")       



## Ray Cluster Initialization
For the purpose of this demo, we will be initializing a mock ray cluster here.

In [2]:
from ray.cluster_utils import Cluster

mini_cluster = Cluster(
    initialize_head=True,
    head_node_args={
        "num_cpus": 1,
    },
)
mini_cluster.add_node(num_cpus=1)

  from .autonotebook import tqdm as notebook_tqdm
Failed to bind to 127.0.0.1:8265 because it's already occupied. You can use `ray start --dashboard-port ...` or `ray.init(dashboard_port=...)` to select a different port.


<ray._private.node.Node at 0x13724b250>

# Dataset Download

We will use the demos module in the thirdai package to download the CLINC 150 small dataset. You can replace theis step and the next step with a download method and a UDT initialization that is specific to yourd ataset.

In [3]:
from thirdai.demos import download_clinc_dataset

train_filenames, test_filename , _ = download_clinc_dataset(num_training_files=2, clinc_small=True)
print(type(train_filenames))
print(train_filenames)

<class 'list'>
['clinc_train_0.csv', 'clinc_train_1.csv']


# UDT Initialization
We can now create a UDT model by passing in the types of each column in the dataset and the target column we want to be able to predict.

In [4]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "text": bolt.types.text(),
        "category": bolt.types.categorical(),
    },
    target="category",
    n_target_classes=151,
    integer_target=True,
)

## Distributed Training

We will now train a UDT model in distributed data parallel fashion. Feel free to customize the number of epochs and the learning rate; we have chosen values that give good convergence. 

In [5]:


import thirdai.distributed_bolt as dist_bolt
import os

cluster_config = dist_bolt.RayTrainingClusterConfig(
    num_workers=2,
    cluster_address=mini_cluster.address,
    requested_cpus_per_node=1,
    communication_type="linear",
    ignore_reinit_error=True,
)



model.train_distributed(
        cluster_config=cluster_config,
        filenames=train_filenames,
        batch_size=256,
        epochs=1,
        learning_rate=0.02,
        metrics=["categorical_accuracy"],
        verbose=True,
    )

NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2023-06-05 22:45:35,558	INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 127.0.0.1:59956...
2023-06-05 22:45:35,572	INFO worker.py:1625 -- Connected to Ray cluster.


[2m[36m(ReplicaWorker pid=49994)[0m loading data | source 'clinc_train_1.csv'
[2m[36m(ReplicaWorker pid=49994)[0m loaded data | source 'clinc_train_1.csv' | vectors 3750 | batches 30 | time 0s | complete
[2m[36m(ReplicaWorker pid=49994)[0m 
[2m[36m(PrimaryWorker pid=49993)[0m loading data | source 'clinc_train_0.csv'[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)[0m
[2m[36m(PrimaryWorker pid=49993)[0m loading data | source 'clinc_train_0.csv'
[2m[36m(PrimaryWorker pid=49993)[0m loading data | source 'clinc_train_0.csv'
[2m[36m(PrimaryWorker pid=49993)[0m loaded data | source 'clinc_train_0.csv' | vectors 3750 | batches 30 | time 0s | complete
[2m[36m(PrimaryWorker pid=49993)[0m 


{'total_batches_trained': 30,
 'train_metrics': [{'categorical_accuracy': [0.6399999856948853]},
  {'categorical_accuracy': [0.6034666895866394]}],
 'validation_metrics': []}

# Evaluation
Evaluating the performance of the UDT model is just one line!

In [6]:
model.evaluate(test_filename, metrics=["categorical_accuracy"]);

loading data | source './clinc_test.csv'
loaded data | source './clinc_test.csv' | vectors 4500 | batches 3 | time 0s | complete

validate | epoch 0 | train_steps 30 | val_categorical_accuracy=0.854667  | val_batches 3 | time 1s

