## Data Parallel Training for ThirdAI's UDT

This notebook shows how to run Data Parallel Training for ThirdAI's UDT. We will be using CLINC 150 small dataset for training and evaluation for this demo. But, you can easily replace this with your workload. 

ThirdAI's Distributed Data Parallel Training assumes that you already have a ray cluster running. For this demo, we would be using the ray mock cluster to simulate the ray cluster. For seetting up a ray cluster, see here: https://docs.ray.io/en/latest/cluster/getting-started.html

In [1]:
!pip3 install thirdai --upgrade
!pip3 install ray

[0m

## Ray Cluster Initialization
For the purpose of this demo, we will be initializing a mock ray cluster here.

In [2]:
from ray.cluster_utils import Cluster

mini_cluster = Cluster(
    initialize_head=True,
    head_node_args={
        "num_cpus": 1,
    },
)
mini_cluster.add_node(num_cpus=1)

<ray._private.node.Node at 0x105983fa0>

#Dataset Download

We will use the demos module in the thirdai package to download the CLINC 150 small dataset. You can replace theis step and the next step with a download method and a UDT initialization that is specific to yourd ataset.

In [3]:
from thirdai.demos import download_clinc_dataset

train_filenames, test_filename , _ = download_clinc_dataset(num_training_files=2, clinc_small=True)

# UDT Initialization
We can now create a UDT model by passing in the types of each column in the dataset and the target column we want to be able to predict.

In [4]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "text": bolt.types.text(),
        "category": bolt.types.categorical(),
    },
    target="category",
    n_target_classes=151,
    integer_target=True,
)

## Distributed Training

We will now train a UDT model in distributed data parallel fashion. Feel free to customize the number of epochs and the learning rate; we have chosen values that give good convergence. 

In [5]:


import thirdai.distributed_bolt as dist_bolt
import os

cluster_config = dist_bolt.RayTrainingClusterConfig(
    num_workers=2,
    cluster_address=mini_cluster.address,
    requested_cpus_per_node=1,
    communication_type="linear",
    runtime_env={"working_dir": os.getcwd()},
    ignore_reinit_error=True,
)



model.train_distributed(
        cluster_config=cluster_config,
        filenames=train_filenames,
        batch_size=256,
        epochs=1,
        learning_rate=0.02,
        metrics=["mean_squared_error"],
        verbose=True,
    )

NCCL seems unavailable. Please install Cupy following the guide at: https://docs.cupy.dev/en/stable/install.html.
2023-01-31 12:42:12,114	INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 127.0.0.1:62096...
2023-01-31 12:42:12,118	INFO worker.py:1529 -- Connected to Ray cluster. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
2023-01-31 12:42:12,163	INFO packaging.py:546 -- Creating a file package for local directory '/Users/pratikqpranav/ThirdAI/Tests'.
2023-01-31 12:42:12,185	INFO packaging.py:373 -- Pushing file package 'gcs://_ray_pkg_689eafd7eeb19963.zip' (10.89MiB) to Ray cluster...
2023-01-31 12:42:12,242	INFO packaging.py:386 -- Successfully pushed file package 'gcs://_ray_pkg_689eafd7eeb19963.zip'.


[2m[36m(ReplicaWorker pid=58357)[0m loaded data | source 'clinc_train_1.csv' | vectors 3750 | batches 30 | time 0s | complete
[2m[36m(ReplicaWorker pid=58357)[0m 




[2m[36m(PrimaryWorker pid=58359)[0m loaded data | source 'clinc_train_0.csv' | vectors 3750 | batches 30 | time 0s | complete
[2m[36m(PrimaryWorker pid=58359)[0m 
[2m[36m(ReplicaWorker pid=58357)[0m loaded data | source 'clinc_train_1.csv' | vectors 3750 | batches 30 | time 0s | complete
[2m[36m(ReplicaWorker pid=58357)[0m 
[2m[36m(PrimaryWorker pid=58359)[0m loaded data | source 'clinc_train_0.csv' | vectors 3750 | batches 30 | time 0s | complete
[2m[36m(PrimaryWorker pid=58359)[0m 


{'time': 51.867286682128906, 'total_batches_trained': 30}

# Evaluation
Evaluating the performance of the UDT model is just one line!

In [6]:
model.evaluate(test_filename, metrics=["categorical_accuracy"]);

loaded data | source './clinc_test.csv' | vectors 4500 | batches 3 | time 0s | complete

evaluate | epoch 0 | train_steps 30 | {categorical_accuracy: 0.854667} | eval_batches 3 | time 407ms

