# Distributed HPO with Ray Tune and XGBoost-Ray

This demo introduces **Ray tune's** key concepts using a classification example. This example is derived from [Hyperparameter Tuning with Ray Tune and XGBoost-Ray](https://github.com/ray-project/xgboost_ray#hyperparameter-tuning). Basically, there are three basic steps or Ray Tune pattern for you as a newcomer to get started with using Ray Tune.

Three simple steps:

 1. Setup your config space and define your trainable and objective function
 2. Use Tune to execute your training hyperparameter sweep, supplying the appropriate arguments including: search space, [search algorithms](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#summary) or [trial schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-schedulers)
 3. Examine or analyse the results returned
 
 <img src="https://docs.ray.io/en/latest/_images/tune-workflow.png" height="50%" width="60%">


See also the [Understanding Hyperparameter Tuning](https://github.com/anyscale/academy/blob/main/ray-tune/02-Understanding-Hyperparameter-Tuning.ipynb) notebook and the [Tune documentation](http://tune.io), in particular, the [API reference](https://docs.ray.io/en/latest/tune/api_docs/overview.html). 


In [5]:
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

import ray
from ray import tune
CONNECT_TO_ANYSCALE=True

In [6]:
if ray.is_initialized:
    ray.shutdown()
    if CONNECT_TO_ANYSCALE:
        ray.init("anyscale://jsd-weekly-demo")
    else:
        ray.init()

[1m[36mOutput[0m
[1m[36m(anyscale +0.3s)[0m .anyscale.yaml found in project_dir. Directory is attached to a project.
[1m[36m(anyscale +0.6s)[0m Using project (name: prj-weekly-demo, project_dir: /Users/jules/git-repos/ray-core-tutorial, id: prj_5rvR1w2ciyUs9RM27FeZ6FVB).
[1m[36m(anyscale +1.5s)[0m cluster jsd-weekly-demo is currently running, the cluster will not be restarted.


2022-02-02 15:32:10,009	INFO packaging.py:352 -- Creating a file package for local directory '/Users/jules/git-repos/ray-core-tutorial'.
2022-02-02 15:32:10,050	INFO packaging.py:221 -- Pushing file package 'gcs://_ray_pkg_e59db065a8ca6dac.zip' (6.34MiB) to Ray cluster...
2022-02-02 15:32:11,012	INFO packaging.py:224 -- Successfully pushed file package 'gcs://_ray_pkg_e59db065a8ca6dac.zip'.


[1m[36m(anyscale +12.0s)[0m Connected to jsd-weekly-demo, see: https://console.anyscale.com/projects/prj_5rvR1w2ciyUs9RM27FeZ6FVB/clusters/ses_jUg93ra8KHWTzAMZv5nig2Rb
[1m[36m(anyscale +12.0s)[0m URL for head node of cluster: https://session-jug93ra8khwtzamzv5nig2rb.i.anyscaleuserdata.com


## Step 1: Define a 'Trainable' training function to use with Ray Tune `ray.tune(...)`

In [7]:
NUM_OF_ACTORS = 4           # degree of parallel trials; each actor will have a separate trial with a set of unique config from the search space
NUM_OF_CPUS_PER_ACTOR = 1   # number of CPUs per actor

ray_params = RayParams(num_actors=NUM_OF_ACTORS, cpus_per_actor=NUM_OF_CPUS_PER_ACTOR)

In [8]:
def train_func_model(config:dict, checkpoint_dir=None):
    # create the dataset
    train_X, train_y = load_breast_cancer(return_X_y=True)
    # Convert to RayDMatrix data structure
    train_set = RayDMatrix(train_X, train_y)

    # Empty dictionary for the evaluation results reported back
    # to tune
    evals_result = {}

    # Train the model with XGBoost train
    bst = train(
        params=config,                       # our hyperparameter search space
        dtrain=train_set,                    # our RayDMatrix data structure
        evals_result=evals_result,           # place holder for results
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)                # distributed parameters configs for Ray Tune

    bst.save_model("model.xgb")

## Step 2: Define a hyperparameter search space

In [9]:
 # Specify the typical hyperparameter search space
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

## Step 3: Run Ray tune main trainer and examine the results

Ray Tune will launch distributed HPO, using four remote actors, each with its own instance of the trainable func

<img src="images/ray_tune_dist_hpo.png" height="60%" width="70%"> 

In [10]:
# Run tune
analysis = tune.run(
    train_func_model,
    config=config,
    metric="train-error",
    mode="min",
    num_samples=4,
    verbose=1,
    resources_per_trial=ray_params.get_tune_resources()
)

[2m[36m(run pid=None)[0m == Status ==
[2m[36m(run pid=None)[0m Current time: 2022-02-02 15:32:33 (running for 00:00:00.12)
[2m[36m(run pid=None)[0m Memory usage on this node: 15.4/62.0 GiB
[2m[36m(run pid=None)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=None)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/216.23 GiB heap, 0.0/92.38 GiB objects
[2m[36m(run pid=None)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-02-02_15-32-33
[2m[36m(run pid=None)[0m Number of trials: 4/4 (4 PENDING)
[2m[36m(run pid=None)[0m 
[2m[36m(run pid=None)[0m 


[2m[36m(ImplicitFunc pid=None, ip=172.31.122.0)[0m 2022-02-02 15:32:35,841	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=None, ip=172.31.106.159)[0m 2022-02-02 15:32:35,861	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=None)[0m 2022-02-02 15:32:36,042	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=None, ip=172.31.120.87)[0m 2022-02-02 15:32:36,398	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=None, ip=172.31.122.0)[0m 2022-02-02 15:32:37,558	INFO main.py:1024 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=None, ip=172.31.122.0)[0m [15:32:37] task [xgboost.ray]:1

[2m[36m(run pid=None)[0m 2022-02-02 15:32:38,286	WARN commands.py:269 -- Loaded cached provider configuration
[2m[36m(run pid=None)[0m 2022-02-02 15:32:38,286	WARN commands.py:270 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
[2m[36m(run pid=None)[0m [1m[36mAuthenticating[0m
[2m[36m(run pid=None)[0m Loaded Anyscale authentication token from variable.
[2m[36m(run pid=None)[0m 
[2m[36m(run pid=None)[0m 2022-02-02 15:32:39,552	INFO command_runner.py:357 -- Fetched IP: 172.31.106.159
[2m[36m(run pid=None)[0m 2022-02-02 15:32:39,552	INFO log_timer.py:25 -- NodeUpdater: ins_N4PtedRvBWP5dTunfgbVSEq1: Got IP  [LogTimer=32ms]
[2m[36m(run pid=None)[0m == Status ==
[2m[36m(run pid=None)[0m Current time: 2022-02-02 15:32:40 (running for 00:00:06.55)
[2m[36m(run pid=None)[0m Memory usage on this node: 16.0/62.0 GiB
[2m[36m(run pid=None)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=None)[0m Resour

[2m[36m(ImplicitFunc pid=None, ip=172.31.120.87)[0m 2022-02-02 15:32:43,588	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.41 seconds (5.47 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=None)[0m 2022-02-02 15:32:43,583	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.57 seconds (5.70 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=None, ip=172.31.106.159)[0m 2022-02-02 15:32:43,590	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.76 seconds (5.91 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=None, ip=172.31.122.0)[0m 2022-02-02 15:32:43,607	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 7.80 seconds (6.04 pure XGBoost training time).


[2m[36m(run pid=None)[0m == Status ==
[2m[36m(run pid=None)[0m Current time: 2022-02-02 15:32:43 (running for 00:00:09.97)
[2m[36m(run pid=None)[0m Memory usage on this node: 15.8/62.0 GiB
[2m[36m(run pid=None)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=None)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/216.23 GiB heap, 0.0/92.38 GiB objects
[2m[36m(run pid=None)[0m Current best trial: 6586b_00002 with train-error=0.01406 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0012394061273935833, 'subsample': 0.7034529139252035, 'max_depth': 4, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=None)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-02-02_15-32-33
[2m[36m(run pid=None)[0m Number of trials: 4/4 (4 TERMINATED)
[2m[36m(run pid=None)[0m 
[2m[36m(run pid=None)[0m 


[2m[36m(run pid=None)[0m 2022-02-02 15:32:43,853	INFO tune.py:626 -- Total run time: 10.30 seconds (9.96 seconds for the tuning loop).


In [11]:
print("Best hyperparameters", analysis.best_config)

Best hyperparameters {'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0012394061273935833, 'subsample': 0.7034529139252035, 'max_depth': 4}


In [12]:
analysis.results_df.head(5)



Unnamed: 0_level_0,train-logloss,train-error,time_this_iter_s,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,...,iterations_since_restore,experiment_tag,config.tree_method,config.objective,config.eval_metric,config.eta,config.subsample,config.max_depth,config.nthread,config.n_jobs
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6586b_00000,0.671836,0.022847,0.006767,True,,,10,ed8c78b7e3244f7b855d6e7f8fdc4b19,2022-02-02_15-32-43,1643844763,...,10,"0_eta=0.0026025,max_depth=5,subsample=0.5119",approx,binary:logistic,"[logloss, error]",0.002602,0.511898,5,1,1
6586b_00001,0.488382,0.029877,0.005827,True,,,10,6247c6b495da44558b684e8a03847791,2022-02-02_15-32-43,1643844763,...,10,"1_eta=0.030779,max_depth=4,subsample=0.59824",approx,binary:logistic,"[logloss, error]",0.030779,0.598241,4,1,1
6586b_00002,0.682543,0.01406,0.005723,True,,,10,29097d8cdfc8455db626aea817165486,2022-02-02_15-32-43,1643844763,...,10,"2_eta=0.0012394,max_depth=4,subsample=0.70345",approx,binary:logistic,"[logloss, error]",0.001239,0.703453,4,1,1
6586b_00003,0.688289,0.02812,0.005581,True,,,10,70acdfc3121f4e939919b4a3b3e9b326,2022-02-02_15-32-43,1643844763,...,10,"3_eta=0.00057779,max_depth=8,subsample=0.54722",approx,binary:logistic,"[logloss, error]",0.000578,0.547222,8,1,1


---

In [13]:
ray.shutdown()

## References

 * [Ray Train: Tune: Scalable Hyperparameter Tuning](https://docs.ray.io/en/master/tune/index.html)
 * [Introducing Distributed XGBoost Training with Ray](https://www.anyscale.com/blog/distributed-xgboost-training-with-ray)
 * [How to Speed Up XGBoost Model Training](https://www.anyscale.com/blog/how-to-speed-up-xgboost-model-training)
 * [XGBoost-Ray Project](https://github.com/ray-project/xgboost_ray)
 * [Distributed XGBoost on Ray](https://docs.ray.io/en/latest/xgboost-ray.html)