# Distributed HPO with Ray Tune and XGBoost-Ray

This demo introduces **Ray tune's** key concepts using a classification example. This example is derived from [Hyperparameter Tuning with Ray Tune and XGBoost-Ray](https://github.com/ray-project/xgboost_ray#hyperparameter-tuning). Basically, there are three basic steps or Ray Tune pattern for you as a newcomer to get started with using Ray Tune.

 1. Setup your config space and define your trainable and objective function
 2. Use Tune to execute your training hyperparameter sweep, supplying the appropriate arguments including: search space, [search algorithms](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#summary) or [trial schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-schedulers)
 3. Examine or analyse the results returned
 
 <img src="https://docs.ray.io/en/latest/_images/tune-workflow.png" height="50%" width="60%">


See also the [Understanding Hyperparameter Tuning](https://github.com/anyscale/academy/blob/main/ray-tune/02-Understanding-Hyperparameter-Tuning.ipynb) notebook and the [Tune documentation](http://tune.io), in particular, the [API reference](https://docs.ray.io/en/latest/tune/api_docs/overview.html). 


In [1]:
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

import ray
from ray import tune
CONNECT_TO_ANYSCALE=True

In [2]:
if ray.is_initialized:
    ray.shutdown()
    if CONNECT_TO_ANYSCALE:
        ray.init("anyscale://jsd-weekly-demo")
    else:
        ray.init()

[1m[36mAuthenticating[0m
Loaded Anyscale authentication token from ANYSCALE_CLI_TOKEN.

[1m[36mOutput[0m
[1m[36m(anyscale +0.3s)[0m .anyscale.yaml found in project_dir. Directory is attached to a project.
[1m[36m(anyscale +0.4s)[0m Using project (name: prj-weekly-demo, project_dir: /Users/jules/git-repos/ray-core-tutorial, id: prj_5rvR1w2ciyUs9RM27FeZ6FVB).
[1m[36m(anyscale +1.4s)[0m cluster jsd-weekly-demo is currently running, the cluster will not be restarted.


2022-01-27 07:52:37,909	INFO packaging.py:352 -- Creating a file package for local directory '/Users/jules/git-repos/ray-core-tutorial'.
2022-01-27 07:52:37,941	INFO packaging.py:221 -- Pushing file package 'gcs://_ray_pkg_01e5cdbe46829718.zip' (6.30MiB) to Ray cluster...
2022-01-27 07:52:43,149	INFO packaging.py:224 -- Successfully pushed file package 'gcs://_ray_pkg_01e5cdbe46829718.zip'.


[1m[36m(anyscale +15.9s)[0m Connected to jsd-weekly-demo, see: https://console.anyscale.com/projects/prj_5rvR1w2ciyUs9RM27FeZ6FVB/clusters/ses_jUg93ra8KHWTzAMZv5nig2Rb
[1m[36m(anyscale +15.9s)[0m URL for head node of cluster: https://session-jug93ra8khwtzamzv5nig2rb.i.anyscaleuserdata.com


## Step 1: Define a 'Trainable' training function to use with Ray Tune `ray.tune(...)`

In [3]:
NUM_OF_ACTORS = 4           # degree of parallel trials; each actor will have a separate trial with a set of unique config from the search space
NUM_OF_CPUS_PER_ACTOR = 1   # number of CPUs per actor

ray_params = RayParams(num_actors=NUM_OF_ACTORS, cpus_per_actor=NUM_OF_CPUS_PER_ACTOR)

In [4]:
def train_func_model(config:dict, checkpoint_dir=None):
    # create the dataset
    train_X, train_y = load_breast_cancer(return_X_y=True)
    # Convert to RayDMatrix data structure
    train_set = RayDMatrix(train_X, train_y)

    # Empty dictionary for the evaluation results reported back
    # to tune
    evals_result = {}

    # Train the model with XGBoost train
    bst = train(
        params=config,                       # our hyperparameter search space
        dtrain=train_set,                    # our RayDMatrix data structure
        evals_result=evals_result,           # place holder for results
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)                # distributed parameters configs for Ray Tune

    bst.save_model("model.xgb")

## Step 2: Define a hyperparameter search space

In [5]:
 # Specify the typical hyperparameter search space
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

## Step 3: Run Ray tune main trainer and examine the results

Ray Tune will launch distributed HPO, using four remote actors, each with its own instance of the trainable func

<img src="images/ray_tune_dist_hpo.png" height="60%" width="70%"> 

In [6]:
# Run tune
analysis = tune.run(
    train_func_model,
    config=config,
    metric="train-error",
    mode="min",
    num_samples=4,
    verbose=1,
    resources_per_trial=ray_params.get_tune_resources()
)

[2m[36m(run pid=4844)[0m == Status ==
[2m[36m(run pid=4844)[0m Current time: 2022-01-27 07:53:03 (running for 00:00:00.22)
[2m[36m(run pid=4844)[0m Memory usage on this node: 2.4/62.0 GiB
[2m[36m(run pid=4844)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=4844)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/216.22 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=4844)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-01-27_07-53-02
[2m[36m(run pid=4844)[0m Number of trials: 4/4 (4 PENDING)
[2m[36m(run pid=4844)[0m 
[2m[36m(run pid=4844)[0m 


[2m[36m(ImplicitFunc pid=4953)[0m 2022-01-27 07:53:05,230	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=4953)[0m 2022-01-27 07:53:06,947	INFO main.py:1024 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=5015)[0m [07:53:06] task [xgboost.ray]:139780947481888 got new rank 0
[2m[36m(_RemoteRayXGBoostActor pid=5017)[0m [07:53:06] task [xgboost.ray]:140615337628960 got new rank 2
[2m[36m(_RemoteRayXGBoostActor pid=5018)[0m [07:53:06] task [xgboost.ray]:140027524908320 got new rank 3
[2m[36m(_RemoteRayXGBoostActor pid=5016)[0m [07:53:06] task [xgboost.ray]:140220007744848 got new rank 1
[2m[36m(ImplicitFunc pid=4953)[0m 2022-01-27 07:53:07,863	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 2.76 seconds (0.91 pure XGBoost training time).


[2m[36m(run pid=4844)[0m == Status ==
[2m[36m(run pid=4844)[0m Current time: 2022-01-27 07:53:12 (running for 00:00:09.90)
[2m[36m(run pid=4844)[0m Memory usage on this node: 2.3/62.0 GiB
[2m[36m(run pid=4844)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=4844)[0m Resources requested: 15.0/80 CPUs, 0/0 GPUs, 0.0/216.22 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=4844)[0m Current best trial: 3576b_00001 with train-error=0.010545 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0009316428296060249, 'subsample': 0.9212008867068575, 'max_depth': 5, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=4844)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-01-27_07-53-02
[2m[36m(run pid=4844)[0m Number of trials: 4/4 (3 RUNNING, 1 TERMINATED)
[2m[36m(run pid=4844)[0m 
[2m[36m(run pid=4844)[0m 


[2m[36m(ImplicitFunc pid=445, ip=172.31.126.32)[0m 2022-01-27 07:53:13,050	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=445, ip=172.31.115.123)[0m 2022-01-27 07:53:13,052	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=444, ip=172.31.121.3)[0m 2022-01-27 07:53:13,053	INFO main.py:979 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
[2m[36m(ImplicitFunc pid=445, ip=172.31.115.123)[0m 2022-01-27 07:53:14,768	INFO main.py:1024 -- [RayXGBoost] Starting XGBoost training.
[2m[36m(_RemoteRayXGBoostActor pid=481, ip=172.31.115.123)[0m [07:53:14] task [xgboost.ray]:140145258268032 got new rank 1
[2m[36m(_RemoteRayXGBoostActor pid=482, ip=172.31.115.123)[0m [07:53:14] task [xgboost.ray]:140158640801152 got new rank 2
[2m[36m(_RemoteRayXGB

[2m[36m(run pid=4844)[0m == Status ==
[2m[36m(run pid=4844)[0m Current time: 2022-01-27 07:53:14 (running for 00:00:11.90)
[2m[36m(run pid=4844)[0m Memory usage on this node: 2.3/62.0 GiB
[2m[36m(run pid=4844)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=4844)[0m Resources requested: 15.0/80 CPUs, 0/0 GPUs, 0.0/216.22 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=4844)[0m Current best trial: 3576b_00001 with train-error=0.010545 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0009316428296060249, 'subsample': 0.9212008867068575, 'max_depth': 5, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=4844)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-01-27_07-53-02
[2m[36m(run pid=4844)[0m Number of trials: 4/4 (3 RUNNING, 1 TERMINATED)
[2m[36m(run pid=4844)[0m 
[2m[36m(run pid=4844)[0m 
[2m[36m(run pid=4844)[0m 2022-01-27 07:53:15,799	INFO commands.py:292 -- Checking Ext

[2m[36m(ImplicitFunc pid=445, ip=172.31.115.123)[0m 2022-01-27 07:53:19,775	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 6.87 seconds (5.00 pure XGBoost training time).


[2m[36m(run pid=4844)[0m 2022-01-27 07:53:19,840	WARN commands.py:269 -- Loaded cached provider configuration
[2m[36m(run pid=4844)[0m 2022-01-27 07:53:19,841	WARN commands.py:270 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
[2m[36m(run pid=4844)[0m 2022-01-27 07:53:20,854	INFO command_runner.py:357 -- Fetched IP: 172.31.121.3
[2m[36m(run pid=4844)[0m 2022-01-27 07:53:20,854	INFO log_timer.py:25 -- NodeUpdater: ins_TQHm3Z2WMkpsAvn5s5pECCdp: Got IP  [LogTimer=32ms]
[2m[36m(run pid=4844)[0m == Status ==
[2m[36m(run pid=4844)[0m Current time: 2022-01-27 07:53:21 (running for 00:00:18.79)
[2m[36m(run pid=4844)[0m Memory usage on this node: 2.4/62.0 GiB
[2m[36m(run pid=4844)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=4844)[0m Resources requested: 15.0/80 CPUs, 0/0 GPUs, 0.0/216.22 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=4844)[0m Current best trial: 3576b_00001 with train-error=0.010545 

[2m[36m(ImplicitFunc pid=445, ip=172.31.126.32)[0m 2022-01-27 07:53:23,689	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 10.78 seconds (8.82 pure XGBoost training time).
[2m[36m(ImplicitFunc pid=444, ip=172.31.121.3)[0m 2022-01-27 07:53:23,684	INFO main.py:1503 -- [RayXGBoost] Finished XGBoost training on training data with total N=569 in 10.84 seconds (8.90 pure XGBoost training time).
[2m[36m(run pid=4844)[0m 2022-01-27 07:53:23,865	INFO tune.py:626 -- Total run time: 21.73 seconds (20.75 seconds for the tuning loop).


[2m[36m(run pid=4844)[0m == Status ==
[2m[36m(run pid=4844)[0m Current time: 2022-01-27 07:53:23 (running for 00:00:20.75)
[2m[36m(run pid=4844)[0m Memory usage on this node: 2.4/62.0 GiB
[2m[36m(run pid=4844)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=4844)[0m Resources requested: 0/80 CPUs, 0/0 GPUs, 0.0/216.22 GiB heap, 0.0/92.18 GiB objects
[2m[36m(run pid=4844)[0m Current best trial: 3576b_00001 with train-error=0.010545 and parameters={'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0009316428296060249, 'subsample': 0.9212008867068575, 'max_depth': 5, 'nthread': 1, 'n_jobs': 1}
[2m[36m(run pid=4844)[0m Result logdir: /home/ray/ray_results/train_func_model_2022-01-27_07-53-02
[2m[36m(run pid=4844)[0m Number of trials: 4/4 (4 TERMINATED)
[2m[36m(run pid=4844)[0m 
[2m[36m(run pid=4844)[0m 


In [7]:
print("Best hyperparameters", analysis.best_config)

Best hyperparameters {'tree_method': 'approx', 'objective': 'binary:logistic', 'eval_metric': ['logloss', 'error'], 'eta': 0.0009316428296060249, 'subsample': 0.9212008867068575, 'max_depth': 5}


In [8]:
analysis.results_df.head(5)



Unnamed: 0_level_0,train-logloss,train-error,time_this_iter_s,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,...,iterations_since_restore,experiment_tag,config.tree_method,config.objective,config.eval_metric,config.eta,config.subsample,config.max_depth,config.nthread,config.n_jobs
trial_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3576b_00000,0.598461,0.017575,0.003546,True,,,10,66b1b985980f4f8192ae598098438b6d,2022-01-27_07-53-23,1643298803,...,10,"0_eta=0.012158,max_depth=4,subsample=0.75912",approx,binary:logistic,"[logloss, error]",0.012158,0.759123,4,1,1
3576b_00001,0.684981,0.010545,0.003856,True,,,10,be3ee5276c5c4ec6a181b21038a00859,2022-01-27_07-53-07,1643298787,...,10,"1_eta=0.00093164,max_depth=5,subsample=0.9212",approx,binary:logistic,"[logloss, error]",0.000932,0.921201,5,1,1
3576b_00002,0.691071,0.029877,0.003715,True,,,10,82cc3b9a582e4ed6b7572248e0170484,2022-01-27_07-53-19,1643298799,...,10,"2_eta=0.00024877,max_depth=3,subsample=0.60588",approx,binary:logistic,"[logloss, error]",0.000249,0.605882,3,1,1
3576b_00003,0.692246,0.02812,0.003723,True,,,10,b081c547f16b427c8e484a6e149b512a,2022-01-27_07-53-23,1643298803,...,10,"3_eta=0.00010676,max_depth=7,subsample=0.55468",approx,binary:logistic,"[logloss, error]",0.000107,0.554683,7,1,1


---

In [9]:
ray.shutdown()

## References

 * [Ray Train: Tune: Scalable Hyperparameter Tuning](https://docs.ray.io/en/master/tune/index.html)
 * [Introducing Distributed XGBoost Training with Ray](https://www.anyscale.com/blog/distributed-xgboost-training-with-ray)
 * [How to Speed Up XGBoost Model Training](https://www.anyscale.com/blog/how-to-speed-up-xgboost-model-training)
 * [XGBoost-Ray Project](https://github.com/ray-project/xgboost_ray)
 * [Distributed XGBoost on Ray](https://docs.ray.io/en/latest/xgboost-ray.html)