# Task 2: Training and Tuning with Ray

## Part 1: Training with Ray Train and Xgboost
In this task, you will train a machine learning model using the preprocessed data. The goal is to train an Xgboost model to predict the user rating for a product. 

In [1]:
import ray
ray.init() # connect to existing Ray cluster
import pandas as pd
import os
import json
import random
import numpy as np
seed = 41
random.seed(seed)
np.random.seed(seed)

from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig, RunConfig

2024-03-13 16:44:42,369	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.8.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-03-13 16:44:42,377	INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.47.192.26:6380...
2024-03-13 16:44:42,408	INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://10.47.192.26:8265 [39m[22m
2024-03-13 16:44:43,214	INFO util.py:154 -- Outdated packages:
  ipywidgets==7.8.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
# clear out previously saved results
!rm -f res_2_2.json res_2_1.json

In [3]:
# load the preprocessed dataset as dense vectors in the parquet format
train_data_path=os.path.expanduser("~/public/pa3/ml_features_train.parquet")
train_data = ray.data.read_parquet(train_data_path)

Metadata Fetch Progress 0:   0%|          | 0/41 [00:00<?, ?it/s]

Parquet Files Sample 0:   0%|          | 0/2 [00:00<?, ?it/s]


Instantiate a Ray trainer and train an xgboost model on the training dataset. The model should be trained with a regression objective to minimize the mean squared error. The `max_depth` parameter of the model must be set to 3, the `eta` value to 0.3. All other parameters of the model should be left to default values.

Note: Ray will by default try to store results in `~/ray_results`. This can throw permission errors in DataHub, so you can change the location to `~/private/ray_results`. [Docs](https://docs.ray.io/en/latest/train/api/doc/ray.train.RunConfig.html)

In [4]:
# YOUR CODE HERE
trainer = XGBoostTrainer(label_column = "overall",
                         params = {'objective': 'reg:squarederror',
                                   'max_depth':3,
                                    'eta':0.3},
                         scaling_config=ScalingConfig(num_workers = 3, use_gpu = False, resources_per_worker = {"CPU" : 6}),
                         datasets = {'train':train_data}
                        )


In [5]:
result = trainer.fit()

0,1
Current time:,2024-03-13 16:45:45
Running for:,00:00:56.19
Memory:,124.5/503.6 GiB

Trial name,status,loc,iter,total time (s),train-rmse
XGBoostTrainer_ad2b5_00000,TERMINATED,10.35.0.20:22653,11,51.8878,0.884518


[36m(XGBoostTrainer pid=22653, ip=10.35.0.20)[0m [RayXGBoost] Created 3 new actors (3 total actors). Waiting until actors are ready for training.


(pid=22653, ip=10.35.0.20) Read progress 0:   0%|          | 0/200 [00:00<?, ?it/s]

[36m(XGBoostTrainer pid=22653, ip=10.35.0.20)[0m [RayXGBoost] Starting XGBoost training.
[36m(_RemoteRayXGBoostActor pid=22809, ip=10.35.0.20)[0m [16:45:06] task [xgboost.ray]:140480536028736 got new rank 0
[36m(XGBoostTrainer pid=22653, ip=10.35.0.20)[0m Training in progress (30 seconds since last restart).
[36m(_RemoteRayXGBoostActor pid=43765)[0m [16:45:06] task [xgboost.ray]:140087680559712 got new rank 2[32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)[0m
[36m(XGBoostTrainer pid=22653, ip=10.35.0.20)[0m [RayXGBoost] Finished XGBoost training on training data with total N=18,444,174 in 51.91 seconds (40.01 pure XGBoost training time).2024-03-13 16:45:45,666	INFO tune.py:1042 -- Total run time: 58.04 seconds (56.17 seconds for the tuning loop).

[36m(XGBoostTrainer pid=22653, ip=10.35.0.20)[0m

In [6]:
print(result)

Result(
  metrics={'train-rmse': 0.8845177330600246},
  path='/home/a1jadhav/ray_results/XGBoostTrainer_2024-03-13_16-44-47/XGBoostTrainer_ad2b5_00000_0_2024-03-13_16-44-49',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/home/a1jadhav/ray_results/XGBoostTrainer_2024-03-13_16-44-47/XGBoostTrainer_ad2b5_00000_0_2024-03-13_16-44-49/checkpoint_000000)
)


## Analyzing test data performance

Next, use the trained model to generate predictions on test data. Calculate the root mean square error (RMSE) of
the test predictions and report it in the output. 

For this task, we will make use of [`map_batches`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) to make a stateful transformation of the test data. 

In [7]:
test_data_path=os.path.expanduser("~/public/pa3/ml_features_test.parquet")
test_dataset= ray.data.read_parquet(test_data_path)

Parquet Files Sample 0:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
model = trainer.get_model(result.checkpoint)

In [9]:
import pandas as pd
from ray.train import Checkpoint
import xgboost
import math

class Predictor:

    def __init__(self, checkpoint: Checkpoint):
        self.model = XGBoostTrainer.get_model(checkpoint)
        self.label_col = "overall"

    def __call__(self, batch: pd.DataFrame) -> pd.DataFrame:
        """
        Implement the following:
        1. Get the predictions on a batch of data for an xgboost model as you would do normally.
        2. Return the squared errors for each entry using the label column
        """
        # YOUR CODE HERE
        X = batch.drop(columns=[self.label_col])
        y_true = batch[self.label_col]
        y_pred = self.model.predict(xgboost.DMatrix(X))
        errors = (y_true - y_pred) ** 2
        return {"se": errors}

def predict_xgboost(test_dataset, result):
    """
    Obtains the predictions for a test dataset given a `ray.train.Result` object and returns the squared errors for each entry
    """
    # YOUR CODE HERE
    predictor = Predictor(result.checkpoint)
    squared_errors = test_dataset.map_batches(predictor, batch_format = "pandas")
    
    return squared_errors

In [10]:
# get the root mean squared error for the test dataset using the result.
# Save the test rmse in `test_rmse` 

# YOUR CODE HERE
squared_errors = predict_xgboost(test_dataset, result)
test_rmse = math.sqrt(squared_errors.sum()/squared_errors.count())

# write to file
res = {"test_rmse": test_rmse, 
          "train_rmse": result.metrics["train-rmse"]}
with open("res_2_1.json", "w") as f:
    json.dump(res, f)

2024-03-13 16:45:47,384	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-13 16:45:47,385	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 9 smaller blocks.
2024-03-13 16:45:47,386	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(Predictor)] -> LimitOperator[limit=1]
2024-03-13 16:45:47,388	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-13 16:45:47,389	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.

Running 0:   0%|          | 0/23 [00:00<?, ?it/s]

2024-03-13 16:45:50,339	INFO dataset.py:2488 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2024-03-13 16:45:50,349	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-13 16:45:50,350	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 9 smaller blocks.
2024-03-13 16:45:50,351	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(Predictor)] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-03-13 16:45:50,352	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, pr

- Aggregate 1:   0%|          | 0/23 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/23 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/23 [00:00<?, ?it/s]

Running 0:   0%|          | 0/23 [00:00<?, ?it/s]

2024-03-13 16:45:56,104	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-13 16:45:56,104	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 9 smaller blocks.
2024-03-13 16:45:56,105	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(Predictor)]
2024-03-13 16:45:56,106	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-13 16:45:56,106	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().

Running 0:   0%|          | 0/23 [00:00<?, ?it/s]

# Part 2: Tuning with Ray Tune

We'll tune the following Xgboost hyperparameters:

1. `max_depth`
3. `subsample`
4. `eta`

You can read more about each hyperparameter in the [official docs](https://xgboost.readthedocs.io/en/stable/parameter.html). Since the overall search space is large, and our compute budget is limited, we'll focus on running 12 *trials* (or 12 instances of 3-tuples of hyperparameters) with a grid search.  Here are the values:

1. `max_depth`: $[3, 4, 5]$ 
3. `subsample`: $[0.8, 1.0]$
4. `eta`: $[0.3, 0.5]$


Steps to implement, repeated from the problem statement:
1. Create a new training and validation data from the original training data - with a random split of 75/25.
2. Train Xgboost models with 12 hyperparameter trials over the given grid using Ray Tune. [Offical Example](https://docs.ray.io/en/latest/tune/examples/tune-xgboost.html)
3. Select the best model with the lowest validation RMSE. 
4. Report the test RMSE for the best model and the lowest validation RMSE.

Make sure to use the same `ScalingConfig` as before. Restrict the number of concurrent trials to 1 for memory efficiency. Store the final `tune.ResultGrid` object in `result_grid` and the best result in the variable `best_result`.

In [11]:
from ray import tune
from ray.tune import Tuner
from ray.train.xgboost import XGBoostTrainer
from sklearn.model_selection import train_test_split

# store your answers in these
best_result = None
result_grid = None

train, val, = train_data.train_test_split(test_size=0.25, shuffle=True)

trainer = XGBoostTrainer(
    label_column="overall",
    params={
        "objective": "reg:squarederror",
        "max_depth": 4,
        "subsample": 0.8,
        "eta": 0.3
    },
    scaling_config=ScalingConfig(num_workers=3, resources_per_worker = {"CPU":6}, use_gpu = False),
    datasets={"train": train, "val":val}
)

# Create Tuner
tuner = Tuner(
    trainer,
    # Add some parameters to tune
    param_space={"params": {
        "max_depth": tune.grid_search([3,4,5]),
        "subsample": tune.grid_search([0.8, 1.0]),
        "eta": tune.grid_search([0.3, 0.5]),}
                },
    # Specify tuning behavior
    tune_config=tune.TuneConfig(metric="train-rmse", mode="min", num_samples=1),
)


#Saving Results
result_grid = tuner.fit()
best_result = result_grid.get_best_result(metric="val-rmse", mode="min")

0,1
Current time:,2024-03-13 16:56:51
Running for:,00:10:31.15
Memory:,130.9/503.6 GiB

Trial name,status,loc,params/eta,params/max_depth,params/subsample,iter,total time (s),train-rmse,val-rmse
XGBoostTrainer_e4665_00000,TERMINATED,10.47.192.26:47227,0.3,3,0.8,11,51.0381,0.884277,0.88485
XGBoostTrainer_e4665_00001,TERMINATED,10.45.64.19:24886,0.5,3,0.8,11,46.3068,0.876538,0.877231
XGBoostTrainer_e4665_00002,TERMINATED,10.47.192.26:49028,0.3,4,0.8,11,46.3773,0.877908,0.878565
XGBoostTrainer_e4665_00003,TERMINATED,10.45.64.19:25917,0.5,4,0.8,11,46.087,0.871727,0.872406
XGBoostTrainer_e4665_00004,TERMINATED,10.35.0.20:26109,0.3,5,0.8,11,52.2955,0.873965,0.874649
XGBoostTrainer_e4665_00005,TERMINATED,10.35.0.20:26723,0.5,5,0.8,11,51.2324,0.867158,0.867836
XGBoostTrainer_e4665_00006,TERMINATED,10.35.0.20:27343,0.3,3,1.0,11,47.8074,0.884268,0.884852
XGBoostTrainer_e4665_00007,TERMINATED,10.47.192.26:51424,0.5,3,1.0,11,41.6642,0.877016,0.877651
XGBoostTrainer_e4665_00008,TERMINATED,10.35.0.20:28365,0.3,4,1.0,11,46.221,0.87767,0.87833
XGBoostTrainer_e4665_00009,TERMINATED,10.45.64.19:28404,0.5,4,1.0,11,48.3248,0.87133,0.872067


[36m(XGBoostTrainer pid=47227)[0m [RayXGBoost] Created 3 new actors (3 total actors). Waiting until actors are ready for training.
[36m(XGBoostTrainer pid=47227)[0m [RayXGBoost] Starting XGBoost training.
[36m(_RemoteRayXGBoostActor pid=24644, ip=10.35.0.20)[0m [16:46:37] task [xgboost.ray]:140012045121136 got new rank 0
[36m(XGBoostTrainer pid=47227)[0m Training in progress (31 seconds since last restart).
[36m(_RemoteRayXGBoostActor pid=47508)[0m [16:46:37] task [xgboost.ray]:139622141098544 got new rank 2[32m [repeated 2x across cluster][0m
[36m(XGBoostTrainer pid=47227)[0m [RayXGBoost] Finished XGBoost training on training data with total N=13,833,129 in 51.06 seconds (38.86 pure XGBoost training time).
[36m(XGBoostTrainer pid=47227)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/a1jadhav/ray_results/XGBoostTrainer_2024-03-13_16-46-20/XGBoostTrainer_e4665_00000_0_eta=0.3000,max_depth=3,subsample=0.8000_2024-03-13_16-46-20/checkpoint_0

In [12]:
print(best_result)

Result(
  metrics={'train-rmse': 0.8671578920302253, 'val-rmse': 0.867836184329676},
  path='/home/a1jadhav/ray_results/XGBoostTrainer_2024-03-13_16-46-20/XGBoostTrainer_e4665_00005_5_eta=0.5000,max_depth=5,subsample=0.8000_2024-03-13_16-46-20',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/home/a1jadhav/ray_results/XGBoostTrainer_2024-03-13_16-46-20/XGBoostTrainer_e4665_00005_5_eta=0.5000,max_depth=5,subsample=0.8000_2024-03-13_16-46-20/checkpoint_000000)
)


[36m(XGBoostTrainer pid=30003, ip=10.35.0.20)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/a1jadhav/ray_results/XGBoostTrainer_2024-03-13_16-46-20/XGBoostTrainer_e4665_00011_11_eta=0.5000,max_depth=5,subsample=1.0000_2024-03-13_16-46-20/checkpoint_000000)


Now, 
1. Get the root mean squared error for the test dataset using the best result from the hyperparameter tuning experiments.
2. Report the validation rmse values for the best model as well as the given configurations

In [13]:
def get_task_2_2_results(result_grid: tune.ResultGrid, best_result: ray.train.Result):
    res = {
       "test_rmse": None, # test rmse for the best model
        "valid_rmse": None, # validation rmse for the best model
        "valid_depth_5_eta_0.3_subsample_0.8": None, # validation rmse for max_depth=5, eta=0.3, subsample=0.8
        "valid_depth_4_eta_0.3_subsample_1": None, # validation rmse for max_depth=4, eta=0.3, subsample=1
        "valid_depth_3_eta_0.5_subsample_1": None, # validation rmse for max_depth=3, eta=0.5, subsample=1
    }

    # YOUR CODE HERE
    #test-rmse for best model
    test_rmse_se = predict_xgboost(test_dataset, best_result)
    res["test_rmse"] = math.sqrt(test_rmse_se.sum()/test_rmse_se.count())
    
    #best model val rmse
    res["valid_rmse"] = best_result.metrics['val-rmse']
    
    for trial in result_grid:
        trial_params = trial.config["params"]
        if trial_params["max_depth"] == 5 and trial_params["eta"] == 0.3 and trial_params["subsample"] == 0.8:
            res["valid_depth_5_eta_0.3_subsample_0.8"] = trial.metrics.get("val-rmse")
        elif trial_params["max_depth"] == 4 and trial_params["eta"] == 0.3 and trial_params["subsample"] == 1.0:
            res["valid_depth_4_eta_0.3_subsample_1"] = trial.metrics.get("val-rmse")
        elif trial_params["max_depth"] == 3 and trial_params["eta"] == 0.5 and trial_params["subsample"] == 1.0:
            res["valid_depth_3_eta_0.5_subsample_1"] = trial.metrics.get("val-rmse")
    
    return res

In [14]:
res_2_2 = get_task_2_2_results(result_grid, best_result)
with open("res_2_2.json", "w") as f:
    json.dump(res_2_2, f)

2024-03-13 16:56:51,599	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-13 16:56:51,600	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 9 smaller blocks.
2024-03-13 16:56:51,601	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(Predictor)] -> LimitOperator[limit=1]
2024-03-13 16:56:51,602	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-13 16:56:51,603	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.

Running 0:   0%|          | 0/23 [00:00<?, ?it/s]

2024-03-13 16:56:54,726	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-13 16:56:54,728	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 9 smaller blocks.
2024-03-13 16:56:54,728	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(Predictor)] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]
2024-03-13 16:56:54,730	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-13 16:56:54,730	INFO streaming_executor.py:115 -- Tip: For detailed pro

- Aggregate 1:   0%|          | 0/23 [00:00<?, ?it/s]

Shuffle Map 2:   0%|          | 0/23 [00:00<?, ?it/s]

Shuffle Reduce 3:   0%|          | 0/23 [00:00<?, ?it/s]

Running 0:   0%|          | 0/23 [00:00<?, ?it/s]

2024-03-13 16:57:01,145	INFO set_read_parallelism.py:115 -- Using autodetected parallelism=200 for stage ReadParquet to satisfy DataContext.get_current().min_parallelism=200.
2024-03-13 16:57:01,147	INFO set_read_parallelism.py:122 -- To satisfy the requested parallelism of 200, each read task output is split into 9 smaller blocks.
2024-03-13 16:57:01,148	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(Predictor)]
2024-03-13 16:57:01,149	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-03-13 16:57:01,150	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().

Running 0:   0%|          | 0/23 [00:00<?, ?it/s]

In [15]:
# shutdown!
ray.shutdown()