### **Let's explore Ray for an End-to-End Deep Learning Project!**

Welcome to an exploration of Ray, an open-source framework that empowers us to tackle the formidable challenges of training large deep learning models. You might be wondering, why Ray? Over the past four years, my curiosity has been fueled by the desire to understand how these massive models are trained. Think about it&mdash;training a vision transformer on your laptop with millions of parameters or even conducting ablation studies seems like an insurmountable task.

That's where Ray steps in. This remarkable framework offers distributed computing capabilities that enable us to train colossal models swiftly. It eliminates the need for expertise in infrastructure management, taking care of the heavy lifting. Moreover, transitioning from local development to a cloud environment is a breeze with Ray; no drastic code changes required. For an in-depth understanding of the framework, I urge you to refer to their [website](https://www.ray.io/). 


#### **Exploring this Repository**

Before we delve into the nitty-gritty details of this implementation, I'd like to extend my heartfelt gratitude to [Goku Mohandas](https://www.linkedin.com/in/goku/) for his outstanding [Made-With-ML](https://github.com/GokuMohandas/Made-With-ML) repository&mdash;a phenomenal source for MLOps; Their dedication to advancing the field and sharing knowledge has been a guiding light for countless enthusiasts, including myself. 

Now, let's briefly discuss the different modules applied in this implementation. 

1. **utils.py**: This module covers two functions focused on reading the configuration file (i.e., config.YAML) and setting the tracking URI for mlflow. 
2. **data_utils.py**: This module prepares the image classification data. This should be changed as required. Note that, in this case, we are working with Image Data. It covers two scenarios: when the data is stored locally and when the data needs to be downloaded. For simplicity, we can use the data from an online source.
3. **build_model.py**: This module defines the model for training. In this case, we are fine-tuning a ResNet50 Model. Therefore, it only initializes it with pre-trained weights. In this module, we can define custom Pytorch models as well.
4. **train_utils.py**: This module has functions that will be used in the training process.
5. **train_engine.py**: This is the module where we train the model using the Ray Framework. The critical concepts of Ray Train can be studied in this [documentation](https://docs.ray.io/en/latest/train/key-concepts.html). Trainers in Ray are required to perform distributed training. All we need to do is define a train_loop_per_worker function to perform the training steps that the different workers will use. To scale the training, we need to specify the Scaling Config. In addition, we must specify the Checkpoint config and Run Config that take care of the experiment's name, where the checkpoints will be stored, and any callbacks we are using. In this case, we will use mlflow to manage our experiments.
6. **tune_engine.py**: This module is responsible for performing the tuning. It utilizes the Ray's TorchTuner. Details can be gathered from [HERE](https://docs.ray.io/en/latest/tune/key-concepts.html). In addition to the TorchTrainer, Scaling Config, Run Config, and Checkpoint Config, we define the search space, search algorithms, and schedulars. The tuning is performed in a distributed manner, and the experiments can be tracked using mlflow.
7. **evaluate_engine.py**: This is the module for model evaluation as well as it can be adapted for batch inference.
8. **serve_engine.py**: Lastly, this implementation also explores model serving using Gradio with Ray. 

Also, the **config.YAML** file has to be updated, especially for the directories used in the project.

The code is in the form of Python scripts. To demonstrate how it works, in the notebook, these scripts are called using CLI commands. If you have any questions or comments, please do reach out!


---
**Handling Custom Datasets in Different Scenarios**

When you need to run the model on a custom dataset, you'll likely encounter two common scenarios: working with local data and working with remote data. Let's explore how to handle each scenario effectively:

**Scenario 1: Local Data**

Organizing Your Custom Dataset

If your custom dataset is stored locally, you can ensure a smooth integration by following these steps:

1. **Folder Structure**: Create a folder named `data` within your project directory. Inside this `data` folder, include subfolders for your custom dataset, such as `train`, `val` (validation), and `test`. Organize your data files accordingly.

2. **Data Loading**: Modify your data loading process in the `train_engine.py` module. Instead of importing the `build_datasets_download` method, import the `build_datasets_local` method. This adjustment allows you to read the custom dataset from the local directory structure.

**Scenario 2: Remote Data**

Accessing Data from a Remote Source

If your custom dataset is hosted remotely or available via a web service, you can access it as follows:

1. **Data Retrieval**: Utilize methods or libraries to fetch the data from the remote source. This could involve using APIs, downloading data from a web URL, or accessing cloud-based storage.

2. **Data Loading**: For the access of the data during training and tuning, you need to ensure that the correct function is imported from the 'data_utils.py' module. In this case, 'build_datasets_download' method needs to be imported.



---
**How to Use the `train_engine.py` Module**

Once you've tailored the configuration file to match your project's specifications, you can proceed with training the model using the `train_engine` module. This module can be executed via a command-line interface (CLI) command, simplifying the training process. Here's how to use it:

**CLI Command**

To initiate the training process, execute the following CLI command:

```bash
python train_engine.py


In [11]:
%run ../src/train_engine.py

0,1
Current time:,2023-09-03 09:41:03
Running for:,00:00:43.76
Memory:,61.0/187.5 GiB

Trial name,status,loc,iter,total time (s),val_loss,val_acc
TorchTrainer_6c48f_00000,TERMINATED,192.168.13.177:11541,10,38.1492,0.142499,0.973684


[2m[36m(TorchTrainer pid=11541)[0m Starting distributed worker processes: ['11605 (192.168.13.177)', '11606 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=11605)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=11605)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=11605)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=11606)[0m Epoch 0-train Loss: 0.6301 Acc: 0.6803
[2m[36m(RayTrainWorker pid=11606)[0m Epoch 2-train Loss: 0.3065 Acc: 0.9508[32m [repeated 8x across cluster][0m
[2m[36m(RayTrainWorker pid=11606)[0m Epoch 4-train Loss: 0.1939 Acc: 0.9344[32m [repeated 8x across cluster][0m
[2m[36m(RayTrainWorker pid=11606)[0m Epoch 6-train Loss: 0.1488 Acc: 0.9590[32m [repeated 8x across cluster][0m
[2m[36m(RayTrainWorker pid=11606)[0m Epoch 8-train Loss: 0.1364 Acc: 0.9508[32m [repeated 8x across cluster][0m


2023-09-03 09:41:03,210	INFO tune.py:1148 -- Total run time: 43.81 seconds (43.76 seconds for the tuning loop).


Result(
  metrics={'val_loss': 0.14249894532718158, 'val_acc': 0.9736842105263158, 'should_checkpoint': True, 'done': True, 'trial_id': '6c48f_00000', 'experiment_tag': '0'},
  path='/home/dhaval/Projects/NewRay/results/ray_results/tuning-resnet-1693748416/TorchTrainer_6c48f_00000_0_2023-09-03_09-40-19',
  checkpoint=TorchCheckpoint(local_path=/home/dhaval/Projects/NewRay/results/ray_results/tuning-resnet-1693748416/TorchTrainer_6c48f_00000_0_2023-09-03_09-40-19/checkpoint_000009)
)


---
**How to Use the `tune_engine.py` Module**

After understanding how to execute the `train_engine`, let's explore the usage of the `tune_engine` module. This module allows you to perform multiple hyperparameter tuning runs based on the parameter space defined within `tune_engine.py`. In the future, consider configuring this parameter space in a separate configuration file to abstract away code details.

**Module Invocation**

The `tune_engine.py` module does not require any command-line arguments. Instead, it accesses the necessary parameters and configuration settings from a specified config file. The absence of command-line arguments simplifies usage and reduces the likelihood of errors.

**Configuration File**

To use the `tune_engine` effectively, make sure you have a configuration file in place. This file should contain the required arguments for your tuning runs.

**Hyperparameter Tuning**

The primary purpose of the `tune_engine` module is to perform hyperparameter tuning, where it systematically explores various combinations of hyperparameters to find the best configuration for your machine learning model. This process can help optimize your model's performance.




In [12]:
%run ../src/tune_engine.py

0,1
Current time:,2023-09-03 10:01:52
Running for:,00:07:15.57
Memory:,61.2/187.5 GiB

Trial name,status,loc,train_loop_config/ba tch_size,train_loop_config/lr,iter,total time (s),val_loss,val_acc
TorchTrainer_6b353_00000,TERMINATED,192.168.13.177:16786,32,0.000395467,10,38.9987,0.190796,0.960526
TorchTrainer_6b353_00001,TERMINATED,192.168.13.177:17337,128,0.000238078,1,9.49993,0.660946,0.631579
TorchTrainer_6b353_00002,TERMINATED,192.168.13.177:17537,64,0.000265461,1,9.22163,0.662777,0.671053
TorchTrainer_6b353_00003,TERMINATED,192.168.13.177:17692,32,0.000387216,1,10.5228,0.621326,0.723684
TorchTrainer_6b353_00004,TERMINATED,192.168.13.177:17856,128,0.001116,4,18.3405,0.437896,0.960526
TorchTrainer_6b353_00005,TERMINATED,192.168.13.177:18052,128,0.000850337,1,9.41257,0.625424,0.697368
TorchTrainer_6b353_00006,TERMINATED,192.168.13.177:18231,64,0.0103263,10,35.745,0.144249,0.947368
TorchTrainer_6b353_00007,TERMINATED,192.168.13.177:18504,128,0.0769279,1,10.5601,6.28846,0.460526
TorchTrainer_6b353_00008,TERMINATED,192.168.13.177:18693,128,0.0529297,1,9.41065,3.95463,0.460526
TorchTrainer_6b353_00009,TERMINATED,192.168.13.177:18859,128,0.000290985,1,9.64748,0.642906,0.644737


[2m[36m(TorchTrainer pid=16786)[0m Starting distributed worker processes: ['16868 (192.168.13.177)', '16869 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=16868)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=16868)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=16868)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=16868)[0m Epoch 0-train Loss: 0.6773 Acc: 0.6066
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 0-val Loss: 0.5774 Acc: 0.8026
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 1-train Loss: 0.5965 Acc: 0.6967
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 1-val Loss: 0.4606 Acc: 0.9211
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 2-train Loss: 0.4511 Acc: 0.8934
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 2-val Loss: 0.3749 Acc: 0.9342
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 3-train Loss: 0.3966 Acc: 0.9098
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 3-val Loss: 0.3245 Acc: 0.9211
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 4-train Loss: 0.3464 Acc: 0.9180
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 4-val Loss: 0.2703 Acc: 0.9474
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 5-train Loss: 0.2938 Acc: 0.9344
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 5-val Loss: 0.2443 Acc: 0.9605
[2m[36m(RayTrainWorker pid=16868)[0m Epoch 6-train Loss: 0.27

[2m[36m(TorchTrainer pid=17337)[0m Starting distributed worker processes: ['17413 (192.168.13.177)', '17414 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=17413)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=17413)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=17413)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=17413)[0m Epoch 0-train Loss: 0.6903 Acc: 0.5328
[2m[36m(RayTrainWorker pid=17413)[0m Epoch 0-val Loss: 0.6609 Acc: 0.6316


[2m[36m(TorchTrainer pid=17537)[0m Starting distributed worker processes: ['17575 (192.168.13.177)', '17576 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=17575)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=17575)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=17575)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=17575)[0m Epoch 0-train Loss: 0.6539 Acc: 0.6475
[2m[36m(RayTrainWorker pid=17575)[0m Epoch 0-val Loss: 0.6628 Acc: 0.6711


[2m[36m(TorchTrainer pid=17692)[0m Starting distributed worker processes: ['17733 (192.168.13.177)', '17734 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=17733)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=17733)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=17733)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=17733)[0m Epoch 0-train Loss: 0.6883 Acc: 0.5738
[2m[36m(RayTrainWorker pid=17733)[0m Epoch 0-val Loss: 0.6213 Acc: 0.7237


[2m[36m(TorchTrainer pid=17856)[0m Starting distributed worker processes: ['17905 (192.168.13.177)', '17906 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=17905)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=17905)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=17905)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=17905)[0m Epoch 0-train Loss: 0.6819 Acc: 0.6230
[2m[36m(RayTrainWorker pid=17905)[0m Epoch 0-val Loss: 0.5742 Acc: 0.8947
[2m[36m(RayTrainWorker pid=17905)[0m Epoch 1-train Loss: 0.6474 Acc: 0.6721
[2m[36m(RayTrainWorker pid=17905)[0m Epoch 1-val Loss: 0.5557 Acc: 0.8158
[2m[36m(RayTrainWorker pid=17905)[0m Epoch 2-train Loss: 0.6157 Acc: 0.7377
[2m[36m(RayTrainWorker pid=17905)[0m Epoch 2-val Loss: 0.5112 Acc: 0.8816
[2m[36m(RayTrainWorker pid=17905)[0m Epoch 3-train Loss: 0.5570 Acc: 0.7623
[2m[36m(RayTrainWorker pid=17905)[0m Epoch 3-val Loss: 0.4379 Acc: 0.9605


[2m[36m(TorchTrainer pid=18052)[0m Starting distributed worker processes: ['18091 (192.168.13.177)', '18092 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=18091)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=18091)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=18091)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=18091)[0m Epoch 0-train Loss: 0.6842 Acc: 0.5574
[2m[36m(RayTrainWorker pid=18091)[0m Epoch 0-val Loss: 0.6254 Acc: 0.6974


[2m[36m(TorchTrainer pid=18231)[0m Starting distributed worker processes: ['18281 (192.168.13.177)', '18282 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=18281)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=18281)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=18281)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=18281)[0m Epoch 0-train Loss: 0.5917 Acc: 0.6639
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 0-val Loss: 0.2760 Acc: 0.9211
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 1-train Loss: 0.2538 Acc: 0.8852
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 1-val Loss: 0.1451 Acc: 0.9605
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 2-train Loss: 0.1636 Acc: 0.9344
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 2-val Loss: 0.1516 Acc: 0.9605
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 3-train Loss: 0.0660 Acc: 0.9836
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 3-val Loss: 0.1756 Acc: 0.9737
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 4-train Loss: 0.0874 Acc: 0.9672
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 4-val Loss: 0.2181 Acc: 0.9342
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 5-train Loss: 0.0413 Acc: 0.9918
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 5-val Loss: 0.3416 Acc: 0.9211
[2m[36m(RayTrainWorker pid=18281)[0m Epoch 6-train Loss: 0.08

[2m[36m(TorchTrainer pid=18504)[0m Starting distributed worker processes: ['18542 (192.168.13.177)', '18543 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=18542)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=18542)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=18542)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=18542)[0m Epoch 0-train Loss: 0.5698 Acc: 0.6393
[2m[36m(RayTrainWorker pid=18542)[0m Epoch 0-val Loss: 6.2885 Acc: 0.4605


[2m[36m(TorchTrainer pid=18693)[0m Starting distributed worker processes: ['18740 (192.168.13.177)', '18741 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=18740)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=18740)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=18740)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=18740)[0m Epoch 0-train Loss: 0.5845 Acc: 0.5902
[2m[36m(RayTrainWorker pid=18740)[0m Epoch 0-val Loss: 3.9546 Acc: 0.4605


[2m[36m(TorchTrainer pid=18859)[0m Starting distributed worker processes: ['18937 (192.168.13.177)', '18938 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=18937)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=18937)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=18937)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=18937)[0m Epoch 0-train Loss: 0.6755 Acc: 0.5902
[2m[36m(RayTrainWorker pid=18937)[0m Epoch 0-val Loss: 0.6429 Acc: 0.6447


[2m[36m(TorchTrainer pid=19055)[0m Starting distributed worker processes: ['19099 (192.168.13.177)', '19100 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=19099)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=19099)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=19099)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=19099)[0m Epoch 0-train Loss: 0.6969 Acc: 0.4590
[2m[36m(RayTrainWorker pid=19099)[0m Epoch 0-val Loss: 0.7424 Acc: 0.5921


[2m[36m(TorchTrainer pid=19230)[0m Starting distributed worker processes: ['19283 (192.168.13.177)', '19288 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=19283)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=19283)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=19283)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=19283)[0m Epoch 0-train Loss: 0.6555 Acc: 0.6066
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 0-val Loss: 0.4576 Acc: 0.9079
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 1-train Loss: 0.4108 Acc: 0.8197
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 1-val Loss: 0.2451 Acc: 0.9474
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 2-train Loss: 0.2629 Acc: 0.9262
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 2-val Loss: 0.1940 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 3-train Loss: 0.2229 Acc: 0.9016
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 3-val Loss: 0.1691 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 4-train Loss: 0.1446 Acc: 0.9590
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 4-val Loss: 0.1637 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 5-train Loss: 0.1475 Acc: 0.9426
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 5-val Loss: 0.1513 Acc: 0.9737
[2m[36m(RayTrainWorker pid=19283)[0m Epoch 6-train Loss: 0.12

[2m[36m(TorchTrainer pid=19521)[0m Starting distributed worker processes: ['19561 (192.168.13.177)', '19562 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=19561)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=19561)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=19561)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=19561)[0m Epoch 0-train Loss: 0.6264 Acc: 0.5820
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 0-val Loss: 0.4593 Acc: 0.8421
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 1-train Loss: 0.4188 Acc: 0.8689
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 1-val Loss: 0.1879 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 2-train Loss: 0.2318 Acc: 0.9508
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 2-val Loss: 0.1508 Acc: 0.9474
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 3-train Loss: 0.1639 Acc: 0.9508
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 3-val Loss: 0.1368 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 4-train Loss: 0.1079 Acc: 0.9344
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 4-val Loss: 0.1418 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 5-train Loss: 0.0926 Acc: 0.9754
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 5-val Loss: 0.1734 Acc: 0.9474
[2m[36m(RayTrainWorker pid=19561)[0m Epoch 6-train Loss: 0.09

[2m[36m(TorchTrainer pid=19783)[0m Starting distributed worker processes: ['19828 (192.168.13.177)', '19829 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=19828)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=19828)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=19828)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=19828)[0m Epoch 0-train Loss: 0.7147 Acc: 0.5820
[2m[36m(RayTrainWorker pid=19828)[0m Epoch 0-val Loss: 0.6231 Acc: 0.7368


[2m[36m(TorchTrainer pid=19945)[0m Starting distributed worker processes: ['19990 (192.168.13.177)', '19991 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=19990)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=19990)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=19990)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=19990)[0m Epoch 0-train Loss: 0.6301 Acc: 0.6721
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 0-val Loss: 0.4387 Acc: 0.9474
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 1-train Loss: 0.3865 Acc: 0.8607
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 1-val Loss: 0.2137 Acc: 0.9737
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 2-train Loss: 0.2682 Acc: 0.9098
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 2-val Loss: 0.1611 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 3-train Loss: 0.1717 Acc: 0.9426
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 3-val Loss: 0.1458 Acc: 0.9737
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 4-train Loss: 0.1394 Acc: 0.9508
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 4-val Loss: 0.1417 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 5-train Loss: 0.1238 Acc: 0.9426
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 5-val Loss: 0.1318 Acc: 0.9605
[2m[36m(RayTrainWorker pid=19990)[0m Epoch 6-train Loss: 0.09

[2m[36m(TorchTrainer pid=20202)[0m Starting distributed worker processes: ['20244 (192.168.13.177)', '20245 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=20244)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=20244)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=20244)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=20244)[0m Epoch 0-train Loss: 1.4841 Acc: 0.7459
[2m[36m(RayTrainWorker pid=20244)[0m Epoch 0-val Loss: 264.6427 Acc: 0.5263


[2m[36m(TorchTrainer pid=20425)[0m Starting distributed worker processes: ['20468 (192.168.13.177)', '20469 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=20468)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=20468)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=20468)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=20468)[0m Epoch 0-train Loss: 0.7918 Acc: 0.4918
[2m[36m(RayTrainWorker pid=20468)[0m Epoch 0-val Loss: 0.7132 Acc: 0.5789


[2m[36m(TorchTrainer pid=20659)[0m Starting distributed worker processes: ['20767 (192.168.13.177)', '20768 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=20767)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=20767)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=20767)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=20767)[0m Epoch 0-train Loss: 0.5407 Acc: 0.7131
[2m[36m(RayTrainWorker pid=20767)[0m Epoch 0-val Loss: 0.5511 Acc: 0.8553
[2m[36m(RayTrainWorker pid=20767)[0m Epoch 1-train Loss: 0.4344 Acc: 0.8525
[2m[36m(RayTrainWorker pid=20767)[0m Epoch 1-val Loss: 2.9890 Acc: 0.6842
[2m[36m(RayTrainWorker pid=20767)[0m Epoch 2-train Loss: 0.5361 Acc: 0.8525
[2m[36m(RayTrainWorker pid=20767)[0m Epoch 2-val Loss: 9.9129 Acc: 0.6316
[2m[36m(RayTrainWorker pid=20767)[0m Epoch 3-train Loss: 1.3305 Acc: 0.6803
[2m[36m(RayTrainWorker pid=20767)[0m Epoch 3-val Loss: 3.8423 Acc: 0.6447


[2m[36m(TorchTrainer pid=20961)[0m Starting distributed worker processes: ['21003 (192.168.13.177)', '21004 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=21003)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=21003)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=21003)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=21003)[0m Epoch 0-train Loss: 0.7006 Acc: 0.5164
[2m[36m(RayTrainWorker pid=21003)[0m Epoch 0-val Loss: 0.7488 Acc: 0.4737


[2m[36m(TorchTrainer pid=21156)[0m Starting distributed worker processes: ['21197 (192.168.13.177)', '21198 (192.168.13.177)']
[2m[36m(RayTrainWorker pid=21197)[0m Setting up process group for: env:// [rank=0, world_size=2]
[2m[36m(RayTrainWorker pid=21197)[0m Moving model to device: cuda:0
[2m[36m(RayTrainWorker pid=21197)[0m Wrapping provided model in DistributedDataParallel.


[2m[36m(RayTrainWorker pid=21197)[0m Epoch 0-train Loss: 0.6474 Acc: 0.5656
[2m[36m(RayTrainWorker pid=21197)[0m Epoch 0-val Loss: 0.4727 Acc: 0.7895
[2m[36m(RayTrainWorker pid=21197)[0m Epoch 1-train Loss: 0.4349 Acc: 0.7951
[2m[36m(RayTrainWorker pid=21197)[0m Epoch 1-val Loss: 0.1965 Acc: 0.9605
[2m[36m(RayTrainWorker pid=21197)[0m Epoch 2-train Loss: 0.2349 Acc: 0.9016
[2m[36m(RayTrainWorker pid=21197)[0m Epoch 2-val Loss: 0.1718 Acc: 0.9474
[2m[36m(RayTrainWorker pid=21197)[0m Epoch 3-train Loss: 0.1468 Acc: 0.9590
[2m[36m(RayTrainWorker pid=21197)[0m Epoch 3-val Loss: 0.1646 Acc: 0.9605


2023-09-03 10:01:52,206	INFO tune.py:1148 -- Total run time: 435.62 seconds (435.57 seconds for the tuning loop).


Result(
  metrics={'val_loss': 0.14027262243785357, 'val_acc': 0.9736842105263158, 'should_checkpoint': True, 'done': True, 'trial_id': '6b353_00014', 'experiment_tag': '14_batch_size=32,lr=0.0018'},
  path='/home/dhaval/Projects/NewRay/results/ray_results/tuning-resnet-1693749273/TorchTrainer_6b353_00014_14_batch_size=32,lr=0.0018_2023-09-03_09-54-36',
  checkpoint=TorchCheckpoint(local_path=/home/dhaval/Projects/NewRay/results/ray_results/tuning-resnet-1693749273/TorchTrainer_6b353_00014_14_batch_size=32,lr=0.0018_2023-09-03_09-54-36/checkpoint_000009)
)


**Evaluation Module**

This module is designed for evaluating the best model on a test/validation dataset. Here, we use the validation dataset, which is stored locally by default. If you need to evaluate the model on the test dataset, you will need to update the `data_utils.py` module accordingly.

**Usage**

To use this module, provide the name of the experiment as an argument when invoking the relevant function. The module will access the best run from the specified experiment and return the respective checkpoints for evaluation. The model loaded with the best checkpoint will be used to make predictions on the dataset.

Note: The results on the entire validation set and the one that you saw in the tuning runs will be different because we are checkpointing results only from one worker which is exposed to half of the entire dataset.


In [13]:
!python ../src/evaluate_engine.py --experiment-name "tuning-resnet-1693749273"

Accuracy:  0.954248366013072


**Serving Module**

The serving module is built using Gradio. This CLI command can be used to run the app.

In [None]:
%run ../src/serve_engine.py --experiment_name "tuning-resnet-1693749273"