radT

radT (Resource Aware Data science Tracker) is an extension to MLFlow that simplifies the collection and exploration of hardware metrics of machine learning and deep learning applications. Usually, collecting and processing all the required metrics for these workloads is a hassle. In contrast, RADT is easy to deploy and use, with minimal impact on both performance and time investment. The codebase of RADT is documented and easily expandable.

This work has been published at the SIGMOD workshop DEEM 2023: Data Management and Visualization for Benchmarking Deep Learning Training Systems

pip install radt

Releases

The current release is 0.2.15. radT has been recently released and is frequently receiving updates.

If you find any issues or bugs, feel free to message titr (at) itu.dk or open an issue in this repository.

Changelog

0.2.15: RADT now runs correctly on machines that have a corrupt DCGMI installation.
0.2.14: Automatically disable the DCGMI listener when DCGMI is not found.
0.2.13: Enable RADT on systems without DCGMI.
0.2.12: Fixed an issue with dependencies.
0.2.11: Workloads are now nested to group them together. Run names include the workload and letter. Improved flexibility of param passthrough.
0.2.10: Workload listeners now upload logs when file points to a different folder. rerun argument now works correctly.
0.2.9: Allow text printing while env is setting up.
0.2.8: Resolved issue preventing logs from being collected.
0.2.7: Resolved race condition that could sometimes disrupt collocated model execution.
0.2.6: Resolved synchronisation issues with .csv runs.
0.2.5: Automatically log pip, conda package lists and nvidia-smi driver info for reproducability.
0.2.4: Fixed rerun flag, added run names to status
0.2.3: Reintroduced manual mode, fixed issue with context attributes, max_epoch, max_time, and manual are now logged as parameters
0.2.2: Reintroduced contexts, fixed issue of not having migedit as a formal requirement
0.2.1: Removed legacy print-statements
0.2.0: Moved radtrun to be a subcommand in radt, reintroduced workload listeners, use migedit for mig management, local mode
0.1.4: Fixed several minor issues
0.1.3: Fixed several bugs that prevented correct logging
0.1.0: Initial

Features

Wide configuration support including collocation
Track hardware and software metrics, including Nsight
Handle continuous streams of data
Support multiple visualization use-cases
Filter large amounts of inconsequential data
Minimal code impact

Sample usage & getting started

Replace python in your training script by radt, e.g.:

>>> radt train.py --batch-size 256

or, when using virtual environments/conda:

>>> python -m radt train.py --batch-size 256

For a complete getting started guide and examples please visit the Examples.

Easy to use via automated tracking

radT will automatically track hardware metrics for your application. The listeners will start tracking your application on invocation.

As radT extends MLFlow, you can either use the advanced tracking or use MLFlow to track software metrics (e.g. loss).

Advanced tracking options via context

If you want to have more control over what is logged, you can encapsulate your training loop in the RADT context. This allows for logging of ML metrics among other MLFlow functions:

import radt

with radt.run.RADTBenchmark() as run:
  # training loop
  run.log_metric("Metric A", amount)
  run.log_artifact("artifact.file")

All methods and functions under mlflow are accessible this way. These functions are disabled when running the codebase without radt, ensuring code flexibility.

CSV syntax for larger experiments

RADT can take the hassle of large experiments off you by training multiple models in succession. Models can even be trained at the same time on different gpus or at the same gpu using a range of collocation schemes.

Experiment,Workload,Status,Run,Devices,Collocation,    File,    Listeners,Params
         1,       1,      ,   ,      0,          -,train.py,smi+top+dcgmi,batch-size=128
         1,       1,      ,   ,      1,          -,train.py,smi+top+dcgmi,batch-size=128
         1,       2,      ,   ,      2,    3g.20gb,train.py,smi+top+dcgmi,batch-size=128
         1,       2,      ,   ,      2,    3g.20gb,train.py,smi+top+dcgmi,batch-size=128
         1,       3,      ,   ,      1,          -,train.py,smi+top+dcgmi,batch-size=256

When interrupted by any means, a csv experiment can be rescheduled to continue from where it left off.

Supported platforms

Linux

Citation

If you need to cite this repository in academic research:

@inproceedings{robroek2023data,
  title={Data Management and Visualization for Benchmarking Deep Learning Training Systems},
  author={Robroek, Ties and Duane, Aaron and Yousefzadeh-Asl-Miandoab, Ehsan and Tozun, Pinar},
  booktitle={Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning},
  pages={1--5},
  year={2023}
}

Contributors

Thank You!

Contributions are welcome. (Please add yourself to the list)

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
environments		environments
examples		examples
frontend		frontend
media		media
queries		queries
radt		radt
.gitignore		.gitignore
README.md		README.md
clean_mlflow_envs.sh		clean_mlflow_envs.sh
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

environments

environments

examples

examples

frontend

frontend

media

media

queries

queries

radt

radt

.gitignore

.gitignore

README.md

README.md

clean_mlflow_envs.sh

clean_mlflow_envs.sh

docker-compose.yml

docker-compose.yml

Repository files navigation

radT

Releases

Changelog

Features

Sample usage & getting started

Easy to use via automated tracking

Advanced tracking options via context

CSV syntax for larger experiments

Supported platforms

Citation

Contributors

About

Releases

Packages

Contributors 3

Languages

Resource-Aware-Data-systems-RAD/radt

Folders and files

Latest commit

History

Repository files navigation

radT

Releases

Changelog

Features

Sample usage & getting started

Easy to use via automated tracking

Advanced tracking options via context

CSV syntax for larger experiments

Supported platforms

Citation

Contributors

About

Resources

Stars

Watchers

Forks

Languages