# [Darshan-LDMS Integrator](https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-streams.html#darshans)

### A **framework** that integrates Darshan + LDMS to provide low-latency monitoring of I/O events during runtime. 

- **[Darshan](https://www.mcs.anl.gov/research/projects/darshan/)**: a lightweight I/O characterization tool used to capture I/O access information in memory during the execution of HPC applications. After a job running with Darshan finishes executing, Darshan merges I/O data from all processes and generates a profile document and, optionally, also a trace file.

- **[Lightweight Distributed Metric Service (LDMS)](https://ovis-hpc.readthedocs.io/en/latest/ldms/ldms-quickstart.html)**: a low-overhead production monitoring system for HPC machines. It has the capability to collect, transport, aggregate and store timeseries data during runtime. Example of system performance metrics collected are CPU, memory, power, and storage.

![Darshan-LDMS Integration](newdarshanconnector.jpg)

### Main benefits

- Collecting system and I/O traces at the same time in a **single file**
- Data is monitored and recorded continuously **throughout the execution of the program** (as opposed to only aggregating the data at the end of execution)
- Captures **absolute time-series data** with high precision: can be used to correlate application-specific events with system utilization activities
- Captures read/write/**close/open/flushes**
- Captures POSIX, MPI-IO and stdio, distinguishing between **STDIN/STDOUT/STDERR** 
- Stores all application processes in **JSON or CSV** format, which facilitates processing by most data analysis tools
- No changes in the code necessary

### Output

It captures metrics with information about the job, I/O operations and timestamp:

![Metrics](metrics.png)

An example of the data collected in CSV:


In [5]:
!head -n5 ./darshan-ldms-output/19047177-IOR_pscratch_32_none.csv

uid,exe,job_id,rank,ProducerName,file,record_id,module,type,max_byte,switches,flushes,cnt,op,pt_sel,irreg_hslab,reg_hslab,ndims,npoints,off,len,start,dur,total,timestamp
12345,/projects/ovis/darshanConnector/apps/rhel9.7/ior/build/bin/ior,19047177,0,n1119,<STDIN>,9.22337E+18,STDIO,MET,-1,-1,-1,1,open,-1,-1,-1,-1,-1,-1,-1,0,0,0,1713585247.463272
12345,/projects/ovis/darshanConnector/apps/rhel9.7/ior/build/bin/ior,19047177,0,n1119,<STDOUT>,9.22337E+18,STDIO,MET,-1,-1,-1,1,open,-1,-1,-1,-1,-1,-1,-1,0,0,0,1713585247.463272
12345,/projects/ovis/darshanConnector/apps/rhel9.7/ior/build/bin/ior,19047177,0,n1119,<STDERR>,7.23826E+18,STDIO,MET,-1,-1,-1,1,open,-1,-1,-1,-1,-1,-1,-1,0,0,0,1713585247.463272
12345,N/A,19047177,0,n1119,N/A,9.22337E+18,STDIO,MOD,51,-1,0,1,write,-1,-1,-1,-1,-1,0,52,0.067659,0.000004,0.000004,1713585247.530934


# Use cases

Let's explore some benefits of Darshan-LDMS. We run the same experiment setup for the [IOR benchmark](https://github.com/hpc/ior) using 36 ranks, for 2 tasks (iterations), block size of 16MB, and transfer size of 4MB, for 32 segments, with Lustre file system running in the Eclipse system.

*$ ./ior -i 2 -b 16m -t 4m -s 32 -F -C -e -k -o /pscratch/user/iorTest/darshan*


### Collecting read/write/open/closes with absolute timestamps

With the Darshan-LDMS data we can identify the occurence of opens and closes and how ranks behaved differently in the system at different times.

![Darshan-LDMS Integration](ior1.png) 
![Darshan-LDMS Integration](ior3.png)

### Comparing multiple iterations of the same I/O pattern

We can also compare multiple iterations of the same tasks and identify synchronization points not caused by writes or reads.

![Darshan-LDMS Integration](ior-repetitions.png) 
![Daration](ior-repetitions2.png) 

## Correlating timeseries data with unexpected system behavior

We can identify undexpected behavior in real time, and impact on each individual event as opposite to aggregated data at the end of the execution. Absolute timestamps can be used to correlate with other system metrics and identify bottlenecks at a deeper level: 

![Stressors](stressors.png) 
![Stressors](stressors2.png) 


## Displaying results in dashboards

Sharing results in dashboards such as Grafana in real time.

![Dashboard](dashboard.png) 



# Supporting material

### Video tutorial 

- [Darshan-LDMS introduction and examples](x)
- [IOR Demostration](https://drive.google.com/file/d/13KTiYS-uq81jH0zdSaCA8_z6Ql-DV-uI/view?usp=sharing) We showcase running an IOR application on a Sandia HPC machine, with the collected I/O data being visualized in real-time on a Grafana dashboard.
- Installation and collection in an AWS cloud instance for [single node](https://drive.google.com/file/d/1xFmOxJpRhOOWyEAMkv6fxEGIFoTA4_YZ/view?usp=sharing) and [multi-node.](https://drive.google.com/file/d/1kucLEIjtf3sB74HQ26iXd71TRH37eAOQ/view?usp=sharing)

### Others
- Documentation: https://ovis-hpc.readthedocs.io/projects/ldms/en/latest/streams/ldms_stream_apps.html#darshan
- Available in Darshan>=3.4.5: https://www.mcs.anl.gov/research/projects/darshan/download/
- Cite: *S. Walton, O. Aaziz, A. L. V. Solórzano and B. Schwaller, ["LDMS Darshan Connector: For Run Time Diagnosis of HPC Application I/O Performance"](https://ieeexplore.ieee.org/abstract/document/9912673), 2022 HPCMASPA Workshop, IEEE International Conference on Cluster Computing (CLUSTER), Heidelberg, Germany, 2022*

***
_In collaboration between:_

_- Northeastern University: Ana Solórzano, Devesh Tiwari_

_- Sandia National Laboratories: Sara Walton, Benjamin Schwaller, Jim M. Brandt, Evan Donato, Jen Green_