-
Notifications
You must be signed in to change notification settings - Fork 0
AdamDorwart/HPLdiag
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This directory contains everything that you'll need to run a single node HPL problem on Comet using 12 MPI processes and 2 threads per process. log/ Stores simplified logs of test results results/ Stores indivdual job stdout/err files compute/HPL.dat HPL input data for CPU tests compute/hpl_1n_12p_2t.comet Slurm batch script for HPL benchmarks on Comet compute nodes compute/xhpl.cpu HPL executable for hybrid (MPI+OpenMP) execution compute/ssd-test.fio FIO input data used for SSD test compute/fio Flexible IO tester. Used for SSD benchmarks gpu/HPL.dat HPL input data for GPU tests (larger problem size for longer runtime) gpu/hpl_1n_12p_4g_2t.comet Slurm batch script for HPL benchmarks on Comet gpu nodes gpu/xhpl.gpu HPL executable for CUDA GPU execution hosts List of Comet nodes with partitions HPLdiag.sh Runs diagnostics on all nodes in a provided input file (see hosts) and puts the results in log/HPLresults.MMDDYY HPLcancel.sh Cancels all running tests for the current $USER. Logs actions in log/HPLresults.MMDDYY sendStatusEmail.sh Sends a status email on the health of the system given an input log file (located in logs), uses statusMailingList for recipients statusEmail.txt The template used for sendStatusEmail script statusMailingList Line deliminated file of emails to send status reports to To run a diagnostic across all of comet use the provided hosts file like so $ ./HPLdiag.sh hosts To run a diagnostic on particular nodes and partitions $ cat > smallTest comet-xx-yy compute comet-ww-zz compute comet-ii-jj shared $ ./HPLdiag.sh smallTest 'hosts' file can be generated with sinfo -o "%n %R" -h | column -t | sort -u -k1,1 > hosts The end of the file will sometimes contain a line for a 'fat' partition. Remove it as needed. TODO: For GPU nodes, use combination of HPL and AMBER For SSD tests, look into IOR and FIO. Note that drive performance degrades as sectors get marked for TRIM and need to be periodically wiped to achevie full speed Create Health Moniter cronjob to manage automated benchmarks Analyze duration times of results Better input handling for HPLdiag - Accept pipping or input file - Options to run specific tests Useful commands: Get the status of all nodes currently enqueued sinfo -o "%n %T %E" -n `squeue -h -o %n -u $USER | sed ':a;N;$!ba;s/\n/,/g'` Create a hosts retest file from the TIMEOUTs in a result file grep HPL-*-TIMEOUT log/HPLresults.$timestamp | awk '{print $4}' | grep -f - hosts > retestHosts
About
A Cron job distributed system monitor. Performs periodic diagnostics using the High Performance Linpack benchmark to test for abnormal behavior in an HPC system.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published