Set up monitoring and automatic tools
This issue will provide the implementation of a variety of monitoring tools, automatic conversion scripts etc.
The approach is to run cronjobs locally that query over multiplexed tmux on the login nodes. Tmux sessions with non-whitelisted commands are usually cancelled by the admins after a certain period of time. Because we use "headless" (essentially only an sshd process which is assumed to be whitelisted) the process will therefore not to be killed on the login node.
Note that it's not desired to run this on a shared localhost as it uses ssh credentials to log into the HPC.
Some HPCs do not have internet access on the compute nodes and MareNostrum5 does not even have internet access on the login nodes. Thus this approach easily generalizes for every HPC cluster.
Current setup uses the Raspberry Pi5 of Thomas.
Codebase is here: https://github.com/OpenEuroLLM/oellm-monitoring
Tools
Set up monitoring and automatic tools
This issue will provide the implementation of a variety of monitoring tools, automatic conversion scripts etc.
The approach is to run cronjobs locally that query over multiplexed tmux on the login nodes. Tmux sessions with non-whitelisted commands are usually cancelled by the admins after a certain period of time. Because we use "headless" (essentially only an sshd process which is assumed to be whitelisted) the process will therefore not to be killed on the login node.
Note that it's not desired to run this on a shared localhost as it uses ssh credentials to log into the HPC.
Some HPCs do not have internet access on the compute nodes and MareNostrum5 does not even have internet access on the login nodes. Thus this approach easily generalizes for every HPC cluster.
Current setup uses the Raspberry Pi5 of Thomas.
Codebase is here: https://github.com/OpenEuroLLM/oellm-monitoring
Tools