Skip to content

Set up monitoring and automatic tools for model training #340

Description

@tvosch

Set up monitoring and automatic tools

This issue will provide the implementation of a variety of monitoring tools, automatic conversion scripts etc.
The approach is to run cronjobs locally that query over multiplexed tmux on the login nodes. Tmux sessions with non-whitelisted commands are usually cancelled by the admins after a certain period of time. Because we use "headless" (essentially only an sshd process which is assumed to be whitelisted) the process will therefore not to be killed on the login node.

Note that it's not desired to run this on a shared localhost as it uses ssh credentials to log into the HPC.
Some HPCs do not have internet access on the compute nodes and MareNostrum5 does not even have internet access on the login nodes. Thus this approach easily generalizes for every HPC cluster.

Current setup uses the Raspberry Pi5 of Thomas.
Codebase is here: https://github.com/OpenEuroLLM/oellm-monitoring

Tools

Metadata

Metadata

Assignees

Fields

No fields configured for Feature.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions