-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out how to do multi-node jobs and to compute the load of multi-node systems #6
Comments
|
I think we already have what we need to compute cross-node utilization (your first point) but sonar does not currently capture any data about communication (the second point), be it volume or topography. It is a sampling profiler and its only means of sampling is to probe system tables (via (I'll add communication volume to the set of use cases.) |
Technical quirk: with the synthesized job IDs (as on the ML nodes) there's a risk that the same PID is being used as the job ID on two different machines in an overlapping timeframe, yet these are two different jobs. It's important for sonalyze not to be confused about this. I think that in the case where we're interacting with a batch queue, there will be a command line argument to sonalyze to identify the system as such, eg to point to a data directory. The default, in the absence of such a switch, should be to treat hosts as independent. In presenting a query that runs against the logs of multiple hosts, the same job ID may thus be shown multiple times in a listing, but it is always relative to the host. The consumer of the data must be aware of this.
|
This is pretty much done now, I'm just doing some final testing and will then merge. I'll cut NordicHPC/sonar#67 loose, it doesn't need to block this bug, it's something that can come later. There are other mop-up issues too, like #54, but again, not really blocking us here. |
Fixed, for now. We'll file additional things as followup bugs. |
For the ML and light-HPC systems there's at most one node per job, but this is not true on the bigger systems - in that case, jobs can span multiple nodes. The sonar records will have the same job ID - these are SLURM jobs - so we'll collect records properly into jobs. But there's the matter of perhaps filtering and printing the node names sensibly, as well as computing and presenting the cross-node load. For system-relative load data we must also (in some way) use the capabilities of multiple systems to compute proper values, it's not enough to sum things and hope for the best.
Evolving task list:
The text was updated successfully, but these errors were encountered: