Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading issues #222

Open
sophiemiddleton opened this issue Dec 10, 2022 · 12 comments
Open

Multi-threading issues #222

sophiemiddleton opened this issue Dec 10, 2022 · 12 comments
Assignees

Comments

@sophiemiddleton
Copy link
Collaborator

Please look through the Production fcl files to look for inappropriate uses of multi-threading.

We should only use multi-threading when G4 is the dominant CPU time in the job. Consider the example that Alessandro showed on Wednesday, which is based off of

Production/JobConfig/pileup/MuStopPileup.fcl

This uses 2 threads and 2 schedules. Please set them both to 1.

This job is dominated by the time spent in PileupPath:TargetStopResampler:ResamplingMixer, an average of 2.3 seconds per event out of 2.4 total seconds per event. The time spent in G4 is only 0.2 seconds per event. The only module in this job that is parallelizable is the G4 module ( and maybe the event generator). So art serializes everything else. The net result is that when an instance of ResamplingMixer gets in, it blocks the other thread until it completes.

If you run with 1 thread the job completes in some amount of wall clock time. If you run with 2 threads it completes in very slightly less wall clock time but it is using 2 CPUs, not one. Each CPU is idle half of the time.

Let me know if you have any questions.

Thanks,

Rob

@sophiemiddleton sophiemiddleton self-assigned this Dec 10, 2022
@brownd1978
Copy link
Collaborator

brownd1978 commented Dec 10, 2022 via email

@rlcee
Copy link
Collaborator

rlcee commented Dec 11, 2022

I thought the way this should work is that when MT is turned on, only modules that are explicitly declared as MT-ready can be run in MT mode. Geant is our only MT-ready module; all others are declared legacy or default to legacy The geant MT is reproducible. Therefore this MT job should be reproducible. Which statement is wrong?

@kutschke
Copy link
Collaborator

The most important thing is that Dave is right when he says that repeatability in the sense we are talking about here does not affect physics quality.

I hope that the following will answer the other questions in both Ray's and Dave's posts.

Both are right that Mu2eG4MT_module is our only MT-ready module. art assumes that all legacy modules are unsafe to run in parallel; so it serializes execution of those. It assumes that it is safe to run any number of MT ready modules in parallel. It also assumes that it is safe to run any one legacy module in parallel with any number of MT ready modules.

Consider a job with 2 threads. If thread 1 is running a legacy module and thread 2's next task is also a legacy module, then thread 2 blocks until the module on thread 1 is finished. If thread 1 is running a legacy module and thread 2's next task is an MT enabled module then the two threads will run in parallel. If thread 1 run is running an MT enabled module and thread 2's next task is run any module, MT enabled or not, then the module on thread 2 will run. That covers the full 2x2 matix of possibilities. If you go to 3 or more threads the analysis is similar: only one legacy module can be running at a time and any number of MT ready modules may be running in parallel with it.

The other important points is that there is some non-determinism in the scheduling algorithms and race conditions between threads.

I am not 100% sure about the legacy/MT status of the input and output modules but I do know that they can be active only in one thread at a time. They might have internal locking or they might rely on art's locking of legacy modules. In any case, they are not true MT; that's driven by limitations of root IO.

In a typical stage 1 job, the CPU time is dominated by the G4 module, often >90% of the time. And the art schedule has the form: (source module, some legacy modules, G4MT module, some more legacy modules, output module(s)). Most of the time the job will be executing two G4 threads in parallel. One of the threads will finish with G4; most of the time it will run through the rest of it's schedule, start the schedule for the next event and re-enter G4. All of the time the other thread was busy in G4.

From time to time both threads will be out of G4 and running through the other modules in their schedules. When that happens the legacy modules on the two threads will block each other. Roughly speaking the threads will alternate modules until one of the threads gets back to G4. During this period, execution speed drops to nominally 50%. ( I glossed over the fact that the order of execution of modules is not guaranteed to be strictly alternating between threads - you may get 2 modules from thread 1 followed by one module from thread 2 and then back to thread 1 ).

So that's the thread/schedule mechanics. How do we break the sequence of random numbers?

Depending on race conditions events, are not guaranteed to arrive at any particular legacy module in the same order on every run. When that happens the sequence of random numbers breaks.

We do reseed G4 every event in a deterministic way based on art::EventID. Issue Offline#849 (Mu2e/Offline#849) discusses seeding all modules this way. This would fix the non-repeatability that Alessandro found. Aside: I misread Ray's analysis the first time: in his example it would add 0.25% to the time to process an event; I agree that we can tolerate that ( I had misread it as 25% which would not be acceptable - sorry for the confusion this caused ).

In the job that Alessandro commented on, G4 is only a tiny fraction of the total CPU time, so the job spends most of it's wall clock time with one thread active and one blocked. It would be best to run it single threaded.

I have not thought carefully about the intermediate case where we spend maybe 50% or 60% of the time with 2 threads both running G4. I bet that there is no clean optimal answer; I expect that it will depend on the properties of the jobs that other experiments are running.

Let me know if I missed anything in the earlier questions.

@rlcee
Copy link
Collaborator

rlcee commented Dec 12, 2022

The fact that I was missing was there was a legacy module with a random seed following the G4 module. In this case, I see the problem and agree it is the same as the 849 issue. Thanks for the thorough explanation. I see you commented about the random re-seed time. If that's OK, then maybe I can push that issue forward soon.

@brownd1978
Copy link
Collaborator

brownd1978 commented Dec 12, 2022 via email

@sophiemiddleton
Copy link
Collaborator Author

HI Everyone, Getting back to fixing these issues now. What is the status here?

@kutschke
Copy link
Collaborator

kutschke commented Jan 5, 2023 via email

@brownd1978
Copy link
Collaborator

brownd1978 commented Jan 5, 2023 via email

@kutschke
Copy link
Collaborator

kutschke commented Jan 5, 2023

Hi Rob, Cori is being decommissioned this month, the replacement is Perlmutter. I haven’t yet tried to use Perlmutter but it is supposedly very similar. Dave

Thanks Dave,

Perlmutter's CPU-only nodes have 128 cores and 512 GB of memory, so 4 GB/core.  That's a great fit for our jobs. 

I looked up the specs: https://docs.nersc.gov/systems/perlmutter/architecture/  .  I had thought that Perlmutter was intended to be mostly GPUs but I see that the design is that most of the nodes are dual CPU with no GPUs.  But there are indeed many nodes with 1 CPU plus 4 GPUS.

The GPU nodes also have 4 GB/core but I imagine we would rarely, if every, be scheduled on those nodes since we have no code that can use the GPUs.

In the future, we could target AI/DL training for the GPU nodes.

  Rob

@rlcee
Copy link
Collaborator

rlcee commented Jan 5, 2023

Cori is being decommissioned this month, the replacement is Perlmutter.

When we submit, we only say "site=NERSC" and where we land is determined by the agreements between computing and NERSC and/or matching the job ads. So I think to understand what is happening, we would need to talk to computing..

@kutschke
Copy link
Collaborator

kutschke commented Jan 5, 2023

I have a conversation on going with Steve Timm and will summarize here when it converges.

@kutschke
Copy link
Collaborator

kutschke commented Jan 5, 2023

Steve says that each core is hyperthreaded so there are 256 cores per node and 2 GB/core. So our G4 jobs should continue to use 2 threads and 2 schedules for memory reasons. Below is the rest of the thread:

Hi Rob--I already made a perlmutter entry for mu2e when I set it up, actually there are two entries,
one for CPU and one for GPU.. actually the CPU nodes have 256 cores each. the one challenge is that
you have only a 12 hour queue limit on perlmutter. that will eventually go up.
As soon as the new FIFE allocation kicks in on Jan. 18 we will be glad to get you started.
There will be a different DESIRED_Sites field to set in jobsub and everything else should be the same.
Actually Perlmutter is already in "Production" and has been for a couple months but the startup has been
shaky to say the least. But when it's running it is a very nice machine.

Steve Timm

From: Robert K Kutschke <kutschke@fnal.gov>
Sent: Thursday, January 5, 2023 10:19 AM
To: Andrew John Norman <anorman@fnal.gov>; Steven C Timm <timm@fnal.gov>
Subject: About NERSC

Hi Guys,

Sophie Middleton is back from vacation and restart development towards running our next sim campaign at NERSC. I just learned that CORI will be shut down in a few weeks and that Perlmutter will soon be in production.

I checked the Perlmutter specs and see that their nodes are mostly 128 CPU cores with no GPUS 0 and 4 GB/core. So this is a good match to our needs; even better than a typical grid node. See https://docs.nersc.gov/systems/perlmutter/architecture/ .

What’s the status of Perlmutter access via HepCloud? Our jobs do not have code that can use the GPUS – I presume that there is a way to advertise that?

  Rob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants