-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threading issues #222
Comments
Good job figuring this out Rob. The reason the pileup job is
multithreaded is because it invokes G4. We decided at the beginning of
MDC2020 to multithread all the G4 jobs. At that time we thought this was
safe. Note that irreproducibility doesn’t affect the physics quality of
the output.
Do you know what module is inappropriate for multi threading? I thought
all modules that can’t support multithreading were run sequential, so I
don’t understand how this job is irreproducibile. Is it coming from the g4
module itself? If so, does this point to a problem in how multithreaded G4
jobs get their seeds?
On Fri, Dec 9, 2022 at 19:27 Sophie Middleton ***@***.***> wrote:
Please look through the Production fcl files to look for inappropriate
uses of multi-threading.
We should only use multi-threading when G4 is the dominant CPU time in the
job. Consider the example that Alessandro showed on Wednesday, which is
based off of
Production/JobConfig/pileup/MuStopPileup.fcl
This uses 2 threads and 2 schedules. Please set them both to 1.
This job is dominated by the time spent in
PileupPath:TargetStopResampler:ResamplingMixer, an average of 2.3 seconds
per event out of 2.4 total seconds per event. The time spent in G4 is only
0.2 seconds per event. The only module in this job that is parallelizable
is the G4 module ( and maybe the event generator). So art serializes
everything else. The net result is that when an instance of ResamplingMixer
gets in, it blocks the other thread until it completes.
If you run with 1 thread the job completes in some amount of wall clock
time. If you run with 2 threads it completes in very slightly less wall
clock time but it is using 2 CPUs, not one. Each CPU is idle half of the
time.
Let me know if you have any questions.
Thanks,
Rob
—
Reply to this email directly, view it on GitHub
<#222>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAH577U6WW5V7EU2YL3CJ3WMPE7JANCNFSM6AAAAAASZ4MHGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
David Brown ***@***.***
Office Phone (510) 486-7261
Lawrence Berkeley National Lab
M/S 50R5008 (50-6026C) Berkeley, CA 94720
|
I thought the way this should work is that when MT is turned on, only modules that are explicitly declared as MT-ready can be run in MT mode. Geant is our only MT-ready module; all others are declared legacy or default to legacy The geant MT is reproducible. Therefore this MT job should be reproducible. Which statement is wrong? |
The most important thing is that Dave is right when he says that repeatability in the sense we are talking about here does not affect physics quality. I hope that the following will answer the other questions in both Ray's and Dave's posts. Both are right that Mu2eG4MT_module is our only MT-ready module. art assumes that all legacy modules are unsafe to run in parallel; so it serializes execution of those. It assumes that it is safe to run any number of MT ready modules in parallel. It also assumes that it is safe to run any one legacy module in parallel with any number of MT ready modules. Consider a job with 2 threads. If thread 1 is running a legacy module and thread 2's next task is also a legacy module, then thread 2 blocks until the module on thread 1 is finished. If thread 1 is running a legacy module and thread 2's next task is an MT enabled module then the two threads will run in parallel. If thread 1 run is running an MT enabled module and thread 2's next task is run any module, MT enabled or not, then the module on thread 2 will run. That covers the full 2x2 matix of possibilities. If you go to 3 or more threads the analysis is similar: only one legacy module can be running at a time and any number of MT ready modules may be running in parallel with it. The other important points is that there is some non-determinism in the scheduling algorithms and race conditions between threads. I am not 100% sure about the legacy/MT status of the input and output modules but I do know that they can be active only in one thread at a time. They might have internal locking or they might rely on art's locking of legacy modules. In any case, they are not true MT; that's driven by limitations of root IO. In a typical stage 1 job, the CPU time is dominated by the G4 module, often >90% of the time. And the art schedule has the form: (source module, some legacy modules, G4MT module, some more legacy modules, output module(s)). Most of the time the job will be executing two G4 threads in parallel. One of the threads will finish with G4; most of the time it will run through the rest of it's schedule, start the schedule for the next event and re-enter G4. All of the time the other thread was busy in G4. From time to time both threads will be out of G4 and running through the other modules in their schedules. When that happens the legacy modules on the two threads will block each other. Roughly speaking the threads will alternate modules until one of the threads gets back to G4. During this period, execution speed drops to nominally 50%. ( I glossed over the fact that the order of execution of modules is not guaranteed to be strictly alternating between threads - you may get 2 modules from thread 1 followed by one module from thread 2 and then back to thread 1 ). So that's the thread/schedule mechanics. How do we break the sequence of random numbers? Depending on race conditions events, are not guaranteed to arrive at any particular legacy module in the same order on every run. When that happens the sequence of random numbers breaks. We do reseed G4 every event in a deterministic way based on art::EventID. Issue Offline#849 (Mu2e/Offline#849) discusses seeding all modules this way. This would fix the non-repeatability that Alessandro found. Aside: I misread Ray's analysis the first time: in his example it would add 0.25% to the time to process an event; I agree that we can tolerate that ( I had misread it as 25% which would not be acceptable - sorry for the confusion this caused ). In the job that Alessandro commented on, G4 is only a tiny fraction of the total CPU time, so the job spends most of it's wall clock time with one thread active and one blocked. It would be best to run it single threaded. I have not thought carefully about the intermediate case where we spend maybe 50% or 60% of the time with 2 threads both running G4. I bet that there is no clean optimal answer; I expect that it will depend on the properties of the jobs that other experiments are running. Let me know if I missed anything in the earlier questions. |
The fact that I was missing was there was a legacy module with a random seed following the G4 module. In this case, I see the problem and agree it is the same as the 849 issue. Thanks for the thorough explanation. I see you commented about the random re-seed time. If that's OK, then maybe I can push that issue forward soon. |
Hi Rob,
On Sun, Dec 11, 2022 at 15:20 Rob Kutschke ***@***.***> wrote:
The most important thing is that Dave is right when he says that
repeatability in the sense we are talking about here does not affect
physics quality.
I hope that the following will answer the other questions in both Ray's
and Dave's posts.
Both are right that Mu2eG4MT_module is our only MT-ready module. art
assumes that all legacy modules are unsafe to run in parallel; so it
serializes execution of those. It assumes that it is safe to run any number
of MT ready modules in parallel. It also assumes that it is safe to run any
one legacy module in parallel with any number of MT ready modules.
Consider a job with 2 threads. If thread 1 is running a legacy module and
thread 2's next task is also a legacy module, then thread 2 blocks until
the module on thread 1 is finished. If thread 1 is running a legacy module
and thread 2's next task is an MT enabled module then the two threads will
run in parallel. If thread 1 run is running an MT enabled module and thread
2's next task is run any module, MT enabled or not, then the module on
thread 2 will run. That covers the full 2x2 matix of possibilities. If you
go to 3 or more threads the analysis is similar: only one legacy module can
be running at a time and any number of MT ready modules may be running in
parallel with it.
The other important points is that there is some non-determinism in the
scheduling algorithms and race conditions between threads.
I am not 100% sure about the legacy/MT status of the input and output
modules but I do know that they can be active only in one thread at a time.
They might have internal locking or they might rely on art's locking of
legacy modules. In any case, they are not true MT; that's driven by
limitations of root IO.
In a typical stage 1 job, the CPU time is dominated by the G4 module,
often >90% of the time. And the art schedule has the form: (source module,
some legacy modules, G4MT module, some more legacy modules, output
module(s)). Most of the time the job will be executing two G4 threads in
parallel. One of the threads will finish with G4; most of the time it will
run through the rest of it's schedule, start the schedule for the next
event and re-enter G4. All of the time the other thread was busy in G4.
From time to time both threads will be out of G4 and running through the
other modules in their schedules. When that happens the legacy modules on
the two threads will block each other. Roughly speaking the threads will
alternate modules until one of the threads gets back to G4. During this
period, execution speed drops to nominally 50%. ( I glossed over the fact
that the order of execution of modules is not guaranteed to be strictly
alternating between threads - you may get 2 modules from thread 1 followed
by one module from thread 2 and then back to thread 1 ).
So that's the thread/schedule mechanics. How do we break the sequence of
random numbers?
Depending on race conditions events, are not guaranteed to arrive at any
particular legacy module in the same order on every run. When that happens
the sequence of random numbers breaks.
Pileup is a resampling job, there is no input event, so I don’t fully
follow your logic. I guess you are saying any sequential module random
number use in a G4 MT job will break reproducibility. That raises the
question: if we precompute random numbers and run resampling with a ‘rnd’
input dataset (as we discussed at the production workshop) will that solve
this problem? Or does it require a deeper fix? Note that we run every G4
job except POT as a resampler.
The simple fix is to update the global production G4 setting to not
multithread. I will put in that PR today. If we decide to run MT just
for POT we can do that in the POT job config.
Dave
We do reseed G4 every event in a deterministic way based on art::EventID.
Issue Offline#849 (Mu2e/Offline#849
<Mu2e/Offline#849>) discusses seeding all
modules this way. This would fix the non-repeatability that Alessandro
found. Aside: I misread Ray's analysis the first time: in his example it
would add 0.25% to the time to process an event; I agree that we can
tolerate that ( I had misread it as 25% which would not be acceptable -
sorry for the confusion this caused ).
In the job that Alessandro commented on, G4 is only a tiny fraction of the
total CPU time, so the job spends most of it's wall clock time with one
thread active and one blocked. It would be best to run it single threaded.
I have not thought carefully about the intermediate case where we spend
maybe 50% or 60% of the time with 2 threads both running G4. I bet that
there is no clean optimal answer; I expect that it will depend on the
properties of the jobs that other experiments are running.
Let me know if I missed anything in the earlier questions.
—
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAH574KDTICC2L2OJ7MUFDWMZOTLANCNFSM6AAAAAASZ4MHGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
David Brown ***@***.***
Office Phone (510) 486-7261
Lawrence Berkeley National Lab
M/S 50R5008 (50-6026C) Berkeley, CA 94720
|
HI Everyone, Getting back to fixing these issues now. What is the status here? |
Hi Sophie,
Can you point me to the fcl that you will use for the campaign? Can you remind me if you plan to run both stage 1 and stage 2. Also, if running at NSERC do you plan to submit to CORI I or CORI II (ie big core machines or KNL?)
Rob
From: Sophie Middleton ***@***.***>
Reply-To: Mu2e/Production ***@***.***>
Date: Thursday, January 5, 2023 at 9:31 AM
To: Mu2e/Production ***@***.***>
Cc: Rob Kutschke ***@***.***>, Comment ***@***.***>
Subject: Re: [Mu2e/Production] Multi-threading issues (Issue #222)
HI Everyone, Getting back to fixing these issues now. What is the status here?
—
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mu2e_Production_issues_222-23issuecomment-2D1372369243&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=ToPVYu3_yfxS7_fSJfZ9staN1GcA-YmpxuMGfeSyFkY&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABHY5ZXE65GJ6K3JNNTALTTWQ3SNDANCNFSM6AAAAAASZ4MHGY&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=f9_95aHN5huXcxKe_tNtbMAdd193BSbC1zpqqhQOBNM&e=>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Hi Rob,
Cori is being decommissioned this month, the replacement is Perlmutter. I
haven’t yet tried to use Perlmutter but it is supposedly very similar.
Dave
On Thu, Jan 5, 2023 at 07:39 Rob Kutschke ***@***.***> wrote:
Hi Sophie,
Can you point me to the fcl that you will use for the campaign? Can you
remind me if you plan to run both stage 1 and stage 2. Also, if running at
NSERC do you plan to submit to CORI I or CORI II (ie big core machines or
KNL?)
Rob
From: Sophie Middleton ***@***.***>
Reply-To: Mu2e/Production ***@***.***>
Date: Thursday, January 5, 2023 at 9:31 AM
To: Mu2e/Production ***@***.***>
Cc: Rob Kutschke ***@***.***>, Comment ***@***.***>
Subject: Re: [Mu2e/Production] Multi-threading issues (Issue #222)
HI Everyone, Getting back to fixing these issues now. What is the status
here?
—
Reply to this email directly, view it on GitHub<
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mu2e_Production_issues_222-23issuecomment-2D1372369243&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=ToPVYu3_yfxS7_fSJfZ9staN1GcA-YmpxuMGfeSyFkY&e=>,
or unsubscribe<
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABHY5ZXE65GJ6K3JNNTALTTWQ3SNDANCNFSM6AAAAAASZ4MHGY&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=Zi5IH-k2_JiK_Wipiv3pwwSXO9UNaHiLP8sisfPz__k&m=CzxshkA6OZoYYW9N_MW1uLxltv6ASe8UUIg52wN9eBZejmcYT0d_lil7Tswkg-nW&s=f9_95aHN5huXcxKe_tNtbMAdd193BSbC1zpqqhQOBNM&e=>.
You are receiving this because you commented.Message ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAH573THPNR25CP5ZFHPMLWQ3TJNANCNFSM6AAAAAASZ4MHGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
David Brown ***@***.***
Office Phone (510) 486-7261
Lawrence Berkeley National Lab
M/S 50R5008 (50-6026C) Berkeley, CA 94720
|
Thanks Dave,
The GPU nodes also have 4 GB/core but I imagine we would rarely, if every, be scheduled on those nodes since we have no code that can use the GPUs.
|
When we submit, we only say "site=NERSC" and where we land is determined by the agreements between computing and NERSC and/or matching the job ads. So I think to understand what is happening, we would need to talk to computing.. |
I have a conversation on going with Steve Timm and will summarize here when it converges. |
Steve says that each core is hyperthreaded so there are 256 cores per node and 2 GB/core. So our G4 jobs should continue to use 2 threads and 2 schedules for memory reasons. Below is the rest of the thread:
|
Please look through the Production fcl files to look for inappropriate uses of multi-threading.
We should only use multi-threading when G4 is the dominant CPU time in the job. Consider the example that Alessandro showed on Wednesday, which is based off of
Production/JobConfig/pileup/MuStopPileup.fcl
This uses 2 threads and 2 schedules. Please set them both to 1.
This job is dominated by the time spent in PileupPath:TargetStopResampler:ResamplingMixer, an average of 2.3 seconds per event out of 2.4 total seconds per event. The time spent in G4 is only 0.2 seconds per event. The only module in this job that is parallelizable is the G4 module ( and maybe the event generator). So art serializes everything else. The net result is that when an instance of ResamplingMixer gets in, it blocks the other thread until it completes.
If you run with 1 thread the job completes in some amount of wall clock time. If you run with 2 threads it completes in very slightly less wall clock time but it is using 2 CPUs, not one. Each CPU is idle half of the time.
Let me know if you have any questions.
Thanks,
Rob
The text was updated successfully, but these errors were encountered: