-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect derivatives when using bcasts and parallel groups connected to outer components #2883
Comments
Thanks @naylor-b, that does seem to fix the hang, and check_totals shows the correct derivatives for the issue #2884 example. The #2883 example doesn't get stuck now, but derivatives are the same as before (e.g. 6's instead of 2's when using 10 processors... 21's vs. 2's on 40). I assume you were referring to the other example showing correct derivatives, and not this example? |
Ah, sorry, I was just looking at 2 processor case. I'll dig into this. |
Ah I see--thanks. Yeah, for this example, derivatives are correct with 1 processor per parallel group. Then they increase by 1 per processor per subsystem. I've seen another more complicated case that also scaled linearly with the number of processors, but one that didn't as well. |
I played around with your example, and by removing the
|
Yeah, I had noticed that it doesn't happen without bcasts... I guess even though d_outputs['mass'] varies across processors (unlike in a serial group), the resulting total derivative is still fine. Unfortunately our aero/structural solvers rely on bcasts in quite a few places, in both implicit and explicit components. That way our FEM can be ran on a subset of processors for example, since it doesn't make sense to use the same number as our CFD. I did try tweaking our VLM solver (ran on only the first rank) knowing this fix, but haven't gotten that working so far (its use of implicit components seems to complicate things). |
For what it's worth, I found a while back that this "fix" worked for the aero/structural cases I was trying to run: d_inputs['dv_struct'] = self.comm.allreduce(d_outputs['mass'], op=MPI.SUM) / self.comm.size I just needed to do this with the d_outputs of whatever final quantity of interest was being computed (e.g. aero coefficients, KS stress aggregate, structural mass, etc.)... no changes necessary in upstream components like the implicit aero/structural solvers. |
I think this issue is fixed now on my |
This example works as well. I also tested something similar with the MPhys example (https://github.com/Asthelen/mphys/blob/e20877696e9f280b9b56fe469e93ef5c312cd9c9/examples/aerostructural/supersonic_panel/as_opt_parallel.py), which worked too. However, that did hang when I tried this check_totals:
with this DummyOuterComp setup:
When I comment out the first C_L constraint, then it doesn't hang. Maybe this is similar to #2911 which unfortunately isn't fixed. |
Description
When running our model with parallel groups instead of serial groups, derivatives were observed to be incorrect for outputs that were functions of both a parallel group subsystem and an outer subsystem. Output derivatives that only depend on one of the two are correct. Derivatives are also correct when the parallel group subsystem only uses one processor.
When looking at a simpler example, this issue seems to stem from a combination of inconsistent d_output values across processors (within the parallel group subsystem's get_jacvec_product function) and a bcast of the computed d_inputs (which may be necessary when a solver/analysis only exists on a subset of ranks).
As a side note, the example's check_totals also seems to get stuck when using a parallel group (hence the if statement at the end of the script).
Example
OpenMDAO Version
3.25.1-dev
Relevant environment information
I am using python 3.9.5 with the following package versions (among other packages):
numpy==1.24.2
scipy==1.10.1
mpi4py==3.1.4
petsc4py==3.15.0
The text was updated successfully, but these errors were encountered: