Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bit shaving results in non-layout independent History? #1941

Closed
bena-nasa opened this issue Jan 23, 2023 · 4 comments
Closed

Bit shaving results in non-layout independent History? #1941

bena-nasa opened this issue Jan 23, 2023 · 4 comments
Assignees
Labels
🪲 Bug Something isn't working

Comments

@bena-nasa
Copy link
Collaborator

bena-nasa commented Jan 23, 2023

The GMAO-OPS @rlucches team reported that the GEOS-IT system was giving different diagnostic output at different layouts (but the model itself is layout independent as far as the checkpoints) but @lltakacs traced it down to the bit shaving as the culprit.

Indeed, I've been able to reproduce with "main" branch of the GEOSgcm fixture as of 1/20/2023 (that's when model I just ran was cloned and built). At C48, I have collection that is output on the native grid of the fields, the fields are 2D, so there's no processing. If I turn on bit shaving I get different results at 1x12 and 4x48 for the collection at the first write. If I turn off bit-shaving, the collection is layout independent.

I'll investigate; I'm perplexed as the bit shaving should be an element wise operation, just not seeing how the layout could possibly matter, but apparently it does.

@bena-nasa bena-nasa added the 🪲 Bug Something isn't working label Jan 23, 2023
@bena-nasa bena-nasa self-assigned this Jan 23, 2023
@tclune
Copy link
Collaborator

tclune commented Jan 23, 2023

Part of the scheme involves extracting the average before the bit shaving. I suspect that this is the culprit.
But surprised this is the first time it has shown up. Guess our usual tests only look at checkpoints.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Jan 23, 2023

Ok, as @tclune pointed out, the bit-shaving algorithm that we still use that was implemented by Arlindo a long time ago, is a little more complicated than I realized (I have not looked at it in at least half a decade) and maybe doing some for a math to prevent bias after the shaving, in which case if this is done on the distributed arrays may be the issue. I will confirm this by seeing if doing on the bit shaving on the gathered arrays on the server side fixes this, although this brings up other thorny design issues we will have to address if indeed solves the problem.

@bena-nasa
Copy link
Collaborator Author

@rlucches @rtodling @lltakacs
Indeed @tclune was right it is more complicated and not just a element by element operation, and here I have thought for a decade our bit shaving was some really naive throw away some bits thing.

I've confirmed that if I hack the bit shaving over on the server when the data is gathered it seems to fix any bit shaving related layout problem. (rather than do the bit shaving on the History/griddedio side when the data is still distributed). Now the trickier part is doing this in a clean way in the current history/griddedio/pfio output server system we have. Hopefully this is not too bad to shift this over to the server. I will let everyone know when I have a solution and I imagine whatever I do in MAPL develop will need to be backported if that GEOS-IT is using that MAPL 2.8.0.X series where we have been collecting patches as needed to avoid forcing a bigger number update.

@bena-nasa
Copy link
Collaborator Author

@rlucches @rtodling @lltakacs
I figured out how with a few MPI calls to make the bit shaving layout independent. I'll shortly make PR into MAPL develop in our development.
Is there an older MAPL tag I should back port this into for the GEOSadas or any other older system?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants