New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mdsio section #359
base: master
Are you sure you want to change the base?
Mdsio section #359
Conversation
I have to say I don't really understand why the tutorial_plume_on_slope check is failing. I ran the testreport on the cluster at our institute and without turning on mpi it passed, with output matching to at least 16 digits of accuracy. I believe the verification results are without mpi, but running with mpi I get the same result as travis, where it simply says that the run failed... But even running with mpi on our cluster, diffing the STDOUT.0000 and results/output.txt files shows no differences in the monitor output. So I'm not really sure where I've gone wrong. |
@timothyas First of all, I like your effort. I too am a little annoyed by the multiple files and the missing meta information. I am not too much of an MDSIO expert but I can reproduce the failing experiment on my laptop. The numbers are OK (I guess), but it fails at the end in |
@timothyas I think I now have a fix. Would you like me to check it in for you to see? Not sure that it does the right thing (whether the pickup file is correct, contains only zeros), but at least the test does not break. |
Hey @mjlosch, thanks so much for the feedback and helpful debugging! Yes I'd be interested to see your fix. I'm re-wrapping my head around all these buffers, segs, and pass routines... Now that you mention it, this does look suspect. Here I was mirroring (or, trying to mirror) the style of |
Hi Tim, just added my fix, but be aware that this only fixes the runtime error. The |
Thanks @mjlosch this is super helpful. I bet this has to do with the corresponding read routines, I'll check it out |
@timothyas you are right. With |
d6e369e
to
6efcb8f
Compare
* mdsio_read_section: don't initialize buffers, for some reason this clears the variable "dataFName", these don't need to be initialized anyway though, it seems * unnecessary barriers in exf_set, also some unmaintained code needs fixing for picky compilers
Hey @mjlosch and @jm-c I think this is now in very good shape. I have made some fixes related to exch2 and tested all possible options ( Additionally, @antnguyen13 gave me a lightweight ASTE setup to test with. This has been a good test because it uses a variety of options: Some seaice related details on the ASTE setup that I think @mjlosch will find interesting (big thanks to @antnguyen13 for some coaching here):
However, almost no matter what the options were, I could always get through 10 iterations with output similar to what is shown below (i.e. passing with pretty good agreement). It seems like these stability issues are a matter of seaice being a tricky beast, rather than MDSIO problems, so I feel pretty confident about the code. I'd be curious for any feedback and I'm ready to do what I can to help get this merged. "miniASTE" testreport output after 20 iterations (note it reads in single precision OBCS fields):
|
- fix so that it compiles when pkg/fizhi is compiled. - also remove un-used variables and fix 1 typo in comments
@timothyas I did not check all what I wanted to but have few comments:
I will continue to review the other bits but it's taking me some time ... |
Thanks so much for the feedback @jm-c. I agree with your points and I will work on your list in reverse order, starting with 3. Hopefully this will simplify things a bit. |
@timothyas I wrote a new issue #753 that proposes an alternative Note that it we go this route (which would imply also postponing to a future PR many of the changes to S/R calls) the list of files with potential merge-conflicts would go down to a single one ( |
What changes does this PR introduce?
(Bug fix, feature, docs update, ...)
The current mdsio read/write routines for YZ or XZ slices don't take
useSingleCPUIO
into account, and also mds_write_sec_[xz/yz] do not write meta files. This PR makes some additions so that both of these are addressed: the mdsio read/write routines follow the same single cpu io logic as mds_write_field, and prints meta files accordingly.Perhaps the first issue, whether mds_[read/write]_sec uses singlecpuio logic or not, does not matter. I found it to be annoying that this was turned on, yet all the xx_obcs files always print out as tiled files (since pkg/autodiff always uses non global files). I dug into addressing this (which turned out to be more tedious than expected...) and figured I would open up the PR to see what people think. I think that at least writing meta files would be a useful addition, although relatively minor.
What is the current behaviour?
(You can also link to an open issue here)
It does not matter if singlecpuio is turned on for sliced fields. The time where I noticed this is with the autodiff+obcs+ctrl packages in use.
What is the new behaviour
(if this is a feature change)?
Now if singlecpuio is turned on, xz or yz slices are read/written with the master process via some new scatter/gather routines.
Does this PR introduce a breaking change?
(What changes might users need to make in their application due to this PR?)
Honestly, it might. I did not wade through the exch2 logic dutifully since it is very confusing and I'm not familiar with it. At least something that needs to change is the size of the
x_buffer
andy_buffer
fields inmdsio_[read/write]_section
, and the indexing that happens in the [scatter/gather]_[xz/yz] routines could be incorrect. However, I wanted to see if anyone thinks this is worth keeping before digging into that...Another catch is with the gather_[xz/yz] routines. In this routine it's not clear which processor has the sliced information. For instance, if the slice is on one of the boundaries, then it would live on all the processors (and tiles) that intersect with the boundary. This would be difficult and perhaps overly intrusive to determine or pass as an input to the gather routine, so I solved this by simply only passing values to the global buffer if they are nonzero. Presumably, the sliced data are zero on all processors and tiles except where that slice is "active", and nonzero only where it is active, and the gather routine would then successfully grab the nonzero information. This was just a first pass so could be faulty (and all of this not worth it at all), but I'd be curious what the gurus think...
Other information:
Suggested addition to
tag-index
(To avoid unnecessary merge conflicts, please don't update
tag-index
. One of the admins will do that when merging your pull request.)o mdsio_[read/write]section rewritten to include useSingleCPUIO logic, mirroring mdsio[read/write]field
o [scatter/gather][xz/yz]_[r4/r8] routines added