Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track issues and PRs that were moved to the run 2.0i release branch #165

Closed
15 tasks done
cwwalter opened this issue Sep 14, 2018 · 9 comments
Closed
15 tasks done

Comments

@cwwalter
Copy link
Member

cwwalter commented Sep 14, 2018

In order to start 2.0i quickly large numbers of issues and PRs were finalized and merged to a release branch. This all happened quickly and though it was all tested by Jim not everything was reviewed etc. The code is on a release branch so that if we find any issues after we start we can more easily rewind things. Most of the PRs still exist since they haven't been merged to master.

Here I am attempting to list all of the Issues/PRs that this happened with last week so we can better track what is going on and decide how to integrate everything. Issues always refer to imSim issues.

Please update, fix and add if I missed anything or made a mistake.

In imSim

In obs_lsstCam

in sims_GalSimInterface

Changes that should not go to master:

  • PR disable checkpoint file deletion for Run2.0i #167 turned off the deleting of checkpoint files after successfully completion. This is because the AGN instance catalogs are not yet complete and they will need to restart adding them in. Normally, we do not want this behavior.

Unclear whether they should be included:

  • 4aecf75 Add node_id to checkpoint sqilte3 db filename so that different jobs for the same visit don't try to write to the same file.

These PRs and issues have actually gone to master on simsGalSim interface :

@cwwalter cwwalter added the DC2 label Sep 14, 2018
@cwwalter
Copy link
Member Author

Added more to the list including items which should no be added to master and ones that are unclear.

@cwwalter
Copy link
Member Author

Note: PR lsst-sims/legacy_sims_GalSimInterface#63 now merged. imSim #151 still outstanding as it is in a direct commit 2786b1f that went to the release branch. So checking off the sims_galSimInterface block, but not the imSim one.

@cwwalter
Copy link
Member Author

cwwalter commented Oct 25, 2018

@jchiang87 and all..

I've been (trying) to keep this up-to-date. We should start merging these back to master before it gets more out of sync soon. Some of this never really went through a real review in the the rush for 2.0i, but we have been using it a lot so it is seems solid.

I think to keep things trackable etc, the best thing would be if we could layer these PRs merges on one-by-one.

But there are a few we should leave out (I guess PR #167 and probably 4aecf75 for example). Also there are a couple of just GH commits which we might want to turn into a PR.

We could target this for after you get back Jim.

@cwwalter
Copy link
Member Author

cwwalter commented Nov 1, 2018

In addition to tracking to track the various PRs / commits and changes which have gone into various branches during the 2.0i work (see first comment which is being kept up to date). I have tried to summarize here the versions of all packages which were used for 2.0i and then list all changes that have been made to relevant branches since that time. Suggestions for moving forward and testing will be made separately. Here I try to summarize the baseline status of all new changes since our last production run.

The script used to build the singularity container for production at Theta is here:

https://github.com/LSSTDESC/ALCF_1.2i/blob/7a59d63debb04a030f564d0b9a6301be4f1422f6/Singularity

Using: lsstdesc/stack-sims:w_2018_26-sims_2_9_0 as base

imSim (on dc2_run2.0_rc - v0.3.5-beta )(Sep 17th)

4aecf75 - Add node_id to checkpoint sqilte3 db filename so that different jobs for the same visit don't try to write to the same file.

PR #174 - Speed up object trimming and skip ingest of drawn objects for restarts
PR #176 memory fix for chip-level object down-selection
PR #181 - not yet merged - it uses galSimInterface PR #70

sims_GalSimInterface - on branch u/jchiang/uniqueId_as_string (Sep 11th)
Equivalent to PR #63 - enable uniqueId strings to be written to centroid file

New things include:

Merge branch 'u/danielsf/fix/180920’ - Sep 24 - Interface changes???
Merge branch 'u/danielsf/fix/geom' - Oct 4 - Looks like DM infrastructure change
PR #66 - Oct 9th - ‘checkpoint right after sky background rendering’
PR #67 - Oct 9th - ‘U/danielsf/rmjarvis/cut gsco’ - Use numpy to store GalSim celestial object
PR #69 - Oct 10th - ‘U/danielsf/rmjarvis/find all detectors’ - skip really dim galaxies and complicated sizing operations in some cases - also simplify code.
PR #70 - Oct 30 - ‘Simplify draw procedure for very faint objects’. -Use simple SEDs and skip silicon mode in certain circumstances. (note needs imSim #181 to be turned on)

sims_skybrightness - fdd58c7eb0414e89f5c7fa12eccf8809acabcf92 (discontinuity fix)
(what does sims_2_10_0 use?)

PR #28 - Oct 4th - use sims-util for coord transformations for speed

obs_lsstCam - imsim-0.1.0
(I think this package has now changed…)

GalSim - 1.6
sims_2_10_0 Now at v2.0.? - includes smaller memory footprint for atmospheric screens.

@cwwalter
Copy link
Member Author

cwwalter commented Nov 2, 2018

In the comment above I have listed additions and changes to imSim and associated code since our last production run on Theta. Additionally, in the top post of this issue I have been tracking what code has been going to the release branch that was made to run on Theta for 2.0i (not just from the production run period). So currently the master branch has diverged quite badly and we should fix this quickly as it is now becoming quite difficult to track.

Many of the changes to (especially) the imSim release branch were never formally tested or reviewed by multiple people because of time constraints when we started the 2.0i run. However, we did spend a lot of time running with them and we have gone over the code a lot since that time to add new features, so I think at this point it is safe to say those are tested and accepted. Our testing should be done focused on the new work since that time (as listed in the comment above this one).

Below I list each package: where what version we used and what we should do. I will addressing testing and development plans in the next comment.

imSim: Currently on release branch, and behind in many features (see above).

  • I think we should first of all apply the PRs and commits to master one by one (as opposed to just merging the branch). There are two reasons for this: it will allows us to follow and reconstruct the development history, and there are at least two Theta specific features we may not want to apply (or we may want to put them into a new branch just for theta). After we get master up-to-date we can tag the code and test it.

GalSim: using 1.6 in sims_2_9_0

  • Currently there is a 2.0.X release in sims_2_10_0. It was actually our intention to use this version but it did not happen for some reason. This version includes the fix to decrease the size of the shared atmospheric memory screens. It is possible the reason this didn't happen is that Jim ran into some problems using it during the tests at NERSC. He found when he ran on a single sensor and compared a few versions the results between the two versions seems statistically (because RNGs in GalsSm were different) identical but:

" Running in multiprocessing mode was more problematic. When I tried to run 5 or sensor-visits in parallel, sometimes I would see a sensor or two not start up with the others. I wasn’t seeing that behavior with sims 2.9.0 and galsim 1.6.0. One other difference though is that the sims 2.10.0/galsim 2.0.7 runs were using a shifter image interactively, whereas the sims 2.9.0/galsim 1.6.0 used in install in project space. Haven’t had time to understand the stalling issue with the shifter runs.”"

We should switch to the sims_2_10 version and run tests including confirming we can run at NERSC with no problems.

sims_galSimInterface: effectively pinned to a particular commit on master

The changes here are where we expect the greatest improvements in performance. See comment above for details.

We should tag the current head of master for tests.

sims_skybrightness: I think we used the version included in sims_2_9_0

It looks like an oversight in the singularity building code caused us to use the version of sims_skybrightness before we applied a fix for twilight light sky discontinuity (the newer version was built but not setup)

We should check if sims_2_10_0 contains a recent enough version to just use.

obs_lsstCam - we are using tag imsim-0.1.0

This package is being migrated to obs_lsst. We should discuss what to do about this. Ideally we will use the new version in v17+ of the stack.

@cwwalter
Copy link
Member Author

cwwalter commented Nov 2, 2018

Once we apply all of the PRs and make tags and switch base sims distributions as described above we should undertake testing both to make that things results are as expected and also to test final performance.

imSim PR #181 is not yet merged. After it is, imSim will start to use faint object drawing code implemented by Mike in sims_galSimInterface. These changes will effectively allow us to run just as before but more quickly and in a smaller memory footprint .

In addition to the GalSim version change there are a large number of changes that have been made to imSim and sims_galSimInterface (including with tunable parameters). Ideally, it might be preferable to try to test the GalSim change separately but, since there are so may changes now and building the software environments can be hard it might be better to just test and characterize the results for an up-to-date version of everything. If we find issues that we can't isolate we might need to go back and do something more complicated but most of these pieces have been tested at least individually now.

For testing:

-> I suggest trying to build a version/environment with everything up-to-date and try testing that.

Most of the performance changes will be related to Mike's new faint object code. This changes not only speed but actual output. Dim objects will now have simple SEDs and/or not use the full silicon model.

-> I would like to suggest we try using this without our previous performance fix where we skip very dim objects completely. Including both fixes will introduce three classes of rendering objects and have the side effect of very dim objects being left out of the centroid files. Is this still fast enough?

-> So, automated tests that check for completeness and any shape and color parameters which are possible to measure especially for quite dim objects will be very important I think. If there is some consequence we overlooked I suspect it will be here.

-> I'm not sure how well the interaction between this code and the randomwalk code has been tested. I don't think we should run simplified tests without the knots to asses readiness. That caught us last time

-> Timing tests based on running a large or complete instance catalog is useful since last time our simple scaling arguments for time seemed not to work as expected. At least I don't really know what our actual run time in a real environment is with all of these changes applied.

-> I'm still concerned by the KNL scale up. I wonder if we will still see such a big effect as before. I also wonder if it might be the same at NERSC...

While those tests are happening I think we can work on new updates/development:

-> We should update gains etc but this should be done in concert with the transition from obs_lsstCam to obs_lsst (since that is where the information is stored.

-> There are missing headers in the HDU (#178) . We should also (at a minimum) store the imSim version number in the headers.

-> Jim's comment about the testing of GalSim implies there may be some issues we haven't worked out for running at NERSC. For production type runnings we certainly need discussion of what approach to use.

-> There were some run choices and flags introduces that we might want to re-visit (including making them more flexible) the one that comes to mind is the choice of e-image/amp or both. (Currently both is not an option). We talked before about possibly going 'backwards'. In recent DM work we have wanted to look at some of the e-image files for understanding what was happening.

More work but I think likely worth if for workflow, speed and other reasons is:

-> Implementing the new instance catalog format.

-> Mike and Josh might be able to speed up the atmospheric ray-tracing.

@cwwalter
Copy link
Member Author

Hi Jim, with your merge of the _rc branch does that mean that all PRs above are now on master along with 27b8257, 89e28f9, 2786b1f and 4aecf75?

I'm trying to understand if master is now completely merged with all the previous changes.

Should we back out PR #167? (disabling cleanup of checkpoint files)

And the decision is to leave "Add node_id to checkpoint sqilte3 db filename so that different jobs for the same visit don't try to write to the same file" in for now correct?

@jchiang87
Copy link
Collaborator

I didn't actually merge the _rc branch. I just merged the last development branch that had changes made over the last several weeks. No need to back out #167 itself, since I made the needed change in e15cad3.

And the decision is to leave "Add node_id to checkpoint sqilte3 db filename so that different jobs for the same visit don't try to write to the same file" in for now correct?

yes.

Everything in master is now up-to-date with all of the development that was in the works over the last few weeks that we wanted to merge.

@cwwalter
Copy link
Member Author

Closing this tracking issue. Master is now back up-to-date with all development work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants