New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visit metadata analysis #159
Comments
Talking to @sethdigel , we decided to try extending his analysis into the murky world of Seth, here's the machine learning example notebook I showed you: https://github.com/drphilmarshall/StatisticalMethods/blob/master/examples/SDSScatalog/Quasars.ipynb In that repo there is also an introductory tutorial, and both are linked from lesson 9's notebook. I'd suggest that you fork and clone that repo, and try running the lesson 9 notebooks. And then we'll just need visit metadata in |
PS. @sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-) |
Done. I went for descriptive rather than catchy. About 200 of the Run 1 visits are still running in the pipeline. These jobs are all up to about 4000 minutes of CPU time. By some time tomorrow all of the Run 1 jobs will have either finished or hit the 5-day run time limit, at which point a complete set of metadata for Run 1 Phosim simulated visits can be assembled. |
Perfect! This should be fun :-) On Fri, Mar 4, 2016 at 11:47 PM, Seth Digel notifications@github.com
|
About 150 of the Run 1 visits are still running. I did the math wrong last night regarding how much longer they can go before hitting the CPU time run limit, which is 7200 minutes. The remaining jobs will not hit the CPU time limit until Monday morning. Also, I am thinking that the analysis will need to fold in the type of batch host in the SLAC farm. Some generations of batch hosts do a lot more per CPU minute than others (and this may be the reason for the banding of CPU times for a given moonalt, moonphase, and filter). Probably we can at least normalize CPU times to a common scale. The table below are the 'CPU Factors' assigned in the LSF system for the various classes of hosts that have been used in generating Run 1. These presumably relate to relative speeds. About half of the jobs ran on hequ hosts and one third on fell hosts.
|
Nice! Host class CPU Factor looks like an excellent new feature to add to On Sat, Mar 5, 2016 at 10:50 PM, Seth Digel notifications@github.com
|
@sethdigel I think you'll want to double check this. SLAC discourages the use of CPU time (or named queues) and suggests users only use a wall clock time and, in fact, the |
Thanks, Brian. What we are looking at doing is studying the dependence of the actual CPU time for phosim runs (extracted from the pipeline log file for the runs) on some basic parameters in the phosim instance catalogs (like the altitude of the moon). In a preliminary look at the Run 1 output, posted on issue #137, for runs with similar parameters (moon altitude, moon phase, filter), the CPU times appear to have two ranges, separated by a constant factor. I have not looked into it quantitatively yet, but I was guessing that it might be due to the batch hosts not all being the same speed. I posted the CPU Factors for potential future reference because that was the closest thing that I could find that looked like a measure of relative speeds, and because it was not particularly easy to find (it involved using the bhost command for specific hosts). |
About 40 phosim runs are still going. Almost all are about to reach the run limit. Three are runs that Tom restarted after they failed for some (probably transient) reason. It looks like about 105 of the runs either have timed out, or will. These represent 1.4 CPU years. The runs that finished used 3.5 CPU years. I've made a csv file with the metadata from the phosim runs, collected from the instance catalogs and the log files from the pipeline. Here are the headings: obshistid, expmjd, filter, start, end, altitude, rawseeing, airmass, moonalt, moonphase, dist2moon, sunalt, cputime, hostname, runlimit. (start and end are the MJD starting and ending times of the runs, in case wall clock time turns out to be interesting, hostname is the first character of the batch host name, and runlimit is a flag for runs that hit the limit - which did not always occur at exactly the same CPU time.) Where in github or Confluence would be the right place to put the file? Here is an updated plot of CPU time vs. moonaltitude, with the points color coded by filter and sized according to moonphase. The horizontal dashed line is the approximate run limit (5 days) and the vertical dashed line is at 0 deg moon altitude. The histogram has a linear scale and shows the distribution of moon altitude for the runs that are still going. The points with '+' signs hit the run limit (and so produced no phosim output). |
Nice! K nearest neighbors or Random Forest are going to clean up on the CPU I'd say, just put the data file in your On Tue, Mar 8, 2016 at 12:08 AM, Seth Digel notifications@github.com
|
The metadata for the Run 1 phosim runs are in this file. Run 1 has 1227 observations of a Deep Drilling Field at RA, Dec = 53.0091, -27.4389 deg (J2000). As of this writing 4 of the simulation runs are incomplete; they are flagged as described in the table below. In each case, these runs crashed for some (probably) transient reason and were restarted by Tom.
* An x in the hostname column or a negative number for CPU time indicates that the phosim run is still executing. These jobs also have 0 as the start and end time. |
Excellent! We are all set. On Tue, Mar 8, 2016 at 3:29 PM, Seth Digel notifications@github.com wrote:
|
Oh cool! :-) This is great, Seth. Let's spend some time this morning On Thu, Mar 10, 2016 at 5:21 PM, Seth Digel notifications@github.com
|
Incidentally, whatever methods are being used here can also be used to study correlations of five sigma depths with other columns in OpSim. Obviously these are simulated, but it might be a fun project to do exactly the same thing with the OpSim outputs with the interesting variables being fivesigmadepth, or Seeing. At least that exercise would give me some insight into stuff that obsververs have a good intuition for already. I can get the OpSim output in the form of a dataframe, and I propose we combine with the group looking into observing strategy today. |
Good idea. Seth I would bet something expensive (your bike?) that Random Forest is On Friday, March 11, 2016, rbiswas4 notifications@github.com wrote:
|
@sethdigel @humnaawan and @tmcclintock: nice work on Friday! Reproducing your main result here: From this plot it looks to me as though you are able to predict CPU time to about +/- 0.2 dex (95% confidence, very roughly), no matter what the absolute CPU time is (although it would be good to be more precise about this). 0.2 dex corresponds to about 50% uncertainty, or ranges like "5000 to 15000 CPU hours." I think this could be useful ( @TomGlanzman can comment further ), and we didn't even get into extending the data with more OpSim or PhoSim parameters. The next step could be to extract the ML parts of your notebook and repackage them into a |
I hope @drphilmarshall 's quoted range of 5000 to 15000 CPU hours ... is really seconds. Or were you looking at integrated CPU hours for Twinkles-phoSim? In terms of job scheduling, it will certainly be extremely useful to know the (approximate) CPU time ranges. However, unless and until we get some form of checkpointing running, one would also need the facility within the Pipeline to customize the job run time on a per-stream basis. Otherwise, we are in the same situation as now: send all jobs to the same, maximum time queue. And/or we arbitrarily make a cut so that any phoSim that requires > NN hours of CPU is simply not attempted. |
I haven't had time to actually play with this yet but at NERSC they have: http://slurm.schedmd.com/checkpoint_blcr.html As far as I can tell this is done at a system/kernel level so you don't have to change the actual code. |
I tried exercising checkpointing a couple of years ago at NERSC but wound up frustrated because carver did not support this feature. Now, with both a new architecture (cori) and batch system (slurm) it is probably worth trying again... |
@TomGlanzman, this feature has been built into the workflow engine since day one, as is used extensively in the EXO data processing. There should be no problem using this with your phosim task. |
@tony-johnson are you referring to BLCR? Do you have any simple examples of how to use it in a slurm file? |
@cwwalter no sorry, I was referring to Tom Glanzman's post above yours concerning using the CPU time estimate to set the time required for a batch job. (GitHub needs to add threaded conversations). The blcr does look quite interesting, I read the FAQs linked from the page you referenced above and it certainly seems as if it might be usable. |
In the PhoSim image generation master thread #137 (comment) @sethdigel is making some cracking plots showing the link between observing conditions and PhoSim run times. I think we should extend this a little bit, so we can look at how things like observed image quality, observed image depth and so on are distributed, and how they depend on each other. Seth, @jchiang87 - how should we proceed? Should we start validation module, put some functions in it, and make some more scripts for the workflow? Or focus on collating more metadata, and making it available for analysis by more people? Is a PhoSim CPU time predictor a useful tool that we should try and put together?
The text was updated successfully, but these errors were encountered: