Visit metadata analysis #159

drphilmarshall · 2016-03-04T00:11:42Z

In the PhoSim image generation master thread #137 (comment) @sethdigel is making some cracking plots showing the link between observing conditions and PhoSim run times. I think we should extend this a little bit, so we can look at how things like observed image quality, observed image depth and so on are distributed, and how they depend on each other. Seth, @jchiang87 - how should we proceed? Should we start validation module, put some functions in it, and make some more scripts for the workflow? Or focus on collating more metadata, and making it available for analysis by more people? Is a PhoSim CPU time predictor a useful tool that we should try and put together?

drphilmarshall · 2016-03-04T22:04:09Z

Talking to @sethdigel , we decided to try extending his analysis into the murky world of scikit-learn at the Hack Day next week - we'll just aim for a nice notebook to start with, and can figure out scripts later. I guess we can put the notebook in examples/notebooks, although what we are talking about is not really an example...

Seth, here's the machine learning example notebook I showed you:

https://github.com/drphilmarshall/StatisticalMethods/blob/master/examples/SDSScatalog/Quasars.ipynb

In that repo there is also an introductory tutorial, and both are linked from lesson 9's notebook. I'd suggest that you fork and clone that repo, and try running the lesson 9 notebooks. And then we'll just need visit metadata in csv format for next Friday :-)

drphilmarshall · 2016-03-04T22:21:18Z

PS. @sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-)

sethdigel · 2016-03-05T07:47:03Z

PS. @sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-)

Done. I went for descriptive rather than catchy.

About 200 of the Run 1 visits are still running in the pipeline. These jobs are all up to about 4000 minutes of CPU time. By some time tomorrow all of the Run 1 jobs will have either finished or hit the 5-day run time limit, at which point a complete set of metadata for Run 1 Phosim simulated visits can be assembled.

drphilmarshall · 2016-03-06T01:01:15Z

Perfect! This should be fun :-)

On Fri, Mar 4, 2016 at 11:47 PM, Seth Digel notifications@github.com
wrote:

PS. @sethdigel https://github.com/sethdigel can you please post this
hack idea to the Hack Day confluence page, and provide a link to this
thread? You'll need some sort of catchy project name :-)

Done. I went for descriptive rather than catchy.

About 200 of the Run 1 visits are still running in the pipeline. These
jobs are all up to about 4000 minutes of CPU time. By some time tomorrow
all of the Run 1 jobs will have either finished or hit the 5-day run time
limit, at which point a complete set of metadata for Run 1 Phosim simulated
visits can be assembled.

—
Reply to this email directly or view it on GitHub
#159 (comment)
.

sethdigel · 2016-03-06T06:50:23Z

About 150 of the Run 1 visits are still running. I did the math wrong last night regarding how much longer they can go before hitting the CPU time run limit, which is 7200 minutes. The remaining jobs will not hit the CPU time limit until Monday morning.

Also, I am thinking that the analysis will need to fold in the type of batch host in the SLAC farm. Some generations of batch hosts do a lot more per CPU minute than others (and this may be the reason for the banding of CPU times for a given moonalt, moonphase, and filter). Probably we can at least normalize CPU times to a common scale.

The table below are the 'CPU Factors' assigned in the LSF system for the various classes of hosts that have been used in generating Run 1. These presumably relate to relative speeds. About half of the jobs ran on hequ hosts and one third on fell hosts.

Host class	CPU Factor
bullet	13.99
dole	15.61
fell	11.00
hequ	14.58
kiso	12.16

drphilmarshall · 2016-03-06T23:37:39Z

Nice! Host class CPU Factor looks like an excellent new feature to add to
the csv file, and we can decide on Friday whether to correct the CPU times
or just throw in the CPU factor as a feature.

On Sat, Mar 5, 2016 at 10:50 PM, Seth Digel notifications@github.com
wrote:

About 150 of the Run 1 visits are still running. I did the math wrong last
night regarding how much longer they can go before hitting the CPU time run
limit, which is 7200 minutes. The remaining jobs will not hit the CPU time
limit until Monday morning.

Also, I am thinking that the analysis will need to fold in the type of
batch host in the SLAC farm. Some generations of batch hosts do a lot more
per CPU minute than others (and this may be the reason for the banding of
CPU times for a given moonalt, moonphase, and filter). Probably we can at
least normalize CPU times to a common scale.

The table below are the 'CPU Factors' assigned in the LSF system for the
various classes of hosts that have been used in generating Run 1. These
presumably relate to relative speeds. About half of the jobs ran on hequ
hosts and one third on fell hosts.
Host class CPU Factor
bullet 13.99
dole 15.61
fell 11.00
hequ 14.58
kiso 12.16

—
Reply to this email directly or view it on GitHub
#159 (comment)
.

brianv0 · 2016-03-07T06:20:46Z

@sethdigel I think you'll want to double check this. SLAC discourages the use of CPU time (or named queues) and suggests users only use a wall clock time and, in fact, the bsub wrapper script SLAC maintains is actually supposed to divide a supplied CPU time by 5 and set then just set that as the wall clock time. That number was arrived at by assuming a job might run on a fell machine (CPU factor ~10) and giving it twice the amount of time to run.

sethdigel · 2016-03-07T06:43:13Z

Thanks, Brian. What we are looking at doing is studying the dependence of the actual CPU time for phosim runs (extracted from the pipeline log file for the runs) on some basic parameters in the phosim instance catalogs (like the altitude of the moon). In a preliminary look at the Run 1 output, posted on issue #137, for runs with similar parameters (moon altitude, moon phase, filter), the CPU times appear to have two ranges, separated by a constant factor. I have not looked into it quantitatively yet, but I was guessing that it might be due to the batch hosts not all being the same speed. I posted the CPU Factors for potential future reference because that was the closest thing that I could find that looked like a measure of relative speeds, and because it was not particularly easy to find (it involved using the bhost command for specific hosts).

sethdigel · 2016-03-08T08:08:34Z

About 40 phosim runs are still going. Almost all are about to reach the run limit. Three are runs that Tom restarted after they failed for some (probably transient) reason. It looks like about 105 of the runs either have timed out, or will. These represent 1.4 CPU years. The runs that finished used 3.5 CPU years.

I've made a csv file with the metadata from the phosim runs, collected from the instance catalogs and the log files from the pipeline. Here are the headings: obshistid, expmjd, filter, start, end, altitude, rawseeing, airmass, moonalt, moonphase, dist2moon, sunalt, cputime, hostname, runlimit. (start and end are the MJD starting and ending times of the runs, in case wall clock time turns out to be interesting, hostname is the first character of the batch host name, and runlimit is a flag for runs that hit the limit - which did not always occur at exactly the same CPU time.)

Where in github or Confluence would be the right place to put the file?

Here is an updated plot of CPU time vs. moonaltitude, with the points color coded by filter and sized according to moonphase. The horizontal dashed line is the approximate run limit (5 days) and the vertical dashed line is at 0 deg moon altitude. The histogram has a linear scale and shows the distribution of moon altitude for the runs that are still going. The points with '+' signs hit the run limit (and so produced no phosim output).

drphilmarshall · 2016-03-08T14:57:37Z

Nice! K nearest neighbors or Random Forest are going to clean up on the CPU
time prediction, I think. Can you post this plot and two bullets of text to
the Twinkles slides at
https://docs.google.com/presentation/d/1MdGGDrITW4-n04goJNYBAoVwnQBjRkibxuJY8EJZWXc
please? Good to chat about this in our session.

I'd say, just put the data file in your public_html for now, and post the
URL to this thread. We can pull directly from there in our hack notebook.
Thanks Seth! :-)

On Tue, Mar 8, 2016 at 12:08 AM, Seth Digel notifications@github.com
wrote:

About 40 phosim runs are still going. Almost all are about to reach the
run limit. Three are runs that Tom restarted after they failed for some
(probably transient) reason. It looks like about 105 of the runs either
have timed out, or will. These represent 1.4 CPU years. The runs that
finished used 3.5 CPU years.

I've made a csv file with the metadata from the phosim runs, collected
from the instance catalogs and the log files from the pipeline. Here are
the headings: obshistid, expmjd, filter, start, end, altitude,
rawseeing, airmass, moonalt, moonphase, dist2moon, sunalt, cputime,
hostname, runlimit. (start and end are the MJD starting and ending
times of the runs, in case wall clock time turns out to be interesting,
hostname is the first character of the batch host name, and runlimit
is a flag for runs that hit the limit - which did not always occur at
exactly the same CPU time.)

Where in github or Confluence would be the right place to put the file?

Here is an updated plot of CPU time vs. moonaltitude, with the points
color coded by filter and sized according to moonphase. The horizontal
dashed line is the approximate run limit (5 days) and the vertical dashed
line is at 0 deg moon altitude. The histogram has a linear scale and shows
the distribution of moon altitude for the runs that are still going. The
points with '+' signs hit the run limit (and so produced no phosim output).

[image: cpu_moonalt_comb]
https://cloud.githubusercontent.com/assets/6035835/13595599/96802ebe-e4c1-11e5-9130-0d295766557c.png

—
Reply to this email directly or view it on GitHub
#159 (comment)
.

sethdigel · 2016-03-08T23:29:10Z

The metadata for the Run 1 phosim runs are in this file.

Run 1 has 1227 observations of a Deep Drilling Field at RA, Dec = 53.0091, -27.4389 deg (J2000). As of this writing 4 of the simulation runs are incomplete; they are flagged as described in the table below. In each case, these runs crashed for some (probably) transient reason and were restarted by Tom.

Column	Description
obshistid	Opsim designator of the visit.
expmjd	MJD of the (simulated) observation
filter	0-5 for ugrizy
rotskypos	angle of sky relative to camera coordinates (deg)
start_run	starting time of the phosim run on the batch farm (MJD)*
end_run	ending time of the phosim run on the batch farm (MJD)*
altitude	elevation of the observing direction (deg)
rawseeing	seeing at 500 nm (arcsec), a phosim input
airmass	airmass at the altitude of the observation
moonalt	elevation of the Moon (deg)
moonphase	phase of the Moon (0-100)
dist2moon	angular distance of the Moon from the observing direction
sunalt	elevation of the Sun (deg)
cputime	CPU time required for the phosim run (sec)*
hostname	First character of the name of the batch host that ran the job (b, d, f, h, k)*
runlimit	Flag indicating whether the phosim run was terminated at the 5-day execution time limit (1 = yes)

* An x in the hostname column or a negative number for CPU time indicates that the phosim run is still executing. These jobs also have 0 as the start and end time.

drphilmarshall · 2016-03-09T00:34:03Z

Excellent! We are all set.

On Tue, Mar 8, 2016 at 3:29 PM, Seth Digel notifications@github.com wrote:

The metadata for the Run 1 phosim runs are in this file
http://www.slac.stanford.edu/%7Edigel/lsst/run1_metadata.csv.

Run 1 has 1227 observations of a Deep Drilling Field at RA, Dec = 53.0091,
-27.4389 deg (J2000). As of this writing 4 of the simulation runs are
incomplete; they are flagged as described in the table below. In each case,
these runs crashed for some (probably) transient reason and were restarted
by Tom.
Column Description
obshistid Opsim designator of the visit.
expmjd MJD of the (simulated) observation
filter 0-5 for ugrizy
start starting time of the phosim run on the batch farm (MJD)*
end ending time of the phosim run on the batch farm (MJD)*
altitude elevation of the observing direction (deg)
rawseeing seeing parameter of phosim (arcsec?)
airmass airmass at the altitude of the observation
moonalt elevation of the Moon (deg)
moonphase phase of the Moon (0-100)
dist2moon angular distance of the Moon from the observing direction
sunalt elevation of the Sun (deg)
cputime CPU time required for the phosim run (sec)*
hostname First character of the name of the batch host that ran the job
(b, d, f, h, k)*
runlimit Flag indicating whether the phosim run was terminated at the
5-day execution time limit (1 = yes)

An x in the hostname column or a negative number for CPU time indicates
that the phosim run is still executing. These jobs also have 0 as the start
and end time.

—
Reply to this email directly or view it on GitHub
#159 (comment)
.

sethdigel · 2016-03-11T01:21:58Z

Phil's machine learning example notebook runs for me. This is sort of a Hello World plot showing some of the metadata for Run 1. So, yes, I think it will work.

drphilmarshall · 2016-03-11T14:52:11Z

Oh cool! :-) This is great, Seth. Let's spend some time this morning
staring at the full plot, to get some feel for what is going on. This is of
course not part of the traditional machine learning development flow but
sod it, we're physicists. Then I think we can just step through the
notebook, editing both the markdown and the python to train the KNN model
and make some predictions :-)

On Thu, Mar 10, 2016 at 5:21 PM, Seth Digel notifications@github.com
wrote:

Phil's machine learning example notebook runs for me. This is sort of a
Hello World plot showing some of the metadata for Run 1. So, yes, I think
it will work.

[image: run1_everything]
https://cloud.githubusercontent.com/assets/6035835/13690344/5ca48666-e6e4-11e5-9312-aad863a5de5d.png

—
Reply to this email directly or view it on GitHub
#159 (comment)
.

rbiswas4 · 2016-03-11T15:47:18Z

Incidentally, whatever methods are being used here can also be used to study correlations of five sigma depths with other columns in OpSim. Obviously these are simulated, but it might be a fun project to do exactly the same thing with the OpSim outputs with the interesting variables being fivesigmadepth, or Seeing. At least that exercise would give me some insight into stuff that obsververs have a good intuition for already. I can get the OpSim output in the form of a dataframe, and I propose we combine with the group looking into observing strategy today.

drphilmarshall · 2016-03-11T15:53:56Z

Good idea.

Seth I would bet something expensive (your bike?) that Random Forest is
going to work better than KNN in this problem, especially of we dont split
the data by filter (and we don't have time for such sensible things today).
But still, let's get KNN working first and then unplug it and replace it.

On Friday, March 11, 2016, rbiswas4 notifications@github.com wrote:

Incidentally, whatever methods are being used here can also be used to
study correlations of five sigma depths with other columns in OpSim.
Obviously these are simulated, but it might be a fun project to do exactly
the same thing with the OpSim outputs with the interesting variables being
fivesigmadepth, or Seeing. At least that exercise would give me some
insight into stuff that obsververs have a good intuition for already. I can
get the OpSim output in the form of a dataframe, and I propose we combine
with the group looking into observing strategy today.

—
Reply to this email directly or view it on GitHub
#159 (comment)
.

drphilmarshall · 2016-03-16T00:18:42Z

@sethdigel @humnaawan and @tmcclintock: nice work on Friday! Reproducing your main result here:

From this plot it looks to me as though you are able to predict CPU time to about +/- 0.2 dex (95% confidence, very roughly), no matter what the absolute CPU time is (although it would be good to be more precise about this). 0.2 dex corresponds to about 50% uncertainty, or ranges like "5000 to 15000 CPU hours." I think this could be useful ( @TomGlanzman can comment further ), and we didn't even get into extending the data with more OpSim or PhoSim parameters.

The next step could be to extract the ML parts of your notebook and repackage them into a PhoSimPredictor class, which could be trained and then pickled for use before every new PhoSim run to determine which queue to use. Again, we should be guided by Tom here. Let us know if you're interested in helping with this! And thanks for all your efforts on Friday - what a nice hack! :-)

TomGlanzman · 2016-03-16T20:28:24Z

I hope @drphilmarshall 's quoted range of 5000 to 15000 CPU hours ... is really seconds. Or were you looking at integrated CPU hours for Twinkles-phoSim?

In terms of job scheduling, it will certainly be extremely useful to know the (approximate) CPU time ranges. However, unless and until we get some form of checkpointing running, one would also need the facility within the Pipeline to customize the job run time on a per-stream basis. Otherwise, we are in the same situation as now: send all jobs to the same, maximum time queue. And/or we arbitrarily make a cut so that any phoSim that requires > NN hours of CPU is simply not attempted.

cwwalter · 2016-03-16T20:31:23Z

I haven't had time to actually play with this yet but at NERSC they have:

http://slurm.schedmd.com/checkpoint_blcr.html

As far as I can tell this is done at a system/kernel level so you don't have to change the actual code.

TomGlanzman · 2016-03-16T20:34:10Z

I tried exercising checkpointing a couple of years ago at NERSC but wound up frustrated because carver did not support this feature. Now, with both a new architecture (cori) and batch system (slurm) it is probably worth trying again...

tony-johnson · 2016-03-16T20:40:17Z

However, unless and until we get some form of checkpointing running, one would also need the facility > within the Pipeline to customize the job run time on a per-stream basis.

@TomGlanzman, this feature has been built into the workflow engine since day one, as is used extensively in the EXO data processing. There should be no problem using this with your phosim task.

cwwalter · 2016-03-16T20:50:39Z

@tony-johnson are you referring to BLCR? Do you have any simple examples of how to use it in a slurm file?

tony-johnson · 2016-03-16T21:33:12Z

@cwwalter no sorry, I was referring to Tom Glanzman's post above yours concerning using the CPU time estimate to set the time required for a batch job. (GitHub needs to add threaded conversations).

The blcr does look quite interesting, I read the FAQs linked from the page you referenced above and it certainly seems as if it might be usable.

jchiang87 · 2016-03-24T15:46:48Z

This seems to have been addressed by the hack day project at #177 and #178. I'll open a new issue for Phil's proposed PhoSimPredictor class.

drphilmarshall added Question Analysis labels Mar 4, 2016

drphilmarshall assigned jchiang87 Mar 4, 2016

drphilmarshall added this to the Run 1 Analysis milestone Mar 4, 2016

jchiang87 closed this as completed Mar 24, 2016

sethdigel mentioned this issue Mar 24, 2016

Implement PhoSimPredictor class using ML results from #159 #186

Closed

rbiswas4 mentioned this issue Jul 15, 2016

Missing observations in Twinkles forced sources #278

Open

sethdigel mentioned this issue Mar 30, 2017

DC1 phoSim production LSSTDESC/SSim_DC1#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visit metadata analysis #159

Visit metadata analysis #159

drphilmarshall commented Mar 4, 2016

drphilmarshall commented Mar 4, 2016

drphilmarshall commented Mar 4, 2016

sethdigel commented Mar 5, 2016

drphilmarshall commented Mar 6, 2016

sethdigel commented Mar 6, 2016

drphilmarshall commented Mar 6, 2016

brianv0 commented Mar 7, 2016

sethdigel commented Mar 7, 2016

sethdigel commented Mar 8, 2016

drphilmarshall commented Mar 8, 2016

sethdigel commented Mar 8, 2016

drphilmarshall commented Mar 9, 2016

sethdigel commented Mar 11, 2016

drphilmarshall commented Mar 11, 2016

rbiswas4 commented Mar 11, 2016

drphilmarshall commented Mar 11, 2016

drphilmarshall commented Mar 16, 2016

TomGlanzman commented Mar 16, 2016

cwwalter commented Mar 16, 2016

TomGlanzman commented Mar 16, 2016

tony-johnson commented Mar 16, 2016

cwwalter commented Mar 16, 2016

tony-johnson commented Mar 16, 2016

jchiang87 commented Mar 24, 2016

Visit metadata analysis #159

Visit metadata analysis #159

Comments

drphilmarshall commented Mar 4, 2016

drphilmarshall commented Mar 4, 2016

drphilmarshall commented Mar 4, 2016

sethdigel commented Mar 5, 2016

drphilmarshall commented Mar 6, 2016

sethdigel commented Mar 6, 2016

drphilmarshall commented Mar 6, 2016

brianv0 commented Mar 7, 2016

sethdigel commented Mar 7, 2016

sethdigel commented Mar 8, 2016

drphilmarshall commented Mar 8, 2016

sethdigel commented Mar 8, 2016

drphilmarshall commented Mar 9, 2016

sethdigel commented Mar 11, 2016

drphilmarshall commented Mar 11, 2016

rbiswas4 commented Mar 11, 2016

drphilmarshall commented Mar 11, 2016

drphilmarshall commented Mar 16, 2016

TomGlanzman commented Mar 16, 2016

cwwalter commented Mar 16, 2016

TomGlanzman commented Mar 16, 2016

tony-johnson commented Mar 16, 2016

cwwalter commented Mar 16, 2016

tony-johnson commented Mar 16, 2016

jchiang87 commented Mar 24, 2016