Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visit metadata analysis #159

Closed
drphilmarshall opened this issue Mar 4, 2016 · 24 comments
Closed

Visit metadata analysis #159

drphilmarshall opened this issue Mar 4, 2016 · 24 comments

Comments

@drphilmarshall
Copy link
Contributor

In the PhoSim image generation master thread #137 (comment) @sethdigel is making some cracking plots showing the link between observing conditions and PhoSim run times. I think we should extend this a little bit, so we can look at how things like observed image quality, observed image depth and so on are distributed, and how they depend on each other. Seth, @jchiang87 - how should we proceed? Should we start validation module, put some functions in it, and make some more scripts for the workflow? Or focus on collating more metadata, and making it available for analysis by more people? Is a PhoSim CPU time predictor a useful tool that we should try and put together?

@drphilmarshall
Copy link
Contributor Author

Talking to @sethdigel , we decided to try extending his analysis into the murky world of scikit-learn at the Hack Day next week - we'll just aim for a nice notebook to start with, and can figure out scripts later. I guess we can put the notebook in examples/notebooks, although what we are talking about is not really an example...

Seth, here's the machine learning example notebook I showed you:

https://github.com/drphilmarshall/StatisticalMethods/blob/master/examples/SDSScatalog/Quasars.ipynb

In that repo there is also an introductory tutorial, and both are linked from lesson 9's notebook. I'd suggest that you fork and clone that repo, and try running the lesson 9 notebooks. And then we'll just need visit metadata in csv format for next Friday :-)

@drphilmarshall
Copy link
Contributor Author

PS. @sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-)

@sethdigel
Copy link
Contributor

PS. @sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-)

Done. I went for descriptive rather than catchy.

About 200 of the Run 1 visits are still running in the pipeline. These jobs are all up to about 4000 minutes of CPU time. By some time tomorrow all of the Run 1 jobs will have either finished or hit the 5-day run time limit, at which point a complete set of metadata for Run 1 Phosim simulated visits can be assembled.

@drphilmarshall
Copy link
Contributor Author

Perfect! This should be fun :-)

On Fri, Mar 4, 2016 at 11:47 PM, Seth Digel notifications@github.com
wrote:

PS. @sethdigel https://github.com/sethdigel can you please post this
hack idea to the Hack Day confluence page, and provide a link to this
thread? You'll need some sort of catchy project name :-)

Done. I went for descriptive rather than catchy.

About 200 of the Run 1 visits are still running in the pipeline. These
jobs are all up to about 4000 minutes of CPU time. By some time tomorrow
all of the Run 1 jobs will have either finished or hit the 5-day run time
limit, at which point a complete set of metadata for Run 1 Phosim simulated
visits can be assembled.


Reply to this email directly or view it on GitHub
#159 (comment)
.

@sethdigel
Copy link
Contributor

About 150 of the Run 1 visits are still running. I did the math wrong last night regarding how much longer they can go before hitting the CPU time run limit, which is 7200 minutes. The remaining jobs will not hit the CPU time limit until Monday morning.

Also, I am thinking that the analysis will need to fold in the type of batch host in the SLAC farm. Some generations of batch hosts do a lot more per CPU minute than others (and this may be the reason for the banding of CPU times for a given moonalt, moonphase, and filter). Probably we can at least normalize CPU times to a common scale.

The table below are the 'CPU Factors' assigned in the LSF system for the various classes of hosts that have been used in generating Run 1. These presumably relate to relative speeds. About half of the jobs ran on hequ hosts and one third on fell hosts.

Host class CPU Factor
bullet 13.99
dole 15.61
fell 11.00
hequ 14.58
kiso 12.16

@drphilmarshall
Copy link
Contributor Author

Nice! Host class CPU Factor looks like an excellent new feature to add to
the csv file, and we can decide on Friday whether to correct the CPU times
or just throw in the CPU factor as a feature.

On Sat, Mar 5, 2016 at 10:50 PM, Seth Digel notifications@github.com
wrote:

About 150 of the Run 1 visits are still running. I did the math wrong last
night regarding how much longer they can go before hitting the CPU time run
limit, which is 7200 minutes. The remaining jobs will not hit the CPU time
limit until Monday morning.

Also, I am thinking that the analysis will need to fold in the type of
batch host in the SLAC farm. Some generations of batch hosts do a lot more
per CPU minute than others (and this may be the reason for the banding of
CPU times for a given moonalt, moonphase, and filter). Probably we can at
least normalize CPU times to a common scale.

The table below are the 'CPU Factors' assigned in the LSF system for the
various classes of hosts that have been used in generating Run 1. These
presumably relate to relative speeds. About half of the jobs ran on hequ
hosts and one third on fell hosts.
Host class CPU Factor
bullet 13.99
dole 15.61
fell 11.00
hequ 14.58
kiso 12.16


Reply to this email directly or view it on GitHub
#159 (comment)
.

@brianv0
Copy link

brianv0 commented Mar 7, 2016

@sethdigel I think you'll want to double check this. SLAC discourages the use of CPU time (or named queues) and suggests users only use a wall clock time and, in fact, the bsub wrapper script SLAC maintains is actually supposed to divide a supplied CPU time by 5 and set then just set that as the wall clock time. That number was arrived at by assuming a job might run on a fell machine (CPU factor ~10) and giving it twice the amount of time to run.

@sethdigel
Copy link
Contributor

Thanks, Brian. What we are looking at doing is studying the dependence of the actual CPU time for phosim runs (extracted from the pipeline log file for the runs) on some basic parameters in the phosim instance catalogs (like the altitude of the moon). In a preliminary look at the Run 1 output, posted on issue #137, for runs with similar parameters (moon altitude, moon phase, filter), the CPU times appear to have two ranges, separated by a constant factor. I have not looked into it quantitatively yet, but I was guessing that it might be due to the batch hosts not all being the same speed. I posted the CPU Factors for potential future reference because that was the closest thing that I could find that looked like a measure of relative speeds, and because it was not particularly easy to find (it involved using the bhost command for specific hosts).

@sethdigel
Copy link
Contributor

About 40 phosim runs are still going. Almost all are about to reach the run limit. Three are runs that Tom restarted after they failed for some (probably transient) reason. It looks like about 105 of the runs either have timed out, or will. These represent 1.4 CPU years. The runs that finished used 3.5 CPU years.

I've made a csv file with the metadata from the phosim runs, collected from the instance catalogs and the log files from the pipeline. Here are the headings: obshistid, expmjd, filter, start, end, altitude, rawseeing, airmass, moonalt, moonphase, dist2moon, sunalt, cputime, hostname, runlimit. (start and end are the MJD starting and ending times of the runs, in case wall clock time turns out to be interesting, hostname is the first character of the batch host name, and runlimit is a flag for runs that hit the limit - which did not always occur at exactly the same CPU time.)

Where in github or Confluence would be the right place to put the file?

Here is an updated plot of CPU time vs. moonaltitude, with the points color coded by filter and sized according to moonphase. The horizontal dashed line is the approximate run limit (5 days) and the vertical dashed line is at 0 deg moon altitude. The histogram has a linear scale and shows the distribution of moon altitude for the runs that are still going. The points with '+' signs hit the run limit (and so produced no phosim output).

cpu_moonalt_comb

@drphilmarshall
Copy link
Contributor Author

Nice! K nearest neighbors or Random Forest are going to clean up on the CPU
time prediction, I think. Can you post this plot and two bullets of text to
the Twinkles slides at
https://docs.google.com/presentation/d/1MdGGDrITW4-n04goJNYBAoVwnQBjRkibxuJY8EJZWXc
please? Good to chat about this in our session.

I'd say, just put the data file in your public_html for now, and post the
URL to this thread. We can pull directly from there in our hack notebook.
Thanks Seth! :-)

On Tue, Mar 8, 2016 at 12:08 AM, Seth Digel notifications@github.com
wrote:

About 40 phosim runs are still going. Almost all are about to reach the
run limit. Three are runs that Tom restarted after they failed for some
(probably transient) reason. It looks like about 105 of the runs either
have timed out, or will. These represent 1.4 CPU years. The runs that
finished used 3.5 CPU years.

I've made a csv file with the metadata from the phosim runs, collected
from the instance catalogs and the log files from the pipeline. Here are
the headings: obshistid, expmjd, filter, start, end, altitude,
rawseeing, airmass, moonalt, moonphase, dist2moon, sunalt, cputime,
hostname, runlimit
. (start and end are the MJD starting and ending
times of the runs, in case wall clock time turns out to be interesting,
hostname is the first character of the batch host name, and runlimit
is a flag for runs that hit the limit - which did not always occur at
exactly the same CPU time.)

Where in github or Confluence would be the right place to put the file?

Here is an updated plot of CPU time vs. moonaltitude, with the points
color coded by filter and sized according to moonphase. The horizontal
dashed line is the approximate run limit (5 days) and the vertical dashed
line is at 0 deg moon altitude. The histogram has a linear scale and shows
the distribution of moon altitude for the runs that are still going. The
points with '+' signs hit the run limit (and so produced no phosim output).

[image: cpu_moonalt_comb]
https://cloud.githubusercontent.com/assets/6035835/13595599/96802ebe-e4c1-11e5-9130-0d295766557c.png


Reply to this email directly or view it on GitHub
#159 (comment)
.

@sethdigel
Copy link
Contributor

The metadata for the Run 1 phosim runs are in this file.

Run 1 has 1227 observations of a Deep Drilling Field at RA, Dec = 53.0091, -27.4389 deg (J2000). As of this writing 4 of the simulation runs are incomplete; they are flagged as described in the table below. In each case, these runs crashed for some (probably) transient reason and were restarted by Tom.

Column Description
obshistid Opsim designator of the visit.
expmjd MJD of the (simulated) observation
filter 0-5 for ugrizy
rotskypos angle of sky relative to camera coordinates (deg)
start_run starting time of the phosim run on the batch farm (MJD)*
end_run ending time of the phosim run on the batch farm (MJD)*
altitude elevation of the observing direction (deg)
rawseeing seeing at 500 nm (arcsec), a phosim input
airmass airmass at the altitude of the observation
moonalt elevation of the Moon (deg)
moonphase phase of the Moon (0-100)
dist2moon angular distance of the Moon from the observing direction
sunalt elevation of the Sun (deg)
cputime CPU time required for the phosim run (sec)*
hostname First character of the name of the batch host that ran the job (b, d, f, h, k)*
runlimit Flag indicating whether the phosim run was terminated at the 5-day execution time limit (1 = yes)

* An x in the hostname column or a negative number for CPU time indicates that the phosim run is still executing. These jobs also have 0 as the start and end time.

@drphilmarshall
Copy link
Contributor Author

Excellent! We are all set.

On Tue, Mar 8, 2016 at 3:29 PM, Seth Digel notifications@github.com wrote:

The metadata for the Run 1 phosim runs are in this file
http://www.slac.stanford.edu/%7Edigel/lsst/run1_metadata.csv.

Run 1 has 1227 observations of a Deep Drilling Field at RA, Dec = 53.0091,
-27.4389 deg (J2000). As of this writing 4 of the simulation runs are
incomplete; they are flagged as described in the table below. In each case,
these runs crashed for some (probably) transient reason and were restarted
by Tom.
Column Description
obshistid Opsim designator of the visit.
expmjd MJD of the (simulated) observation
filter 0-5 for ugrizy
start starting time of the phosim run on the batch farm (MJD)*
end ending time of the phosim run on the batch farm (MJD)*
altitude elevation of the observing direction (deg)
rawseeing seeing parameter of phosim (arcsec?)
airmass airmass at the altitude of the observation
moonalt elevation of the Moon (deg)
moonphase phase of the Moon (0-100)
dist2moon angular distance of the Moon from the observing direction
sunalt elevation of the Sun (deg)
cputime CPU time required for the phosim run (sec)*
hostname First character of the name of the batch host that ran the job
(b, d, f, h, k)*
runlimit Flag indicating whether the phosim run was terminated at the
5-day execution time limit (1 = yes)

  • An x in the hostname column or a negative number for CPU time indicates
    that the phosim run is still executing. These jobs also have 0 as the start
    and end time.


Reply to this email directly or view it on GitHub
#159 (comment)
.

@sethdigel
Copy link
Contributor

Phil's machine learning example notebook runs for me. This is sort of a Hello World plot showing some of the metadata for Run 1. So, yes, I think it will work.
run1_everything

@drphilmarshall
Copy link
Contributor Author

Oh cool! :-) This is great, Seth. Let's spend some time this morning
staring at the full plot, to get some feel for what is going on. This is of
course not part of the traditional machine learning development flow but
sod it, we're physicists. Then I think we can just step through the
notebook, editing both the markdown and the python to train the KNN model
and make some predictions :-)

On Thu, Mar 10, 2016 at 5:21 PM, Seth Digel notifications@github.com
wrote:

Phil's machine learning example notebook runs for me. This is sort of a
Hello World plot showing some of the metadata for Run 1. So, yes, I think
it will work.

[image: run1_everything]
https://cloud.githubusercontent.com/assets/6035835/13690344/5ca48666-e6e4-11e5-9312-aad863a5de5d.png


Reply to this email directly or view it on GitHub
#159 (comment)
.

@rbiswas4
Copy link
Member

Incidentally, whatever methods are being used here can also be used to study correlations of five sigma depths with other columns in OpSim. Obviously these are simulated, but it might be a fun project to do exactly the same thing with the OpSim outputs with the interesting variables being fivesigmadepth, or Seeing. At least that exercise would give me some insight into stuff that obsververs have a good intuition for already. I can get the OpSim output in the form of a dataframe, and I propose we combine with the group looking into observing strategy today.

@drphilmarshall
Copy link
Contributor Author

Good idea.

Seth I would bet something expensive (your bike?) that Random Forest is
going to work better than KNN in this problem, especially of we dont split
the data by filter (and we don't have time for such sensible things today).
But still, let's get KNN working first and then unplug it and replace it.

On Friday, March 11, 2016, rbiswas4 notifications@github.com wrote:

Incidentally, whatever methods are being used here can also be used to
study correlations of five sigma depths with other columns in OpSim.
Obviously these are simulated, but it might be a fun project to do exactly
the same thing with the OpSim outputs with the interesting variables being
fivesigmadepth, or Seeing. At least that exercise would give me some
insight into stuff that obsververs have a good intuition for already. I can
get the OpSim output in the form of a dataframe, and I propose we combine
with the group looking into observing strategy today.


Reply to this email directly or view it on GitHub
#159 (comment)
.

@drphilmarshall
Copy link
Contributor Author

@sethdigel @humnaawan and @tmcclintock: nice work on Friday! Reproducing your main result here:

From this plot it looks to me as though you are able to predict CPU time to about +/- 0.2 dex (95% confidence, very roughly), no matter what the absolute CPU time is (although it would be good to be more precise about this). 0.2 dex corresponds to about 50% uncertainty, or ranges like "5000 to 15000 CPU hours." I think this could be useful ( @TomGlanzman can comment further ), and we didn't even get into extending the data with more OpSim or PhoSim parameters.

The next step could be to extract the ML parts of your notebook and repackage them into a PhoSimPredictor class, which could be trained and then pickled for use before every new PhoSim run to determine which queue to use. Again, we should be guided by Tom here. Let us know if you're interested in helping with this! And thanks for all your efforts on Friday - what a nice hack! :-)

@TomGlanzman
Copy link
Contributor

I hope @drphilmarshall 's quoted range of 5000 to 15000 CPU hours ... is really seconds. Or were you looking at integrated CPU hours for Twinkles-phoSim?

In terms of job scheduling, it will certainly be extremely useful to know the (approximate) CPU time ranges. However, unless and until we get some form of checkpointing running, one would also need the facility within the Pipeline to customize the job run time on a per-stream basis. Otherwise, we are in the same situation as now: send all jobs to the same, maximum time queue. And/or we arbitrarily make a cut so that any phoSim that requires > NN hours of CPU is simply not attempted.

@cwwalter
Copy link
Member

I haven't had time to actually play with this yet but at NERSC they have:

http://slurm.schedmd.com/checkpoint_blcr.html

As far as I can tell this is done at a system/kernel level so you don't have to change the actual code.

@TomGlanzman
Copy link
Contributor

I tried exercising checkpointing a couple of years ago at NERSC but wound up frustrated because carver did not support this feature. Now, with both a new architecture (cori) and batch system (slurm) it is probably worth trying again...

@tony-johnson
Copy link
Contributor

However, unless and until we get some form of checkpointing running, one would also need the facility > within the Pipeline to customize the job run time on a per-stream basis.

@TomGlanzman, this feature has been built into the workflow engine since day one, as is used extensively in the EXO data processing. There should be no problem using this with your phosim task.

@cwwalter
Copy link
Member

@tony-johnson are you referring to BLCR? Do you have any simple examples of how to use it in a slurm file?

@tony-johnson
Copy link
Contributor

@cwwalter no sorry, I was referring to Tom Glanzman's post above yours concerning using the CPU time estimate to set the time required for a batch job. (GitHub needs to add threaded conversations).

The blcr does look quite interesting, I read the FAQs linked from the page you referenced above and it certainly seems as if it might be usable.

@jchiang87
Copy link
Contributor

This seems to have been addressed by the hack day project at #177 and #178. I'll open a new issue for Phil's proposed PhoSimPredictor class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants