Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Refinements to the real galaxy calculations #168

Closed
rmandelb opened this Issue · 68 comments

7 participants

@rmandelb
Owner

Now that we have a way of simulating galaxies in HST training data, I'm opening this issue as a place to collect suggestions for changes to that part of the code. There are a number of obvious things to be done:

  • A thorough exploration of timing in realistic scenarios (shears, target PSFs, etc.), including tradeoffs between time spent in calculation vs. time spent on i/o, etc.

  • Adding more information to the training data catalogs as needed. For example, if we want to start whitening the noise in the original HST images, then we need to save some information about the noise covariances -- at minimum, a number for the diagonal variance and a few numbers for the level of correlations as a function of pixel separation.

  • Tests of the accuracy of shearing and dependence on interpolation kernel.

  • Decisions about how to handle 1000x as many training galaxies as have been included now.

  • Tests of the impact of some choices I made in putting the data into the repository. For SHERA, I saved large postage stamps that were padded with noise. Here, I've saved small postage stamps that are trimmed based on sextractor segmentation maps, and when SBProfile chooses how much to pad them, they get padded with zero, which can lead to some image artifacts that are visually jarring and that would complicate the process of dealing cleanly/consistently with the noise in the original HST images.

  • Exploration of the possibility of photon-shooting real galaxies. My preliminary tests suggest that the deconvolved galaxies look pretty bad (way too much ringing etc.), noticeably worse than the pseudo-deconvolved versions from SHERA. It might be worth my making a SBPseudoDeconvolve class that carries out pseudo-deconvolution, so we can check whether those would be easier to photon-shoot.

Other suggestions are welcome!

@rmandelb rmandelb was assigned
@rmjarvis
Owner

Another point to investigate relevant to this is how to keep the pyfits calls reasonably efficient without going crazy on the memory when you have a lot of files to read. For the use case demoed in MultiObjectDemo.py script 3, where 100 objects all come from a single file, calling pyfits.getdata for each one was very slow. I added the preload() method to read the file once and store the hdu list in memory, but this won't always be a viable option.

It would be worth figuring out exactly why pyfits.getdata was slow. e.g. is it just because of disk I/O overheads of opening and closing the file 100 times? Or is pyfits really reading the full file just to get a single HDU out, so it had to do that 100 times? etc. If the latter, then that might suggest we should save the real galaxies each in their own file, so the read command only reads the file we actually want. If the former, then we do want to store a significant number of them in multi-extension fits files and preload at least a file's worth at a time to avoid the repeated file accesses.

@rmandelb
Owner

Yes, this is exactly the kind of thing I was hoping to cover with this issue. Another question is even if pyfits is reading in the whole file to get out a single HDU, perhaps there is some other way to get pyfits to not do that, so we wouldn't have to preload OR store everything in separate files. I don't know enough about pyfits to know if this is an option.

This will be especially relevant when we have 100k training galaxies!

@rmandelb
Owner

I'd like to revisit this issue since it's part of the 4th milestone, albeit at slightly lower priority than other items. We should decide which bits to prioritize for now and which could wait. I'm pinging the people who were on the telecon and/or expressed interest in working on the code for a while: @barnabytprowe @TallJimbo @rmjarvis @gbernstein @pmelchior @joezuntz @joergdietrich @rearmstr @dkirkby (with apologies to those who aren't interested in this particular issue).

Going down the list above with an emphasis on what seems important to do before a release of the code within GREAT3 and what tasks seem straightforward vs. which are open-ended:

1) timing for various shears and target PSFs: I think that for the initial release, we don't have to worry too much of the code is a factor of a few slower than we'd like for certain shears and target PSFs when doing calculations with realistic galaxies. We do have to worry about timing issues related to i/o that could be debilitating (see below) but let's treat that issue separately.

2) adding more info to the catalogs if needed: I'd say let's postpone this until we need it!

3) tests of accuracy of shearing as a function of interpolation kernel: this is super important for the challenge or for any use of the code for the basic shear tests. It's also potentially quite difficult to test and has the potential to be a complex and open-ended task.

4) decisions related to large training samples: right now we have 100 galaxies in the repo just as a basic test. Ideally I'd like to be able to handle the I<22.5 sample of 26k galaxies when we release this within the GREAT3 collaboration for testing. This will require investigation of time spent on i/o depending on how we store the sample (e.g., separate files vs. one large file) and possibly how we use pyfits to access the data. My feeling is that this is relatively important and also not as open-ended / complicated as (3). It's also a bit boring and technical, but that's life....

5) image padding: the padding of the images by zeros can lead to visually jarring artifacts in the simulated images. While this isn't an issue for accuracy of simulated galaxies, it looks kind of cruddy. So I'd say it's not as important ultimately as (3), but it is moderately important. I could investigate having SBProfile pad with a noise field, and I don't think that would be too hard (hopefully those aren't famous last words). However, the amount of padding itself can lead to accuracy issues depending on the interpolation kernel, so that is more important to check into, even if we end up leaving the padding as zeros.

6) photon-shooting investigations: this could be quick or could be a very long / drawn-out / technical investigation, depending on how things go. The advantage here is that shearing is exact, so we don't have to worry about issues related to accuracy of shearing via interpolation.

Tentatively I'd propose that for this milestone, we try to tackle (4), (5), then time-permitting (3) and (6), and not even attempt to get to (1) and (2) until after the initial release of the code in Sept. Thoughts? Suggestions?

@rmjarvis
Owner

I would say that 4 is the most important to get done soon.

About photon-shooting (6), we've determined that photon shooting gets the noise in the image wrong if the interpolation kernel has negative regions (i.e. Cubic, Quintic, Lanczos). So if we want to use photon shooting with real galaxies, we should either make sure that the sky noise dominates, or use Bilinear interpolation. Or shoot way more photons than you normally need and then add noise back in, but that kind of defeats the purpose.

@rearmstr
Collaborator

I'd be willing to investigate (4). It seems like a relatively simple task to start getting more familiar with the code.

@rmandelb
Owner

I agree that (4) is relatively simple. How comfortable / familiar are you with pyfits? (We might want to try different ways of having pyfits access the data.)

The only caveat is that to try (4), you might want a larger dataset than the current 100, and you might want that dataset in a few different forms (different HDUs in big files, lots of separate files, etc.). I have all the training data and scripts to put it in various formats. So one possible working model would be for you to request what you need from me for these tests, and I can send it to you. Another option is that you'll have to download the dataset of 26K training galaxies, and I can hand over my scripts that I used to make the catalog of 100 galaxies (IDL, I'm afraid). This dumps more of the work on you since you'd have to learn my scripts, but OTOH it means you're not limited by my ability to promptly send you data that you request. What do you think? What is your time availability like?

@rearmstr
Collaborator

I've only used the basic functionality in pyfits for reading in images and catalogs.

I think it sounds easier for me to download all the training set and learn to run your scripts. It seems better to be able to generate different formats without having to keep asking you. I should have time next week to devote to this. Where can I get the data?

@rmandelb
Owner

You can download the data from IPAC:
http://irsa.ipac.caltech.edu/data/COSMOS/images/shera_galaxy_postage_stamps/index.html

(check out the README, and download the tarball with the postage stamps, but you don't need the tarball with weight maps)

I'm on a telecon so I can't pull my scripts together, but I will try to do that for you tonight. It'll take a while for you to get all that data on your machine, anyway.

@rmandelb
Owner

Sorry, I didn't finish that thought. If you have time for this during this week, then I propose to try to get you everything you need today. In addition to downloading the data and trying the scripts I send you later, you'll need to install GalSim and play with it a bit using the existing training data that is in there already. Perhaps we could aim to have some answers to question (4) by the end of July, since having a way to handle larger training datasets will facilitate a lot of the other work on this issue?

@rmandelb
Owner

... and I just realized that what I stuck into the repo already depends on additional data that wasn't released on IPAC. I used the sextractor segmentation maps to trim the postage stamps so they wouldn't be so huge, relying on SBProfile to pad them after reading them in. SO if you want to run my scripts, I'll need to package up that data too.

@rearmstr
Collaborator

How difficult is it to package the additional data? If getting everything in place for me to run the code is more effort than you generating things yourself, then it doesn't seem to make sense.

I think the end of July sounds like a reasonable time frame to have some answers.

@rmandelb
Owner
@joergdietrich
Collaborator

A free dropbox account should have more than enough storage to share these files.

@joergdietrich
Collaborator

I just realized they are a bit smaller than I remembered, but a few referrals among ourselves should add the required space. So, don't create accounts yet if you don't already have one.

@rmandelb
Owner

I already have a dropbox acct but I don't have room for 10 GB in it. And I thought my CMU webspace was limited to less than that, but now I can't find documentation that gives any limit, so I will have to try it today once I get into the office.

@rmandelb
Owner

When I estimated the size of the file, I completely forgot about the fact that segmentation maps compress much better than regular images so they take up much less space. I just stuck them in my dropbox, and sent an invitation to Bob. If you have the images from IPAC and the segmentation maps from my dropbox, you have most of what you need. The rest are just catalogs and scripts, which I will package up and send today/tonight.

@rmandelb
Owner

Bob: some more info for you - I was just packing up my scripts to send to you, plus commenting them a bit more thoroughly, when I realized that they rely on another IDL package: http://www.astro.caltech.edu/~rjm/shapelets/code/
They rely on it only because that code has tools for handling sextractor catalogs, and once I had the shapelets code installed I figured why not rely on it for this. But if you don't have that software, you would need to either install it, or put in your own way for dealing with sextractor.

Anyway, I think that everything you need is now in the dropbox along with a README, but please let me know as questions arise.

@barnabytprowe

Hi Rachel,

Apologies for coming late to this discussion. I agree with your broad designations:

Short term: (4) & (5)
Medium term (but v important): (3) & (6)
Longer term: (1) & (2)

My own feeling with (5) would be that we should maybe approach this from the other direction: restore the postage stamps to full size as our working set and see how much we can shrink them or what effects we get from replacing their noise fields with our own. I guess I'm a little unsure that I see the logic behind the statement:

5) image padding: the padding of the images by zeros can lead to visually jarring artifacts in the simulated images. While this isn't an issue for accuracy of simulated galaxies, it looks kind of cruddy.

If we can see boxy artifacts, is that not a cause for concern? It seems to me to rely on a kind of detailed balance around an assumed-zero background to be truly safe...

@rmandelb
Owner
@rearmstr
Collaborator

I have some initial results on trying to use a larger number of training galaxies. I created a sample of 1000 galaxies and split the images into different number of files: 10, 40, and 100. I then ran a modified version of RealDemo.py to see how long it took to create a galaxy on average. I also include how long it took if I use the preload option.

Preload = 0.03 sec/gal
N10 = 0.25 sec/gal
N40 = 0.07 sec
N100 = 0.04 sec

I used the cProfile module to see where the most time was being spent. I noticed that when it was getting data from a file it would cycle through all headers to potentially do a checksum. This was using a rather old version of PyFits v2.4. I upgraded to the latest version v3.0.8 because I could see that there was a significant refactoring of the code and it didn't include this explicit step. Unfortunately, the times were worse with the latest version.

Preload = 0.03 sec
N10 = 0.4 sec
N40 = 0.09 sec
N100 = 0.05 sec

The newer version has more advanced python usage, but the profiler still shows a majority of the time is spent reading headers. I looked through the code to see if you could read a file without explicitly reading all the headers, but I didn't
see anything. Perhaps we should write one ourselves?

@rmandelb
Owner
@rearmstr
Collaborator

Rachel,

Your numbers for the "preload" option are if you have all 1000 galaxies in one file, just like the current training sample has all 100 galaxies in one file?

It doesn't really matter how many galaxies you have per file, if you use preload the timing is the same. For these times, I did not include the initial overhead read. For 1000 galaxies it was a few seconds.

It seems like long term we want the (b) and (c) combo. For the next milestone, if we want to include all 26k galaxies the preload time is ~30 sec. Is this acceptable? Should we try to investigate writing a new read routine or put it off till later?

@rmandelb
Owner

I just thought of one more consideration - at some point, preload won't be an option because there won't be enough RAM to have everything sitting in memory. The training sample of 100 galaxies takes up 6M (for galaxy + PSF images). If we have 26k galaxies, that would be 1.6G! My laptop is 4 years old and has 4G of RAM, which I think is sort of typical. So that would kind of work, I guess, but it's getting close to the limit. Eventually we'll have >4x as much training data, so it will no longer be a viable option to preload the entire training dataset.

Furthermore, this is with trimmed postage stamps. If our investigations suggest that we cannot trim them as severely as I've done here, then even for the 26k, preloading won't be an option.

Are there any other considerations that I'm missing?

@rearmstr
Collaborator

If I understand how it works, it actually doesn't load the data into memory, only the headers. They are fairly small so I don't think it makes a big impact.

@rmandelb
Owner

That wasn't my understanding. I thought it's actually sucked everything in, hence Mike's comment in the dox that "There are memory implications to this, so we don't do this by default." @rmjarvis , have I misunderstood?

@rmjarvis
Owner

Preload definitely loads in all the image data. If you look at the getGal function, you can see that if the catalog is marked as having been preloaded, then it doesn't use any pyfits command to return the data. It's all in memory already.

@joergdietrich
Collaborator

My understanding of pyfits internals is different. Pyfits copies data into memory only when an HDU's .data is accessed. I.e., for preloaded files this only happens in getGal(). According to the pyfits user manual:

'''
When a FITS image HDU’s .data is accessed, either the whole data is copied into memory (in cases of NOT using
memory mapping or if the data is scaled) or a virtual memory space equivalent to the data size is allocated (in the case
of memory mapping of non-scaled data).
'''

A quick test loading a large fits file confirms that the resident memory of the python process only increases once the .data accessed, pyfits.open() is not enough.

If this is still a problem, we might want to consider opening fits files with memmap=True, which is not the default and should load only those parts into memory that are actually needed. You still need enough virtual memory for the worst case scenario of accessing all data at once.

@rmjarvis
Owner

OK. You're probably right. I didn't do any tracking of the memory usage through the program or anything. I just assumed that if we were simply accessing a .data member, then it already had to be in memory. But of course, pyfits can use some OO magic to hide a file operation behind the scenes.

@rmandelb
Owner

Okay, so what you're saying is that preloading includes the time-consuming loop over all the HDUs that pyfits.open does, so that getGal or getPSF can simply read in the image without having to do that. And this means that preloading can speed everything up without taking up insane amounts of RAM.

IMO, if we're thinking about making simulations with millions of galaxies, and we have a training sample of ~100k galaxies, then taking 30s to a few minutes to pre-load is a really small price to pay if the timing per galaxy gets decreased by factors of many, which seems to be the case whenever we have a substantial number of training set galaxies grouped together into files. In that case, then when running a simple script to simulate small numbers of galaxies, we would not want to preload, but otherwise we would.

But, I would like to make a well-motivated decision about how to do this for a standard use case, which means we have to decide what is a standard use case. In my head, the case we most worry about is the one I mentioned above, where people want to test shear recovery for realistic galaxies to high precision, requiring a training sample of 100k (presumably grouped into chunks that are written to file together) that would get reused to make millions of simulated galaxies. And then for that use case, we want to make sure that time spent on i/o and the amount of required RAM are minimized. So, if we might have 100k training galaxies, could it be possible that even the small amount of memory required to preload the headers will be too much? How much space does a single one of those headers take, so we can figure out whether the training sample we'll have in a few months is too much to preload? I would like to think that far ahead.

Or, is there another standard use case that could be problematic and might require different considerations?

@barnabytprowe

A very informative and interesting discussion... I hadn't realised that Pyfits could be made to work in this 'preload' fashion, and it seems to suit our needs rather well.

So, if we might have 100k training galaxies, could it be possible that even the small amount of memory required to preload the headers will be too much?

So, FITS headers are always "a multiple of 2880 bytes (equivalent to 36 80-byte keywords) long." (from the FITS primer http://fits.gsfc.nasa.gov/fits_primer.html). Let's assume a fairly big header with 144 keywords, for sake of argument. That's ~12 kB ~ 1e4 B. Multiplying that by 1e5 for a full set of galaxies gets us to a GB.

So, it will be close to our comfort zone on a laptop. But hardly out of the question. We can also probably allow users to optionally prune away much of this unwanted header information - 144 keywords is a lot.

@rmandelb
Owner

Right now the headers are quite minimalistic, i.e. just like this:

XTENSION= 'IMAGE   '           /Image Extension created by MWRFITS v1.4a        
BITPIX  =                  -32 /                                                
NAXIS   =                    2 /                                                
NAXIS1  =                   47 /                                                
NAXIS2  =                   69 /                                                
PCOUNT  =                    0 /                                                
GCOUNT  =                    1 /                                                
END                                                                             

not 144 keywords, so even with 1e5 training galaxies we'd be okay to preload (and, duh, I could have checked this earlier but apparently I wasn't using my brain this morning... sorry).

But this actually raises another point about what information we want to carry around where. For example, I ditched the WCS that came with the original COSMOS images, figuring that it's just a training galaxy and we don't need to carry around all the information about where it lives. But, in the catalogs, I have saved the COSMOS ID so that if someone does want to know more about the training galaxy, they can look it up. At some point if we have fits from Claire, we will want to save the parameters of those, so people can easily reconstruct a model fit for the galaxy. Right now the catalog is not large... it's 17K for the 100 galaxies in the repo, which scales up to a manageable size even with 1000x as many galaxies. But if we decide we need tons of other data, it could get a little crazy, so we should be careful in considering what info must be saved.

@barnabytprowe

Yes. It will be better for us as users too if we can force ourselves to be somewhat economical with what info we carry around. Long, full catalogues are great but I find you start to get diminishing returns once you begin forgetting what information they contain :)

Just a quick caveat is that even if not a full multiple of 36 keywords are used, the FITS format nonetheless allocates blocks of 2880 bytes and not smaller, so the remaining keywords are simply padded. This means, at best, we still have ~300MB of header being carried around for 1e5 galaxies, even if each header were only 7 keywords long.

@rmandelb
Owner
@barnabytprowe

As long as we can try to have a clear, relatively easily-remembered divide between what info is in the headers and what will come from external catalogues I'd be happy.

One way to do that would be to put a selection of the COSMOS info we think is most relevant into the header, up to 36 keywords. This is sort of the background info - the fits from Claire would then be "derived" info in a separate catalogue.

Another option would be to make our header as "GalSim specific" as possible, storing all the information (sky background, noise info, PSF info, Claire's fitting info) required to allow us make a pretty good simulation of the postage stamp in question. This also captures a lot of the properties of the postage stamp, but not all that we would necessarily want from COSMOS (e.g. RA, DEC etc.)

Finally, we have the option of putting any amount of information into a FITS binary table extension, although this may then need its own overhead for I/O and while neat might be just more of a pain than having a good catalogue.

@rmandelb
Owner

Let me summarize this discussion:

a) We're going to stick with preloading, since it (i) is really fast and (ii) doesn't require us to carry around all the image data in memory, just the headers.

b) In principle, we still don't want to have just 1 huge file with images, because they would be very large, and people who want to just access a few objects (without the ~1min overhead of preloading) would then be limited by the need to find one image out of many HDUs. Plus, enormous files are just unwieldy for people to play with and pass around. We could perhaps imagine storing the images in groups of 1k, which means our current training sample would have 26 files and the eventual full one, ~100. This is very much up for discussion IMO, so other opinions are welcome.

c) I propose that we should have the FITS catalog have all the basic COSMOS information, and the FITS headers for the images will have GalSim-specific info, like galaxy model fit parameters.

d) We need a place to store these files. We're not putting 26k galaxy images into the repo! Given the current 7M for 100 galaxies, we would have 1.8G for 26k galaxies. Barney suggested that we open a GalSim-related dropbox which is fully public (at least for reading :) and tell people to get the data from there. This seems like a good solution to me, at least for now considering that we don't have the GREAT3 server set up, and probably won't for a few months. We also have some $ that could be used to maintain a non-free Dropbox site with more space if we expand the training dataset substantially (but at that point we might want to rely on people downloading from our server, which will be in place by that time).

Implementing the above (a)-(d) would take care of the most pressing order of business related to real galaxies. Well, we might also want to do something additional in the dox, like specifically recommending that people preload when they are simulating more than N real galaxy images, for some N. But that's minor.

However, we have other action items in this issue (5 of them!), some of which might involve changing the postage stamps themselves (for example, the current ones are trimmed pretty severely to contain no blank space). And we will be getting galaxy model fit parameters which we'll put into the headers. So the question is, will any of those changes take place soon enough that it's not worth our doing (a)-(d) now and getting all that data into the repo for our first release of GalSim within the GREAT3 collaboration next month, but instead waiting until we do the other action items?

I'm busy with variable shear fields, and will be for the conceivable future. @rearmstr , my understanding was that you were available mainly in July, and we're well into August. Do you see yourself wanting to address any of the other "real galaxy" action items, or are you done at this point? (which I would completely understand...) If you don't have the time/inclination to address any other action items in this issue, then unless someone else volunteers, I propose that we should do (a)-(d) now, and take a short break on this issue. And the question for you, Bob, is do you have time to do (a)-(d) now that you already have scripts set up to make a bunch of files with training set galaxies? If not, I completely understand and will do it myself, but if you do, then it would be helpful if you could do that. In either case, I really appreciate the tests that you did, as I really hadn't realized just how ideal this preloading option was in terms of speed and memory usage! Some very useful lessons about pyfits here...

@rearmstr
Collaborator

I'd be happy to do (a)-(d).

We could perhaps imagine storing the images in groups of 1k, which means our current training sample would have 26 files and the eventual full one, ~100. This is very much up for discussion IMO, so other opinions are welcome.

c) I propose that we should have the FITS catalog have all the basic COSMOS information, and the FITS headers for the images will have GalSim-specific info, like galaxy model fit parameters.

Do we know which information want to keep in the headers. Is all the information we need in the set of files that you pointed me to or do we need other external data?

Well, we might also want to do something additional in the dox, like specifically recommending that people preload when they are simulating more than N real galaxy images, for some N. But that's minor.

I'm not sure there is a case for not preloading if we have ~1K galaxies per file. In that case each time you want to simulate one galaxy you will have to read all 1K headers.

I'd also like to address some of the other remaining issues here. Based on the discussion above it seems like issues (5) is the most pressing followed by (3) and (6). It seems reasonable for to at least (5) before the next milestone and perhaps begin to investigate the remaining two.

For (5) there seem to a couple of different directions already discussed. 1) modify SBProfile to add pad with noise instead of zeros and 2) keep more of the original postage stamp. Are there other avenues I should explore?

@rmandelb
Owner

Cool, thanks.
For the headers, I think we currently don't have anything. The main thing would be galaxy model fit parameters, which are 1/2 done but not complete, and probably won't be complete for the next ~2 months due to the fact that the person running the code is moving internationally. So for now, we'll just have the FITS catalog with the basic info.

The case for not preloading: here was my calculation -
If we want to simulate just a few galaxies (let's say M) as a test case, but we preload, then the time it takes with 26k training galaxies is
30 (preload) + 0.03M seconds

If we don't preload, then the time taken is the time per galaxy to read in the 1k headers in whichever training sample file the galaxy lives in, i.e.
(~1s)M

So for tests with <30 galaxies, preloading doesn't make sense in this scenario with 1k galaxies per file. But 1k is a number I pulled out of nowhere (based on vague considerations- that it's too annoying to have a file per galaxy since that's tons of files, but extremely huge files are unwieldy to pass around) so if we want to consider other choices of galaxies per file then we might revisit this.

As for the other issues: I do agree with you about the priority. For (5), those are the first things I would consider. There is a problem which is that we don't have a way of generating a correlated noise field (yet). We will need this functionality, so the question is whether it is worth developing now. For Shera, I sort of cheated and used bits of sky from COSMOS data, but this isn't a great solution, and it seems that instead of putting in a kludge like that and then revamping it later with truly random correlated noise fields once we have them, it would be better to do this properly from the start.

@rearmstr
Collaborator

I think it makes sense to do things properly from the start, it seems to save time in the end. I don't have any experience with generating correlated noise fields. How big of an effort do think this would be?

@rmandelb
Owner
@rearmstr
Collaborator

Good. It sounds the way to proceed is to write the correlated noise routine in C++ (using TMV) and allow for SBProfile to use this to pad the images.

Can you point me to Barney's python routines that do this.

@rmandelb
Owner

@barnabytprowe , can I put those correlated noise routines from python into the repo, or at least send them directly to Bob?

Bob, Barney is away now so he might not be able to reply quickly. If he can't, then I will simply write a summary of what matrix manipulations are needed.

And... sorry to keep doing this, but I realize one thing I forgot: we need to decide on a noise covariance matrix. This means that we need to decide on one for each of the HST training galaxies, which don't have exactly the same covariance matrix. So, a few thoughts:

1) we have to make sure we get the per-galaxy noise covariance matrix? (they won't be quite the same because there are several causes of correlated noise, some isotropic and some anisotropic, some depending on the point in the focal plane, etc. Of course most will be quite similar.)

2) if we have postage stamps with lots of noise around the galaxies we can estimate it from the pixel values by masking out the galaxy (which i think Barney also has some code to do), but the covariance matrices will be quite large and, of course, with a finite correlation length, will have many values that are just zero. So what is the best way to carry this info around? We could imagine running a routine to get the full noise covariance for each real galaxy, then parameterizing it in terms of just a few numbers (e.g.: diagonal covariance, and then a few values of the correlation for various distances, with correlations assumed to be zero for larger distances). Those few numbers could get saved in the image header, and then we could have a routine that reconstructs the full covariance matrix from them for the generation of padded noise.

3) The operation in (2) sounds like overkill when it comes to padding with correlated noise that exactly matches the noise field near each galaxy. But, when we want to carry out noise whitening on the existing noise in the COSMOS galaxy images, which will be important for including fainter training galaxies, I believe we will need that information anyway (we have to take the original noise field, say what happens when we shear and convolve to the target PSF, and then whiten the resulting noise - so this requires carrying around the original noise field!).

Now, we could imagine doing something simpler for the purpose of padding with correlated noise - e.g., using some average level of noise correlations for the training set galaxies, so only the noise variance is allowed to vary per-galaxy - but, as in (3), we'll eventually want to do the job right. While we didn't plan to include the noise whitening and fainter training galaxies until later this fall, doing some of the preliminary work now (time permitting!) would help make the later work go faster.

@barnabytprowe
@rearmstr
Collaborator

Ok, given your response Barney I will try to implement just a diagonal covariance matrix for each galaxy. I assume the best way to calculate this is to mask out the all the objects in each image by using the segmentation map and find the variance of the remaining pixels.

I also have a question about the best way to add this noise to the zero padded region. Rachel suggested doing it in SBProfile itself. If I understand how the different draw routines work there, the code determines how large of an image is needed and sets all the pixel values to zero. It then will add to the pixels the values from the function itself. It seems that the noise should be added after this point by checking if any remaining pixel values are zero. To do it in SBProfile itself we would need to pass a covariance matrix to the draw command. Is this a good way to approach this?

@rmandelb
Owner
@rearmstr
Collaborator

Sorry, I've been pulled away the last couple of weeks. Given the shorter time constraints it would seem your first option is a better near term solution and I don't have any experience passing things between C++ and python. I guess my question is how much you think this will help in solving the original problem?

@rmandelb
Owner

Based on my past experience: when we look at the images padded with uncorrelated noise, it will be visually quite obvious that the noise properties are different in that region. But when we look at the resulting simulated images after shearing and convolving with a ground-based PSF, the difference won't be as obvious. It will be a definite improvement over the case where the padding is with zero. So, it is up to you as to whether you have the time and inclination to do this.

If you don't want to deal with learning how to pass things between C++ and python, then if you are up for doing the job in the C++ parts of SBProfile, I could do the rest. If you do want to learn this, I can point you to relevant examples in the code that would show you how to do this.

Looking at the big picture here: you had previously seemed interested in doing steps (a)-(d) in my comment above from 20 days ago. Are you still interested in that? And in the context of the planned release of GalSim within the GREAT3 collaboration, here are my thoughts:

  • even if the real galaxies are left as-is, with just 100 objects and no image padding etc., that's enough for people to start playing with

  • then there is the possibility of giving them the 26k training galaxies. If that's easy, it would be nice, but no pressure. (For context, it's lower priority than most of the tasks Barney mentioned in his e-mail to great3-code list this week.)

  • just like getting the 26k training galaxies in the repo, image padding is also slightly lower priority for now. If it can be done easily, that would be nice, but if not, we can simply state this as a goal for the fall, so users know that we recognize the limitations of what's currently in the repo (and this is not the only limitation!).

  • as for all the other things described in this issue, I think those are definitely ones for the fall, we shouldn't even attempt them now.

Hopefully @barnabytprowe will chime in here, but I think he and I are in agreement about these priorities.

@rearmstr
Collaborator

Looking at the big picture here: you had previously seemed interested in doing steps (a)-(d) in my comment above from 20 days ago. Are you still interested in that?

Yes, those are definitely in my plans. I have generated 26k set and am planning on putting them on Dropbox as Barney suggested.

But when we look at the resulting simulated images after shearing and convolving with a ground-based PSF, the difference won't be as obvious. It will be a definite improvement over the case where the padding is with zero. So, it is up to you as to whether you have the time and inclination to do this.

I think there is still something I don't understand about how it would be used in SBProfile. Just adding noise to the "draw" method in SBProfile doesn't seem like it would accomplish what you want. Since it determines the needed size of an image fills with zero and then draws it. For an interpolated image, like a real galaxy, it would give zero outside the bounds of the initial image. Does this mean the noise needs to be added to SBInterpolate? Am I understanding things correctly?

@rmandelb
Owner

Thanks for clarifying your plans about the 26k set of training galaxies.

Yeah, I'm sorry, it wouldn't be as a keyword to draw and drawShoot (that was from a different part of our discussion). It would be in the initialization of the SBInterpolatedImage, where the image is padded by some factor. Usually that's done by making a new image which has all zeros, and putting the training set galaxy image into the center of it. So we would have to set a keyword for the constructor of SBInterpolatedImage, something like pad_variance=0.0 (default value 0), and when pad_variance is nonzero, then SBInterpolated Image would determine which pixels are the padding and then populate them with noise with the desired variance. In practice the quickest way to do this might be to actually populate all the pixels with noise using one of our Image methods, then zero out the pixels where the smaller Image we have supplied will be located, and then add in our image.

Does that make sense to you?

@rearmstr
Collaborator

Yes, that makes sense. It now seems more straightforward the changes that would need to happen. I should be able to implement these changes by the end of next week.

I noticed in the pysrc directory examples where images were being passed to c++ for the future case of passing in a covariance matrix. Since generating the correlated noise in on hold I think I will just focus on getting things to work for the diagonal case.

@rmjarvis
Owner

For issue #277, it became clear that it's a bit hard to use the original image and psf stored in a RealGalaxy if you want to do something with them directly (rather than combined into the deconvolved profile). e.g. the RealDemo.py script draws them and has to put in a bunch of boilerplate around the draw command to get the image ready. Stuff that the GSObject draw command does for you.

So my proposal (which Rachel suggested belongs on this branch) was to make methods for RealGalaxy that return the original image and original PSF as GSObjects rather than using the internal attributes, which are SBProfiles. So

def getOriginalImage(self):
    return galsim.GSObject(self.original_image)

However, it occurred to me after that that we probably don't even need to store them as SBProfiles. We can just have the self.original_image and self.original_PSF be GSObjects directly (constructed as I did above). Then we can document their existence in the doc string for RealGalaxy (rather than implicitly through RealDemo.py) and users would be able to use them the way they normally do with GalSim objects.

@rmandelb
Owner

This sadly neglected issue related to accuracy of real galaxy calculations needs an update. I'm going to post here a revised list based on the one originally posted here many months ago, plus updates that have occurred since then. Assume that the noise-padding stuff on #238 is done for the purpose of this discussion, since that's nearly true. Issues like optimization are not included, since right now the code is fairly fast and nobody is actively complaining. :) Comments on the plans and/or priority of various items are very welcome:

(1) tests of accuracy of shearing as a function of interpolation kernel: this is super important for the challenge or for any use of the code for the basic shear tests. It's also potentially quite difficult to test in a clear-cut way. Given how important this is, I think it would be nice to discuss how we can do this (pinging the usual suspects: @barnabytprowe , @rmjarvis , and also bothering @gbernstein since it's a natural extension of some of the SHERA vs. SBProfile tests we did before). I have a few ideas, but please let me know if you have more:

(a) Take some reasonably complicated real HST galaxy and maybe two target PSFs and pixel scales (one ground, one space). Simulate the HST galaxy with complications like shear, rotation, magnification, where for each choice of target PSF + shear + rotation + magnification we simulate the galaxy with a range of image padding and interpolant choices. Then, check how the observed images change for the different choices of padding and interpolants. Here we don't have ground truth but we do know which interpolants should do better than others (roughly) so we can at least look for convergence towards a particular answer with better interpolants / more appropriate choice of image padding.

(b) Take one of the HST images and apply to the observed image some operations like shearing, shifting, dilation, rotation. We know how some of the moments should change in each of those cases, so we can look for the correct change in the observed moments.

(c) Make some fake HST data using some analytic profiles / sums of profiles (e.g., bulge + disk) and an OpticalPSF. Use it to simulate sheared lower-resolution data for some target PSF and pixel scale, and compare the resulting images with those we get by directly shearing and convolving the original sums of profiles with the target PSF, then resampling to the target pixel scale.

What else? Is this enough? Depending on the results of these tests, we may end up changing default interpolants / image padding for the RealGalaxy objects, and should certainly update the documentation.

(2) Getting original image / PSF more easily (per Mike's request on this thread from months ago): this should be easy and can take advantage of the InterpolatedImage. It's not super important but it is quick and useful.

@barnabytprowe

(1) tests of accuracy of shearing as a function of interpolation kernel: this is super important for the challenge or for any use of the code for the basic shear tests. It's also potentially quite difficult to test in a clear-cut way. Given how important this is, I think it would be nice to discuss how we can do this (pinging the usual suspects: @barnabytprowe , @rmjarvis , and also bothering @gbernstein since it's a natural extension of some of the SHERA vs. SBProfile tests we did before). I have a few ideas, but please let me know if you have more

Could I suggest a short telecon to talk through these ideas? Would anyone else be interested? I think they are important and a short discussion (~< 20mins) on the phone might be the most efficient way to plan these. If people agree I'll set up a Doodle, and I guess I'm primarily referring to the names pinged above.

For my part, I think c) is a straightforward and strenuous test (particularly if we go via an InterpolatedImage in the simulation of fake HST data). b) looks quick and very sensible, and a) is a bit sprawling but does have to be done. I do think these will be enough: I also think we might not get them all to pass right away. I cannot think of any more to add, but I've been thinking about it.

@rmandelb
Owner
@barnabytprowe

I was hoping that perhaps we could try and get some others on the line too beyond you and I (e.g. a couple of the guys from UCL, even) who share the same interests in the validation of these features, and use the time to informally allocate tasks (which when everyone can speak takes a few minutes only). I was maybe also hoping that it would form a motivational aid, and that the event of talking (even briefly) about this stuff would get people to take a look properly at the proposals would maybe generate new ideas.

@barnabytprowe

That was the rationale anyhow.

@gbernstein
Collaborator
@rmandelb
Owner
@barnabytprowe

Excellent. Your perspective would be very welcome Gary!

@tomaszkacprzak, @lvoigt , & @michaelhirsch , would you be able to join a short call to discuss these tests? I know you're very interested in making sure that the real galaxy calculations in GalSim are validated in terms of their precision. These tests, if we get them right, should allow us to nail the issue.

I've set up the following Doodle poll for a 30 min (max) discussion in the next week: http://www.doodle.com/q67bhacaiv3zud7m

Please enter your availability (or let me know if I need to add more time slots). Thanks all!

@rmandelb
Owner

@msimet , I wanted to bring this topic to your attention as well (for once you're done with #306). If this sounds interesting to you, it might be most useful if you join the phone call yourself.

@barnabytprowe

OK guys, Tuesday at 5pm GMT / 12 noon EST looks like a clear winner, but there are also some possible (if not ideal) times on Monday. Thanks for responding so promptly everyone!

Let's pencil in the Tuesday time, pending confirmation first thing tomorrow am.

@barnabytprowe

I think we should assume the Tuesday date and time is now set, as we've not heard from any more potential participants. Look forward to speaking at Tuesday at 5pm GMT / 12 noon EST.

@rmandelb
Owner

All - telecon reminder, talk to you in a few hours...
Barney - are we doing this on skype? Do you know everyone's contact info?
I think it would be preferable if you could initiate the call, if it's not a problem for you. I'm teaching until 5 minutes before the telecon is supposed to start, so I could be a few minutes late.

@barnabytprowe

I was planning Skype, as I think all are Skype users. The only contact info I'm lacking is for Melanie: would you mind adding me as a contact Melanie? I'm on "barnaby_rowe" and I'll keep an eye out to approve the request... Very happy to lead the call.

@msimet
Collaborator
@barnabytprowe
@rmjarvis
Owner

I think this issue might be closable?

@rmandelb
Owner

Agree.

@rmandelb
Owner

Closing #168.

@rmandelb rmandelb closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.