Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds CADC download option, plus fixes bugs. #4

Merged
merged 17 commits into from
Jul 9, 2020
Merged

Conversation

Nat1405
Copy link
Owner

@Nat1405 Nat1405 commented May 7, 2020

Users can now specify -c (eg; runNifty nifsPipeline -c ...)
to download raw data from the Canadian Astronomy Data Centre.
This has been tested to work from the interactive config session
(runNifty nifsPipeline -i) and the fully automatic mode
(runNifty nifsPipeline -c -f ).

Users can now specify -c (eg; runNifty nifsPipeline -c ...)
to download raw data from the Canadian Astronomy Data Centre.
This has been tested to work from the interactive config session
(runNifty nifsPipeline -i) and the fully automatic mode
(runNifty nifsPipeline -c -f <PROGRAM ID>).
Copy link

@ijiraq ijiraq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes requested. Mostly around the idea of how to retrieve the file and be sure the filename is correct.

urls = cadc.get_data_urls(result)
for url, pid in zip(urls, pids):
try:
urllib.urlretrieve(url, directory+'/'+pid+'.fits')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In stead of urllib can you use 'requests' . That's a more abstracted interface that provides some extra robustness against changes in the lower library. Also, this would allow you to look in the header of the HTTP response to see what the file name should be. requests is part of the astroquery dependencies already.

Here is an example header:

Strict-Transport-Security: max-age=0
Content-MD5: d13af3725015ebd2babd09a8f2eaac3d
ETag: d13af3725015ebd2babd09a8f2eaac3d
Content-Encoding: x-fits
Last-Modified: Mon, 04 Mar 2019 08:09:22 GMT
Content-Disposition: inline; filename=2376828o.fits.fz
X-Uncompressed-Length: 786038400
X-Uncompressed-MD5: 988909a91ab429bbb8a5f51b3d35715e
X-File-CRC: 
X-Uncompressed-CRC: 
Content-Type: application/fits
Content-Length: 330462720
Vary: Origin
Connection: close```
`Content-Disposition` can then be used to name the file. 

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, see 69e03fd . Maybe this is what you're thinking.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks better. I made some comments in the commit.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank you for all your feedback. Perhaps bbf9efe fixes this.

@@ -72,6 +72,8 @@ def makeConfig(self):
self.parser.add_argument('-i', '--interactive', dest = 'interactive', default = False, action = 'store_true', help = 'Create a config.cfg file interactively.')
# Ability to repeat the last data reduction
self.parser.add_argument('-r', '--repeat', dest = 'repeat', default = False, action = 'store_true', help = 'Repeat the last data reduction, loading saved reduction parameters from runtimeData/config.cfg.')
# Specify where downloads come from; either Gemini or CADC.
self.parser.add_argument('-c', '--cadc', dest = 'cadc', default = False, action = 'store_true', help = 'Download raw data from Canadian Astronomy Data Centre rather than the Gemini Science Archive.')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. perhaps --data-source with two choices 'CADC' and 'GSA' (Gemini Science Archive) would make more sense here. With GSA being the default. Eventually you could extend NIFTY with other data sources.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

74a11e9 implements this by changing the "-c" flag to a "-d/--data-source" option with 'GSA' and 'CADC" options. It also adds some better error handling in regards to your next comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-d is typically used for debug mode, at least we've been using it for that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about -s/--data-source? Implemented in 9214743 but can be changed very easily in one line.


if self.cadc:
try:
with open('./' + self.configFile, 'r') as self.config_file:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if the config file doesn't already exist?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 74a11e9 for better error handling.

else:
download_query_gemini(url, './rawData')
url = 'https://archive.gemini.edu/download/'+ str(program) + '/notengineering/NotFail/present/canonical'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is putting this URL building information into the download_query_gemini more appropriate now? To mirror the download_query_cadc?

For the CADC one, we reroute the user to the GSA if we don't have the file, because proprietary. Passing the proprietary key into download_query_cadc would enable that. (would require a try-except on the query, and a check to see if permission denied and was the file being asked for from Gemini, or the cookie could just get added to all requests).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6b976a1 adds the URL building information into the download_query_gemini() function.

Hopefully I understand what you're looking for as actionables from your second point:

  • In download_query_cadc(), if the file is not found with a CADC query (even if the user requested that), we'll retry with a proprietary Gemini download?

author_email='mbussero@gemini.edu',
version="2.0.0",
author='ncomeau',
author_email='ncomeau@uvic.ca',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we need to decide about where this project should live. Its true we don't what to bother gemini with mistakes we might be adding, but we also don't want to take credit for code they have written.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll avoid pushing it to PyPI for now. I know some Dockerfiles will eventually depend on this; perhaps we could bundle a built python wheel (.whl) with our Dockerfiles rather than having them pull the package from PyPI? I'll probably eventually have to talk to Marie at Gemini North someday too...

Copy link

@andamian andamian May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile can install from a github repo if needed.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, implemented in ijiraq/gemini_processing@a052587 . Now the Dockerfile installs from the "dev" branch of Nifty4Gemini.

if isinstance(v, collections.Mapping):
u[k] = update(u.get(k))
else:
if u[k] == 'yes':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use u[k] = u[k] == 'yes'

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, very clean. I don't this works here (if I understand what you're suggesting) because u[k] can take on more values than just 'yes' and 'no'. This function is looking to replace all occurrences of 'yes' and 'no' in the dictionary with True and False and leave all other values intact.

# https://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/GEM/N20140505S0114.fits?RUNID=mf731ukqsipqpdgk
filename = (url.split('/')[-1]).split('?')[0]
# Write the fits file
with open(filename, 'wb') as f:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably think writing to a temporary directory or an application directory. You could expect some failures while downloading the file potentially leaving incomplete files behind. Or write them straight into the destination directory with a temporary name (starting with '.' for example). This is where you'd want to compute the md5 of the received data and compare it with the content-MD5 of the file (assuming the GEMINI also supports it).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea. I went with writing to a '.temp-' file in the working directory for now and doing the md5 check there. I could change it to write the temp file in the download directory as well. Changes are in ff758ba and cf023d9.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a mechanism to get rid of the failed temporary files? Are they deleted when errors occur? Otherwise they might pile up consuming user's quota. You could delete the old ones on startup or tear down. Might want to do that only for older files if multiple instances of the script are executed at the same time. Alternatively, you could use something like the tempfile package to store them in a temporary directory. The cleanup is someone else's job in that case.
'.temp-' is used in multiple places and should be a const. To avoid misspelling bugs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using tempfile.TemporaryFile(mode='w+b', buffering=None, encoding=None, newline=None, suffix=None, prefix=None, dir=None, *, errors=None)

Set the dir=$CACHEDIR and prefix=filename

On success mv the file to the correct filename ... if fail raise a exception. file cleanup is handled by python.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thoughts all around. I'm a newbie when it comes to temp files but hopefully I did it right in 977a915. Definitely let me know if there's a more standard way to do it (especially my use of a .temp-downloads directory).

Checks that a config file exists and if not, sets Nifty to use default configuration.
"""
if not os.path.exists(configFile):
shutil.copy(self.RECIPES_PATH+'defaultConfig.cfg', configFile)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good practice to use os.path.join() to create file paths. Applicable to other places.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch, thank you. Maybe not doing this inadvertently breaks support for Windows users... I've fixed the calls in GetConfig.py in ad9c64d and created an issue to go through the rest of the code base to fix those calls.

@Nat1405
Copy link
Owner Author

Nat1405 commented May 14, 2020

Fantastic @andamian , thank you for the review. I've pushed draft changes to all the things you've raised.

try:
server_checksum = r.headers['Content-MD5']
with open(filename, 'rb') as f:
download_checksum = hashlib.md5(f.read()).hexdigest()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GEMINI files are not that big but in general this could be an expensive (e.g. time consuming) operation. Best would be to feed the bytes to hashlib at the same time you are writing them into the file (line 1233). That's if the Content-MD5 present.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, good idea. Added in b88d045.

@Nat1405
Copy link
Owner Author

Nat1405 commented May 15, 2020

Thanks again for another stage of review. I've pushed two new commits,
b88d045 and 977a915 that address issues raised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants