-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds CADC download option, plus fixes bugs. #4
Conversation
Users can now specify -c (eg; runNifty nifsPipeline -c ...) to download raw data from the Canadian Astronomy Data Centre. This has been tested to work from the interactive config session (runNifty nifsPipeline -i) and the fully automatic mode (runNifty nifsPipeline -c -f <PROGRAM ID>).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes requested. Mostly around the idea of how to retrieve the file and be sure the filename is correct.
nifty/pipeline/nifsUtils.py
Outdated
urls = cadc.get_data_urls(result) | ||
for url, pid in zip(urls, pids): | ||
try: | ||
urllib.urlretrieve(url, directory+'/'+pid+'.fits') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In stead of urllib can you use 'requests' . That's a more abstracted interface that provides some extra robustness against changes in the lower library. Also, this would allow you to look in the header
of the HTTP response
to see what the file name should be. requests
is part of the astroquery dependencies already.
Here is an example header:
Strict-Transport-Security: max-age=0
Content-MD5: d13af3725015ebd2babd09a8f2eaac3d
ETag: d13af3725015ebd2babd09a8f2eaac3d
Content-Encoding: x-fits
Last-Modified: Mon, 04 Mar 2019 08:09:22 GMT
Content-Disposition: inline; filename=2376828o.fits.fz
X-Uncompressed-Length: 786038400
X-Uncompressed-MD5: 988909a91ab429bbb8a5f51b3d35715e
X-File-CRC:
X-Uncompressed-CRC:
Content-Type: application/fits
Content-Length: 330462720
Vary: Origin
Connection: close```
`Content-Disposition` can then be used to name the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, see 69e03fd . Maybe this is what you're thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that looks better. I made some comments in the commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thank you for all your feedback. Perhaps bbf9efe fixes this.
@@ -72,6 +72,8 @@ def makeConfig(self): | |||
self.parser.add_argument('-i', '--interactive', dest = 'interactive', default = False, action = 'store_true', help = 'Create a config.cfg file interactively.') | |||
# Ability to repeat the last data reduction | |||
self.parser.add_argument('-r', '--repeat', dest = 'repeat', default = False, action = 'store_true', help = 'Repeat the last data reduction, loading saved reduction parameters from runtimeData/config.cfg.') | |||
# Specify where downloads come from; either Gemini or CADC. | |||
self.parser.add_argument('-c', '--cadc', dest = 'cadc', default = False, action = 'store_true', help = 'Download raw data from Canadian Astronomy Data Centre rather than the Gemini Science Archive.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. perhaps --data-source with two choices 'CADC' and 'GSA' (Gemini Science Archive) would make more sense here. With GSA being the default. Eventually you could extend NIFTY with other data sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
74a11e9 implements this by changing the "-c" flag to a "-d/--data-source" option with 'GSA' and 'CADC" options. It also adds some better error handling in regards to your next comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-d
is typically used for debug mode, at least we've been using it for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about -s/--data-source? Implemented in 9214743 but can be changed very easily in one line.
|
||
if self.cadc: | ||
try: | ||
with open('./' + self.configFile, 'r') as self.config_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if the config file doesn't already exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 74a11e9 for better error handling.
nifty/pipeline/steps/nifsSort.py
Outdated
else: | ||
download_query_gemini(url, './rawData') | ||
url = 'https://archive.gemini.edu/download/'+ str(program) + '/notengineering/NotFail/present/canonical' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is putting this URL building information into the download_query_gemini more appropriate now? To mirror the download_query_cadc?
For the CADC one, we reroute the user to the GSA if we don't have the file, because proprietary. Passing the proprietary key into download_query_cadc would enable that. (would require a try-except on the query, and a check to see if permission denied and was the file being asked for from Gemini, or the cookie could just get added to all requests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6b976a1 adds the URL building information into the download_query_gemini() function.
Hopefully I understand what you're looking for as actionables from your second point:
- In download_query_cadc(), if the file is not found with a CADC query (even if the user requested that), we'll retry with a proprietary Gemini download?
author_email='mbussero@gemini.edu', | ||
version="2.0.0", | ||
author='ncomeau', | ||
author_email='ncomeau@uvic.ca', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we need to decide about where this project should live. Its true we don't what to bother gemini with mistakes we might be adding, but we also don't want to take credit for code they have written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll avoid pushing it to PyPI for now. I know some Dockerfiles will eventually depend on this; perhaps we could bundle a built python wheel (.whl) with our Dockerfiles rather than having them pull the package from PyPI? I'll probably eventually have to talk to Marie at Gemini North someday too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dockerfile
can install from a github
repo if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, implemented in ijiraq/gemini_processing@a052587 . Now the Dockerfile installs from the "dev" branch of Nifty4Gemini.
Rather than just choosing between Gemini and CADC downloads, --data-source adds support for more than just two data sources. To add support for a new archive at minimum nifsSort.py needs changed.
There were problems with finding the config file.
if isinstance(v, collections.Mapping): | ||
u[k] = update(u.get(k)) | ||
else: | ||
if u[k] == 'yes': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use u[k] = u[k] == 'yes'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, very clean. I don't this works here (if I understand what you're suggesting) because u[k] can take on more values than just 'yes' and 'no'. This function is looking to replace all occurrences of 'yes' and 'no' in the dictionary with True and False and leave all other values intact.
nifty/pipeline/nifsUtils.py
Outdated
# https://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/GEM/N20140505S0114.fits?RUNID=mf731ukqsipqpdgk | ||
filename = (url.split('/')[-1]).split('?')[0] | ||
# Write the fits file | ||
with open(filename, 'wb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably think writing to a temporary directory or an application directory. You could expect some failures while downloading the file potentially leaving incomplete files behind. Or write them straight into the destination directory with a temporary name (starting with '.' for example). This is where you'd want to compute the md5 of the received data and compare it with the content-MD5
of the file (assuming the GEMINI also supports it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a mechanism to get rid of the failed temporary files? Are they deleted when errors occur? Otherwise they might pile up consuming user's quota. You could delete the old ones on startup or tear down. Might want to do that only for older files if multiple instances of the script are executed at the same time. Alternatively, you could use something like the tempfile
package to store them in a temporary directory. The cleanup is someone else's job in that case.
'.temp-' is used in multiple places and should be a const. To avoid misspelling bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using tempfile.TemporaryFile(mode='w+b', buffering=None, encoding=None, newline=None, suffix=None, prefix=None, dir=None, *, errors=None)
Set the dir=$CACHEDIR and prefix=filename
On success mv the file to the correct filename ... if fail raise a exception. file cleanup is handled by python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thoughts all around. I'm a newbie when it comes to temp files but hopefully I did it right in 977a915. Definitely let me know if there's a more standard way to do it (especially my use of a .temp-downloads directory).
Checks that a config file exists and if not, sets Nifty to use default configuration. | ||
""" | ||
if not os.path.exists(configFile): | ||
shutil.copy(self.RECIPES_PATH+'defaultConfig.cfg', configFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to use os.path.join()
to create file paths. Applicable to other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch, thank you. Maybe not doing this inadvertently breaks support for Windows users... I've fixed the calls in GetConfig.py in ad9c64d and created an issue to go through the rest of the code base to fix those calls.
Fantastic @andamian , thank you for the review. I've pushed draft changes to all the things you've raised. |
nifty/pipeline/nifsUtils.py
Outdated
try: | ||
server_checksum = r.headers['Content-MD5'] | ||
with open(filename, 'rb') as f: | ||
download_checksum = hashlib.md5(f.read()).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GEMINI files are not that big but in general this could be an expensive (e.g. time consuming) operation. Best would be to feed the bytes to hashlib
at the same time you are writing them into the file (line 1233). That's if the Content-MD5
present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, good idea. Added in b88d045.
Users can now specify -c (eg; runNifty nifsPipeline -c ...)
to download raw data from the Canadian Astronomy Data Centre.
This has been tested to work from the interactive config session
(runNifty nifsPipeline -i) and the fully automatic mode
(runNifty nifsPipeline -c -f ).