Adds CADC download option, plus fixes bugs. #4

Nat1405 · 2020-05-07T15:58:06Z

Users can now specify -c (eg; runNifty nifsPipeline -c ...)
to download raw data from the Canadian Astronomy Data Centre.
This has been tested to work from the interactive config session
(runNifty nifsPipeline -i) and the fully automatic mode
(runNifty nifsPipeline -c -f ).

Users can now specify -c (eg; runNifty nifsPipeline -c ...) to download raw data from the Canadian Astronomy Data Centre. This has been tested to work from the interactive config session (runNifty nifsPipeline -i) and the fully automatic mode (runNifty nifsPipeline -c -f <PROGRAM ID>).

ijiraq

A few changes requested. Mostly around the idea of how to retrieve the file and be sure the filename is correct.

ijiraq · 2020-05-07T16:34:51Z

nifty/pipeline/nifsUtils.py

+    urls = cadc.get_data_urls(result)
+    for url, pid in zip(urls, pids):
+        try:
+            urllib.urlretrieve(url, directory+'/'+pid+'.fits')


In stead of urllib can you use 'requests' . That's a more abstracted interface that provides some extra robustness against changes in the lower library. Also, this would allow you to look in the header of the HTTP response to see what the file name should be. requests is part of the astroquery dependencies already.

Here is an example header:

Strict-Transport-Security: max-age=0 Content-MD5: d13af3725015ebd2babd09a8f2eaac3d ETag: d13af3725015ebd2babd09a8f2eaac3d Content-Encoding: x-fits Last-Modified: Mon, 04 Mar 2019 08:09:22 GMT Content-Disposition: inline; filename=2376828o.fits.fz X-Uncompressed-Length: 786038400 X-Uncompressed-MD5: 988909a91ab429bbb8a5f51b3d35715e X-File-CRC: X-Uncompressed-CRC: Content-Type: application/fits Content-Length: 330462720 Vary: Origin Connection: close``` `Content-Disposition` can then be used to name the file.

Good idea, see 69e03fd . Maybe this is what you're thinking.

Yes, that looks better. I made some comments in the commit.

Perfect, thank you for all your feedback. Perhaps bbf9efe fixes this.

ijiraq · 2020-05-07T16:37:04Z

nifty/pipeline/objectoriented/GetConfig.py

@@ -72,6 +72,8 @@ def makeConfig(self):
        self.parser.add_argument('-i', '--interactive', dest = 'interactive', default = False, action = 'store_true', help = 'Create a config.cfg file interactively.')
        # Ability to repeat the last data reduction
        self.parser.add_argument('-r', '--repeat', dest = 'repeat', default = False, action = 'store_true', help = 'Repeat the last data reduction, loading saved reduction parameters from runtimeData/config.cfg.')
+        # Specify where downloads come from; either Gemini or CADC.
+        self.parser.add_argument('-c', '--cadc', dest = 'cadc', default = False, action = 'store_true', help = 'Download raw data from Canadian Astronomy Data Centre rather than the Gemini Science Archive.')


Hmm. perhaps --data-source with two choices 'CADC' and 'GSA' (Gemini Science Archive) would make more sense here. With GSA being the default. Eventually you could extend NIFTY with other data sources.

74a11e9 implements this by changing the "-c" flag to a "-d/--data-source" option with 'GSA' and 'CADC" options. It also adds some better error handling in regards to your next comment.

-d is typically used for debug mode, at least we've been using it for that.

How about -s/--data-source? Implemented in 9214743 but can be changed very easily in one line.

ijiraq · 2020-05-07T16:38:07Z

nifty/pipeline/objectoriented/GetConfig.py

+
+        if self.cadc:
+            try:
+                with open('./' + self.configFile, 'r') as self.config_file:


What happens here if the config file doesn't already exist?

See 74a11e9 for better error handling.

ijiraq · 2020-05-07T16:41:37Z

nifty/pipeline/steps/nifsSort.py

        else:
-            download_query_gemini(url, './rawData')
+            url = 'https://archive.gemini.edu/download/'+ str(program) + '/notengineering/NotFail/present/canonical'


Is putting this URL building information into the download_query_gemini more appropriate now? To mirror the download_query_cadc?

For the CADC one, we reroute the user to the GSA if we don't have the file, because proprietary. Passing the proprietary key into download_query_cadc would enable that. (would require a try-except on the query, and a check to see if permission denied and was the file being asked for from Gemini, or the cookie could just get added to all requests).

6b976a1 adds the URL building information into the download_query_gemini() function.

Hopefully I understand what you're looking for as actionables from your second point:

In download_query_cadc(), if the file is not found with a CADC query (even if the user requested that), we'll retry with a proprietary Gemini download?

ijiraq · 2020-05-07T16:43:37Z

setup.py

-    author_email='mbussero@gemini.edu',
+    version="2.0.0",
+    author='ncomeau',
+    author_email='ncomeau@uvic.ca',


Now, we need to decide about where this project should live. Its true we don't what to bother gemini with mistakes we might be adding, but we also don't want to take credit for code they have written.

I'll avoid pushing it to PyPI for now. I know some Dockerfiles will eventually depend on this; perhaps we could bundle a built python wheel (.whl) with our Dockerfiles rather than having them pull the package from PyPI? I'll probably eventually have to talk to Marie at Gemini North someday too...

Dockerfile can install from a github repo if needed.

Perfect, implemented in ijiraq/gemini_processing@a052587 . Now the Dockerfile installs from the "dev" branch of Nifty4Gemini.

Rather than just choosing between Gemini and CADC downloads, --data-source adds support for more than just two data sources. To add support for a new archive at minimum nifsSort.py needs changed.

There were problems with finding the config file.

andamian · 2020-05-14T01:30:24Z

nifty/pipeline/nifsUtils.py

+                if isinstance(v, collections.Mapping):
+                    u[k] = update(u.get(k))
+                else:
+                    if u[k] == 'yes':


Could use u[k] = u[k] == 'yes'

Great suggestion, very clean. I don't this works here (if I understand what you're suggesting) because u[k] can take on more values than just 'yes' and 'no'. This function is looking to replace all occurrences of 'yes' and 'no' in the dictionary with True and False and leave all other values intact.

andamian · 2020-05-14T02:15:52Z

nifty/pipeline/nifsUtils.py

+        # https://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/data/pub/GEM/N20140505S0114.fits?RUNID=mf731ukqsipqpdgk
+        filename = (url.split('/')[-1]).split('?')[0]
+    # Write the fits file
+    with open(filename, 'wb') as f:


I would probably think writing to a temporary directory or an application directory. You could expect some failures while downloading the file potentially leaving incomplete files behind. Or write them straight into the destination directory with a temporary name (starting with '.' for example). This is where you'd want to compute the md5 of the received data and compare it with the content-MD5 of the file (assuming the GEMINI also supports it).

Very good idea. I went with writing to a '.temp-' file in the working directory for now and doing the md5 check there. I could change it to write the temp file in the download directory as well. Changes are in ff758ba and cf023d9.

Is there a mechanism to get rid of the failed temporary files? Are they deleted when errors occur? Otherwise they might pile up consuming user's quota. You could delete the old ones on startup or tear down. Might want to do that only for older files if multiple instances of the script are executed at the same time. Alternatively, you could use something like the tempfile package to store them in a temporary directory. The cleanup is someone else's job in that case.
'.temp-' is used in multiple places and should be a const. To avoid misspelling bugs.

I recommend using tempfile.TemporaryFile(mode='w+b', buffering=None, encoding=None, newline=None, suffix=None, prefix=None, dir=None, *, errors=None)

Set the dir=$CACHEDIR and prefix=filename

On success mv the file to the correct filename ... if fail raise a exception. file cleanup is handled by python.

Good thoughts all around. I'm a newbie when it comes to temp files but hopefully I did it right in 977a915. Definitely let me know if there's a more standard way to do it (especially my use of a .temp-downloads directory).

andamian · 2020-05-14T03:10:29Z

nifty/pipeline/objectoriented/GetConfig.py

+        Checks that a config file exists and if not, sets Nifty to use default configuration.
+        """
+        if not os.path.exists(configFile):
+            shutil.copy(self.RECIPES_PATH+'defaultConfig.cfg', configFile)


It's good practice to use os.path.join() to create file paths. Applicable to other places.

Very good catch, thank you. Maybe not doing this inadvertently breaks support for Windows users... I've fixed the calls in GetConfig.py in ad9c64d and created an issue to go through the rest of the code base to fix those calls.

Nat1405 · 2020-05-14T18:01:52Z

Fantastic @andamian , thank you for the review. I've pushed draft changes to all the things you've raised.

andamian · 2020-05-14T20:04:51Z

nifty/pipeline/nifsUtils.py

+    try:
+        server_checksum = r.headers['Content-MD5']
+        with open(filename, 'rb') as f:
+            download_checksum = hashlib.md5(f.read()).hexdigest()


GEMINI files are not that big but in general this could be an expensive (e.g. time consuming) operation. Best would be to feed the bytes to hashlib at the same time you are writing them into the file (line 1233). That's if the Content-MD5 present.

Awesome, good idea. Added in b88d045.

Nat1405 · 2020-05-15T15:47:08Z

Thanks again for another stage of review. I've pushed two new commits,
b88d045 and 977a915 that address issues raised.

Nat1405 mentioned this pull request May 7, 2020

Modify nifty to use the cadc-tap system to query files. ijiraq/gemini_processing#3

Closed

ijiraq suggested changes May 7, 2020

View reviewed changes

Nat1405 added 8 commits May 7, 2020 11:23

Adds support for downloads using requests.

3e7ffea

Splits out getting of file to new method.

443662e

Fixes function naming conventions to match original project.

0aae71e

Changes '-c' CADC flag to '-d/--data-source' option.

22c3f62

Rather than just choosing between Gemini and CADC downloads, --data-source adds support for more than just two data sources. To add support for a new archive at minimum nifsSort.py needs changed.

Cleans up url construction code.

6dd7cdd

Fixes silly bugs introduced by 74a11e9.

ec027de

Adds better error handling to get_file().

42f70df

Fixes silly bugs in d5ead22.

453c787

There were problems with finding the config file.

andamian reviewed May 14, 2020

View reviewed changes

Nat1405 added 5 commits May 14, 2020 08:27

Fixes path seperators (wasn't using os.path.join) in GetConfig.py.

e006513

Sends downloaded CADC files to a temp location first.

061fd56

Adds CADC downloads MD5 verification.

fa13680

Changes -d/--data-source flag to -s/--data-source flag

ba4b95c

Moves temp files to download directory.

7e57b99

andamian reviewed May 14, 2020

View reviewed changes

Nat1405 added 2 commits May 14, 2020 14:44

Changes md5 verification to happen at the same time as downloads.

ad6d231

Makes CADC downloads use temp files (via tempfile.TemporaryFile).

79a8f59

Adds backwards compatibility for old config files for dataSource option.

f788b67

Nat1405 force-pushed the master branch from fe224a4 to 8f12c2f Compare June 8, 2020 20:12

Nat1405 force-pushed the cadc_downloads branch from 977a915 to f788b67 Compare June 10, 2020 16:35

Nat1405 merged commit 7ec3d72 into master Jul 9, 2020

Nat1405 deleted the cadc_downloads branch July 9, 2020 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds CADC download option, plus fixes bugs. #4

Adds CADC download option, plus fixes bugs. #4

Nat1405 commented May 7, 2020

ijiraq left a comment

ijiraq May 7, 2020

Nat1405 May 7, 2020

ijiraq May 7, 2020

Nat1405 May 7, 2020

ijiraq May 7, 2020

Nat1405 May 7, 2020

andamian May 14, 2020

Nat1405 May 14, 2020

ijiraq May 7, 2020

Nat1405 May 7, 2020

ijiraq May 7, 2020

Nat1405 May 7, 2020

ijiraq May 7, 2020

Nat1405 May 7, 2020

andamian May 14, 2020 •

edited

Loading

Nat1405 May 14, 2020

andamian May 14, 2020

Nat1405 May 14, 2020

andamian May 14, 2020

Nat1405 May 14, 2020

andamian May 14, 2020

ijiraq May 14, 2020

Nat1405 May 15, 2020

andamian May 14, 2020

Nat1405 May 14, 2020

Nat1405 commented May 14, 2020 •

edited

Loading

andamian May 14, 2020

Nat1405 May 15, 2020

Nat1405 commented May 15, 2020

Adds CADC download option, plus fixes bugs. #4

Adds CADC download option, plus fixes bugs. #4

Conversation

Nat1405 commented May 7, 2020

ijiraq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andamian May 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nat1405 commented May 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nat1405 commented May 15, 2020

andamian May 14, 2020 •

edited

Loading

Nat1405 commented May 14, 2020 •

edited

Loading