# Batch Downloads and Decompression

The FITS datacubes are compressed. This saves storage space and allows for faster downloads of the full data release, but the downside is that reading data is relatively slow (about 15s per cube, compared to 0.03s on an uncompressed cube.). If you are downloading cubes lcoally, we recommend decompressing the cubes. This notebook shows you how to download and then decompress in situ a set of datacubes.

In [2]:
import os.path as op
from joblib import Parallel, delayed
import numpy as np
from astropy.table import Column, Table, hstack, unique

from astropy.coordinates import SkyCoord
import astropy.units as u
from astropy.io import fits

from astropy.wcs import WCS
from astropy.wcs.utils import proj_plane_pixel_scales

import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm

from tqdm import tqdm
import traceback
import time

In [3]:
# Follow notebook 1 to access the IFU index file and point to the directory it is located in
pdr1_dir = '/home/jovyan/Hobby-Eberly-Public/HETDEX/internal/pdr1/'
if not op.exists(pdr1_dir):
    pdr1_dir = 'pdr1/'

ifu_data = Table.read(op.join( pdr1_dir, 'ifu-index.fits'))

In [4]:
# Create SkyCoord Object array for IFU centers
ifu_coords = SkyCoord( ra=ifu_data['ra_cen']*u.deg, dec=ifu_data['dec_cen']*u.deg)

## Example Download Script for a single coordinate with multiple observations

In [4]:
coord = SkyCoord(ra=150.23189*u.deg, dec=2.363963*u.deg)

# find list of possible datacubes with coverage
sel = coord.separation( ifu_coords) < 26*u.arcsec
# this AGN is observed 9 times
ifulist=ifu_data[sel]
ifu_pairs = [(int(row['shotid']), str(row['ifuslot'])) for row in ifulist]
print(ifu_pairs)

[(20181118020, '036'), (20181119015, '036'), (20181119016, '036'), (20181120012, '036'), (20181120013, '036'), (20220130020, '024'), (20220131010, '040'), (20240310015, '070'), (20240313016, '070')]


In [5]:
# Create list of URLS to datacubes of interest. Saves to the file urls.txt. Then you can either download in this notebook, or use wget in a unix/terminal window.

In [6]:
# Base URL
base_url = "http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes"

# Write URLs to file

urls = []
with open("urls.txt", "w") as f:
    for shotid, ifuslot in ifu_pairs:
        filename = f"dex_cube_{shotid}_{ifuslot}.fits"
        url = f"{base_url}/{shotid}/{filename}"
        urls.append(url)
        f.write(url + "\n")

In [7]:
urls

['http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20181118020/dex_cube_20181118020_036.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20181119015/dex_cube_20181119015_036.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20181119016/dex_cube_20181119016_036.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20181120012/dex_cube_20181120012_036.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20181120013/dex_cube_20181120013_036.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20220130020/dex_cube_20220130020_024.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20220131010/dex_cube_20220131010_040.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes/20240310015/dex_cube_20240310015_070.fits',
 'http://web.corral.tacc.utexas.edu/hetdex/HETDEX/intern

Now that you have the url list you can either download in a notebook. Or if you prefer opening a ssh/terminal window, you can use wget. Example wget command to download the files, preserving the pdr1 data file structure so all notebooks will work provided you point to the top level directory where this command is performed. Ideally same directory as ifu-index.fits

    wget --user='hetdex_internal' --ask-password --cut-dirs=4 -nH -x -i urls.txt

If you would like to download in parallel you can use this unix function

    cat urls.txt | xargs -n 1 -P 4 wget --user='hetdex_internal' --password='your_password' --cut-dirs=4 -nH -x

Or if you prefer you can use this code to download within this JupyterNotebook.

In [8]:
import os
import getpass
import subprocess
from urllib.parse import urlsplit

username = 'hetdex_internal'
password = getpass.getpass('Password: ')
output_root = 'pdr1'
base_url_path = '/hetdex/HETDEX/internal/pdr1/'  # everything before this will be cut


for url in urls:
    path = urlsplit(url).path
    rel_path = path.split(base_url_path, 1)[-1]  # Extract relative path under pdr1/
    local_path = os.path.join(output_root, rel_path)

    # Ensure directory exists
    os.makedirs(os.path.dirname(local_path), exist_ok=True)

    # Download if not present
    if os.path.exists(local_path):
        print(f"Skipping existing file: {rel_path}")
        continue

    print(f"Downloading: {rel_path}")
    subprocess.run([
        'wget', '-q',
        '--user', username,
        '--password', password,
        '-O', local_path,
        url
    ])

Password:  ········


Skipping existing file: datacubes/20181118020/dex_cube_20181118020_036.fits
Skipping existing file: datacubes/20181119015/dex_cube_20181119015_036.fits
Skipping existing file: datacubes/20181119016/dex_cube_20181119016_036.fits
Skipping existing file: datacubes/20181120012/dex_cube_20181120012_036.fits
Skipping existing file: datacubes/20181120013/dex_cube_20181120013_036.fits
Skipping existing file: datacubes/20220130020/dex_cube_20220130020_024.fits
Skipping existing file: datacubes/20220131010/dex_cube_20220131010_040.fits
Skipping existing file: datacubes/20240310015/dex_cube_20240310015_070.fits
Skipping existing file: datacubes/20240313016/dex_cube_20240313016_070.fits


We suggest you decompress the files if you don't mind increasing the size by 3 times. It will make read times much faster. This function will do so. And the next cell shows its execution.

In [28]:

def decompress_ifu(shotid, ifuslot, pdr1_dir, verbose=False):
    """
    Decompress a RICE-compressed IFU FITS cube and overwrite the original file.
    Returns a tuple: (shotid, ifuslot, output_path, success_flag, error_message)
    """
    infile = op.join(pdr1_dir, 'datacubes', str(shotid), f'dex_cube_{shotid}_{ifuslot}.fits')

    if not op.exists(infile):
        return (shotid, ifuslot, None, False, f"Input file not found: {infile}")

    try:
        with fits.open(infile, memmap=False) as hdul:
            new_hdul = fits.HDUList()
            for hdu in hdul:
                if isinstance(hdu, fits.CompImageHDU):
                    new_hdul.append(fits.ImageHDU(data=hdu.data, header=hdu.header))
                else:
                    new_hdul.append(hdu.copy())
            new_hdul.writeto(infile, overwrite=True)

        if verbose:
            print(f"✓ Decompressed in place: {shotid}-{ifuslot}")
        return (shotid, ifuslot, infile, True, None)

    except Exception as e:
        err_msg = traceback.format_exc()
        if verbose:
            print(f"✗ Failed: {shotid}-{ifuslot}\n{err_msg}")
        return (shotid, ifuslot, None, False, err_msg)

In [10]:

for shotid, ifuslot in ifu_pairs:
    decompress_ifu(shotid, ifuslot, output_root, verbose=True)


✓ Decompressed in place: 20181118020-036
✓ Decompressed in place: 20181119015-036
✓ Decompressed in place: 20181119016-036
✓ Decompressed in place: 20181120012-036
✓ Decompressed in place: 20181120013-036
✓ Decompressed in place: 20220130020-024
✓ Decompressed in place: 20220131010-040
✓ Decompressed in place: 20240310015-070
✓ Decompressed in place: 20240313016-070


## Example Downloading on a long coordinate list

In [11]:
catalog = Table.read('dr16q_hdr5.fits')
catalog.remove_column('shotid') # this example might be done witih different catalog later
catalog_coords = SkyCoord(ra = catalog['ra'], dec= catalog['dec'], unit='deg')

In [12]:
idx_ifu, idx_catalog, d2d, d3d = catalog_coords.search_around_sky(ifu_coords, 35*u.arcsec)

In [13]:
# create master table that matches IFU observation coverage with catalog objects

In [14]:
table = hstack( [catalog_coords[idx_catalog], catalog[idx_catalog], ifu_data[idx_ifu]] )
table.rename_column('col0', 'coords')

In [15]:
# reduce cube list to unique IFU observations

In [16]:
ifulist = unique( table['shotid', 'ifuslot'])

In [17]:
# Make sure 'ifuslot' is a string to preserve naming like '034'
ifu_pairs = [(int(row['shotid']), str(row['ifuslot'])) for row in ifulist]

In [18]:
# Base URL
base_url = "http://web.corral.tacc.utexas.edu/hetdex/HETDEX/internal/pdr1/datacubes"

# Write URLs to file

urls = []
with open("urls.txt", "w") as f:
    for shotid, ifuslot in ifu_pairs:
        filename = f"dex_cube_{shotid}_{ifuslot}.fits"
        url = f"{base_url}/{shotid}/{filename}"
        urls.append(url)
        f.write(url + "\n")

In [27]:
print('Number of datacubes to download is {}. This is approximately {} Gb.'.format( len( urls) , len(urls)*0.027))

Number of datacubes to download is 8013. This is approximately 216.351 Gb.


We recommend using wget for large downloads. 

    wget --user='hetdex_internal' --ask-password --cut-dirs=4 -nH -x -i urls.txt

You can also run in parallel. Change -P 4 to more cores but too many and you will have server issues.

    cat urls.txt | xargs -n 1 -P 4 wget --user='hetdex_internal' --password='your_password' --cut-dirs=4 -nH -x

Once the files are downloaded, you can decompress here in parallel with this funciton.

In [29]:

def batch_decompress_ifus(ifu_list, pdr1_dir, n_jobs=4, verbose=False):
    """
    Decompress a list of IFU cubes in parallel with progress and logging.

    Parameters:
        ifu_list (list): List of tuples (shotid, ifuslot).
        pdr1_dir (str): Root directory containing the 'datacubes' folder.
        n_jobs (int): Number of parallel jobs.
        verbose (bool): If True, print detailed logs.

    Returns:
        results (list): List of tuples (shotid, ifuslot, path, success, error_msg).
    """
    results = Parallel(n_jobs=n_jobs)(
        delayed(decompress_ifu)(shotid, ifuslot, pdr1_dir, verbose=verbose)
        for shotid, ifuslot in tqdm(ifu_list, desc="Decompressing IFUs", ncols=80)
    )

    failed = [r for r in results if not r[3]]
    if failed:
        print(f"\n⚠️  {len(failed)} cubes failed to decompress.")
        for f in failed:
            shotid, ifuslot, _, _, err_msg = f
            print(f" - {shotid}-{ifuslot}: {err_msg.splitlines()[-1]}")
    else:
        print("\n✅ All cubes decompressed successfully.")

    return results

In [None]:
# Execute batch decompression
results = batch_decompress_ifus(ifu_pairs, output_root, n_jobs=4, verbose=True)