Skip to content

Commit

Permalink
Merge pull request #76 from andrewdelman/cloud_compatibility
Browse files Browse the repository at this point in the history
minor changes to ecco_s3_retrieve and tutorials
  • Loading branch information
andrewdelman committed Mar 22, 2024
2 parents f50da80 + 51f2d0d commit 88c0b95
Show file tree
Hide file tree
Showing 4 changed files with 48 additions and 31 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -5105,7 +5105,8 @@
"text": [
"Help on function ecco_podaac_s3_get in module ecco_s3_retrieve:\n",
"\n",
"ecco_podaac_s3_get(ShortName, StartDate, EndDate, download_root_dir=None, n_workers=6, force_redownload=False, return_downloaded_files=False)\n",
"ecco_podaac_s3_get(ShortName, StartDate, EndDate, download_root_dir=None, n_workers=6,\n"
" force_redownload=False, return_downloaded_files=False)\n",
" This routine downloads ECCO datasets from PO.DAAC, to be stored locally on a AWS EC2 instance running in region us-west-2. \n",
" It is adapted from the ecco_podaac_download function in the ecco_download.py module, and is the AWS Cloud equivalent of \n",
" ecco_podaac_download.\n",
Expand Down Expand Up @@ -5316,11 +5317,15 @@
"text": [
"Help on function ecco_podaac_s3_get_diskaware in module ecco_s3_retrieve:\n",
"\n",
"ecco_podaac_s3_get_diskaware(ShortNames, StartDate, EndDate, max_avail_frac=0.5, snapshot_interval=None, download_root_dir=None, n_workers=6, force_redownload=False)\n",
" This function estimates the storage footprint of ECCO datasets, given ShortName(s), a date range, and which files (if any) are already present.\n",
" If the current instance's available storage is at least twice the footprint of the new files, they are downloaded and stored locally on the instance \n",
" using ecco_podaac_s3_get (hosting files locally typically speeds up loading and computation).\n",
" Otherwise, the files are \"opened\" using ecco_podaac_s3_open so that they can be accessed directly on S3 without occupying local storage.\n",
"ecco_podaac_s3_get_diskaware(ShortNames, StartDate, EndDate, max_avail_frac=0.5, \n",
" snapshot_interval=None, download_root_dir=None, n_workers=6, force_redownload=False)\n",
" This function estimates the storage footprint of ECCO datasets, given ShortName(s), a date range, and which \n",
" files (if any) are already present.\n",
" If the footprint of the files to be downloaded (not including files already on the instance or re-downloads) \n",
" is <= the max_avail_frac specified of the instance's available storage, they are downloaded and stored locally \n",
" on the instance (hosting files locally typically speeds up loading and computation).\n",
" Otherwise, the files are "opened" using ecco_podaac_s3_open so that they can be accessed directly \n",
" on S3 without occupying local storage.\n",
" \n",
" Parameters\n",
" ----------\n",
Expand All @@ -5336,11 +5341,14 @@
" \n",
" max_avail_frac: float, maximum fraction of remaining available disk space to use in storing current ECCO datasets.\n",
" This determines whether the dataset files are stored on the current instance, or opened on S3.\n",
" Valid range is [0,0.9]. If number provided is outside this range, it is replaced by the closer endpoint of the range.\n",
" Valid range is [0,0.9]. If number provided is outside this range, it is replaced by the closer \n",
" endpoint of the range.\n",
" \n",
" snapshot_interval: ('monthly', 'daily', or None), if snapshot datasets are included in ShortNames, this determines whether\n",
" snapshots are included for only the beginning/end of each month ('monthly'), or for every day ('daily').\n",
" If None or not specified, defaults to 'daily' if any daily mean ShortNames are included and 'monthly' otherwise.\n",
" snapshot_interval: ('monthly', 'daily', or None), if snapshot datasets are included in ShortNames, \n",
" this determines whether snapshots are included for only the beginning/end of each month \n",
" ('monthly'), or for every day ('daily').\n",
" If None or not specified, defaults to 'daily' if any daily mean ShortNames are included \n",
" and 'monthly' otherwise.\n",
" \n",
" download_root_dir: str, defines parent directory to download files to.\n",
" Files will be downloaded to directory download_root_dir/ShortName/.\n",
Expand Down Expand Up @@ -5373,7 +5381,7 @@
"id": "887f8436-98d3-4b09-a7d1-936810717592",
"metadata": {},
"source": [
"The syntax of this function is similar to `ecco_podaac_s3_get`, but there are two arguments specific to this function: **max_avail_frac** and **snapshot_interval**. **max_avail_frac** sets the storage threshold for whether the specified dataset(s) will be downloaded to the user's instance vs. opened from S3. For example, the default max_avail_frac = 0.5 will download the datasets if they will occupy less than 50% of the instance's remaining available memory. **snapshot_interval** applies only if there are snapshot datasets included in ShortNames, e.g., it could be useful to specify snapshot_interval = 'monthly' if you want to limit the size of the potential download.\n",
"The syntax of this function is similar to `ecco_podaac_s3_get`, but there are two arguments specific to this function: **max_avail_frac** and **snapshot_interval**. **max_avail_frac** sets the storage threshold for whether the specified dataset(s) will be downloaded to the user's instance vs. opened from S3. For example, the default max_avail_frac = 0.5 will download the datasets if they will occupy <= 50% of the instance's remaining available storage. **snapshot_interval** applies only if there are snapshot datasets included in ShortNames, e.g., it could be useful to specify snapshot_interval = 'monthly' if you want to limit the size of the potential download.\n",
"\n",
"Now let's repeat the calculation that was done above by invoking this function, first removing the files if they are already on disk to replicate the previous example in Method 2 as closely as possible."
]
Expand Down
38 changes: 23 additions & 15 deletions ECCO-ACCESS/ecco_s3_retrieve.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ def ecco_podaac_s3_query(ShortName,StartDate,EndDate):
Returns
-------
s3_files_list: str or list, opened file(s) on S3 that can be passed directly to xarray (open_dataset or open_mfdataset)
s3_files_list: str or list, opened file(s) on S3 that can be passed directly to xarray
(open_dataset or open_mfdataset)
"""

Expand Down Expand Up @@ -149,7 +150,8 @@ def get_granules(params: dict):
# actually log in with this command:
setup_earthdata_login_auth()

# Query the NASA Common Metadata Repository to find the URL of every granule associated with the desired ECCO Dataset and date range of interest.
# Query the NASA Common Metadata Repository to find the URL of every granule associated with the desired
# ECCO Dataset and date range of interest.

# create a Python dictionary with our search criteria: `ShortName` and `temporal`
input_search_params = {'ShortName': ShortName,
Expand Down Expand Up @@ -440,15 +442,18 @@ def ecco_podaac_s3_get(ShortName,StartDate,EndDate,download_root_dir=None,n_work
###================================================================================================================


def ecco_podaac_s3_get_diskaware(ShortNames,StartDate,EndDate,max_avail_frac=0.5,snapshot_interval=None,download_root_dir=None,n_workers=6,\
force_redownload=False):
def ecco_podaac_s3_get_diskaware(ShortNames,StartDate,EndDate,max_avail_frac=0.5,snapshot_interval=None,\
download_root_dir=None,n_workers=6,force_redownload=False):

"""
This function estimates the storage footprint of ECCO datasets, given ShortName(s), a date range, and which files (if any) are already present.
If the current instance's available storage is at least twice the footprint of the new files, they are downloaded and stored locally on the instance
using ecco_podaac_s3_get (hosting files locally typically speeds up loading and computation).
Otherwise, the files are "opened" using ecco_podaac_s3_open so that they can be accessed directly on S3 without occupying local storage.
This function estimates the storage footprint of ECCO datasets, given ShortName(s), a date range, and which
files (if any) are already present.
If the footprint of the files to be downloaded (not including files already on the instance or re-downloads)
is <= the max_avail_frac specified of the instance's available storage, they are downloaded and stored locally
on the instance (hosting files locally typically speeds up loading and computation).
Otherwise, the files are "opened" using ecco_podaac_s3_open so that they can be accessed directly
on S3 without occupying local storage.
Parameters
----------
Expand All @@ -464,11 +469,14 @@ def ecco_podaac_s3_get_diskaware(ShortNames,StartDate,EndDate,max_avail_frac=0.5
max_avail_frac: float, maximum fraction of remaining available disk space to use in storing current ECCO datasets.
This determines whether the dataset files are stored on the current instance, or opened on S3.
Valid range is [0,0.9]. If number provided is outside this range, it is replaced by the closer endpoint of the range.
Valid range is [0,0.9]. If number provided is outside this range, it is replaced by the closer
endpoint of the range.
snapshot_interval: ('monthly', 'daily', or None), if snapshot datasets are included in ShortNames, this determines whether
snapshots are included for only the beginning/end of each month ('monthly'), or for every day ('daily').
If None or not specified, defaults to 'daily' if any daily mean ShortNames are included and 'monthly' otherwise.
snapshot_interval: ('monthly', 'daily', or None), if snapshot datasets are included in ShortNames,
this determines whether snapshots are included for only the beginning/end of each month
('monthly'), or for every day ('daily').
If None or not specified, defaults to 'daily' if any daily mean ShortNames are included
and 'monthly' otherwise.
download_root_dir: str, defines parent directory to download files to.
Files will be downloaded to directory download_root_dir/ShortName/.
Expand All @@ -484,8 +492,8 @@ def ecco_podaac_s3_get_diskaware(ShortNames,StartDate,EndDate,max_avail_frac=0.5
Returns
-------
retrieved_files: dict, with keys: ShortNames and values: downloaded or opened file(s) with path on local instance or on S3,
that can be passed directly to xarray (open_dataset or open_mfdataset).
retrieved_files: dict, with keys: ShortNames and values: downloaded or opened file(s) with path on local instance
or on S3, that can be passed directly to xarray (open_dataset or open_mfdataset).
"""

Expand Down Expand Up @@ -600,4 +608,4 @@ def ecco_podaac_s3_get_diskaware(ShortNames,StartDate,EndDate,max_avail_frac=0.5

retrieved_files[curr_shortname] = open_files

return retrieved_files
return retrieved_files
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,12 @@
" max_avail_frac=0.5,\\\n",
" download_root_dir=ECCO_dir)\n",
" ecco_grid = xr.open_mfdataset(files_nested_list[ShortNames_list[0]])\n",
" ecco_vars_TS = xr.open_mfdataset(files_nested_list[ShortNames_list[1]],compat='override',data_vars='minimal',coords='minimal')\n",
" ecco_vars_vel = xr.open_mfdataset(files_nested_list[ShortNames_list[2]],compat='override',data_vars='minimal',coords='minimal')\n",
" ecco_vars_atm = xr.open_mfdataset(files_nested_list[ShortNames_list[3]],compat='override',data_vars='minimal',coords='minimal')\n",
" ecco_vars_TS = xr.open_mfdataset(files_nested_list[ShortNames_list[1]],\\\n",
" compat='override',data_vars='minimal',coords='minimal')\n",
" ecco_vars_vel = xr.open_mfdataset(files_nested_list[ShortNames_list[2]],\\\n",
" compat='override',data_vars='minimal',coords='minimal')\n",
" ecco_vars_atm = xr.open_mfdataset(files_nested_list[ShortNames_list[3]],\\\n",
" compat='override',data_vars='minimal',coords='minimal')\n",
"else:\n",
" ecco_grid = xr.open_mfdataset(glob.glob(join(ECCO_dir,ShortNames_list[0],'*.nc'))[0])\n",
" ecco_vars_TS = xr.open_mfdataset(glob.glob(join(ECCO_dir,ShortNames_list[1],'*2000-*.nc')),\\\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@
"\n",
"The ECCO version 4 release 4 (v4r4) files are provided as NetCDF files. This tutorial shows you how to download and open these files using Python code, and takes a look at the structure of these files. The ECCO output is available as a number of **datasets** that each contain a few variables. Each dataset consists of files corresponding to a single time coordinate (monthly mean, daily mean, or snapshot). Each dataset file that represents a single time is called a **granule**.\n",
"\n",
"or alternatively use *wget* to obtain the files.\n",
"\n",
"In this first tutorial we will start slowly, providing detail at every step. Later tutorials will assume knowledge of some basic operations introduced here.\n",
"\n",
"Let's get started.\n",
Expand Down

0 comments on commit 88c0b95

Please sign in to comment.