# NASA Earth Data
### Written by Minh Phan

This tutorial serves to provide one of many ways a user can download data from NASA's [EarthData](https://www.earthdata.nasa.gov/) database, mostly with datasets (collections in EarthData's terminology) hosted in the cloud.

## Authorize credentials

To download data from NASA's Earth Data database, it's recommended that you set up a .netrc credential file so that you don't have to manually log in every time you run a downloading script. To do this, consult 2021 Cloud Hackathon's tutorial [here](https://github.com/NASA-Openscapes/2021-Cloud-Hackathon/blob/main/tutorials/04_NASA_Earthdata_Authentication.ipynb). Make sure to register an account with Earth Data first before following the tutorial.

When you finished implementing the .netrc file, continue with the tutorial below


## Import necessary libraries

In [1]:
import xarray as xr
import earthaccess
import numpy as np
import pandas as pd
import os, glob

## Stream data to your local machine using earthaccess library

Earthaccess library streamlines your downloading, slicing, and searching for granules easier than ever. For cloud-hosted datasets (which is what this tutorial best works with), we choose to download granules to the local machine instead of streaming them to the working Python scripts as some users may not be physically available in the us-west region for streaming to be effective. Local downloading may result in heavy file sizes, but is consistent, and we am also providing some tweaks to save as much as you can, especially if your research interest area requires a long temporal range and does not cover globally.

In [2]:
# Log in using .netrc file
auth = earthaccess.login(strategy="netrc")

No .netrc found in /home/rstudio


FileNotFoundError: [Errno 2] No such file or directory: '/home/rstudio/.netrc'

## Download granules for an extended period of time

You can consult the [earthacess library website](https://earthdata.readthedocs.io/en/latest/tutorials/demo/) or their [notebooks](https://github.com/nsidc/earthaccess/tree/main/notebooks) for code snippets on how to browse and look up collections. For this notebook, we mainly focus on the downloading aspect. First, we need to get the list of granules to download.

In [5]:
# EarthAccess's approach to collecting granules
results = earthaccess.search_data(
    short_name='OSCAR_L4_OC_FINAL_V2.0',
    cloud_hosted=True,
    bounding_box = (60, 5, 80, 25),
    temporal=("2000-01", "2001-12")
)

Granules found: 335


Since earthacess does not support spatial slicing, we developed a method to download, slice, combine, and export data yearly, then finally delete temporary downloaded files to save disk space. Assumed that you already knew the temporal, spatial range of the dataset of your chosen, we first download the data by year into a temporary folder, then slice the data and then export the combined data to another folder.

In [3]:
# Our approach

def download_granules_by_year(short_name, month_start, month_end, lat1=5, lat2=25, lon1=60, lon2=80):
    for year in range(int(month_start[:4]), int(month_end[:4])+1):      
        print('Collecting granules')
        granules = earthaccess.granule_query().short_name(short_name).temporal(f'{year}',f'{year+1}-01').get(366)
        
        MAIN_FOLDER = 'demonstrated data/earth_data'
        TEMP_FOLDER = 'temp'
        path_temp_folder = os.path.join(MAIN_FOLDER, TEMP_FOLDER)
        path_processed_folder = os.path.join(MAIN_FOLDER, short_name)
        # create folder to store data
        if not os.path.exists(path_temp_folder):
            os.makedirs(path_temp_folder)
        if not os.path.exists(path_processed_folder):
            os.makedirs(path_processed_folder)
        files = earthaccess.download(granules, path_temp_folder)
        
        # grab first file in directory to examine lat and lon values
        first_file = os.listdir(path_temp_folder)[0]
        
        # get bounding box
        lat1_idx, lat2_idx, lon1_idx, lon2_idx = get_bounding_box(os.path.join(path_temp_folder, first_file), lat1, lat2, lon1, lon2)
        # combine files together 
        ## for this example collection, coordinate names are 'lat' and 'lon' while their underlying indices are 'latitude' and 'longitude', respectively
        ## may or may not be applicable for other datasets on the site.
        
        ## if dataset coordinates are slice-able, use:
        ### data = xr.open_mfdataset(f'{path_temp_folder}/*.nc').sel(lat=slice(lat1, lat2+1), longitude=slice(lon1, lon2+1))
        
        print('Slicing...')
        data = xr.open_mfdataset(f'{path_temp_folder}/*.nc').isel(latitude=slice(lat1_idx, lat2_idx+1), longitude=slice(lon1_idx, lon2_idx+1))
        data.to_netcdf(f'{path_processed_folder}/{year}.nc')
        
        # delete files in the temporary folder
        print('Deleting temporary files...')
        files = glob.glob(f'{path_temp_folder}/*.*')
        for f in files:
            os.remove(f)

def get_bounding_box(file_path, lat1=0, lat2=30, lon1=60, lon2=80):
    """
    The dataset we experimented did not have indexed coordinates, 
    so we resorted to slicing using index positions
    """
    ds = xr.open_dataset(file_path)
    
    # modify depending on name of latitude and longitude coordinates
    lat_vals = ds.lat.values
    lon_vals = ds.lon.values
    
    lat1_idx = np.where(lat_vals==lat1)[0][0]
    lat2_idx = np.where(lat_vals==lat2)[0][0]
    lon1_idx = np.where(lon_vals==lon1)[0][0]
    lon2_idx = np.where(lon_vals==lon2)[0][0]
    
    return lat1_idx, lat2_idx, lon1_idx, lon2_idx

In [None]:
download_granules_by_year(short_name='OSCAR_L4_OC_FINAL_V2.0',
                          month_start='2000-01', month_end='2001-12', 
                          lat1=5, lat2=25, lon1=60, lon2=80)

Collecting granules
 Getting 366 granules, approx download size: 0.0 GB


SUBMITTING | :   0%|          | 0/366 [00:00<?, ?it/s]

PROCESSING | :   0%|          | 0/366 [00:00<?, ?it/s]

COLLECTING | :   0%|          | 0/366 [00:00<?, ?it/s]

Slicing...
Deleting temporary files...
Collecting granules
 Getting 366 granules, approx download size: 0.0 GB


SUBMITTING | :   0%|          | 0/366 [00:00<?, ?it/s]

PROCESSING | :   0%|          | 0/366 [00:00<?, ?it/s]

COLLECTING | :   0%|          | 0/366 [00:00<?, ?it/s]

Slicing...


## Combine files together

Now that we have all netcdf4 files in one place, all spatially sliced, combining the rest of the data is a piece of cake! Note that some of the data will be overlap in the process of combing data every year earlier, so it's best practice to remove duplicates (if any)

In [19]:
combined = xr.open_mfdataset('demonstrated data/earth_data/OSCAR_L4_OC_FINAL_V2.0/*.nc')

# convert CFTimeIndex to datetimeindex
combined['time'] = combined.indexes['time'].to_datetimeindex()

unique_time_idxs = np.unique(combined.time, return_index=True)[1]
combined = combined.isel(time=unique_time_idxs)

  combined['time'] = combined.indexes['time'].to_datetimeindex()


In [20]:
combined

Unnamed: 0,Array,Chunk
Bytes,1.38 kiB,1.38 kiB
Shape,"(177,)","(177,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.38 kiB 1.38 kiB Shape (177,) (177,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",177  1,

Unnamed: 0,Array,Chunk
Bytes,1.38 kiB,1.38 kiB
Shape,"(177,)","(177,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.88 kiB,1.88 kiB
Shape,"(241,)","(241,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.88 kiB 1.88 kiB Shape (241,) (241,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",241  1,

Unnamed: 0,Array,Chunk
Bytes,1.88 kiB,1.88 kiB
Shape,"(241,)","(241,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 238.23 MiB 238.23 MiB Shape (732, 241, 177) (732, 241, 177) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",177  241  732,

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 238.23 MiB 238.23 MiB Shape (732, 241, 177) (732, 241, 177) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",177  241  732,

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 238.23 MiB 238.23 MiB Shape (732, 241, 177) (732, 241, 177) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",177  241  732,

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 238.23 MiB 238.23 MiB Shape (732, 241, 177) (732, 241, 177) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",177  241  732,

Unnamed: 0,Array,Chunk
Bytes,238.23 MiB,238.23 MiB
Shape,"(732, 241, 177)","(732, 241, 177)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
