# Case Study: 2017 Northern Plains Flash Drought

In [None]:
import earthaccess
import numpy as np
import xarray as xr
from matplotlib import pyplot

auth = earthaccess.login()

## Before We Get Started

For this case study, we're going to download some data from [the North American Land Data Assimilation System (NLDAS).](https://disc.gsfc.nasa.gov/datasets/NLDAS_NOAH0125_M_2.0/summary?keywords=NLDAS)

Consequently, we'll need a place to store these raw data. It's important that we have a folder in our file system reserved for these raw data so we can keep them separate from any new datasets we might create. 

**Let's create a folder called `data_raw` in our Jupyter Notebook's file system.**

We should never modify the raw data (that we're about to download). Doing so would make it hard to repeat the analysis we're going to perform as we will lose the original data values. This doesn't mean we have to keep the `data_raw` folder around forever: if it's publicly available data, we can always download it again.

---

## Downloading the Data

In [None]:
# TODO Show how to find the "short_name" and "version"
# TODO Compare to the pattern for downloading a single granule (single date)
# TODO Show how string formatting works

results = []

for year in range(2008, 2018):
    search = earthaccess.search_data(
        short_name = 'NLDAS_NOAH0125_M',
        version = '2.0',
        temporal = (f'{year}-08', f'{year}-08'))
    results.extend(search)

In [None]:
len(results)

Previously, we've used `earthaccess.open()` to get access to these data. This time, we'll use `earthaccess.download()`. What's the difference?

- `earthaccess.open()` provides a file-like object that is available to be downloaded and read *only we need it.*
- `earthaccess.download()` actually downloads the file to our file system.

**Note that, below, we're telling `earthaccess.download()` to put the downloaded files into our new `data_raw` folder.**

In [None]:
earthaccess.download(results, 'data_raw')

In [None]:
import glob

file_list = glob.glob('data_raw/*.nc')
file_list.sort()
file_list

In [None]:
import netCDF4

# Open just the first file
nc = netCDF4.Dataset(file_list[0])

In [None]:
# TODO Discuss file-level metadata

nc

In [None]:
# TODO Discuss file-level metadata
# TODO Discuss "scale_factor" and "add_offset" and "missing_value"

et = nc.variables['Evap']
et

In [None]:
# TODO Note the shape
# TODO Note the orientation
# TODO Discuss CF convention

pyplot.imshow(et[0])

In [None]:
pyplot.imshow(np.flipud(et[0]))

In [None]:
# TODO Note data type, why we're changing it to an array

type(et)

### Opening netCDF4 Data with `xarray`

Instead of using the `netCDF4` module, we can use `xarray` to open netCDF4 files.

In [None]:
dataset = xr.open_dataset(file_list[0])
dataset

A big advantage to using `xarray` is how it organizes all the information we're interested in. Recall that `xarray` variables can be accessed using a dictionary-like indexing:

In [None]:
dataset['Evap']

Another advantage is that `xarray` already knows how these netCDF4 variables should be displayed; its capable of figuring out, based on the coordinates, how the image should be oriented.

In [None]:
dataset['Evap'].plot()

In [None]:
arr = dataset['Evap'].to_numpy()
arr

In [None]:
arr.shape

In [None]:
pyplot.imshow(arr[0])

In [None]:
et_series = []

for filename in file_list:
    ds = xr.open_dataset(filename)
    et = ds['Evap'].to_numpy()
    # Don't forget to to flip the image upside-down!
    et_series.append(np.flipud(et[0]))

et_series = np.stack(et_series, axis = 0)
et_series.shape

---

## Computing a Climatology

In [None]:
# TODO define a climatology

et_clim = et_series.mean(axis = 0)
et_clim.shape

In [None]:
pyplot.imshow(et_clim)
pyplot.colorbar()

In [None]:
# TODO NoData

et_clim.min()

In [None]:
et_clim[et_clim < 0] = np.nan

pyplot.imshow(et_clim)
cbar = pyplot.colorbar()
cbar.set_label('Evapotranspiration [kg m-2]')
pyplot.title('Mean September ET')
pyplot.show()

### How Does September 2017 Compare?

In [None]:
file_list[-1]

In [None]:
et_2017_anomaly = et_series[-1] - et_clim

pyplot.imshow(et_2017_anomaly, cmap = 'RdYlBu')
cbar = pyplot.colorbar()
cbar.set_label('Evapotranspiration Anomaly [kg m-2]')
pyplot.show()

#### Using `cartopy`

In [None]:
extent = [
    nc.variables['lon'][:].min(),
    nc.variables['lon'][:].max(),
    nc.variables['lat'][:].min(),
    nc.variables['lat'][:].max()
]
extent

In [None]:
import cartopy.crs as ccrs
import cartopy.io.shapereader as shpreader

shapename = 'admin_1_states_provinces_lakes'
states_shp = shpreader.natural_earth(resolution = '110m', category = 'cultural', name = shapename)

fig = pyplot.figure()
ax = fig.add_subplot(1, 1, 1, projection = ccrs.PlateCarree())
ax.imshow(et_2017_anomaly, extent = extent, cmap = 'RdYlBu')
ax.add_geometries(shpreader.Reader(states_shp).geometries(), ccrs.PlateCarree(), facecolor = 'none')
pyplot.show()

---

## Saving Our Reproducible Workflow

In [None]:
# TODO Outline the steps in our workflow

In [None]:
# TODO Discuss docstring

def stack_time_series(netcdf_file_list, variable, nodata = -9999):
    '''
    Generates a time series for a given variable, based on an 
    ordered list of netCDF4 files.

    Parameters
    ----------
    netcdf_file_list : list
        The list of netCDF4 files, where each file represents a date
    variable : str
        The name of the variable of interest

    Returns
    -------
    numpy.ndarray
    '''
    series = []
    for filename in file_list:
        ds = xr.open_dataset(filename)
        et = ds['Evap'].to_numpy()
        # Don't forget to to flip the image upside-down!
        et_series.append(np.flipud(et[0]))
    
    series = np.stack(series, axis = 0)
    # Fill in the NoData values
    series[series == nodata] = np.nan
    return series

In [None]:
et = stack_time_series(file_list, 'Evap')
et.shape

In [None]:
et.mean(axis = 0).shape

In [None]:
# TODO Discuss broadcasting

anomaly = et - et.mean(axis = 0)
anomaly.shape

In [None]:
# TODO Write docstring together with learners

def anomalies(time_series):
    '''
    Computes the anomaly (current value minus mean value) in a time series.

    Parameters
    ----------
    time_series : numpy.ndarray

    Returns
    -------
    numpy.ndarray
    '''
    clim = time_series.mean(axis = 0)
    return time_series - clim

### Putting it All Together

In [None]:
file_list = glob.glob('data_raw/*.nc')

et = stack_time_series(file_list, 'Evap')
et_anomaly = anomalies(et)

rad_anomaly = anomalies(stack_time_series(file_list, 'SWdown'))
sm_anomaly = anomalies(stack_time_series(file_list, 'SMAvail_0_100cm'))

In [None]:
images = [
    et_anomaly[-1],
    rad_anomaly[-1],
    sm_anomaly[-1]
]
labels = ['ET', 'Radiation', 'Soil Moisture']

fig = pyplot.figure(figsize = (12, 5))
ax = fig.subplots(1, 3)
for i in range(3):
    ax[i].imshow(images[i], cmap = 'RdYlBu')
    ax[i].set_title(labels[i] + ' Anomaly')

---

## Bringing in NASA Earth Observations

The NLDAS data we've used are a great tool for retrospective studies but, as a re-analysis dataset, it has some limitations:

- It has a relatively high latency; it may be days or weeks before data are available.
- It integrates data from multiple sources but with varying levels of accuracy and geographic coverage.

If we want to characterize flash drought or detect it in near-real time, we shouldn't use re-analysis datasets. Instead, we want some kind of direct observation of drought conditions. **Let's see what we can learn about the 2017 Flash Drought from NASA's satellite-based soil-moisture estimates.**

**We'll use data from NASA's Soil Moisture Active Passive (SMAP) Mission.** [NASA's earth observing missions provide data that is grouped into different processing levels:](https://www.earthdata.nasa.gov/engage/open-data-services-and-software/data-information-policy/data-levels)

- **Level 1 (Raw data):** Basically, these are data values measured directly by a satellite instrument. They may or may not be physically interpretable. Most end-users won't benefit from Level 1 data.
- **Level 2:** These are physically interpretable values that have been derived from the raw data, at the same spatial and temporal resolution as the Level 1 data. Level 2 data may be hard to use because the spatial structure of the data matches the instrument's viewing geometry.
- **Level 3:** At Level 3, the geophysical values have been standardized on a uniform spatial grid and uniform time series. While some values may be missing due to low quality, clouds, or sensor failure, gridded Level 3 data from different time steps can be easily combined and compared.
- **Level 4 (Model-enhanced data):** At Level 4, the values from Level 3 data are incorporated into some kind of model, possibly combining additional, independent datasets from other sensors in order to produce enhanced estimates or analyses of geophysical variables.

### Downloading the Data

**[We'll use the 36-km Level 3 surface soil moisture data from the SMAP mission](https://nsidc.org/data/spl3smp/versions/8)** because these are a good compromise between direct sensor observations and ease of use.

- At the website above, we can see there are multiple ways of accessing the data. [Let's use Earthdata Search;](https://search.earthdata.nasa.gov/search?q=SPL3SMP+V008) can we access the data from NASA's cloud using `earthaccess`?
- You may have noticed that the Level 3 SMAP data we want to use are *not* "Available in Earthdata Cloud." It looks like we'll have to download the data directly.
- **Where will we put the raw data we download?** Let's revisit our file tree in Jupyter Notebook.
- **Within the `data_raw` folder, let's create a new folder called `SMAP_L3`.** This is where we'll put the data we're about to download.

We've discussed the importance of having a well-documented workflow that makes it easy to understand how we obtained a particular scientific result. We assume that we can re-download the raw data we used anytime, but what if we forget where the data came from? Since the SMAP Level 3 data aren't available in the Cloud, we're about to do download the data manually, and it would be a good idea to document what steps we took to do that, in case there are questions about where the data came from or what kind of processing was applied.

- In the Jupyter Notebook file tree, within the `SMAP_L3` folder let's make a new `"New File"`. Name the new text file `README.txt`.
- Double-click `README.txt` to open it. This is where we'll add some useful information about the data we're about to download. Below is an example.

```
Author: K. Arthur Endsley
Date: November 1, 2023

This folder contains Level 3 data from the SMAP Mission. It was downloaded from:

    https://search.earthdata.nasa.gov/search?q=SPL3SMP+V008

Here's some more information about this product:

    https://nsidc.org/data/spl3smp/versions/8
```

This might not seem like a lot of information but there's plenty here that we would want to know if we took a long break from this project or if someone else had to try and figure out what we were doing. And its short length is also an advantage: **documenting your project doesn't have to be hard and any amount of information is better than none.**

### Customizing an Earthdata Search Download

The SMAP satellite has two overpasses every day, a "morning" and an "afternoon" overpass (local time). Let's use soil moisture data from the afternoon (PM) overpass, because this is likely when soil moisture stress on vegetation is at its peak.

- We'll download data from August and September to study the onset and progression of the 2017 Flash Drought: **Choose a temporal subset, 2017-08-01 through 2017-09-30.**
- At the bottom right, **click the big green button that reads "Download All."**
- 1.9 GB is a lot of data! Can we make this download any smaller? We're only interested in soil moisture from the afternoon overpass. **Click "Edit Options" and under "Select a data access method," select the "Customize" option.**

![](assets/M1_Earthdata_Search_SMAP-L3_customize_order.png)

- **Scroll down to "Configure data customization options" and down to "Band subsetting."**
- **Within the text box that reads "Filter" type `soil_moisture_dca_pm`.** This will filter the available variables ("bands") to just this specific variable, which is the soil moisture estimate from the Dual-Channel Algorithm (DCA) for the afternoon (PM) overpass.
- To make sure that `soil_moisture_dca_pm` is the *only* variable we download, **you'll need to uncheck the box next to `SPL3SMP` then re-check the box next to `soil_moisture_dca_pm` (see screenshot below).**

![](assets/M1_Earthdata_Search_SMAP-L3_customize_order_variables.png)
  
- Hit "Done" at the bottom of this form then the big green button that reads "Download Data"!

#### But Wait!

Because we selected a subset of variables, we'll have to wait to get an e-mail that the order is ready. **You don't need to do these steps yourself, because I already prepared all the data granules that would be downloaded this way.** They can be download directly from here:

- [SMAP_L3_SPL3SMP_V008_20170801_20170930.zip](http://files.ntsg.umt.edu/data/ScienceCore/SMAP_L3_SPL3SMP_V008_20170801_20170930.zip) (Extract this ZIP file's contents to your `data_raw/SMAP_L3` folder)

--- 

## Reading SMAP Level 3 Data

The SMAP Level 3 data we downloaded are each stored as a **Hierarchical Data File, version 5 (HDF5).**

In [None]:
import h5py

hdf = h5py.File('data_raw/SMAP_L3/SMAP_L3_SM_P_20170801_R18290_001_HEGOUT.h5', 'r')
hdf

An HDF5 file is a lot like a netCDF4 file: they are both hierarhical files capable of storing multiple, diverse datasets and metadata in a single file. What do we mean by "hierarchical"? Well, an HDF5 or netCDF4 file is like a file tree, where *datasets* can be organized into different nested *groups,* as depicted below. Metadata, in the form of *attributes,* can be attached to any dataset or group throughout the file.

![](assets/hdf5-structure.jpg)

*Image courtesy of NEON Science.*

In [None]:
hdf.keys()

In [None]:
hdf['Metadata']

In [None]:
hdf['Metadata'].keys()

In [None]:
# TODO Significance of an empty group?
hdf['Metadata/ProcessStep']

In [None]:
hdf['Metadata/ProcessStep'].attrs.keys()

In [None]:
hdf['Metadata/ProcessStep'].attrs['softwareTitle']

### Reading HDF5 Datasets

In [None]:
filename = 'data_raw/SMAP_L3/SMAP_L3_SM_P_20170801_R18290_001_HEGOUT.h5'

hdf = h5py.File(filename, 'r')
hdf.keys()

In [None]:
# TODO Explain that we'd like to open the file with xarray
# TODO Note there are no coordinates

ds = xr.open_dataset(filename, group = 'Soil_Moisture_Retrieval_Data_PM')
ds

In [None]:
# TODO Note coordinates assignment

ds = ds.assign_coords({'x': hdf['x'][:], 'y': hdf['y'][:]})
ds

In [None]:
# TODO Remark on striping

pyplot.figure(figsize = (12, 5))
ds['soil_moisture_dca_pm'].plot()

### Summary: Reading HDF5 and netCDF4 Files

|                              |  HDF5                              | netCDF4                                |
|:-----------------------------|:-----------------------------------|:---------------------------------------|
|Module name                   | `h5py`                             | `netCDF4`                              |
|Files opened with...          | `hdf = h5py.File(...)`             | `nc = netCDF4.Dataset()`               |
|Datasets/groups viewed with...| `hdf.keys()`                       | `nc.variables` or `nc.variables.keys()`|
|                              | `hdf['group_name'].keys()`         | `nc.variables['group_name'].keys()`    |
|Datasets accessed through...  | `hdf`                              | `nc.variables`                         |
|Attributes listed through...  | `hdf['dataset'].attrs`             | `nc.variables['dataset'].ncattrs()`    |
|Attributes read by...         | `hdf['dataset'].attrs['attribute']`| `nc.variables['dataset'].getncattr()`  |

---

## More Resources

- Curious about how to use `earthaccess.open()` along with `xarray` so that you don't have keep any downloaded files around? Well, `xarray.open_dataset()` can be slow when you have a lot of files to open, as in this time-series example. [This article describes how you can speed up `xarray.open_dataset()`](https://climate-cms.org/posts/2018-09-14-dask-era-interim.html) when working with multiple cloud-hosted files.