# M1.8 - Using Re-Analysis Data to Study Drought

*Part of:* **M1: Open Climate Data**

**Contents:**

1. [Organizing our file system](#Organizing-our-file-system)
2. [Downloading NLDAS data](#Downloading-NLDAS-data)
3. [Understanding the netCDF file format](#Understanding-the-netCDF-file-format)
   - [Getting real values from netCDF datasets](#Getting-real-values-from-netCDF-datasets)
   - [Plotting netCDF4 variables](#Plotting-netCDF4-variables)
4. [Opening multiple files with `xarray`](#Opening-multiple-files-with-xarray)
5. [Computing a climatology](#Computing-a-climatology)
   - [Describing climatic extremes using anomalies](#Describing-climatic-extremes-using-anomalies)
6. [Writing re-useable code and documentation](#Writing-re-useable-code-and-documentation)
   - [Repeating our climate data analysis](#Repeating-our-climate-data-analysis)
7. [Computing vapor pressure deficit](#Computing-vapor-pressure-deficit)
   - [Saving our results](#Saving-our-results)

---

Let's use everything we've learned so far to study a real drought event in an agricultural system.

Whereas meteorological drought is characterized by a deficit of precipitation, **flash drought** is characterized by a sudden, extreme demand for water from the land surface: "anomalously high evapotranspiration rates, caused by anomalously high temperatures, winds, and/or incoming radiation" (Chen et al. 2019).

In 2017, a flash drought emerged in the Northern Plains of the United States. The U.S. Drought Monitor estimated that, on September 5, 2017, about 23% of the region experienced "extreme drought" conditions and subsequent crop losses and water shortages (He et al. 2019, *Environmental Research Letters*).

**For this case study, we're going to consider the following climate variables and data sources:**

- Evapotranspiration, radiation, and soil moisture data from [the North American Land Data Assimilation System (NLDAS)](https://disc.gsfc.nasa.gov/datasets/NLDAS_NOAH0125_M_2.0/summary?keywords=NLDAS), a re-analysis dataset.
- Air temperature, pressure, and humidity, from [the NLDAS forcing data](https://disc.gsfc.nasa.gov/datasets/NLDAS_FORA0125_M_2.0/summary?keywords=NLDAS)
- Soil moisture the Soil Moisture Active Passive (SMAP) mission

In [None]:
import earthaccess
import numpy as np
import xarray as xr
from matplotlib import pyplot

auth = earthaccess.login()

## Organizing our file system

We'll need a place to store these raw data. It's important that we have a folder in our file system reserved for these raw data so we can keep them separate from any new datasets (outputs) we might create. 

**Create a folder called `outputs` in your Jupyter Notebook's file system.** You can do this from the Jupyter Notebook home page (the file tree) by selecting "New" and then "New Folder" as in the screenshot below.

![](./assets/M1_screenshot_Jupyter_new_folder.png)

**Let's also create a folder called `data_raw` in our Jupyter Notebook's file system. Create two sub-folders within `data_raw`:**

- `NLDAS`
- `SMAP_L3`

We should never modify the raw data (that we're about to download). Doing so would make it hard to repeat the analysis we're going to perform as we will lose the original data values. This doesn't mean we have to keep the `data_raw` folder around forever: if it's publicly available data, we can always download it again.

---

## Downloading NLDAS data

We'll use the function `earthaccess.search_data()` again. In this case, the `short_name` and `version` can be found on [the Goddard Earth Sciences (GES) Data and Information Services Center (DISC) website for this product.](https://disc.gsfc.nasa.gov/datasets/NLDAS_NOAH0125_M_2.0/summary?keywords=NLDAS)

**The NLDAS data we're interested in are compiled monthly and we want to download an August monthly dataset for each year.** Because the dates we want are non-consecutive, we need to call `earthaccess.search_data()` within a `for` loop. Below, we also use string formatting so that the string `f'{year}-08'` becomes, e.g.: `'2008-08'`, `'2009-08'`, and so on.

In [None]:
# Instructor: Show how to find the "short_name" and "version"

results = []

# Get data from August for every year from 2008 up to (but not including) 2018
for year in range(2008, 2018):
    search = earthaccess.search_data(
        short_name = 'NLDAS_NOAH0125_M',
        version = '2.0',
        temporal = (f'{year}-08', f'{year}-08'))
    results.extend(search)

In [None]:
len(results)

Previously, we've used `earthaccess.open()` to get access to these data. This time, we'll use `earthaccess.download()`. What's the difference?

- `earthaccess.open()` provides a file-like object that is available to be downloaded and read *only we need it.*
- `earthaccess.download()` actually downloads the file to our file system.

**Note that, below, we're telling `earthaccess.download()` to put the downloaded files into our new `data_raw` folder.**

In [None]:
earthaccess.download(results, 'data_raw/NLDAS')

---

## Understanding the netCDF file format

Although we could open these netCDF files using `xarray`, we're going to first use a different Python library called `netCDF4`. This will help us learn more about the netCDF file format.

In [None]:
import netCDF4

nc = netCDF4.Dataset('data_raw/NLDAS/NLDAS_NOAH0125_M.A200808.020.nc')

As we previously discussed, one of the advantages of the netCDF file format is that it is **self-documenting;** there are file-level and dataset-level descriptive information, or metadata, called **attributes.**

When using the `netCDF4` library to open a netCDF file, we can access **file-level attributes** this way:

In [None]:
nc.ncattrs()

And we can access **dataset-level attributes** this way:

In [None]:
nc['Evap'].ncattrs()

### Getting real values from netCDF datasets

There are some significant differences between how the `xarray` and `netCDF4` libraries represent netCDF files. One important difference is how array data are read from the files.

The `netCDF4` package reads array data from the file the way it is stored on disk.

In [None]:
nc['Evap']

**Notice the `scale_factor`, `add_offset`, and `missing_value` attributes.** These are very important to consider because of the way netCDF4 files sometimes store variables. If the variables are packed in a certain way to save disk space, we need to transform the packed values into real values before using the data:

$$
\text{Real value} = (\text{Packed value}\times \text{Scale factor}) + \text{Offset}
$$

**When we look at the `"Evap"` (total evapotranspiration) dataset's attributes with `xarray`, however...**

In [None]:
ds = xr.open_dataset('data_raw/NLDAS/NLDAS_NOAH0125_M.A200808.020.nc')

ds['Evap'].attrs

**Note that the attributes are different!** The `scale_factor`, `add_offset`, and `missing_value` attributes are missing.

**This is because `xarray` transforms packed values into real values automatically for us.** Compare the two examples below.

In [None]:
np.array(nc['Evap'][:])

In [None]:
np.array(ds['Evap'][:])

**In this case, the `scale_factor` is `1.0` and the `add_offset` is `0.0`, meaning the packed values are the same as the real values.** Hence, there is no difference in the numbers for valid data areas, above; but we can see that the `xarray.Dataset` replaced the `missing_value` (-9999) with `np.nan`.

### Plotting netCDF4 variables

Let's plot the data from the first monthly dataset (August 2008). **Recall that, when using `pyplot.imshow()`, we have to provide a 2D array...**

In [None]:
et = nc['Evap']

et.shape

The first axis of our `et` array is trivial, as it has only one element. We can simply subset the `et` array to this "first" (and only) element using `et[0]`:

In [None]:
pyplot.imshow(et[0])

Does this look right? Why is it upside down?

The reason is because of [the CF Convention](http://cfconventions.org/) that defines how netCDF4 files should be formatted. Part of that standard requires that the coordinate arrays (here, latitude and longitude arrays) be sorted from smallest number to largest number. **Whereas spatial coordinate systems like latitude-longitude have numbers increasing from bottom-to-top and left-to-right, image coordinate systems (for arrays) differ in that numbers increase from top-to-bottom:**

![](assets/coordinate-system-diagram.png)

When working with a coordinate system that uses latitude, that means that the vertical coordinates go from the most southern (negative) latitude to the most northern (positive) latitude. **Essentially, are image is flipped upside-down because the most negative coordinates are at the top of the image.** We can easily flip an array right-side up using `np.flipud()` ("flip upside-down"):

In [None]:
pyplot.imshow(np.flipud(et[0]))

In [None]:
type(et)

---

## Opening multiple files with `xarray`

Now, let's switch back to `xarray`.

**Again, we have several files representing different points in time.** Instead of writing a `for` loop, this time we'll use `open_mfdataset()` ("open multi-file dataset") to collect all the files into a single dataset. 

The string `'data_raw/NLDAS/*.nc'` describes where the files we want to open are located, where `*` is a wildcard: a symbol matching any number of text characters that may be present in a filename. In this case, we want to open *all* the netCDF files (`*.nc`) in the `data_raw/NLDAS` directory that contain the word `"NOAH"`.

In [None]:
# Open all the files as a single xarray Dataset
ds = xr.open_mfdataset('data_raw/NLDAS/*NOAH*.nc')
ds

**These NLDAS files contain multiple different variables...**

In [None]:
list(ds.variables.keys())

**In this case study, we're primarily interested in the variables that quantify the state of the water cycle or evaporative stress:**

- `Evap`: This is total evapotranspiration
- `SWdown`: Down-welling short-wave radiation, i.e., the amount of solar radiation directed downwards
- `SMAvail_0_100cm`: The total liquid water in the top 100 cm of soil

Now we've seen one of the big advantages of the `xarray` library: `xarray` already knows how netCDF variables should be displayed. It is capable of figuring out, based on the coordinates, how the image should be oriented.

Below, because our dataset, `ds`, has more than one time step, we use the notation `ds['Evap'][0]` to subset the data to the first (zeroth) time step.

In [None]:
# Plot the first image in the time series
ds['Evap'][0].plot()

**However, if we extract a netCDF variable as a NumPy array, it will still be upside-down:**

In [None]:
array = ds['Evap'][:]

pyplot.imshow(array[0])

This is why we must be careful when working with netCDF data, regardless of whether we use the `xarray` or `netCDF4` libraries.

---

## Computing a climatology

**What distinguishes a flash drought or any drought from non-drought conditions?** We know that drought is characterized by reduced precipitation, reduced soil moisture, or both, but what is the magnitude of the reduction? To answer this question, we'd need to compare "drought conditions" to "average conditions." That is, compared to a *long-term average,* what is the magnitude of the change in a meteorological condition, like monthly precipitation?

The *long-term average* of a climate variable, calculated for some recurring interval (days, months, years), is called a **climatology.** In this case study, we're interested in how severe the August 2017 drought conditions were. We could quantify that by computing an August evapotranspiration climatology, which is the average August evapotranspiration (ET) over a long period of record/

In this case, we have 10 years of monthly August ET, as indicated by the first axis of our array:

In [None]:
ds['Evap'].shape

Computing a climatology, therefore, is as easy as calling `mean()` on our array and averaging over the `'time'` axis:

In [None]:
et_clim = ds['Evap'].mean('time')
et_clim.shape

In [None]:
# NOTE: We're flipping the et_clim array upside-down because of the CF convention
pyplot.imshow(np.flipud(et_clim))
cbar = pyplot.colorbar()
cbar.set_label('Evapotranspiration [kg m-2]')
pyplot.title('Mean August ET')
pyplot.show()

### Describing climatic extremes using anomalies

To figure out how much lower August ET was in 2017, we want to subtract the climatology from the August 2017 ET, effectively removing the average ET and showing only deviations from (above or below) the mean.

When we subtract the mean from a time series, the result is often called the **anomaly** (deviation from the mean).

In [None]:
et_2017_anomaly = ds['Evap'][-1] - et_clim

pyplot.imshow(np.flipud(et_2017_anomaly), cmap = 'RdYlBu')
cbar = pyplot.colorbar()
cbar.set_label('Evapotranspiration Anomaly [kg m-2]')
pyplot.title('August 2017 ET Anomaly')
pyplot.show()

We can compute the anomalies in *every year* by simply writing:

In [None]:
et_anomaly = ds['Evap'] - et_clim
et_anomaly.shape

#### Using `cartopy`

Once again, it can be helpful to use `cartopy` to see our data in context. In this example, we use `cartopy` built-in support for [data from Natural Earth](https://www.naturalearthdata.com/) to see the U.S. state boundaries on top of our image.

In [None]:
extent = [
    ds['lon'].to_numpy().min(),
    ds['lon'].to_numpy().max(),
    ds['lat'].to_numpy().min(),
    ds['lat'].to_numpy().max()
]
extent

In [None]:
import cartopy.crs as ccrs
import cartopy.io.shapereader as shpreader

shapename = 'admin_1_states_provinces_lakes'
states_shp = shpreader.natural_earth(resolution = '110m', category = 'cultural', name = shapename)

fig = pyplot.figure()
ax = fig.add_subplot(1, 1, 1, projection = ccrs.PlateCarree())
ax.imshow(np.flipud(et_2017_anomaly), extent = extent, cmap = 'RdYlBu')
ax.add_geometries(shpreader.Reader(states_shp).geometries(), ccrs.PlateCarree(), facecolor = 'none')
pyplot.show()

---

## Writing re-useable code and documentation

It looks like August 2017 was characterized by anomalously low evapotranspiration (ET) rates in the Northern Plains. What about other climate variables?

We did a lot of work to create the ET anomaly plot. Is there a way we can easily apply the same workflow to our other climate variables?

**This is what we create Python functions for: to automate a task in a consistent way.** Let's create a function for our current workflow.

In [None]:
def calculate_anomaly(data):
    '''
    Calculates the anomaly in a long-term time series dataset.

    Parameters
    ----------
    data : xarray.DataArray
        The time-series data, a (T x M x N) array where T is the 
        number of time steps

    Returns
    -------
    xarray.DataArray
        The anomaly values
    '''
    clim = data.mean('time')
    # Create a sequence of 2D maps, one for each year
    return data - clim

**The multi-line Python string (beginning with three quote characters, `'''`) marks the beginning of a Python *docstring* or documentation string.** The docstring must immediately follow the first line of the function definition.

The docstring is what users see when they call for `help()` on your function:

In [None]:
help(calculate_anomaly)

**A good docstring tells the user:**

- The purpose of the function; in our example, the function "Generates a time series for a given variable..."
- What input *parameters* (arguments) the function accepts.
- What the *return value* of the function is.

It might also include one or more example use cases. The format of our docstring's "Parameters" and "Return" value are based on a convention ("numpydoc") and [you can read about that convention and alternatives at this reference.](https://pdoc3.github.io/pdoc/doc/pdoc/#supported-docstring-formats) Under the "Parameters" heading, we indicate the name of an input parameter, its type(s), and a brief explanation of what it means:

```
Parameters
----------
param_name : type
    Indented 4 spaces, we describe the input parameter on the next line
```

The "Returns" heading is formatted in a similar way, except the return value doesn't have a name:

```
Returns
-------
type
    Indented 4 spaces, we describe the output parameter
```

**Most importantly, does our Python function work the way we expect?**

In [None]:
et_anomaly = calculate_anomaly(ds['Evap'])
pyplot.imshow(np.flipud(et_anomaly[-1]), cmap = 'RdYlBu')

### Repeating our climate data analysis

**Once we've written re-useable Python functions like this, we can begin to scale-up our analysis in new ways!**

In [None]:
rad_anomaly = calculate_anomaly(ds['SWdown'])
sm_anomaly = calculate_anomaly(ds['SMAvail_0_100cm'])

The same functions we used for calculating an ET anomaly can be used to calculate an anomaly for any variable we're interested in! What do the anomalies in solar radiation (`"SWdown"`) and soil moisture (`"SMAvail_0_100cm"`) look like?

In [None]:
images = [
    et_anomaly[-1],
    rad_anomaly[-1],
    sm_anomaly[-1]
]
labels = ['ET', 'Radiation', 'Soil Moisture']

fig = pyplot.figure(figsize = (12, 5))
ax = fig.subplots(1, 3)
for i in range(3):
    ax[i].imshow(np.flipud(images[i]), cmap = 'RdYlBu')
    ax[i].set_title(labels[i] + ' Anomaly')

--- 

## Computing vapor pressure deficit

One aspect of the climate system we haven't yet examined is **vapor pressure deficit (VPD),** which is a measure of how dry the air is. VPD tells us the amount of additional water (in terms of vapor pressure) that the air could hold at its current temperature. Under high VPD, the atmosphere can act as a drinking straw, drawing water away from the Earth's surface and from plants. Did anomalously high VPD play a role in the 2017 Northern Plains flash drought?

VPD isn't available as a climate variable in the NLDAS re-analysis dataset, but we can compute it from the variables that are available. To do that, we'll need to download a slightly different NLDAS data collection, consisting of the atmospheric forcing data that drive the model.

In [None]:
results = []
# Get data from August for every year from 2008 up to (but not including) 2018
for year in range(2008, 2018):
    search = earthaccess.search_data(
        short_name = 'NLDAS_FORA0125_M',
        version = '2.0',
        temporal = (f'{year}-08', f'{year}-08'))
    results.extend(search)

In [None]:
earthaccess.download(results, 'data_raw/NLDAS')

Once again, we want to open multiple files as a single `xarray.Dataset` using `open_mfdataset()`.

In [None]:
ds = xr.open_mfdataset('data_raw/NLDAS/NLDAS_FORA*.nc')
ds

The NLDAS data we downloaded has three variables we're interested in:

- `Tair`, the air temperature in degrees Kelvin
- `Qair`, the specific humidity
- `PSurf`, the near-surface air pressure in Pascals

### Challenge: Write a function to compute VPD

VPD is defined as the difference between the saturation vapor pressure (SVP) and the actual vapor pressure (AVP). That is, it is the difference between how much water the air *could* hold at its current temperature and the actual amount of water it currently holds.
$$
\text{VPD} = \text{SVP} - \text{AVP}
$$

The August-Roche-Magnus formula is a good approximation for SVP:

$$
\text{SVP} = 610.94\times \text{exp}\left(
\frac{17.625\times T}{T + 243.04}
\right)
$$

And an approximation for AVP is given by Gates (1980, *Biophysical Ecology*):

$$
\text{AVP} = \frac{Q\times P}{0.622 + (0.379\times Q)}
$$

Where:

- $T$ is the air temperature in degrees Kelvin
- $Q$ is the specific humidity
- $P$ is the air pressure in Pascals
- VPD, SVP, and AVP are also in Pascals
- $\text{exp}$ refers to the exponential function and is available in NumPy as `np.exp()`

**Write a Python function called `vpd()` to compute VPD.** When writing your function, remember:

- Write an informative docstring!
- Be sure to add inline comments to describe complex or potentially confusing code.
- Consider what your variable names should be and how you might use them to communicate measurement units.
- When you've finished, compare it to the one written below.

**Hint:** You'll need to convert air temperature from degrees Kelvin to degrees Celsius by subtracting 273.15.

In [None]:
def vpd(temp_k, pressure_pa, s_humidity):
    '''
    Computes vapor pressure deficit (VPD).
    
    Parameters
    ----------
    temp_c : xarray.DataArray
        Air temperature in degrees Kelvin
    pressure_pa : xarray.DataArray
        Air pressure in Pascals
    s_humidity : xarray.DataArray
        Specific humidity (dimensionless)

    Returns
    -------
    xarray.DataArray
        VPD in Pascals
    '''
    temp_c = temp_k - 273.15
    # Saturation vapor pressure (Pa)
    svp = 610.94 * np.exp((17.625 * temp_c) / (temp_c + 243.04))
    # Actual vapor pressure (Pa)
    avp = (s_humidity * pressure_pa) / (0.622 + (0.379 * s_humidity))
    return svp - avp

---

Now that you've written a function to compute VPD, let's apply it to our NLDAS data.

In [None]:
vpd_series = vpd(ds['Tair'], ds['PSurf'], ds['Qair'])

vpd_anomaly = calculate_anomaly(vpd_series)

From the plot below, it appears that only part of the Northern Plains experienced above-average VPD in August 2017.

In [None]:
pyplot.imshow(np.flipud(vpd_anomaly[-1]), cmap = 'RdYlBu')
cbar = pyplot.colorbar()
cbar.set_label('VPD Anomaly (Pa)')
pyplot.title('August 2017 VPD Anomaly')
pyplot.show()

### Saving our results

We've done a lot of interesting work with the NLDAS data. What if we wanted to save our results for someone else to use? What might they need to know about what we've done, and how could we communicate that?

This is where a **self-documenting** file format like netCDF can help!

Starting with our `xarray.DataArray`, let's add some **metadata** in the form of **attributes.** What are some things that people should know about these data? The measurement units are important, of course.

In [None]:
vpd_anomaly.attrs['units'] = 'Pascals'
vpd_anomaly.attrs['name'] = 'Vapor pressure deficit anomaly, relative to 2008-2017 climatology'
vpd_anomaly

We're now ready to create our new `xarray.Dataset`!

The `xr.Dataset()` constructor function takes at least two arguments:

- The data variables, usually in the form a Python dictionary with key-value pairs representing the variable name (key) and the `DataArray` (value).
- The coordinates of the `DataArray`.

In [None]:
new_ds = xr.Dataset({'vpd_anomaly': vpd_anomaly}, ds.coords)
new_ds

We can also add some **file-level attributes.** If we ever wanted to change anything about this dataset, we might want to know what Python script was used to create it in the first place.

In [None]:
new_ds.attrs['source_file'] = 'CaseStudy_2017_Northern_Plains_Flash_Drought.ipynb'
new_ds

**Finally, we're ready to write our file to disk! But what filename should we choose?** For derived outputs, we should pick something *meaningful* that tells us information about:

- What kind of data the file contains
- What spatial locations or time periods it pertains to
- What source data were used

We'll also make sure to put the file in our `outputs` folder so that we don't mistake it for raw data.

In [None]:
new_ds.to_netcdf('outputs/NLDAS_VPD_anomalies_2008-2017.nc')

---

## More resources

- Curious about how to use `earthaccess.open()` along with `xarray` so that you don't have keep any downloaded files around? Well, `xarray.open_dataset()` can be slow when you have a lot of files to open, as in this time-series example. [This article describes how you can speed up `xarray.open_dataset()`](https://climate-cms.org/posts/2018-09-14-dask-era-interim.html) when working with multiple cloud-hosted files.

### References

- Chen, L. G., J. Gottschalck, A. Hartman, D. Miskus, R. Tinker, and A. Artusa. 2019. Flash drought characteristics based on U.S. Drought Monitor. Atmosphere 10 (9):498.
- He, M., J. S. Kimball, Y. Yi, S. W. Running, K. Guan, K. Jensco, B. Maxwell, and M. Maneta. 2019. Impacts of the 2017 flash drought in the US Northern plains informed by satellite-based evapotranspiration and solar-induced fluorescence. Environmental Research Letters 14 (7):074019.