# M1.9 - Using NASA Earth Observations

*Part of:* **M1: Open Climate Data**

**Contents:**

1. [Organizing our file system](#Organizing-our-file-system)
1. [Using NASA climate observations](#Using-NASA-climate-observations)
   - [Downloading SMAP Level 3 soil moisture data](#Downloading-SMAP-Level-3-soil-moisture-data)
   - [Customizing an Earthdata Search download](#Customizing-an-Earthdata-Search-download)
1. [Understanding hierarchical data files (HDF5)](#Understanding-hierarchical-data-files-(HDF5))
   - [Reading HDF5 datasets](#Reading-HDF5-datasets)
   - [Subsetting the SMAP L3 data](#Subsetting-the-SMAP-L3-data)
1. [Creating a soil moisture time series](#Creating-a-soil-moisture-time-series)
   - [Calculating a moving average](#Calculating-a-moving-average)

---

In [None]:
import earthaccess
import numpy as np
import xarray as xr
from matplotlib import pyplot

auth = earthaccess.login()

## Organizing our file system

Again, we'll need a place to store these raw data. When we started working with Noah NLDAS data, we created the following folders in our file system:

```
data_raw/
  NLDAS
  SMAP_L3
```

Make sure there is also a `SMAP_L3` folder to receive the data we're about to download!

---

## Using NASA climate observations

The NLDAS data we've used are a great tool for retrospective studies but, as a re-analysis dataset, it has some limitations:

- It has a relatively high latency; it may be days or weeks before data are available.
- It integrates data from multiple sources but with varying levels of accuracy and geographic coverage.

If we want to characterize flash drought or detect it in near-real time, we shouldn't use re-analysis datasets. Instead, we want some kind of direct observation of drought conditions. **Let's see what we can learn about the 2017 Flash Drought from NASA's satellite-based soil-moisture estimates.**

**We'll use data from NASA's Soil Moisture Active Passive (SMAP) Mission.** [NASA's earth observing missions provide data that is grouped into different processing levels:](https://www.earthdata.nasa.gov/engage/open-data-services-and-software/data-information-policy/data-levels)

- **Level 1 (Raw data):** Basically, these are data values measured directly by a satellite instrument. They may or may not be physically interpretable. Most end-users won't benefit from Level 1 data.
- **Level 2:** These are physically interpretable values that have been derived from the raw data, at the same spatial and temporal resolution as the Level 1 data. Level 2 data may be hard to use because the spatial structure of the data matches the instrument's viewing geometry.
- **Level 3:** At Level 3, the geophysical values have been standardized on a uniform spatial grid and uniform time series. While some values may be missing due to low quality, clouds, or sensor failure, gridded Level 3 data from different time steps can be easily combined and compared.
- **Level 4 (Model-enhanced data):** At Level 4, the values from Level 3 data are incorporated into some kind of model, possibly combining additional, independent datasets from other sensors in order to produce enhanced estimates or analyses of geophysical variables.

### Downloading SMAP Level 3 soil moisture data

**[We'll use the 36-km Level 3 surface soil moisture data from the SMAP mission](https://nsidc.org/data/spl3smp/versions/8)** because these are a good compromise between direct sensor observations and ease of use.

- At the website above, we can see there are multiple ways of accessing the data. [Let's use Earthdata Search;](https://search.earthdata.nasa.gov/search?q=SPL3SMP+V008) can we access the data from NASA's cloud using `earthaccess`?
- You may have noticed that the Level 3 SMAP data we want to use are *not* "Available in Earthdata Cloud." It looks like we'll have to download the data directly.
- **Where will we put the raw data we download?** Let's revisit our file tree in Jupyter Notebook.
- **Within the `data_raw` folder, let's create a new folder called `SMAP_L3`.** This is where we'll put the data we're about to download.

We've discussed the importance of having a well-documented workflow that makes it easy to understand how we obtained a particular scientific result. We assume that we can re-download the raw data we used anytime, but what if we forget where the data came from? Since the SMAP Level 3 data aren't available in the Cloud, we're about to do download the data manually, and it would be a good idea to document what steps we took to do that, in case there are questions about where the data came from or what kind of processing was applied.

- In the Jupyter Notebook file tree, within the `SMAP_L3` folder let's make a new `"New File"`. Name the new text file `README.txt`.
- Double-click `README.txt` to open it. This is where we'll add some useful information about the data we're about to download. Below is an example.

```
Author: K. Arthur Endsley
Date: November 1, 2023

This folder contains Level 3 data from the SMAP Mission. It was downloaded from:

    https://search.earthdata.nasa.gov/search?q=SPL3SMP+V008

Here's some more information about this product:

    https://nsidc.org/data/spl3smp/versions/8
```

This might not seem like a lot of information but there's plenty here that we would want to know if we took a long break from this project or if someone else had to try and figure out what we were doing. And its short length is also an advantage: **documenting your project doesn't have to be hard and any amount of information is better than none.**

### Customizing an Earthdata Search download

The SMAP satellite has two overpasses every day, a "morning" and an "afternoon" overpass (local time). Let's use soil moisture data from the afternoon (PM) overpass, because this is likely when soil moisture stress on vegetation is at its peak.

- [**This link will get you to the right place to start.**](https://search.earthdata.nasa.gov/search?q=SPL3SMP%20V008) Click on the one dataset that is shown on the right-hand side of the search window.
- We'll download data from August and September to study the onset and progression of the 2017 Flash Drought: **Choose a temporal subset, 2017-06-01 through 2017-09-30.**
- At the bottom right, **click the big green button that reads "Download All."**
- 3.8 GB is a lot of data! Can we make this download any smaller? We're only interested in soil moisture from the afternoon overpass. **Click "Edit Options" and under "Select a data access method," select the "Customize" option.**

![](assets/M1_Earthdata_Search_SMAP-L3_customize_order.png)

- **Scroll down to "Configure data customization options" and down to "Band subsetting."**
- **Within the text box that reads "Filter" type `soil_moisture_dca_pm`.** This will filter the available variables ("bands") to just this specific variable, which is the soil moisture estimate from the Dual-Channel Algorithm (DCA) for the afternoon (PM) overpass.
- To make sure we're only downloading the fields we want, **you'll need to uncheck the box next to `SPL3SMP` then re-check the box next to `soil_moisture_dca_pm` (see screenshot below) and each variable we want to keep.**

![](assets/M1_Earthdata_Search_SMAP-L3_customize_order_variables.png)

**We want to download the following fields (only):**

- `soil_moisture_dca_pm`
- `static_water_body_fraction_pm`
- `retrieval_qual_flag_dca_pm`
  
Hit "Done" at the bottom of this form then the big green button that reads "Download Data"!

#### But Wait!

Because we selected a subset of variables, we'll have to wait to get an e-mail that the order is ready. **You don't need to do these steps yourself, because I already prepared all the data granules that would be downloaded this way.** They can be download directly from here:

- [SMAP_L3_SPL3SMP_V008_20170601_20170930.zip](http://files.ntsg.umt.edu/data/ScienceCore/SMAP_L3_SPL3SMP_V008_20170601_20170930.zip) (Extract this ZIP file's contents to your `data_raw/SMAP_L3` folder)

--- 

## Understanding hierarchical data files (HDF5)

The SMAP Level 3 data we downloaded are each stored as a **Hierarchical Data File, version 5 (HDF5).**

In [None]:
import h5py

filename = 'data_raw/SMAP_L3/SMAP_L3_SM_P_20170901_R18290_001.h5'
hdf = h5py.File(filename, 'r')
hdf

An HDF5 file is a lot like a netCDF4 file: they are both hierarhical files capable of storing multiple, diverse datasets and metadata in a single file. What do we mean by "hierarchical"? Well, an HDF5 or netCDF4 file is like a file tree, where *datasets* can be organized into different nested *groups,* as depicted below. Metadata, in the form of *attributes,* can be attached to any dataset or group throughout the file.

![](assets/hdf5-structure.jpg)

*Image courtesy of NEON Science.*

We can look at the groups and datasets that are at the highest level of this hierarchy by typing:

In [None]:
hdf.keys()

The `h5py.File` object, `hdf`, is accessed like a Python dictionary. If we want to look at the `'Metadata'` group, for example, we type:

In [None]:
hdf['Metadata']

This isn't very informative, but every group and dataset in an `h5py.File` object also behaves like a Python dictionary:

In [None]:
hdf['Metadata'].keys()

The `'Metadata'` group is an example of how we might store information in an HDF5 file other than multi-dimensional arrays.

What is the significance of this empty group?

In [None]:
hdf['Metadata/ProcessStep']

Just like netCDF files, every dataset in an HDF5 file can be labeled with attributes.

In [None]:
hdf['Metadata/ProcessStep'].attrs.keys()

In [None]:
hdf['Metadata/ProcessStep'].attrs['processor']

### Reading HDF5 datasets

We open an HDF5 file for reading with the `'r'` flag, below.

In [None]:
hdf = h5py.File(filename, 'r')
hdf.keys()

Again, we can access datasets hierarchically...

In [None]:
hdf['Soil_Moisture_Retrieval_Data_PM'].keys()

And if we want to read a dataset, we use NumPy's `[:]` notation to indicate we want to access an array.

In [None]:
hdf['Soil_Moisture_Retrieval_Data_PM/soil_moisture_dca_pm'][:]

Whenever we're finished working with an open HDF5 file, we should make sure to close it.

In [None]:
hdf.close()

**Let's see what is different about opening the same file using `xarray`.** In particular, look at the **Data variables.**

In [None]:
ds = xr.open_dataset(filename)
ds

The single variable that was found, `"crs"`, is not going to be very useful to us.

**`xarray` has limitations when opening HDF5 files; it isn't able to determine what groups are available.** Instead, we have to specify the group we want to open.

In [None]:
ds = xr.open_dataset(filename, group = 'Soil_Moisture_Retrieval_Data_PM')
ds

Now we have a useful variable, `"soil_moisture_dca_pm"`, but our `xarray.Dataset` has no coordinates!

One way to fix this would be to assign coordinates to our `xarray.Dataset`. This is why we need the `h5py` library, which is specialized for handling HDF5 files. We can read the `"x"` and `"y"` coordinates from our `h5py.File` and write them to the `xarray.Dataset`, as below.

In [None]:
# Instructor: Note coordinates assignment

hdf = h5py.File(filename, 'r')
ds = ds.assign_coords({'x': hdf['x'][:], 'y': hdf['y'][:]})
hdf.close()
ds

Now that we have both a **data variable** and **coordinates,** we're ready to plot the data!

In [None]:
pyplot.figure(figsize = (12, 5))
ds['soil_moisture_dca_pm'].plot()

There are two things to note about this image:

- **It's right-side up!** Unlike netCDF4 files, HDF5 files don't enforce a convention regarding the direction of spatial coordinates.
- **Notice the striping in this image.** The SMAP satellite has a revisit time of between 2 and 3 days. This means that, on a single day, the satellite's radiometer only images part of the globe. We could combine the morning, `"soil_moisture_dca_am"`, and afternoon, `"soil_moisture_dca_pm"`, overpasses for a single day, but soil moisture in many regions of the world varies quite a lot between morning and afternoon, so this might not be reasonable.

We chose the `"soil_moisture_dca_pm"` (afternoon) overpass because the afternoon is typically when soil moisture stress is highest in terrestrial ecosystems.

### Challenge: Write a function to process SMAP L3 data

Based on what we just did above, write a single function called `process_smap_l3` that:

- Accepts a file path to a SMAP L3 `*.h5` file, as a Python string
- Returns an `xr.Dataset`

When you've finished, compare it to the one written below.

In [None]:
def process_smap_l3(file_path):
    '''
    Parameters
    ----------
    file_path : str
        The file path to the SMAP L3 file

    Returns
    -------
    xarray.Dataset
    '''
    with h5py.File(file_path, 'r') as hdf:
        ds = xr.open_dataset(file_path, group = 'Soil_Moisture_Retrieval_Data_PM')
        return ds.assign_coords({'x': hdf['x'][:], 'y': hdf['y'][:]})

---

### Subsetting the SMAP L3 data

The SMAP soil moisture data are global but we're currently interested in a small study region, the Northern Plains of the U.S. How can we subset the SMAP data to our study region?

You may have noticed that the coordinates we added to our `xarray.Dataset`, above, were not latitude-longitude coordinates. The SMAP data are projected onto an EASE-Grid 2.0, where "EASE" stands for Equal-Area Scalable Earth. This unique, global projection has many advantages but the X and Y coordinates can be hard to understand when we're used to working with latitude-longitude coordinates.

**We'll use a tool from the `pyl4c` library to translate latitude-longitude (WGS84 datum) coordinates into the row-column coordinates of pixels.**

In [None]:
from pyl4c.ease2 import ease2_from_wgs84

help(ease2_from_wgs84)

Let's say that the upper-left corner of our study area is 49 degrees N latitude, 109 degrees W longitude.

In [None]:
# We want the upper-left corner coordinates
upper_left = ease2_from_wgs84((-109, 49), grid = 'M36')
upper_left

And the lower-right corner of our study area is 43 degrees N latitude, 95 degrees W longitude.

In [None]:
lower_right = ease2_from_wgs84((-95, 43), grid = 'M36')
lower_right

Let's get an `xarray.Dataset` using the function we wrote and plot our study area.

In [None]:
ds = process_smap_l3('data_raw/SMAP_L3/SMAP_L3_SM_P_20170802_R18290_001.h5')

ds['soil_moisture_dca_pm'][49:59,187:227].plot()

There are two things to note from this image:

- **It's apparent that the western part of our study area was missed by the satellite on this particular data-day.**
- **There is an area with very high soil moisture (bright yellow pixel) in the left-center of the image.** This pixel is almost certainly a permanent water body, so we should mask it out before doing any analysis.

Despite the missing area and water body, it's still possible to get a mean soil moisture value for the region...

In [None]:
ds['soil_moisture_dca_pm'][49:59,187:227].mean().values

But this approach may be biased because of the water body and because, depending on the day, the value reflects the soil moisture conditions in different, smaller parts of our study region.

---

## Masking out permanent water bodies

Let's take a look at the `static_water_body_fraction_pm` dataset.

In [None]:
pyplot.figure(figsize = (12, 5))
ds['static_water_body_fraction_pm'].plot()

Although the water body fraction is "static" and doesn't change over time, the SMAP L3 dataset only shows that part of the mask where data were acquired, hence the striping.

And let's focus on our study area.

In [None]:
ds['static_water_body_fraction_pm'][49:59,187:227].plot()

How could we use this to mask out permanent water bodies in our soil moisture data? We'll need to decide what threshold for fractional water coverage will be used to define permanent water bodies. Let's adopt the convention that pixels with greater than 20% of their area covered by water (i.e., fractional water area $\ge 0.2$) should be masked.

The above image reminds us of another issue we need to address: each daily image may include only part of our study area. **In each daily granule, we can use the granule's `static_water_body_fraction_pm` dataset to create a binary mask where `1` indicates a permanent water body and `0` indicates anything else (valid data).**

In [None]:
np.where(ds['static_water_body_fraction_pm'][:] >= 0.2, 1, 0)[49:59,187:227]

Let's update our `process_smap_l3()` function to include masking of the data. **Note that we've added a keyword argument to our function to allow users to have control over the fractional water area threshold used. By providing a default argument to `threshold`, we are also implicitly documenting our decision to use `0.2` as the threshold.**

In [None]:
def process_smap_l3(file_path, threshold = 0.2):
    '''
    Parameters
    ----------
    file_path : str
        The file path to the SMAP L3 file
    threshold : float
        The fractional water area threshold to use when masking out
        permnanet water bodies

    Returns
    -------
    xarray.Dataset
    '''
    with h5py.File(file_path, 'r') as hdf:
        ds = xr.open_dataset(file_path, group = 'Soil_Moisture_Retrieval_Data_PM')
        ds = ds.assign_coords({'x': hdf['x'][:], 'y': hdf['y'][:]})

    # Create a binary (0 or 1) array based on the water fraction threshold
    mask = np.where(ds['static_water_body_fraction_pm'][:] >= threshold, 1, 0)
    
    # Read the dataset out as a NumPy array so we can mask it
    data = ds['soil_moisture_dca_pm'].to_numpy()
    
    # Write NaN into this dataset wherever there are permanent water bodies
    data[mask == 1] = np.nan
    ds['soil_moisture_dca_pm'][:] = data
    return ds

In [None]:
ds = process_smap_l3('data_raw/SMAP_L3/SMAP_L3_SM_P_20170802_R18290_001.h5')
ds['soil_moisture_dca_pm'][49:59,187:227].plot()

---

## Creating a soil moisture time series

It'd be nice if we could use `xr.open_mf_dataset()` to open all these SMAP HDF5 files as a single time-series dataset. If we tried that, however, we'd find that it doesn't work because `xarray` doesn't know what the coordinates of an HDF5 dataset are, so it can't combine the datasets together.

```python
ds = xr.open_mfdataset('data_raw/SMAP_L3/*.h5', group = 'Soil_Moisture_Retrieval_Data_PM')
```
```
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[166], line 1
----> 1 ds = xr.open_mfdataset('data_raw/SMAP_L3/*.h5', group = 'Soil_Moisture_Retrieval_Data_PM')

...

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation
```

**This means we'll have to open each file ourselves and stack the arrays together.**

We can use the `glob` library to get a list of all the files we want. The notation below, `'data_raw/SMAP_L3/*.h5'`, is similar to what we used with `xr.open_mfdataset()`: we're saying we want to use all the HDF5 files in a particular directory.

In [None]:
import glob

file_list = glob.glob('data_raw/SMAP_L3/*.h5')
file_list[0]

**When we use `glob.glob()` for files that represent a time series, it's very important that we make sure the files are listed in chronological order!**

As long as the filenames include a sensible timestamp, such as a date in `YYYYMMDD` (Year-Month-Day) order, we can use call the `sort()` method of the Python list to get the files in alphanumeric order, which is the same as chronological order in this case.

In [None]:
file_list.sort()
file_list[0]

Let's write a `for` loop to process each SMAP L3 granule and extract the mean within this rectangular window (indicated by the upper-left and lower-right corners).

In [None]:
upper_left = ease2_from_wgs84((-109, 49), grid = 'M36')
lower_right = ease2_from_wgs84((-97, 43), grid = 'M36')
r0, c0 = upper_left
r1, c1 = lower_right

sm_mean = []
for filename in file_list:
    ds = process_smap_l3(filename, threshold = 0.2)
    sm_mean.append(ds['soil_moisture_dca_pm'][r0:r1,c0:c1].mean().values)

sm_mean = np.hstack(sm_mean)

When we plot the data, it would be nice to show dates along the horizontal axis. We can get a sequence of dates using the `pandas.date_range()` function.

In [None]:
import pandas

dates = pandas.date_range('2017-06-01', '2017-09-30', freq = '1D')
pyplot.figure(figsize = (10, 5))
pyplot.plot(dates, sm_mean, 'k-')
pyplot.ylabel('Volumetric Soil Moisture (m3 m-3)')
pyplot.show()

**Our time series looks strange.** There are several two-day gaps (where the line is broken) and a lot of high-frequency variation (spikes). These spikes seem to occur just before or after the gaps. 

**Are these spikes due to permanent water bodies we failed to mask out? One advantage of having a `threshold` parameter in our `process_smap_l3()` function is that we can test this theory pretty quickly: change the `threshold` to `0.1` and see for yourself!**

Unfortunately, that doesn't seem to be the case. We can intuit that the gaps correspond to days where the SMAP satellite did not pass over our study area. We also know there are days when our study area is only partially observed (as we saw in the plot above) and these likely correspond to the extreme values, as wetter or drier parts of the study area are missed.

It's clear that we should not be calculating a mean value when only part of the study area is observed, as this is causing bias (and the spikes, above).

### Calculating a moving average

One way to address the gaps might be to calculate a moving average, filling in missing values for a given date with the average of the values from adjacent dates.

Below, we use two nested `for` loops to create a composite soil moisture map for each date by combining the images from the current, previous, and next days (i.e., a 3-day moving average).

In [None]:
time_series = []

for i in range(len(file_list)):
    # Skip the first and last files
    if i == 0 or i == (len(file_list) - 1):
        continue

    # For the previous, current, and next dates...
    sm_stack = []
    for j in [i-1, i, i+1]:
        ds = process_smap_l3(file_list[j])
        sm = ds['soil_moisture_dca_pm'][r0:r1,c0:c1]
        sm_stack.append(sm)

    # Take the average of the 3 values in each pixel, excluding NaNs
    sm_stack = np.nanmean(np.stack(sm_stack, axis = 0), axis = 0)
    # Then, compute the overall mean for the region of interest
    time_series.append(sm_stack.mean())

We still have 120 days of data, as before.

In [None]:
len(time_series)

And each date is now a composite of three daily images. We can take a look at the last one that was processed, above, by plotting the `sm_stack` array.

In [None]:
pyplot.imshow(sm_stack)

We now obtain a soil moisture time series that looks a little more reasonable. 

It's apparent from our time series that 2017 was a fairly dry summer overall but that soil moisture in the region reached a minimum during the flash drought, between `2017-09-01` and `2017-09-15`. Soil moisture also increases by a large amount 

In [None]:
pyplot.figure(figsize = (10, 5))
pyplot.plot(dates[1:-1], time_series, 'k-')
pyplot.ylabel('Volumetric Soil Moisture (m3 m-3)')
pyplot.show()

---

## Summary

- NASA Earthdata Search provides many different datasets for studying earth's climate system. Re-analysis datasets and NASA Level 4 datasets integrate multiple raw data sources to provide a continuous record and complete spatial coverage, without data gaps. However, they may include model biases that are not reflective of real-world conditions. Remote sensing datasets and NASA Level 3 datasets offer more direct observations.

- **Document your data!** When you download raw data, keep it separate and unmodified. Be sure to make a `README` file, placed in the same directory or the parent directory, that contains information on where the data came from and how it should be used.

- **Document your process!** When working with data, re-useable Python functions can help speed up your work. Well-written functions also serve as a documentation of your workflow, as they describe the steps you took to process data.

- **HDF5 and netCDF4 files can help you stay organized.** Both file formats allow you to add *attributes* to datasets they contain, which is a good way to document measurement units, original data sources, and other key *metadata.*

### Reading HDF5 and netCDF4 files

|                            |  HDF5 files                        | netCDF4 files                          | `xarray` (for both)        |
|:---------------------------|:-----------------------------------|:---------------------------------------|:---------------------------|
|Module import               | `import h5py`                      | `import netCDF4`                       | `import xarray as xr`      |
|Files opened with...        | `hdf = h5py.File(...)`             | `nc = netCDF4.Dataset()`               | `ds = xr.open_dataset()`   |
|Datasets/groups viewed...   | `hdf.keys()`                       | `nc.variables` or `nc.variables.keys()`| `list(ds.variables.keys())`|
|                            | `hdf['group_name'].keys()`         | `nc.variables['group_name'].keys()`    |                            |
|Datasets accessed through...| `hdf`                              | `nc.variables`                         | `ds.variables`             |
|Attributes listed through...| `hdf.attrs`                        | `nc.ncattrs()`                         | `ds.attrs`                 |
|                            | `hdf['dataset'].attrs`             | `nc.variables['dataset'].ncattrs()`    |                            |
|Attributes read by...       | `hdf['dataset'].attrs['attribute']`| `nc.variables['dataset'].getncattr()`  | `ds.variables['dataset']`  |

---

## More resources

- Curious about how to use `earthaccess.open()` along with `xarray` so that you don't have keep any downloaded files around? Well, `xarray.open_dataset()` can be slow when you have a lot of files to open, as in this time-series example. [This article describes how you can speed up `xarray.open_dataset()`](https://climate-cms.org/posts/2018-09-14-dask-era-interim.html) when working with multiple cloud-hosted files.

### References

- Chen, L. G., J. Gottschalck, A. Hartman, D. Miskus, R. Tinker, and A. Artusa. 2019. Flash drought characteristics based on U.S. Drought Monitor. Atmosphere 10 (9):498.
- He, M., J. S. Kimball, Y. Yi, S. W. Running, K. Guan, K. Jensco, B. Maxwell, and M. Maneta. 2019. Impacts of the 2017 flash drought in the US Northern plains informed by satellite-based evapotranspiration and solar-induced fluorescence. Environmental Research Letters 14 (7):074019.