# M3.2 - Analyzing a Global Precipitation Data Cube

*Part of:* [**Open Science for Water Resources**](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources)

In [None]:
import datetime
import glob
import earthaccess
import numpy as np
import h5py
import xarray as xr
from matplotlib import pyplot

auth = earthaccess.login()

The terrestrial water cycle consists of **stocks** and **flows** of water. Stocks include any long-term storage of water: lakes, reservoirs, groundwater, and atmospheric water vapor. Flows include any movement of water from one stock to another.

Just like with bank accounts, we can study flows to get a good idea of the overall picture. If we see a lot of money leaving one account, we can infer that the account balance, and the available money, may be low. We might create a budget to quantify money flowing into the account and money flowing out, in order to calculate whether the balance is growing or shrinking.

**To study water availability in a basin or watershed, we can look at the water budget:**
$$
P = E + R + \Delta S
$$

Where:

- $P$ is the precipitation, assumed to be the only way that water enters a basin.
- $E$ is the **evapotranspiration,** or the sum of water that is evaporated or that is drawn up from the soil by plants (transpired).
- $R$ is the runoff, or water that leaves the basin through overland flow, usually a central river.
- $\Delta S$ is the change in storage; this generally means the change in available water, mostly groundwater.

The idea behind the water budget is that after accounting for water inputs ($P$) and outputs ($E$ and $R$), we should be able to quantify how much the basin's available water storage is growing or shrinking ($\Delta S$).

**Here's an illustration of these water flows within a watershed:**

![](./assets/water_budget.png)

[*Image courtesy of the USGS*](https://www.usgs.gov/media/images/components-a-simple-water-budget-part-a-watershed)

So, to compute a water balance, we need at least three things: precipitation, runoff, and evapotranspiration data.

---

## Calculating basin-scale precipitation

The basin we'll consider for this study is the Yellowstone River basin. The runoff of this basin is gauged in the Yellowstone River near Sidney, Montana (U.S.A.).

In [None]:
import geopandas

basin = geopandas.read_file('/home/arthur.endsley/Workspace/NTSG/projects/Y2024_TOPS_Training/data/YellowstoneRiver_drainage_WSG84.shp')
river = geopandas.read_file('/home/arthur.endsley/Workspace/NTSG/projects/Y2024_TOPS_Training/data/YellowstoneRiver_course_WSG84.shp')
states = geopandas.read_file('/home/arthur.endsley/Workspace/NTSG/projects/Y2024_TOPS_Training/data/YellowstoneRiver_states_WGS84.shp')
basin

Below is a plot of our study area, showing the drainage area (basin) of the Yellowstone River. Where the River exits the basin is our gauging station, in eastern Montana.

In [None]:
ax = states.plot(edgecolor = 'black', color = 'lightgray')
basin.plot(ax = ax, edgecolor = 'darkblue', color = 'none')
river.plot(ax = ax, edgecolor = 'green', label = 'Yellowstone River')
pyplot.legend(loc = 'lower left')
pyplot.show()

### Downloading IMERG-Final precipitation data

To calculate the amount of precipitation entering the basin, we'll use [NASA's IMERG-Final dataset, which was described in detail in a previous lesson.](https://github.com/OpenClimateScience/M1-Open-Climate-Data/blob/main/notebooks/05_Earth_Observation_Data.ipynb) Specifically, we'll use a monthly version of IMERG-Final that estimates the precipitation rate on a global grid.

&#x1F449; [Read the documentation for IMERG-Final monthly data](https://dx.doi.org/10.5067/GPM/IMERG/3B-MONTH/07)

Let's download 10 years of monthly precipitation data, from 2014 through 2023.

&#x1F449; **Note that you must create the directory you want to put the data in, first. Here, we assume you want to put it in a directory called `data/IMERG-Final_monthly`.**

In [None]:
results = earthaccess.search_data(
    short_name = 'GPM_3IMERGM',
    temporal = ('2014-01-01', '2023-12-31'))

In [None]:
earthaccess.download(results, 'data/IMERG-Final_monthly')

### Working with multiple HDF5 files

The IMERG-Final data are in HDF5 format, which can be difficult to work with when we are interested in the spatial coordinates. Our preferred library, `xarray`, doesn't know how to interpret the coordinate systems of HDF5 files, so let's confirm some basic details ourselves after opening one of the data granules with the `h5py` library.

In [None]:
with h5py.File('data/IMERG-Final_monthly/3B-MO.MS.MRG.3IMERG.20180701-S000000-E235959.07.V07B.HDF5', 'r') as hdf:
    longitude = hdf['Grid/lon'][:]
    latitude = hdf['Grid/lat'][:]
    print(longitude.shape)
    print(latitude.shape)
    print(hdf['Grid/precipitation'].shape)
    print(hdf['Grid/precipitation'].attrs['units'])

It seems our data are on a longitude-latitude grid, with 3600 columns and 1800 rows. This corresponds to a spatial resolution of 0.1 degrees. The units of precipitation are millimeters per hour (mm/hr).

We have 10 years of monthly data, so we should have 120 files.

In [None]:
file_list = glob.glob('data/IMERG-Final_monthly/*.HDF5')
file_list.sort()
len(file_list)

The date information for each file can be extracted from the filename.

In [None]:
file_list[0].split('.')[4][0:8]

Now let's see how we would open these data using `xarray`. 

#### &#x1F6A9; <span style="color:red">Reading HDF5 Files with `xarray`</red>

Again, because it's an HDF5 file, it is difficult for `xarray` to interpret the coordinate systems and structure of the file. So, when using `xr.open_dataset()`, we need to:

- Indicate that the `group` in which our datasets are stored is called `'Grid'`
- Tell `xarray` not to try and interpret any date or time information: `decode_times = False`

We're only interested in the `'precipitation'` dataset, so we use the `get()` method to subset our `xarray` Dataset to just this variable.

Furthermore, because `xarray` doesn't know much about HDF5 files, we need to use `assign_coords()` to define the coordinates of our data.

In [None]:
filename = file_list[0]

# We only care about the "precipitation" variable, but we want an xarray.Dataset,
#    so we include the name of the variable(s) we want as a list in get()
ds = xr.open_dataset(filename, group = 'Grid', decode_times = False).get(['precipitation'])

# Define the missing coordinates
date = datetime.datetime.strptime(filename.split('.')[4][0:8], '%Y%m%d') # e.g., "20180101"
ds = ds.assign_coords({
    'time': [date], 'x': longitude, 'y': latitude
})

Finally, we're ready to visualize our data. To enhance some of the patterns in the data, let's set `vmax = 1`, so that our colorbar stretches only to 1 mm/hr at the high end.

The plot may look strange because the IMERG-Final product quantifies precipitation over both land and the oceans. However, you should be able to identify western Europe and Iceland in the top-center of the image. Also note that the tropics, as expected, receive a lot of precipitation relative to the rest of the world.

In [None]:
ds.precipitation.plot(x = 'lon', vmax = 1)

### Spatial coordinate reference systems

Spatial datasets are special. In addition to the data values they contain (e.g., precipitation), each value is associated with *spatial coordinates* that describe a point on the earth.

Because our datasets are often 2-dimensional, flat representations (like the precipitation data above), we need a way of locating each pixel in our flat dataset on the round earth. A **coordinate reference system (CRS)** describes how the flat representation relates to the earth.

**A CRS is often described using a unique identifier known as an EPSG code,** where EPSG stands for the European Petroleum Survey Group.

&#x1F449; You can learn more about EPSG codes, and look up the code for a specific CRS, at [the epsg.io website.](https://epsg.io/)

### Spatial subsetting of an `xarray` Dataset

This is a global dataset, but we're only interested in precipitation in the Yellowstone River basin. We've previously seen how to subset an `xarray` dataset to a specific point or within a rectangular bounding box. But how can we subset the data to an irregular shape, like a drainage basin?

&#x1F449; **First, we need to define the spatial coordinate system of our data. We can do this by accessing the `rio` property of a Dataset.**

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

**In order for the following example to work, we must have the module `rioxarray` installed.** (It doesn't need to be imported, but it does need to be installed.)

The `rio` property of a Dataset is accessed as `ds.rio`, if the Dataset is named `ds`. `ds.rio` gives us access to all sorts of tools for manipulating the spatial coordinates and attributes of a spatial dataset. It's name comes from the `rasterio` library, which is often abbreviated as `rio`.

**Let's define the coordinate reference system (CRS) of our data using it's EPSG code. For longitude-latitude data, we can usually assume it is best described by the WGS84 Geographic Coordinate System, which has the EPSG code 4326.**

In [None]:
from pyproj import CRS

# NOTE: We must have rioxarray installed to be able to access the 
#  "rio" property of xarray Datasets; here, we define the CRS of this
#  dataset, which was previously undefined

# Tell xarray that this dataset's CRS is the WGS84 Geographic Coordinate System
ds = ds.rio.write_crs(CRS.from_epsg(4326))

# Also, we need to indicate which of the coordinates correspond to X and Y dimensions
ds = ds.rio.set_spatial_dims(x_dim = 'lon', y_dim = 'lat')

**With the CRS and spatial coordinates defined, it's very easy to clip our precipitation grid to the bounds of our basin.**

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

For this to work, the CRS of our gridded data, `ds`, must be the same as the CRS of our basin, `basin`. We could verify this by asking:

```python
ds.rio.crs == basin.crs
```

In [None]:
ds_clip = ds.rio.clip(basin.geometry.values)
ds_clip.precipitation

We can already tell that the new dataset, `ds_clip`, is smaller. Let's plot it, to be sure.

In [None]:
ds_clip.precipitation[0].plot(x = 'lon')

---

## Creating a data processing pipeline

This worked great for a single Dataset, but we have several files to process. Let's create a `for` loop to deal with them.

The `concat()` function in `xarray` can be used to combine multiple datasets together into a single dataset. Here, because each of our IMERG-Final data granules represents a different date (month), we combine them together along the `time` axis.

In [None]:
datasets = []

for filename in file_list:
    date = datetime.datetime.strptime(filename.split('.')[4][0:8], '%Y%m%d')
    ds0 = xr.open_dataset(filename, group = 'Grid', decode_times = False).get(['precipitation'])
    ds0 = ds0.assign_coords({
        'time': [date], 'x': longitude, 'y': latitude
    })

    # Tell xarray that this dataset's CRS is the WGS84 Geographic Coordinate System
    ds0 = ds0.rio.write_crs(CRS.from_epsg(4326))
    
    # Also, we need to indicate which of the coordinates correspond to X and Y dimensions
    ds0 = ds0.rio.set_spatial_dims(x_dim = 'lon', y_dim = 'lat')

    ds_clip = ds0.rio.clip(basin.geometry.values)
    
    # Only write the file if it doesn't exist (in case we run this again)
    datasets.append(ds_clip)

# Combine the datasets together along the "time" axis
ds = xr.concat(datasets, dim = 'time')
ds

## Calculating total basin-wide precipitation rate

Next, we need to convert the units of measurement. We currently have precipitation measured in mm/hr but we have monthly data. To convert from mm/hr to mm/month, we need to figure out how many days are in each month.

In [None]:
import calendar

calendar.mdays

In [None]:
days_in_month = np.array(calendar.mdays)[ds.coords['time.month'].values]
days_in_month

It seems easy enough to multiply our monthly data, with its hourly rate, by the number of days (and 24 hours). However, when we're working with array data, we have to pay attention to the shape of our data.

In [None]:
ds.precipitation.shape

Our precipitation data are stored as a NumPy array with three axes:

- 120 months
- 72 rows (each 0.1 degrees latitude)
- 53 columns (each 0.1 degrees longitude)

When we multiply arrays, they should have the same shape, so we also need to reshape our `days_in_month` array to match the shape of our precipitation data.

In [None]:
# Converting from [mm hour-1] to [mm month-1]
ds['precip_monthly'] = ds.precipitation * 24 * days_in_month.reshape((days_in_month.size, 1, 1))
ds

**Finally, we're ready to compute the total precipitation for our basin.**

#### &#x1F6A9; <span style="color:red">Computing Total Precipitation</red>

You may be tempted to compute the sum of all the pixels in the basin in order to determine the total precipitation that fell in a given month. However, this would be grossly exaggerating the amount of precipitation, because our units of measurement (mm/hr or mm/month) are tied to a specific spatial area. That is, the vertical height of standing water (measured in mm) corresponds to a fixed area. If the area were larger, the vertical height of water would be lower as the water spreads out to cover the larger area. **We actually want to compute the mean of our basin's precipitation amounts.**

Another way to think about this is to imagine we have a tank with a removable divider, separating the tank into two regions of equal area. On the left of the divider, the water is 4 mm deep. On the right, the water is 2 mm deep. If we removed the divider, how deep would the water be after it settles? It would be 3 mm deep, which corresponds to the average of 2 and 4, not the sum.

&#x1F449; For an `xarray` Dataset, when we are computing the average over the entire basin, we specify that we want to collapse both the `'lon'` and the `'lat'` axes, leaving us with only the `'time'` axis.

In [None]:
# Compute the mean over all longitude-latitude bins
precip_series = ds.precip_monthly.mean(['lon','lat']).values

# Plot the result
pyplot.plot(precip_series)
pyplot.ylabel('Basin-wide precipitation (mm per month)')

Now, we're ready to save our dataset for later use. Let's save it to a netCDF4 file.

In [None]:
ds.to_netcdf('./processed/IMERG-Final_precip_monthly_2014-2023.nc')