# M2.1 - Working with Gridded Climate Data

**Contents:**

- Lorem ipsum

## Conceptual and computational constraints

As computer technology improves and more satellite-based datasets are available, the size, complexity, and frequency of climate data are increasing rapidly. This can lead to two problems for researchers and resources managers looking to use climate data. First, it can be difficult to conceptualize the size and scale of some climate datasets, leading to issues with data and project management (Jain et al. 2022). Second, it can be difficult to process and analyze climate datasets that are sometimes too large to fit into computer memory or too complex for tractable computation.

In this lesson, we'll explore how to manage some of these issues. Because we're learning, we'll use a dataset that isn't as large or complex as others we might encounter, but it should serve as a good illustration of more difficult datasets and how to handle them.

In [None]:
import xarray as xr
import earthaccess

auth = earthaccess.login(strategy = 'netrc')

## A terrestrial water storage (TWS) time series

To gain more experience working with gridded climate data, and particularly a time series of gridded climate data, we'll use a terrestrial water storage (TWS) dataset from the Global Land Data Assimilation System (GLDAS), which is a global version of the NLDAS we are already familiar with.

TWS includes all water on and under the land surface, in the form of snowpack, surface streams and reservoirs, and groundwater. [The TWS dataset we'll be working with from GDLAS](https://podaac.jpl.nasa.gov/dataset/TELLUS_GLDAS-NOAH-3.3_TWS-ANOMALY_MONTHLY) has already been converted to *monthly anomalies.*

Recall that *anomalies* are one way of representing variability around a long-term mean. They tell us something about how a particular data point (a particular day, month, or year) compares over time; e.g., is this an especially dry year?

### Performing our skills: Handling raw data

*We're about to download some raw data! You know what to do.* Create a new folder called `GLDAS` inside the `data_raw` folder.

In [None]:
results = earthaccess.search_data(
    short_name = 'TELLUS_GLDAS-NOAH-3.3_TWS-ANOMALY_MONTHLY')

# About 20 MB in total, which might take 1-2 minutes
earthaccess.download(results, 'data_raw/GLDAS')

Let's take a look at one of the datasets we just downloaded. Recall that when we use `glob.glob()`, we can easily get a list of files, but they may not be chronological (or even alphanumeric) order.

In [None]:
import glob

files = glob.glob('data_raw/GLDAS/*.nc')
files.sort()

ds = xr.open_dataset(files[0]) # Just the first file
ds['TWS_monthly'].plot(cmap = 'RdYlBu')

In the above plot, which depicts the TWS anomaly for April 2002, cool colors show a positive TWS anomaly (i.e., more water than usual for April) and warmer colors show a negative TWS anomaly (i.e., less water than usual for April).

### Getting a TWS time series

As we've seen previously, the `xarray` library has a function, `open_mfdataset()` that can be used to open multiple files as a single dataset. The files may represent multiple pieces of a dataset along *any axis;* in this case, each file represents a different part of the time axis. `xarray` can figure out automatically what order the files should go in because netCDF4 files are structured in a way that explicitly defines X, Y, and time axes.

In [None]:
tws_anomaly = xr.open_mfdataset('data_raw/GLDAS/*.nc')
tws_anomaly

## Thinking about multi-dimensional arrays

As we've seen, netCDF4 and HDF5 files can be represented as `xarray` Datasets.

In [None]:
type(tws_anomaly)

A Dataset can contain more than one Variable, each represented as a multi-dimensional array.

There are different ways that we can represent arrays in Python. When we use `xarray`, the variables are represented as DataArrays.

In [None]:
type(tws_anomaly['TWS_monthly'])

The `xarray` DataArray is usually just a special kind of NumPy array. But when we use `xarray.open_mfdataset()` we get a new type of array.

In [None]:
type(tws_anomaly['TWS_monthly'].data)

**The `dask` library is automatically used to represent our underlying multi-dimensional array.** This is because the use of `open_mfdataset()` implies we have a potentially large multi-dimensional array to work with, because we are opening multiple files.

**But by using `xarray.open_mfdataset()` to stack multiple files together in a time series, we've also gained a new way of thinking about our data.** Each individual file represented essentially a 2D image: the TWS anomaly on a latitude-longitude grid. With multiple image dates stacked together, we obtain a 3D **data cube,** with X (longitude), Y (latitude), and time axes. `xarray` Datasets can show a helpful illustration when we use them inside a Jupyter Notebook.

In [None]:
tws_anomaly['TWS_monthly'].data

When analyzing data cubes, we're typically trying to answer one of these questions:

1. How does a climate variable in one or more locations vary over time?
1. How does a climate variable at a certain time vary over space?
1. How does climate variability (variation over time) compare across space?

The first two questions don't require a data cube to answer, but we often have to merge different datasets together in order to answer them. We might consider ourselves lucky to start with a data cube, in that case, because we have all of the data in one place. These questions require us to subset or *index* a multi-dimensional array.

The third question can only be answered if we start with a data cube; it's a question that requires us to aggregate or *collapse* an axis of our data cube.

---

## Indexing a multi-dimensional array

We can slice our data cube a number of different ways to answer different questions. For example, what do the past 20 years of TWS anomalies look like in the Western U.S., given the multi-decadal drought the region has experienced?

Let's start with looking at the pixel that contains Sacramento, CA. Slicing our array at this single pixel amounts to taking a thin stip along the time axis; as illustrated below, we end up with what is essentially a 1D array of 220 values (for 220 months of data).

In [None]:
tws_anomaly['TWS_monthly'].sel(lon = -121.5, lat = 38.5).data

The data could be represented as a time series, since we have one value over time.

In [None]:
tws_anomaly['TWS_monthly'].sel(lon = -121.5, lat = 38.5).plot()

**If we wanted to look at multiple pixels representing a region of interest, we're still slicing our array along the time axis, but we end up a smaller data cube.**

Recall that the built-in `slice()` function can be used in combination with the `sel()` method of `xarray` to select a region of interest. In this case, our region is 10 pixels wide by 10 pixels tall.

In [None]:
west_us = tws_anomaly['TWS_monthly'].sel(lon = slice(-124, -114), lat = slice(32, 42))
west_us.data

There's no easy way to visualize the values of a 3D data cube, so when we ask `xarray` to plot the data, it just shows us a histogram, essentially pooling all the values from the data cube together.

In [None]:
west_us.plot()

## Aggregating along the axis of a multi-dimensional array

As the histogram above suggests, we need a way of transforming our data cube in order to better visualize the data and answer some of our questions.

**What's the average TWS anomaly in our study area in each month?** This can be answered by averaging over the spatial domain. We can visualize this as taking an average of each 10-by-10 pixel slice in our data cube, for each monthly time step.

In [None]:
west_us.data

If we specify we want to average over one of our spatial axes, we go from a 3D array to a 2D array; i.e., **we collapsed one of our axes.**

In [None]:
west_us.mean('lat').data

In this case, we want to average over two of our spatial axes: averaging over latitude *and* longitude. This means we go from a 3D array to a 1D array; although it is depicted, below, as a 2D array, we have a trivial axis of length one (1), so we really have just a 1D sequence of 220 values.

In [None]:
west_us.mean(['lat', 'lon']).data

---

### References

- Jain S, Mindlin J, Koren G, Gulizia C, Steadman C, Langendijk GS, Osman M, Abid MA, Rao Y, Rabanal V. 2022. [Are we at risk of losing the current generation of climate researchers to data science?](https://doi.org/10.1029/2022AV000676) *AGU Advances.*