# M2.2 - Working with Gridded Climate Data

**Contents:**

- [Conceptual and computational constraints](#Conceptual-and-computational-constraints)
- [A terrestrial water storage (TWS) time series](#A-terrestrial-water-storage-(TWS)-time-series)
- [Thinking about multi-dimensional arrays](#Thinking-about-multi-dimensional-arrays)
  - [Updating coordinate systems](#Updating-coordinate-systems)
- [Indexing a multi-dimensional array](#Indexing-a-multi-dimensional-array)
- [Aggregating along the axis of a multi-dimensional array](#Aggregating-along-the-axis-of-a-multi-dimensional-array)
- [Resampling a time series](#Resampling-a-time-series)
  - [Handling missing data](#Handling-missing-data)
- [Visualizing variation over space](#Visualizing-variation-over-space)
- [Summary](#Summary)

## Conceptual and computational constraints

As computer technology improves and more satellite-based datasets are available, the size, complexity, and frequency of climate data are increasing rapidly. This can lead to two problems for researchers and resources managers looking to use climate data. First, it can be difficult to conceptualize the size and scale of some climate datasets, leading to issues with data and project management (Jain et al. 2022). Second, it can be difficult to process and analyze climate datasets that are sometimes too large to fit into computer memory or too complex for tractable computation.

In this lesson, we'll explore how to manage some of these issues. Because we're learning, we'll use a dataset that isn't as large or complex as others we might encounter, but it should serve as a good illustration of more difficult datasets and how to handle them.

In [None]:
import xarray as xr
import earthaccess
import numpy as np
from matplotlib import pyplot

auth = earthaccess.login(strategy = 'netrc')

## A terrestrial water storage (TWS) time series

To gain more experience working with gridded climate data, and particularly a time series of gridded climate data, we'll use a terrestrial water storage (TWS) dataset produced using data from two satellite missions: the Gravity Recovery and Climate Experiment (GRACE) mission and its successor, the GRACE Follow-On (GRACE-FO) mission.

TWS includes all water on and under the land surface, in the form of snowpack, surface streams and reservoirs, and groundwater. [The TWS dataset we'll be working with from the GRACE and GRACE-FO missions](https://podaac.jpl.nasa.gov/dataset/TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06.1_V3) has already been converted to *monthly anomalies.*

&#x1F449; [Read more about GRACE/GRACE-FO data here.](https://grace.jpl.nasa.gov/about/faq/)

Recall that *anomalies* are one way of representing variability around a long-term mean. They tell us something about how a particular data point (a particular day, month, or year) compares over time; e.g., is this an especially dry year?

#### &#x1F3AF; Best Practice

*We're about to download some raw data! You know what to do.* Create a new folder called `GRACE` inside the `data_raw` folder.

In [None]:
results = earthaccess.search_data(short_name = 'TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06.1_V3')
results

This dataset has been prepared as a single netCDF file.

In [None]:
earthaccess.download(results, 'data_raw/GRACE')

**You may get an error message when trying to download this dataset because of a problem with NASA's servers.** If that's the case, note that there is a single URL for this dataset (included in the output above), which you can use to download the data directly. It should be:

- [https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06.1_V3/GRCTellus.JPL.200204_202311.GLO.RL06.1M.MSCNv03CRI.nc](https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/TELLUS_GRAC-GRFO_MASCON_CRI_GRID_RL06.1_V3/GRCTellus.JPL.200204_202311.GLO.RL06.1M.MSCNv03CRI.nc) (Click to download)

In [None]:
ds = xr.open_dataset('data_raw/GRACE/GRCTellus.JPL.200204_202311.GLO.RL06.1M.MSCNv03CRI.nc')
ds

The GRACE/GRACE-FO data already represent *anomalies,* i.e., a change in total water storage relative to some time period (or "epoch"). What time period is that? We can find that out from the **file-level attributes** of this netCDF file.

In [None]:
ds.attrs['time_mean_removed']

We can also verify this by looking at a time series of the global average in monthly TWS anomalies. For the period 2004-2009, those anomalies appear to be centered on zero, which is what we would expect since the mean of that period was subtracted from the data.

We also see a long-term downward trend. What might explain this? The global loss of land ice is probably the chief factor driving this loss. The most Assessment Report (AR6) from the Intergovernmental Panel on Climate Change (IPCC) estimates that glacial and ice-sheet melt since between 2006 and 2018 accounts for almost 2 centimeters of sea level rise (IPCC 2021, Chapter 9), which is consistent with what we see here (the vertical axis of our plot is also in centimeters). The warming of Earth's atmosphere also plays a role, as warmer air can hold more water vapor. 

In [None]:
ds['lwe_thickness'].mean(['lon','lat']).plot()

The anomaly in the time series around 2017-2018 is due to the fact that GRACE's mission ended in October 2017 and GRACE-FO was not launched until May 2018; hence, there are no data values for those months.

We can look at more than a time series: this is a spatial dataset, as well. Let's take a look at the global TWS anomalies for first date.

In [None]:
ds['lwe_thickness'][0].plot(vmin = -200, vmax = 200, cmap = 'RdBu')

In the above plot, which depicts the monthly TWS anomaly for April 2002, cool colors show a positive TWS anomaly (i.e., more water than usual for April) and warmer colors show a negative TWS anomaly (i.e., less water than usual for April).

---

## Thinking about multi-dimensional arrays

**Let's re-open our dataset using `open_mfdataset()`.** Recall that we use the `open_mfdataset()` from `xarray` whenever we want to open multiple files as if they were a single dataset. Here, we have only one file, but we'll pretend it is multiple files because it contains a global TWS anomaly image for multiple dates.

In [None]:
ds = xr.open_mfdataset('data_raw/GRACE/GRCTellus.JPL.200204_202311.GLO.RL06.1M.MSCNv03CRI.nc')

As we've seen, netCDF4 and HDF5 files can be represented as `xarray` Datasets.

In [None]:
type(ds)

A Dataset can contain more than one Variable, each represented as a multi-dimensional array.

There are different ways that we can represent arrays in Python. When we use `xarray`, the variables are represented as DataArrays.

In [None]:
type(ds['lwe_thickness'])

The `xarray` DataArray is usually just a special kind of NumPy array. But when we use `xarray.open_mfdataset()` we get a new type of array.

In [None]:
type(ds['lwe_thickness'].data)

**The `dask` library is automatically used to represent our underlying multi-dimensional array.** This is because the use of `open_mfdataset()` implies we have a potentially large multi-dimensional array to work with, because we are opening multiple files.

**But by using `xarray.open_mfdataset()` to stack multiple files together in a time series, we've also gained a new way of thinking about our data.** Each individual file represented essentially a 2D image: the TWS anomaly on a latitude-longitude grid. With multiple image dates stacked together, we obtain a 3D **data cube,** with X (longitude), Y (latitude), and time axes. `xarray` Datasets can show a helpful illustration when we use them inside a Jupyter Notebook.

In [None]:
ds['lwe_thickness'].data

When analyzing data cubes, we're typically trying to answer one of these questions:

1. How does a climate variable in one or more locations vary over time?
1. How does a climate variable at a certain time vary over space?
1. How does climate variability (variation over time) compare across space?

The first two questions don't require a data cube to answer, but we often have to merge different datasets together in order to answer them. We might consider ourselves lucky to start with a data cube, in that case, because we have all of the data in one place. These questions require us to subset or *index* a multi-dimensional array.

The third question can only be answered if we start with a data cube; it's a question that requires us to aggregate or *collapse* an axis of our data cube.

### Updating coordinate systems

Recall that `xarray` Datasets have coordinate systems, which encode the spatial information of the data.

In [None]:
ds.coords

We can see that our data are on a half-degree latitude-longitude grid; i.e., each pixel is 0.5 degrees by 0.5 degrees.

You may have noticed that while the latitudes, or `lat` values, span -90 degrees to 90 degrees, the longitude, or `lon` values, span 0 to 360 degrees. We're used to talking about longitudes that span -180 degrees (West) longitude to 180 degrees (East) longitude. How can we fix this?

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

**Let's change our `lon` values so they are easier to work with.** We'll subtract 180 degrees from the coordinates so that the smallest longitude value is -180 and the largest value, which used to be approximately 360, is now 180.

In [None]:
# Run this only once!
ds.coords['lon'] = ds.coords['lon'] - 180

In [None]:
ds.coords

**Coordinates are like labels for the data points in our data cube, while dimensions are the axes of the data cube.** We can apply multiple labels to the same dimension. For example, we have both a dimension named `time` and a coordinate set named `time`. The `time` coordinate describes the precise moment in time down to the nanosecond (ns), but we could also label the time points of our dataset by the year, month, season, or day.

**Let's add a new `year` coordinate to our dataset, so we can identify individual data points by a common year.** We first extract the numeric year from the `"time"` coordinate. Then, we use [the `assign_coords()` method](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.assign_coords.html) of an `xarray.Dataset` to create new coordinate labels along an existing dimension.

In [None]:
# Get a list of all the dates from the "time" coordinate
dates = ds.coords['time'].values

Take a look at what `dates` contains.

```python
dates
```

In [None]:
# Convert each "datetime64[ns]" object to a 4-digit year
years_list = []
for each in dates:
    years_list.append(int(str(each)[0:4]))

Take a look at what `years_list` contains.

```python
years_list
```

In [None]:
# Create a new coordinate ("year") along an existing dimension ("time")
ds = ds.assign_coords(year = ('time', years_list))
ds

---

## Indexing a multi-dimensional array

We can slice our data cube a number of different ways to answer different questions. For example, what do the past 20 years of TWS anomalies look like in the Western U.S., given the multi-decadal drought the region has experienced?

Let's start with looking at the pixel that contains Sacramento, CA. Slicing our array at this single pixel amounts to taking a thin stip along the time axis; as illustrated below, we end up with what is essentially a 1D array of 220 values (for 220 months of data).

In [None]:
ds['lwe_thickness'].sel(lon = -121.75, lat = 38.25).data

This is equivalent to slicing our array along the X and Y (longitude and latitude) axes, as we can see below.

In [None]:
# Three dimensions: (time, lat, lon)
ds['lwe_thickness'].shape

In [None]:
# Take the first pixel along the "lat" dimension and the first pixel along the "lon" dimension
ds['lwe_thickness'][:,0,0].data

The data could be represented as a time series, since we have one value over time.

In [None]:
ds['lwe_thickness'].sel(lon = -121.75, lat = 38.25).plot(figsize = (12, 6))

**If we wanted to look at multiple pixels representing a region of interest, we're still slicing our array along the time axis, but we end up a smaller data cube.**

Recall that the built-in `slice()` function can be used in combination with the `sel()` method of `xarray` to select a region of interest. In this case, our region is 10 pixels wide by 10 pixels tall.

In [None]:
west_us = ds['lwe_thickness'].sel(lon = slice(-124, -114), lat = slice(32, 42))
west_us.data

There's no easy way to visualize the values of a 3D data cube, so when we ask `xarray` to plot the data, it just shows us a histogram, essentially pooling all the values from the data cube together.

In [None]:
west_us.plot()

Of course, we can always plot a single time step by subsetting the dataset.

In [None]:
west_us.sel(time = '2003-01-01', method = 'nearest').plot()

From the way this image looks (and if we examine the underlying array values), we can infer that while this latitude-longitude dataset has a 0.5-degree spatial resolution, the values are repeated in some areas because [the actual spatial resolution of the data is closer to 300 km](https://grace.jpl.nasa.gov/about/faq/).

---

## Aggregating along the axis of a multi-dimensional array

As the histogram above suggests, we need a way of transforming our data cube in order to better visualize the data and answer some of our questions.

**What's the average TWS anomaly in our study area in each month?** This can be answered by averaging over the spatial domain. We can visualize this as taking an average of each 10-by-10 pixel slice in our data cube, for each monthly time step.

In [None]:
west_us.data

If we specify we want to average over one of our spatial axes, we go from a 3D array to a 2D array; i.e., **we collapsed one of our axes.**

In [None]:
west_us.mean('lat').data

Specifically, we lost the `"lat"` dimension, as we can see below.

In [None]:
west_us.mean('lat').dims

Let's average over two of our spatial axes: averaging over latitude *and* longitude. This means we go from a 3D array to a 1D array; although it is depicted, below, as a 2D array, we have a trivial axis of length one (1), so we really have just a 1D sequence of 220 values.

In [None]:
west_us.mean(['lat', 'lon']).data

And since we have a 1D time series of TWS anomalies, we can visualize them as a line plot. From this plot, we can see that the multi-decadal drought in the Western U.S. has gotten worse in recent years.

In [None]:
west_us.mean(['lat', 'lon']).plot(figsize = (12, 6))

---

## Resampling a time series

Another common data analysis task we might want to perform with time-series data is to re-aggregate the data; for example, can we calculate the mean *annual* TWS anomaly, to better characterize wet and dry years?

In [None]:
west_us.data

[The `resample()` method of an `xarray.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.resample.html) is the best way to aggregate time-series data.

In [None]:
west_us_annual = west_us.resample(time = 'YS').mean()
west_us_annual.data

The syntax `time = 'YS'`, above, indicates that the *start of the year* (abbreviated `YS`) should be used as the resampling frequency; i.e., calculate the mean for each year. This syntax comes from the `pandas` library. 

&#x1F449; [**Read more about `resample()` and resampling frequencies here.**](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects)

We should also use `resample()` for certain calculations we perform, such as calculating time trends, where the units of a trend are denominated by the unit of time (e.g., "per day" or "per year"). This is because `coarsen()` doesn't change the unit of time; in this case, it doesn't change the units of time from days to years.

So, in general, with time-series axes, it's better to use the `resample()` method of a DataArray.

But note that when we resample along a given dimension, not all the coordinate labels are preserved. In this case, `xarray` doesn't know how to assign a `year` label to the resampled coordinates.

In [None]:
# We lost the "year" coordinate
west_us_annual.coords

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

But in this particular case, since we are interested in aggregating by year, anyway, we don't have to use `resample()`; we can use `groupby()` to indicate we want to take the mean within each `"year"` coordinate label.

In [None]:
west_us_annual = west_us.groupby('year').mean()
west_us_annual

### Handling missing data

With satellite-based datsets like this TWS dataset, there is always a possibility of missing data. For example, there is a gap in 2017 and 2018 when the GRACE mission ended, before the GRACE-FO satellite was launched. There are also missing months in 2002, before the launch of GRACE in March, and the 2023 GRACE-FO data do not include December 2023, because it is still being processed.

&#x1F449; **Note that, because you're accessing these data from NASA's cloud, you may find that more recent data (with fewer missing months) are available!**

Below, we count how many months are available in each year.

In [None]:
counts = ds.groupby('year').count()

# Get a value (the count) in each year for an arbitrary pixel, skipping the first year (2002)
counts['lwe_thickness']

**However, we're getting a count of the months in each pixel!** We just want the count for one (arbitrary) pixel because, with this modeled TWS product, the data availability is the same for every pixel.

In [None]:
# Get the count at an arbitrary pixel: the top-left (0,0) corner
counts_by_year = counts['lwe_thickness'][:,0,0].values
counts_by_year

It's apparent we're missing several months of data, particularly in 2017 and 2018 (only 5 months available each year) between the GRACE and GRACE-FO missions.

We should first make sure we have at least some data in every year. Below, we can see our time axis has 21 elements (21 years). This lines up with our monthly counts when we skip the first year (2002).

In [None]:
west_us_annual.shape

In [None]:
counts_by_year.size

**Let's impose a rule: We need at least 9 months of data in each year for a reliable annual average.**

In [None]:
mask = counts_by_year < 9
mask

In [None]:
west_us_annual[mask] = np.nan

It's worth remembering that a related fix for this problem, if all the valid data points were consecutive, would be to take a slice:

In [None]:
# A related fix, if the valid data points were all consecutive
west_us_annual = west_us_annual.sel(year = slice(2003, 2023))
west_us_annual.year

---

Aggregating time intervals is just one of the many neat things an `xarray.DataArray` can do! [Check out the `xarray` documentation to learn more.](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html)

In [None]:
west_us_annual.mean(['lat','lon']).plot()

We can observe a general downward trend in the data, which is consistent with the decades-long drought the region has experienced (Liu et al. 2022).

---

## Visualizing variation over space

In the previous example, we resampled our data over the temporal axis (over time). But our dataset has spatial coordinates, too! We often want to visualize how temporal variation compares between different locations. This is simply a matter of resampling our data cube a different way.

In [None]:
west_us_annual.data

We could calculate a simple inter-annual mean. This collapses our data cube to two spatial dimensions, latitude and longitude.

In [None]:
west_us_annual.mean('year').data

**However, it'd be much more interesting if we could calculate pixel-level trends in the annual TWS anomalies.**

Because we're working with `xarray`, a highly efficient and mature Python library for working with multi-dimensional arrays, it's a good idea to check to see if the functionality we want is already built into the `xarray.DataArray` class. Indeed, there are two functions that might help here:

- `DataArray.curvefit()` [(Documentation)](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.curvefit.html#xarray.DataArray.curvefit)
- `DataArray.polyfit()` [(Documentation)](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.polyfit.html#xarray.DataArray.polyfit)

Either one can be used to fit a function to a set of data points. However, we want to use `polyfit()` today because we are fitting a simple function (a linear trend) that should have [a closed-form solution](https://en.wikipedia.org/wiki/Closed-form_expression). A "degree 1" polynomial is a straight line because, in the function involved, $y(x) = mx + b$, $x$ is raised to the power of 1: $x^1$.

In [None]:
fit = west_us_annual.polyfit('year', deg = 1)
fit

In [None]:
fit['polyfit_coefficients']

**We obtained another data cube!** This data cube has an axis called **degree** with two (2) elements. While we asked for a 1-degree polynomial fit, recall that we need two numbers to describe a line: the y-intercept and the slope.

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

Another thing to note about our data cube is that we have a trend line (y-intercept and slope) for each of 400 pixels. Those 400 linear regressions seemed really fast! 

**Actually, we haven't computed the linear regressions yet. This is because we loaded the TWS data into the computer's memory using `xr.open_mfdataset()`.** This function automatically signals to `xarray` that we may be working with a very large dataset! Consequently, `xarray` doesn't actually do any computation until we explicitly ask it to.

So, what happened when we typed the following?

```python
fit = west_us_annual.polyfit('year', deg = 1)
```

What happened is that we obtained a **representation** of the computation we wanted `xarray` to perform. We called this representation `fit` and it looks like a data cube because that is what the result of the computation will look like when it is performed.

**When we want to actually comptue the results, we need to call the `compute()` method.**

In [None]:
# Get a representation of the computation we want to do
fit = west_us_annual.polyfit('year', deg = 1)

# Actually run the computation
results = fit.compute()

# Look at the coefficients
results['polyfit_coefficients']

**There are two ways to ensure that computations on `xarray` Datasets and DataArrays actually run when you want them to:**

- Call the `compute()` method, as in the example above.
- Or, try to access the `values` property, as in the example below.

```python
fit['polyfit_coefficients'].values
```

---

## Summary

Here are the key Python tools and techniques to remember from this lesson:

- You can use `xarray.open_mfdataset()` to open multiple files as a single `xarray` Dataset. For example, you might write
  `data = xarray.open_mfdataset("precip_in_year_Y*.nc")` to open a series of annual precipitation data files.
- The attributes (or metadata) of any file opened in `xarray` can be accessed through: `data.attrs`
- `xarray` Datasets are also sometimes called **data cubes.** You can aggregate over one or more of **axes** (or "dimensions") of the data cube using methods like `mean()`; for example, to average over the two spatial dimensions of longitude and latitude: `data['variable'].mean(['lon','lat'])`.
- To slice a data cube along one or more axes, use the `sel()` method, for example: `data['variable'].sel(lon = -121.75, lat = 38.25)`.

---

### References

- IPCC. *Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change.* 2021. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, 2391 pp. doi:10.1017/9781009157896
- Jain S, Mindlin J, Koren G, Gulizia C, Steadman C, Langendijk GS, Osman M, Abid MA, Rao Y, Rabanal V. 2022. [Are we at risk of losing the current generation of climate researchers to data science?](https://doi.org/10.1029/2022AV000676) *AGU Advances.*
- Liu, P.-W., J. S. Famiglietti, A. J. Purdy, K. H. Adams, A. L. McEvoy, J. T. Reager, R. Bindlish, D. N. Wiese, C. H. David, and M. Rodell. 2022. [Groundwater depletion in California’s Central Valley accelerates during megadrought.](https://www.nature.com/articles/s41467-022-35582-x) Nature Communications 13 (1):7825.
- xarray.DataArray (xarray Documentation). [https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html) Accessed: February 5, 2024.

### Additional Resources

- [Computation on `xarray` Datasets and DataArrays](https://docs.xarray.dev/en/stable/user-guide/computation.html)