# M1.4 - Reading MERRA-2 Data from netCDF4 Files

*Part of:* **M1: Open Climate Data**

**Contents:**

**Now that we've seen how to access some climate data with NASA Earthdata Search, let's explore how to *use* the data in Python.**

The MERRA-2 file we downloaded has a file extension of `.nc4`. This indicates it is a type of file called netCDF4, or Network Common Data Format, version 4. We'll talk more about the netCDF4 format later. For now, you should know that you can open this type of file using a Python library called `xarray`.

## Introduction to `xarray`

We'll use a Python package called `xarray` to open the netCDF4 file we downloaded. `xarray` is designed for working with multi-dimensional gridded datasets.

![](./assets/xarray-dataset.png)

As shown in the figure above, a single `xarray.Dataset` can contain multiple variables like temperature, precipitation, latitude, and longitude. Each variable is stored as an array, specifically an `xarray.DataArray`. While latitude and longitude are constant over time, and are therefore represented as 2D arrays, variables like temperature and precipitation vary over both time and space, so they can be represented as 3D **data cubes.** The x, y, and time (t) **axes** (also called **dimensions**) can be used to subset the arrays to time periods or study areas of interest.

**We typically import `xarray` with a shorter name, to make it easier to use. Below, we also import the `pyplot` module from `matplotlib`.**

In [None]:
import xarray as xr
from matplotlib import pyplot

## Using a Downloaded netCDF4 File

We can open a netCDF4 file in `xarray` using the `open_dataset()` function.

In [None]:
ds = xr.open_dataset('/home/arthur.endsley/Downloads/MERRA2_400.statD_2d_slv_Nx.20230701.nc4')
ds

There is a lot to look at here. This dataset has:

- **Dimensions:** If you're working with map data, those data have at least two dimensions (e.g., latitude and longitude). If the map was generated from satellite data and the satellite contributes new observations every day, we can introduce a third dimension, time. These dimensions describe the shape of a **data cube** with three **axes:** longitude ("lon"), latitude ("lat"), and time.
- **Coordinates:** Similar to dimensions, coordinates are the distances along each axis. For longitude and latitude, these are the coordinates of the center of each pixel.
- **Data variables:** A netCDF4 file can contain different variables that are mapped on the same grid. For example, you might have both minimum and maximum daily temperature in the same file.
- **Indexes:** These are just like coordinates and dimensions, so we don't need to worry about these for now.
- **Attributes:** In addition to mapped data values, a netCDF4 file can contain **metadata** to help users understand the data. Metadata are recorded as attributes and describe things like the software version used to create the data or the author of the data.

Variables in an `xarray.Dataset` can be accessed like the keys of a Python dictionary.

In [None]:
ds['T2MMIN']

Each `Dataset` has an underlying array. The `"T2MMIN"` dataset is a 3-dimensional array; we can verify the name and number of dimensions by accessing the `dims` property:

In [None]:
ds['T2MMIN'].dims

This indicates that `"T2MMIN"` values vary across time and two dimensions of space (latitude and longitude). Even though this dataset represents a single point in time, there is still a time dimension because the granule we downloaded is one of many, each representing a different time step.

**One of the things that makes netCDF4 files special is that they are able to store both data and metadata, or attributes.**

In [None]:
ds['T2MMIN'].attrs

Attributes can store vital information about data. For example, it would be hard to make use of temperature data if you didn't know the correct units for the data.

`xarray` brings some convenient built-in tools for analyzing our data, such as the ability to plot datasets.

In [None]:
ds['T2MMIN'].plot()

The underlying data arrays are just NumPy arrays, so if we ever want to work with a NumPy array instead...

In [None]:
ds['T2MMIN'].data

Again, the first axis of the array has a single element, `(1)`, because this file represents a single point in time.

In [None]:
ds['T2MMIN'].data.shape

---

## Working with `xarray` DataArrays

In [None]:
tmin = ds['T2MMIN']
tmin.attrs

As with NumPy arrays, we can treat a DataArray just like a number, making mathematical transformations of our data easy. For example, we might want to convert our minimum temperatures from degrees K to degrees C.

In [None]:
# Convert temperatures from deg K to deg C
tmin_c = tmin - 273.15

One thing to be aware of is that when we do this kind of operation, we lose the attributes of the original DataArray. This is because the old attributes may no longer apply; in fact, we already know the "units" of the old DataArray (degrees K) are no longer accurate.

In [None]:
tmin_c.attrs

We can assign attributes at any time, using a Python dictionary syntax.

In [None]:
tmin_c.attrs['units'] = 'degrees K'
tmin_c.attrs

What's the point of assigning new attributes? You should do this anytime you're going to save a Dataset or DataArray to an output file and share it with someone. Datasets and DataArrays have a method, `to_netcdf()`, that allows you to do just this.

In [None]:
tmin_c.to_netcdf('example.nc4')

---

## Subsetting Gridded Datasets Using `xarray`

Because we downloaded daily average data from MERRA-2, there is only one 2D grid of temperatures in this dataset. That means our time dimension has a length of one.

In [None]:
tmin.shape

If we needed to subset our dataset to a specific time, we could use numeric indices, just like with a NumPy array.

In [None]:
tmin[0]

Similarly, if we wanted to get a time series of values at specific row-column coordinates, we could write:

In [None]:
# Get all values on the time axis for the position: row 50, column 100
tmin[:,50,100]

**But we often don't know the exact row-column position(s) of locations we're interested in. How can we select values based on longitude and latitude, instead?**

We're in luck, because our DataArray has coordinates that describe where each data value is located.

In [None]:
ds.coords

We can *select* the value(s) at certain coordinates using the `sel()` function.

In [None]:
# 2-meter minimum temperature at the South Pole
ds['T2MMIN'].sel(lat = -90, lon = -180)

Note: In a different dataset, the latitude and longitude coordinates may have different names!

In [None]:
ds['T2MMIN'].sel(lat = -90, lon = -180).values

What's the minimum temperature in Algiers?

In [None]:
ds['T2MMIN'].sel(lat = 36.754, lon = 3.059)

What happened? If we examine our dataset's coordinates, we'll see that there is no exact match for the longitude and latitude pair we provided; they only come in regularly spaced intervals of 0.25 or 0.5 degrees.

In [None]:
ds['lon'].values[0:10]

In [None]:
ds['lat'].values[0:10]

Note that we must specify a `method` here because the coordinates of Algiers don't exactly match the coordinates of each grid cell's center; i.e., we must ask for a nearest-neighbor interpolation.

In [None]:
ds['T2MMIN'].sel(lat = 36.754, lon = 3.059, method = 'nearest').values

Another way we can get the answer we want is to use the `interp()` function. **Notice that the answer we get is slightly different than what we got above.** That's because the previous answer used *nearest-neighbor interpolation* but the `interp()` function uses *linear interpolation* by default. There are many other choices for interpolation using the `interp()` function.

In [None]:
ds['T2MMIN'].interp(lat = 36.754, lon = 3.059)

### Slicing Arrays

What if we want to see an area of interest that extends beyond a single longitude-latitude grid cell? We can retrieve a rectangular subset of an array using Python's built-in `slice()` function along with the `xarray` `sel()` function.

In [None]:
aoi = ds['T2MMIN'].sel(lon = slice(-50, 50), lat = slice(-50, 50))
aoi

In [None]:
aoi.plot()