# Reading MERRA-2 Data from a Downloaded netCDF4 File

In [None]:
import xarray as xr
from matplotlib import pyplot

We'll use a Python package called `xarray` to open the netCDF4 file we downloaded. `xarray` is designed for working with multi-dimensional gridded datasets.

![](./assets/xarray-dataset.png)

As shown in the figure above, a single `xarray.Dataset` can contain multiple variables like temperature, precipitation, latitude, and longitude. Each variable is stored as an array, specifically an `xarray.DataArray`. While latitude and longitude are constant over time, and are therefore represented as 2D arrays, variables like temperature and precipitation vary over both time and space, so they can be represented as 3D **data cubes.** The x, y, and time (t) **axes** (also called **dimensions**) can be used to subset the arrays to time periods or study areas of interest.

## Using a Downloaded netCDF4 File

We can open a netCDF4 file in `xarray` using the `open_dataset()` function.

In [None]:
ds = xr.open_dataset('/home/arthur.endsley/Downloads/MERRA2_400.statD_2d_slv_Nx.20230701.nc4')
ds

We can see that this dataset has three **dimensions** or axes: longitude ("lon"), latitude ("lat"), and time. That means each variable in the dataset should have these three dimensions.

Variables in an `xarray.Dataset` can be accessed like the keys of a Python dictionary.

In [None]:
ds['T2MMIN']

**One of the things that makes netCDF4 files special is that they are able to store both data and metadata, or attributes.**

In [None]:
ds['T2MMIN'].attrs

Attributes can store vital information about data. For example, it would be hard to make use of temperature data if you didn't know the correct units for the data.

`xarray` brings some convenient built-in tools for analyzing our data, such as the ability to plot datasets.

In [None]:
ds['T2MMIN'].plot()

The underlying data arrays are just NumPy arrays, so if we ever want to work with a NumPy array instead...

In [None]:
ds['T2MMIN'].data

In [None]:
ds['T2MMIN'].data.shape

---

## Working with `xarray` DataArrays

In [None]:
tmin = ds['T2MMIN']
tmin.attrs

As with NumPy arrays, we can treat a DataArray just like a number, making mathematical transformations of our data easy. For example, we might want to convert our minimum temperatures from degrees K to degrees C.

In [None]:
# Convert temperatures from deg K to deg C
tmin_c = tmin - 273.15

One thing to be aware of is that when we do this kind of operation, we lose the attributes of the original DataArray. This is because the old attributes may no longer apply; in fact, we already know the "units" of the old DataArray (degrees K) are no longer accurate.

In [None]:
tmin_c.attrs

We can assign attributes at any time, using a Python dictionary syntax.

In [None]:
tmin_c.attrs['units'] = 'degrees K'
tmin_c.attrs

---

## Subsetting Gridded Datasets Using `xarray`

Because we downloaded daily average data from MERRA-2, there is only one 2D grid of temperatures in this dataset. That means our time dimension has a length of one.

In [None]:
tmin.shape

If we needed to subset our dataset to a specific time, we could use numeric indices, just like with a NumPy array.

In [None]:
tmin[0]

Similarly, if we wanted to get a time series of values at specific row-column coordinates, we could write:

In [None]:
# Get all values on the time axis for the position: row 50, column 100
tmin[:,50,100]

**But we often don't know the exact row-column position(s) of locations we're interested in. How can we select values based on longitude and latitude, instead?**

In [None]:
# 2-meter minimum temperature at the South Pole
ds['T2MMIN'].sel(lat = -90, lon = -180)

Note: In a different dataset, the latitude and longitude coordinates may have different names!

In [None]:
ds['T2MMIN'].sel(lat = -90, lon = -180).values

What's the minimum temperature in Algiers?

In [None]:
ds['T2MMIN'].sel(lat = 36.754, lon = 3.059)

What happened? If we examine our dataset's coordinates, we'll see that there is no exact match for the longitude and latitude pair we provided; they only come in regularly spaced intervals of 0.25 or 0.5 degrees.

In [None]:
ds['lon'].values[0:10]

In [None]:
ds['lat'].values[0:10]

Note that we must specify a `method` here because the coordinates of Algiers don't exactly match the coordinates of each grid cell's center; i.e., we must ask for a nearest-neighbor interpolation.

In [None]:
ds['T2MMIN'].sel(lat = 36.754, lon = 3.059, method = 'nearest').values

Another way we can get the answer we want is to use the `interp()` function. **Notice that the answer we get is slightly different than what we got above.** That's because the previous answer used *nearest-neighbor interpolation* but the `interp()` function uses *linear interpolation* by default. There are many other choices for interpolation using the `interp()` function.

In [None]:
ds['T2MMIN'].interp(lat = 36.754, lon = 3.059)

### Slicing Arrays

What if we want to see an area of interest that extends beyond a single longitude-latitude grid cell? We can retrieve a rectangular subset of an array using Python's built-in `slice()` function along with the `xarray` `sel()` function.

In [None]:
aoi = ds['T2MMIN'].sel(lon = slice(-50, 50), lat = slice(-50, 50))
aoi

In [None]:
aoi.plot()