# **W1D1: Introduction to the Climate System and Xarray**

## Overview

The first three tutorials of today will introduce the basics of gridded, labeled data with [Xarray](https://xarray.dev/). Since Xarray introduces additional abstractions on top of plain arrays of data, our goal is to show why these abstractions are useful and how they frequently lead to simpler, more robust code.

We'll cover these topics:

1. Create a `DataArray`, one of the core object types in Xarray
1. Understand how to use named coordinates and metadata in a `DataArray`
1. Combine individual `DataArrays` into a `Dataset`, the other core object type in Xarray
1. Subset, slice, and interpolate the data using named coordinates
1. Open netCDF data using XArray
1. Basic subsetting and aggregation of a `Dataset`
1. Brief introduction to plotting with Xarray

# **Tutorial 2: Selection, Interpolation and Slicing**

**Week 1, Day 1, Introduction to the Climate System**

**Content creators:** Sloane Garelick, Julia Kent

**Content reviewers:** Katrina Dobson, Danika Gupta, Maria Gonzalez, Will Gregory, Nahid Hasan, Sherry Mi, Beatriz Cosenza Muralles, Ohad Zivan

**Content editors:** Agustina Pesce

**Production editors:** Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan

**Our 2023 Sponsors:** NASA TOPS







### **Code and Data Sources**

Code and data for this tutorial is based on existing content from [Project Pythia](https://foundations.projectpythia.org/core/xarray/xarray-intro.html).

# **Tutorial Objectives**
In the previous tutorial, we learned how to use Xarray to create DataArray and Dataset objects to help us organizing large climate datasets. Global climate datasets can be very large with multiple variables, so DataArrays and Datasets are very useful tools for organizing, comparing and interpreting such data. But what if we want to examing data from a specific time or location, rather than the entire global dataset? For example, we might want to compare the average incoming solar radiation in the tropics versus the poles to the average annual temperature in the tropics versus the poles to assess the effect of insolation on regional temperature. In order to carry-out such analyses, it’s useful to be able to extract and compare subsets of data from a larger global dataset. 

In this tutorial, we will explore multiple computational tools in Xarray that allow us to select data from a specific spatial and temporal range. In particular, we will practice using:


*   **`.sel()`:** select data based on coordinate values or date
*   **`.interp()`:** interpolate to any latitude/longitude location to extract data
*   **`slice()`:** to select a range (or slice) along one or more coordinates, we can pass a Python slice object to `.sel()`


In [2]:
# @title Video 1: Speaker Introduction
#Tech team will add code to format and display the video

# Setup

In [1]:
!pip install datetime

!pip install numpy
!pip install pandas
!pip install xarray
!pip install pythia_datasets



In [2]:
from datetime import timedelta

import numpy as np
import pandas as pd
import xarray as xr
from pythia_datasets import DATASETS

To explore these Xarray tools, we'll recreate the temperature and pressure DataArrays that we generated in the previous tutorial, and combine these two DataArrays into a Dataset.

In [3]:
# @title Create the temperature and pressure Xarray Dataset we made in Tutorial 1

#Temperature data
data = 283 + 5 * np.random.randn(5, 3, 4)
temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
times = pd.date_range('2018-01-01', periods=5)
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'

#Pressure data
pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)
pressure = xr.DataArray(
    pressure_data, coords=[times, lats, lons], dims=['time', 'lat', 'lon']
)
pressure.attrs['units'] = 'hPa'
pressure.attrs['standard_name'] = 'air_pressure'

#Combinate temperature and pressure DataArrays into a Dataset
ds = xr.Dataset(data_vars={'Temperature': temp, 'Pressure': pressure})

To refresh our memory from the previous tutorial, let's look at the DataArrays we created for temperature and pressure.

In [None]:
#Print the temperature DataArray
temp

In [None]:
#Print the pressure DataArray
pressure

In [5]:
# @title Figure Settings
import ipywidgets as widgets       # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle")

ModuleNotFoundError: No module named 'ipywidgets'

# Section 1: Subsetting and selection by coordinate values

Since Xarray allows us to label coordinates, we are able to easily select data based on coordinate names and values, rather than array indices. We'll explore this briefly here. 


## Section 1.1: NumPy-like selection

Suppose we want to extract all the spatial data for one single date: January 2, 2018. It's possible to achieve that with NumPy-like index selection:

In [4]:
indexed_selection = temp[1, :, :]  # Index 1 along axis 0 is the time slice we want...
indexed_selection

However, notice that this requires us (the user) to have detailed knowledge of the order of the axes and the meaning of the indices along those axes. By having named coordinates in Xarray, we can avoid this issue.

## Section 1.2: `.sel()`

Rather than using a NumPy-like index selection, in Xarray, we can instead select data based on coordinate values using the `.sel()` method, which takes one or more named coordinate(s) as keyword argument:

In [13]:
named_selection = temp.sel(time='2018-01-02')
named_selection

We got the same result as when we used the NumPy-like index selection, but 
- we didn't have to know anything about how the array was created or stored
- our code is agnostic about how many dimensions we are dealing with
- the intended meaning of our code is much clearer!

By using the .sel() method in Xarray, we can easily isolate data from a specific time. You can also isolate data from a specific coordinate. Try writing a line of code to select the temperature data from the coordinates 25,-120. For example, you could achieve this with the following code:

`coordinate_selection = temp.sel(lat='25.0', lon='-120.0')`

## Section 1.3: Approximate selection and interpolation

The spatial and temporal resolution of climate data often differs between datasets or a dataset may be incomplete. Therefore, with time and space data, we frequently want to sample "near" the coordinate points in our dataset. For example, we may want to analyze data from a specific coordinate or a specific time, but may not have a value from that specific location or date. In that case, we would want to use the data from the closest coordinate or time-step. Here are a few simple ways to achieve that.

### Section 1.3.1: Nearest-neighbor sampling

Suppose we want to know the temperature from `2018-01-07`. However, the last day on our `time` axis is `2018-01-05`. We can therefore sample within two days of our desired date of `2018-01-07`. We can do this using the `.sel` method we used earlier, but with the added flexibility to perform nearest neighbor sampling and specifying an optional tolerance:

In [14]:
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))

Notice that the resulting data is from the date `2018-01-05`.

### Section 1.3.2:  Interpolation

The latitude values of our dataset are 25ºN, 40ºN, 55ºN, and the longitude values are 120ºW, 100ºW, 80ºW, 60ºW. But suppose we want to extract a timeseries for Boulder, Colorado, USA (40°N, 105°W). Since `lon=-105` is _not_ a point on our longitude axis, this requires interpolation between data points.

We can do this using the `.interp()` method (see the docs [here](http://xarray.pydata.org/en/stable/interpolation.html)), which works similarly to `.sel()`. Using `.interp()`, we can interpolate to any latitude/longitude location:

In [7]:
temp.interp(lon=-105, lat=40, method='linear')

In this case, we specified a linear interpolation method, yet one can choose other methods as well (e.g., nearest, cubic, quadratic). Not that the temperature values we extracted in the code cell above are not actual values in the dataset, but are instead calculated based on linear interpolations between values that are in the dataset.

## Section 1.4: Slicing along coordinates

Frequently we want to select a range (or _slice_) along one or more coordinate(s). For example, we may was to only assess average annual temperatures in equatorial regions. We can achieve this by passing a Python [slice](https://docs.python.org/3/library/functions.html#slice) object to `.sel()`. The calling sequence for <code>slice</code> always looks like <code>slice(start, stop[, step])</code>, where <code>step</code> is optional. In this case, let's only look at values between 110ºW-70ºW and 25ºN-40ºN:

In [8]:
temp.sel(
    time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45)
)

Try changing the code above to slice along a different range of coordinates!

## Section 1.5: One more selection method: `.loc`

All of these operations can also be done within square brackets on the `.loc` attribute of the `DataArray`:


In [9]:
temp.loc['2018-01-02']

This is sort of in between the NumPy-style selection
```
temp[1,:,:]
```
and the fully label-based selection using `.sel()`

With `.loc`, we make use of the coordinate *values*, but lose the ability to specify the *names* of the various dimensions. Instead, the slicing must be done in the correct order:

In [10]:
temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]

One advantage of using `.loc` is that we can use NumPy-style slice notation like `25:45`, rather than the more verbose `slice(25,45)`. But of course that also works:

In [11]:
temp.loc['2018-01-01':'2018-01-03', slice(25, 45), -110:-70]

What *doesn't* work is passing the slices in a different order:

In [12]:
# This will generate an error
# temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']

# Summary

In this tutorial, we have explored the practical use of **`.sel()`** **`.interp()`** **`.loc()`:** and **Slicing** techniques to extract data from specific spatial and temporal ranges. These methods are valuable for comparing subsets of data from larger datasets.