![xarray_logo](https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png)

# Introduction I

&copy; Part of **_DKRZ Python Course for Geoscientists_**, licensed by DKRZ under **CC BY-NC-ND 4.0**

Xarray home page: https://xarray.pydata.org/en/stable/index.html <br>
Xarray documentation: https://docs.xarray.dev/en/stable/index.html


**Xarray** is a python package which allows us to handle multi-dimensional datasets in a simple way. It provides a huge set of functions for advanced analytics and visualization. It is part of higher level package ecosystems like [Pangeo](https://pangeo.io/).

**Xarray’s** underlying data model is borrowed from the data format [netCDF](http://www.unidata.ucar.edu/software/netcdf). This data format in combination with the [Climate and Forecast metadata conventions](https://cfconventions.org/) (CF) is the standard for the climate science community. A large part of DKRZ’s data is available in netCDF. Therefore, **Xarray** allows fast and intuitive data analysis on this kind of data, but file formats like GRIB, HDF5, and Zarr can also be used.

**Xarray** data structure deals with scientific data by using **labels**, **attributes**, **dimensions** and **coordinates**, and extend the capabilities of **NumPy** and **Pandas**.


## <u>Overview</u>: 

### Xarray's data model

A **data model** describes how the elements of data are organized and standardizes how they relate to one another. On code level, a graph of a data model shows the interconnections of classes, types and methods. **Xarray's** data model consists of the classes **Dataset**, **DataArray**, **Dimension**, **Coordinate** and **Attributes**.

----

**Dataset** ( dataset or file ): 

    Dict-like collection of DataArray objects with aligned dimensions. Similar use of variables, dimensions, coordinates, and attributes like for DataArray. You can see an xarray Dataset as a netCDF file like object. Has no data itself but only pointers to DataArrays

----

**DataArray** ( data array or variable in a file ): 

    N-dimensional array with dimensions. The objects add dimension names, coordinates, and attibutes to the underlying data structure (numpy and dask arrays).

----

**Dimensions**: 

    Named dimension axes, if missing the dimension names are dim_0, dim_1, ...

----

**Coordinates**: 

    An array which labels a dimension. Two types are defined 
    a) dimension coordinates - 1-dimensional coordinate array assigned to the DataArray with a name and dimension name. 
    b) Non-dimensional coordinate - a coordinate array assigned to DataArray with the name assigned to the coordinates and not to the dimensions.

----

**Attributes**: 

    Xarray allows you to attach metadata and attributes to both DataArrays and Datasets. 
    Metadata can include information about units, descriptions, and any other relevant information about the data.

----

<br>

<img src="https://storage.googleapis.com/jnl-up-j-jors-files/journals/1/articles/148/submission/proof/148-10-1829-1-17-20170405.png" alt="xarray data structure" border=1 width=900></img> 
<figcaption align = "center"> An overview of xarray’s main data structures. From Hoyer and Hamman (2017); DOI: 10.5334/jors.148 </figcaption>
<br>

### N-dimensional arrays

- a 1-dimensional array is of shape(n,)
- a 2-dimensional array is of shape(n,m)
- a 3-dimensional array is of shape(n,m,k)
- a 4-dimensional array is of shape(n,m,k,l)

Python is **'row major'** which means that the `left dimension varies slowest` and the `right dimension varies fastest`. That's the case why the geo-referenced data have often the dimension order (time, level, lat, lon).

In [None]:
%%html
<table align="left">
    <tr><td><img src="../images/x_y_array_1d.png" alt="xy_1d" border=1 width=300> </img> 1-dimensional </td>
        <td><img src="../images/var_xy_array_2d.png" alt="var_2d" border=1 width=400>  2-dimensional </img> </td></tr>
    <tr><td><img src="../images/var_xyt_array_3d.png" alt="var_3d" border=1 width=400>  3-dimensional </img> </td>
        <td><img src="../images/var_tzyx_array_4d.png" alt="var_4d" border=1 width=400>  4-dimensional </img> </td></tr>
</table>

## Importing modules

In this notebook we work with the Python libraries NumPy, Pandas, Xarray and cfgrib. 


In [None]:
import xarray as xr
import numpy as np
import pandas as pd
import cfgrib
from datetime import datetime

## DataArray

The `DataArray` of **Xarray** is the implementation of a labeled multi-dimensional array.

To see what this means, we start with the creation of a simple DataArray that is based on an NumPy ndarray.

Create NumPy _ndarray_ with shape(4,5):

In [None]:
array = np.arange(1,21).reshape(4,5)
array

Now, we can use the function `xr.DataArray()` to create a DataArray from the NumPy array above.

In [None]:
da = xr.DataArray(array)
da

As you can see, the `xr.DataArray()` adds two dimensions named **dim_0** and **dim_1** to the new data array structure. When the function `xr.DataArray()` is used, it returns a data object with some presettings like Coordinates, Indexes and Attributes. In our case these are empty because we did not declared them yet. You can either add them in the `xr.DataArray()` function call or afterwards.
Also, you can specify the name of the dimensions when creating the DataArray with `xr.DaraArray` or afterwards using the `rename` method. Note: `rename` returns a new DataArray object.

In [None]:
da = da.rename({'dim_0':'y','dim_1':'x'})
da

In the next step we assign the arrays x and y which we want to use as coordinates for our DataArray.

In [None]:
x = np.arange(0., 21., 5.)
y = np.arange(0., 20., 5.)

print(x, y)

Xarrays allows us to do the following steps within one `xr.DataArray()` call:

- the first dimension should be 'y' and the second 'x'
- use the same names as for dims for the coords
- assign values to the coords
- define the attribute 'standard_name', see https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html;   
  we assume that our DataArray represents a variable with the the standard_name 'age_of_sea_ice' 

In [None]:
da = xr.DataArray(array, 
                  dims=('y','x'), 
                  coords={'y': y, 'x': x},
                  attrs={'standard_name':'age_of_sea_ice'})
da

It is also possible to name the DataArray itself, e.g. 'var'. You can set it when the DataArray is defined or you can add it later.

In [None]:
da = xr.DataArray(array, 
                  name='var',
                  dims=('y','x'), 
                  coords={'y': y, 'x': x},
                  attrs={'standard_name':'age_of_sea_ice'})
da

Change the DataArray name of an already existing DataArray to 'var2'.

In [None]:
da.name = 'var2'

#print(da)
da

<br>

To add another attribute to the DataArray use attrs, for instance set the units attribute.

In [None]:
da.attrs['units'] = 'year'

da

### Expand dimensions

You can add a dimension, e.g. time, to the already existing DataArray with `DataArray.expand_dims()`. In the next example, we add a time dimension of length 2 with values 1 and 2 to our DataArray.

In [None]:
time = [1,2]

da.expand_dims({'time':time}, axis=0)

The time dimension and its data is added to the DataArray but as we can see the data array itself is duplicated. This is caused by the fact that our input data **array** is of shape(4,5) (which can be reshaped into (1,4,5)) but now it has the shape(2,4,5). The _missing_ data for the second time step is copied from the first timestep.

Note that the DataArray.expand_dims() just **returns** a DataArray with this new dimension, it does not replace it.

In [None]:
da

We therefore update our variable _da_ by assigning the returned DataArray.

In [None]:
da = da.expand_dims({'time':time}, axis=0)

da

To retrieve the shape of the DataArray use the shape or size property.

In [None]:
da.shape

In [None]:
da.sizes

The result shows that we now have two time steps and a 3-dimensional array with the dimensions time, y and x  while our input data, before expand_dims, was a 2-dimensional array. 

<br>

----

<b><font size="+3" color="#ff0000">Exercise: </font></b> 

Make yourself familiar with `xr.DataArray`

1. generate an Xarray DataArray
2. add some attributes, including a standard_name attribute
3. change the default dimension names and add coordinate values
4. create the same DataArray with just one call of xr.DataArray

In [None]:
# 1.


In [None]:
# 2. 


In [None]:
# 3. 


In [None]:
# 4.


<br>

#### Solution


In [None]:
# 1. generate an Xarray DataArray

np.random.seed(100000)
nt   = 5

data = xr.DataArray(np.random.random((nt,4,5)))

In [None]:
# 2. add some attributes

data.attrs['my_attr']       = 'my new attribute'
data.attrs['creation_date'] = datetime.today().strftime('%Y-%m-%d')

In [None]:
# 3. change the default dimension names and add coordinate values

data    = data.rename({'dim_0':'t', 'dim_1':'y', 'dim_2':'x'})

In [None]:
# 4. create the same DataArray with just one call of xr.DataArray

nt      = 5
np.random.seed(100000)
data    = np.random.random((nt,4,5)) * 2000

data_xr = xr.DataArray(data, 
                      dims=('index', 'axis_x','axis_y'), 
                      coords={'index': np.arange(1,nt+1), 
                             'axis_x': [2, 4, 6, 8], 
                             'axis_y': [1,2,3,4,5]},
                      attrs={'standard_name':'fire_temperature',
                             'units':'K', 
                             'comment': 'Random data min=0., max=2000.'})
#data_xr

<br>

----

## More about DataArrays

Let's first compare a NumPy array with a Xarray DataArray. You can directly convert a NumPy array into an Xarray DataArray type by using it as input for Xarray's function `xr.DataArray`. We use the _atmosphere water vapor content_ data from the file `../data/prw.dat` by loading it with NumPy.


Show the first 5 lines of the ascii input file

In [None]:
!head -5 ../data/prw.dat

Read columns 1 to 3 of the input file while skipping the header

In [None]:
prw_data = np.loadtxt('../data/prw.dat', usecols=(1,2,3), skiprows=1)
prw_data

Convert the numpy array into an Xarray DataArray

In [None]:
prw_data_xr = xr.DataArray(prw_data)
prw_data_xr

In [None]:
prw_data_xr.attrs

`prw_data_xr` has got more structure and descriptive information than `prw_data`. In contrast to the `NumPy` data array, the `Xarray` DataArray can separate the variable of interest, `prw`, as a *data variable* from *coordinate* variables since the `Xarray` DataArray has the Classes:

- **dimensions** with names               (`prw_data_xr.dims`)
- **coordinates** pointing to variables   (`prw_data_xr.coords`)
- **attributes**                          (`prw_data_xr.attrs`)


This information is not correctly parsed from the input NumPy array when executing `xr.DataArray()`, but we configure them in the call `xr.DataArray()` via the function parameters (arguments + keyword arguments):

```python
xr.DataArray(data,
             coords=,
             dims=,
             name=,
             attrs=,
            )
```

<div class="alert alert-info">
    <b>Note:</b> When working with <b>xarray</b>, the arguments and keyword arguments for a function are <i>in general</i> very usefull and important!
</div>

The configuration of coordinate values is not only important for `Xarray` but also other software tools since **labeled geospatial** information from coordinates is required, e.g. for

- **plotting**: mapping of data on a real world grid point
- **analysis**: routines e.g. calculating area **weighted means**


### Parsing NumPy data with labels to xarray

Let's define a clear structure for the `xarray.DataArray()` for the NumPy data first:

1. The actual **data** for the data variable is in the first column of the NumPy array.
2. The **coords** are the second and third column of the NumPy array. They have the same dimension as the data array.
3. We have one dimension (**dims**) which refers to the **_station_**. It is an index which runs from 0 to the length of the a column minus 1.
4. The **name** of the data variable is **prw**.
5. In the **attrs**, we can store variable attributes like **_units_**. The **standard_name** of prw is **_atmosphere_mass_content_of_water_vapor_**; the corresponding canonical units is **_kg m-2_**.

Let's bring that into context with `xr.DataArray()`:

In [None]:
prw_data_xr = xr.DataArray(prw_data[:,2],
                           coords={"lat":("Station", prw_data[:,0]),
                                   "lon":("Station", prw_data[:,1])},
                           dims=["Station"],
                           name="prw",
                           attrs={"units":"kg m-2",
                                  "standard_name":"atmosphere_mass_content_of_water_vapor"})
prw_data_xr

In [None]:
print("Variable Name: ", prw_data_xr.name)
print("Dimensions:    ", prw_data_xr.dims)
print("Coordinates:   ", prw_data_xr.coords)
print("Sizes:         ", prw_data_xr.sizes)
print("Attribute:     ", prw_data_xr.attrs)

### Dimensions

Dimensions are **indices** covering an interval of the length of the dimension.

In our example, we only have one dimension where each index refers to one **station**. However, if we create a quick plot of the data with the function `xr.DataArray.plot()`, we only get a one dimensional view:

In [None]:
prw_data_xr.plot();

#### Create a two dimensional georeferenced plot

Our next goal is to reorganize the data so that `prw_data_xr.plot()` returns a meshed grid plot.

For that, we create a less condensed **two-dimensional** DataArray (with a lot of `NaN` values). 

<br />

<h2 style="color:red"> Exercise </h2>

1. Create a two dimensional NumPy with the size `len(prw_data)` x `len(prw_data)`

1. Assign `NaN` values to the entire array

1. On the diagonal of the quadratic array, insert the values of `prw_data`

1. Show the new data frame

<br>

You will need:

- `np.full()` or `np.empty()`
- `np.Nan`
- use a `for` loop


In [None]:
# 1.


In [None]:
# 2.


In [None]:
# 3.


In [None]:
# 4.


<br>

#### Solution

In [None]:
# 1. and 2.
prw_data_2d = np.full([len(prw_data),len(prw_data)], np.nan)

# another way to generate the prw_data_2d
#prw_data_2d = np.empty([len(prw_data),len(prw_data)]) * np.nan

In [None]:
# 3. 
for i in range(0, len(prw_data)):
    prw_data_2d[i,i] = prw_data[i,2]
    
print(prw_data_2d)

In [None]:
# 4. 
xr.DataArray(prw_data_2d).plot()

<br>

<h2 style="color:red"> Exercise </h2>

Let's pass this DataArray to **Xarray**.

1. Reset the variable `pwr_data_xr` with a `xr.DataArray()` but use `prw_data_2d` as input.<br>
   **Hint**: Set the dims to \["lat","lon"\]. Coordinate and dimension names have to be the same.
2. Plot again

<br>


In [None]:
# 1.


In [None]:
# 2.


<br>


#### Solution

In [None]:
# 1.

prw_data_xr = xr.DataArray(prw_data_2d,
                           coords={"lat": prw_data[:,0],
                                   "lon": prw_data[:,1]},
                           dims=["lat","lon"],
                           name="prw",
                           attrs={"units":"kg m-2",
                                  "standard_name":"atmosphere_mass_content_of_water_vapor"})
print(prw_data_xr)

In [None]:
# 2. his leads to an error

prw_data_xr.plot()

An error occurs due to the fact that we have station data that do not have ascending or descending coordinate values. The **ValueError** at the end of the error message gives you the hint to use `sortby` to solve the error.

The plot only uses the indices of the dimensions for the x and y axes of the plot. This is because the **coordinates** 'lat' and 'lon' are not interpreted as **_index coordinates_**. **Xarray** will interpete coordinates as **index coordinates** only if the name of the coordinate is the same as the name of the dimension. 

In [None]:
# 2. correct way

prw_data_xr = prw_data_xr.sortby(['lon','lat'])
prw_data_xr.plot()


print(prw_data_xr.lon.min().data, prw_data_xr.lon.max().data)
print(prw_data_xr.lat.min().data, prw_data_xr.lat.max().data)
print(prw_data_xr.min().data, prw_data_xr.max().data)

<br>

We created a simple plot which gives us an idea of for which locations we have valid station data using only few **Xarray** commands. In the session Visualization Part 2., we will learn a more sophisticated plotting including e.g. *coastlines*.

Here is a taste:

<br>

In [None]:
import cartopy.crs as ccrs
import matplotlib.pyplot as plt

proj    = ccrs.PlateCarree()                            # choose map projection

fig, ax = plt.subplots(figsize=(12,8), subplot_kw={'projection':proj})
ax.set_extent([-105, -80, 25, 41], proj)
ax.stock_img()                                          # add satellite image
ax.gridlines(draw_labels=True, color='None', zorder=0)  # turn on axis label, turn off gridlines
ax.coastlines()                                         # add coastlines

prw_data_xr.plot(cmap='Reds', cbar_kwargs=dict(shrink=0.6)) ; # decrease colorbar size

<h2 style="color:red"> Exercise </h2>

Play a bit with the Xarray DataArray and use the DataArray **prw_data_xr** from above

1. add a variable long_name attribute  
   (name it as you like, but be aware that many plotting routines parse the long_name to plot the labels ;))
2. change the standard_name and variable name
3. add more attributes and print them all

<br>

In [None]:
# 1.


In [None]:
# 2.


In [None]:
# 3.


<br>

#### Solution

In [None]:
# 1.
prw_data_xr.attrs['long_name'] = 'This is the variables long_name'

In [None]:
# 2.
prw_data_xr['standard_name'] = 'area_fraction'
prw_data_xr.name = 'variable_A'

In [None]:
# 3. 
prw_data_xr.attrs['created_by'] = 'DKRZ Python Course'
prw_data_xr.attrs

----
## Dataset

An Xarray `Dataset` is a dictionairy-like container of data arrays with aligned dimensions. <br><br>

![xrdataset](https://docs.xarray.dev/en/stable/_images/dataset-diagram.png)

Datasets have four key properties:

     1. dims:      dict for dimension names
     2. data_vars: dict of data arrays
     3. coords:    dict of coordinates
     4. attrs:     dict for dataset (global) attributes

**Note:** <br>
If you are familiar with the **netCDF file format**: the Xarray Dataset is designed as an in-memory representation of the netCDF data model.

You can use the already defined DataArrays to create a Dataset. Here, we use our prw_data_xr DataArray.

In [None]:
prw_data_xr

In [None]:
ds = prw_data_xr.to_dataset()
ds

In [None]:
ds = prw_data_xr.to_dataset(promote_attrs=True)
ds

<br>

Next, we use a NumPy DataArray of random values as input data for the Dataset.

1. define two arrays for the variables temp and prec
1. define the coordinate data for lat and lon
1. define the coordinate data for a time dimension
1. create the Dataset

<br>

1a. Define the data for the variable temp (temperature)

In [None]:
temp = np.random.uniform(250,300,40).reshape((2,4,5))
temp

1b. Define random data for a variable prec (precipitation), for reproducability we set the random seed.

In [None]:
np.random.seed(100000)

In [None]:
prec = np.random.uniform(0.001,0.015,40).reshape((2,4,5))

prec

2. Define the data for the coordinate variables lat and lon

In [None]:
lat = [45.,50.,55.,60.]
lon = [0.,5.,10.,15.,20.]

3. Define the time variable

This time we generate a time variable containing 2 time steps with daily-frequency with **Pandas** `pd.date_range()` function.

In [None]:
time = pd.date_range(start='2023-01-01', periods=2)
time

4. Define the Dataset

Use the data and coordinate variables to generate the Dataset. Add an attribute 'comment' to the Dataset.

In [None]:
ds = xr.Dataset({'temp': (['time','lat','lon'], temp),
                 'prec': (['time','lat','lon'], prec),
                 },
                 coords={'time': time,
                         'lat': (['lat'], lat),
                         'lon': (['lon'], lon),
                         },
                 attrs={'comment': 'This is a global attribute of the dataset'})

Let's look at the created Xarray Dataset

In [None]:
ds

If you want to have it more 'ncdump'-like view, use `Dataset.info()`.

In [None]:
ds.info()

<br>

**Let's access the data.**

_the data variable temp_

In [None]:
ds['temp']   
# alternatively:
#ds.temp

_the coordinate variable lat_

In [None]:
ds.lat

# you can use the variable coordinate lat, too
ds.temp.lat

_the coordinate variable time_

In [None]:
ds.time

### Dimensions, shape and size

To get more information about the dimension, shape and size of a **Dataset**, we can use the appropriate attributes.



In [None]:
dims  = ds.dims
shape = temp.shape
size  = temp.size
rank  = len(shape)

print('dimensions: ', dims)
print('shape:      ', shape)
print('size:       ', size)
print('rank:       ', rank)

<br>

<b><font size="+3" color="#ff0000">Exercise: </font></b> 

Make yourself familiar with `xr.Dataset`

1. generate an Xarray Dataset
1. try to add some attributes
1. choose a variable and print its content

In [None]:
# 1.


In [None]:
# 2.


In [None]:
# 3.


<br>

#### Solution

In [None]:
# 1.
tas = xr.DataArray(temp,
                   coords={'time': time,
                           'lat': (['lat'], lat),
                           'lon': (['lon'], lon),
                           },
                   name='tas',
                   attrs={'units': 'K', 'standard_name':'surface_temperature'})

prc = xr.DataArray(prec,
                   coords={'time': time,
                           'lat': (['lat'], lat),
                           'lon': (['lon'], lon),
                           },
                   name='prec',
                   attrs={'units': 'mm', 'standard_name':'precipitation'})


ds_new = xr.merge([tas,prc])
print(ds_new.tas.attrs)

In [None]:
# ... 1.  Use NumPy array temp and Xarray DataArray prc

ds_new = xr.Dataset({'tas': (['time','lat','lon'], temp.data, {'units':'K', 
                                                               'standard_name':'surface_temperature'}),
                     'prc': (['time','lat','lon'], prc.data, prc.attrs),
                     },
                     coords={'time': time,
                            'lat': (['lat'], lat),
                            'lon': (['lon'], lon),
                            },
                     attrs={'comment': 'This is a global attribute of the dataset',
                            'source':'DKRZ Python Course'})
print(ds_new.tas.attrs)
print(ds_new.prc.attrs)
print(ds_new)

In [None]:
# ... 1.  Add a DataArray to an existing Dataset

ds_new2 = xr.merge([ds_new, tas.rename('tas2')])
ds_new2

In [None]:
# 2.

ds_new.tas.attrs['long_name'] = 'near surface temperature'

print(ds_new.tas.attrs)

In [None]:
# 3.

print(ds_new.prc.data)

## Indexing and slicing data 

See also: https://docs.xarray.dev/en/stable/user-guide/indexing.html#

To demonstrate how to do DataArray indexing we create a small DataArray of shape(3,5). 

In <u>this example DataArray</u> the  dimension **x** can be seen as **row** and the dimension **y** as **columns**.

In [None]:
da = xr.DataArray(np.arange(1,16).reshape((3,5)),
                  dims=['x', 'y'],
                  coords={'x':[1,2,3], 'y':[10,20,30,40,50]})
da

You can extract data using the indices of the dimensions. There are different ways to extract data from the DataArray.

For the DataArray da with 2 dimensions using only one index for the 2d-array means you select a complete 'row' ('x').

In [None]:
da[0]

When using 2 indices you can extract single values. You can use

    data_array [index_of_dim_0][index_of_dim_1]
    
or 
    
    data_array [index_of_dim_0, index_of_dim_1]

In [None]:
da[1][0]

In [None]:
da[1,0]

There is a method for DataArrays and Datasets called .isel() which uses the dimension name and the integer index.

The following command does the same as the last example from above.

In [None]:
da.isel(x=1, y=0)

So far we have selected only one element but we want now select more and therefor we use the slicing method (as shown in NumPy).

Select some 'rows':

In [None]:
da[0:2, 1:3]

With the `slice` function you can extract slices from the DataArray using the `.sel()` method.

In [None]:
da.isel(x=slice(0,2), y=0)

<b><font size="+3" color="#ff0000">Exercise: </font></b> 

Make yourself familiar with the indexing and slicing of DataArrays.

## Label-based indexing

Insted of using the index integer value you can also lookup the dimensions by name.

In [None]:
da.sel(x=3)

Do you know what we mean? Let us use a better example next.

Therefore, we use the Dataset with the temperature and precipitation variable from above to demonstrate the `.sel()` and `.loc()` methods. Both can also be used with DataArrays.

In [None]:
ds

Using the .sel() method with a Dataset it has an impact to all data variables (temp, precip).

In the next example we want to extract only the data of all variables for time step '2020-01-15'.

In [None]:
ds.sel(time='2023-01-01')

Extract the temp data of a single time step.

In [None]:
ds.temp.sel(time='2023-01-01')

You can combine multiple labels at the same time to extract data. If you do not know the exact values you can use the keyword method with nearest to find the dimension index nearest to the given value.

In [None]:
ds.temp.sel(lat=51.5, lon=2.5, method='nearest').values

Note: The keyword method can't be used with dimension slicing.

In [None]:
ds.temp.sel(time='2023-01-01', lat=slice(51.5,57.5)).values

If you would prefer to work more Panda-like, then you can use the .loc[] method that uses a dictionary.

In [None]:
ds.temp.loc[{'time':'2023-01-01'}]

Overview of the four different kinds of indexing:

| Dimension lookup |  Index lookup |                 DataArray syntax                 |                  Dataset syntax                  |
|:-----------------|:--------------|:-------------------------------------------------|:-------------------------------------------------|
| Positional       | By integer    | `da[0, :, :]`                                    | not available                                    |
| Positional       | By label      | `da.loc["2001-01-01", :, :]`                     | not available                                    |
| By name          | By integer    | `da.isel(time=0)` or <br>  `da[dict(time=0)]`        | `ds.isel(time=0)` or <br>  `ds[dict(time=0)`]        |
| By name          | By label      | `da.sel(time="2001-01-01")` or <br>  `da.loc[dict(time="2001-01-01")`] | `ds.sel(time="2001-01-01"`) or <br>   `ds.loc[dict(time="2001-01-01")]` |

<br>

<b><font size="+3" color="#ff0000">Exercise: </font></b> 

1. Extract some precipitation data from the Dataset ds using
   - .isel()
   - .sel()
   - .loc[]
2. Which method do you like better `.sel()` or `.loc[]`?

In [None]:
# 1 - a


In [None]:
# 1 - b


In [None]:
# 1 - c


<br>

---- 
## Write DataArray or Dataset to file

Xarray provides an easy way to write the well defined Dataset to an netCDF file with the function `.to_netcdf()`.

In [None]:
!rm ds_output_file.nc

ds.to_netcdf('ds_output_file.nc')

In [None]:
!ncdump -h ds_output_file.nc

That was really easy! But for completeness we should have added some more attributes to the dimensions and data variables like units, standard_name, and others.

Let's see how it looks like when we write the DataArray to a netCDF file.

In [None]:
!rm da_output_file.nc
da.to_netcdf('da_output_file.nc')

In [None]:
!ncdump -h da_output_file.nc

It is also possible to write the Dataset to a Zarr file with the `Dataset.to_zarr()` function.

Note:
To write the data to a CSV file you can convert the Dataset to a `Pandas.DataFrame` and then use the `pandas.DataFrame.to_csv()` function. An alternative is to use the **xarray_extras** package.


<br>

<b><font size="+3" color="#ff0000">Exercise: </font></b> 

1. Write one of the DataArrays and Dataset to a netCDF file


In [None]:
# 1.


<br>

----

## Read file

In the next step we want to read our newly created netCDF file. Xarray provides the function `xr.open_dataset()` to open a file with the file format netCDF, GRIB, HDF5, or Zarr. Default format is netCDF.

    ds_in = xr.open_dataset('infile.nc')

is the same as

    ds_in = xr.open_dataset('infile.nc', engine='netCDF4')


As the function name says it only opens the file and reads in the meta-data, not the data itself, which saves memory.

In [None]:
ds_in = xr.open_dataset('ds_output_file.nc')

ds_in

If you want to load the dataset into memory use load().

In [None]:
ds_in2 = xr.open_dataset('./ds_output_file.nc').load()

Delete this duplicate dataset.

In [None]:
del(ds_in2)

Read another netCDF file.

In [None]:
ds = xr.open_dataset('../data/tsurf.nc')
ds.info()

<br>

### Read a GRIB file

Before we can read a GRIB file we have to import the cfgrib module which has to be installed.

In [None]:
import cfgrib

Now, we can use again the xr.open_dataset() function but this time with the engine 'cfgrib'.

In [None]:
ds2 = xr.open_dataset('../data/MET9_IR108_cosmode_0909210000.grb2',
                      engine='cfgrib')

In [None]:
ds2.variables

<br>

<a class="anchor" id="read-multi"></a>
### Read multiple files at once

Sometimes you get data stored in multiple separate files but you want to have it available in only one Dataset.

In the course directory **data** are 3 example files _precip_day01.nc, precip_day02.nc, and precip_day03.nc_, each containing the data of one day in 6 hour intervals. 

**Xarray** provides the function `xr.open_mfdataset()` to read multiple files in one step as a single dataset. Before you can use `xr.open_mfdataset` make sure that the Python module **dask** is installed in your environment.

<br>


In [None]:
!ls -la ../data

<br>

One reason why **Xarray** is very fast with multiple files is that it does not **load** the data when the files are opened. This is possible by using an underlying library named `dask`. You can recognize that by checking for the `precip` variable in `dsm`.

First, we open the multiple files precip_day*.nc in the data directory.

In [None]:
dsm = xr.open_mfdataset('../data/precip_day*.nc')

dsm

In [None]:
dsm.precip[1,4,5]

will not show you an exact value but only a description of what this output will be. You would have to load the data into memory first for accessing one specific point of the array. This is most often not necessary for your workflow.

The entire array can be loaded into memory by `dsm.precip.load()`. You can also do: 
```python
dsm.precip.values[1,4,5]
```

➡️ While data is not in loaded, you can work on files that are **larger than memory**.

In [None]:
dsm.precip[1,4,5]

In [None]:
dsm.precip.load()

In [None]:
dsm.precip[1,4,5]

# is the same as

dsm.precip.values[1,4,5]

The `xr.open_mfdataset` function is very powerful. It contains over 10 arguments which allow users to configure how the files are combined:

- On what dimension should the data be concatted
- How strict should tests ensure that the data can be concatted
- What are coordinates, what are data variables

<br>

<b><font size="+3" color="#ff0000">Exercise: </font></b> 

1. Read the file '../data/rectilinear_grid_2D.nc'
2. Print the file content

In [None]:
# 1.


In [None]:
# 2.


<br>

#### Solution

In [None]:
# 1.
ds_reclin = xr.open_dataset('../data/rectilinear_grid_2D.nc')

In [None]:
# 2.
ds_reclin

<br>

### Get variable coordinates and names

It is always good to have a closer look at the data, and this can be done very easily using the attributes, dimensions, and coordinates explained above.

Show the coordinates stored in file:


In [None]:
coords = ds.coords
coords

List the variables stored in the file:

In [None]:
variables = ds.variables
variables

Here we can see the time displayed in a readable way, because Xarray use the datetime64 module under the hood. Also the variable and coordinate attributes are displayed.

<h2 style="color:red"> Exercise </h2>

Use the Dataset ds from above.

1. Print the global file attributes
2. What is the difference of list(ds.keys()), list(ds.data_vars), and list(ds) ?
3. Print the attributes of the variable of ds


<br />

In [None]:
# 1.


In [None]:
# 2.


In [None]:
# 3.


<br>

#### Solution

In [None]:
# 1.
print(ds.attrs)

print('------------------------------------------------')

In [None]:
# 2.
print(list(ds.keys()))
print(list(ds.data_vars))
print(list(ds))

print('------------------------------------------------')

In [None]:
# 3. 
print(ds.tsurf.attrs)
print(ds['tsurf'].attrs)