# 3.a xarray

![xarray logo](images/xarray_logo.png)
https://xarray.pydata.org/en/stable/index.html

**xarray** is a python package which allows us to handle multi-dimensional datasets in a simple way. It provides a huge set of functions for advanced analytics and visualization. It is part of the SciPy and Pangeo ecosystem.

**xarray** data structure deals with scientific data by using labels, attributes, dimensions and coordinates, and extend the capabilities of **NumPy** and **pandas**.


## Data structures

- DataArray
- DataSet
- Dimensions
- Coordinates


DataArray: 

    N-dimensional array with dimensions. The objects add dimension names, coordinates, and attibutes to the underlying data structure (numpy and dask arrays).

Dataset: 

    Dict-like collection of DataArray objects with aligned dimensions. Similar use of variables, dimensions, coordinates, and attributes like for DataArray. You can see an xarray Dataset as a netCDF file like object.
 
Dimensions: 

    Named dimension axes, if missing the dimension names are dim_0, dim_1, ...

Coordinates: 

    An array which labels a dimension. Two types are defined a) dimension coordinates - 1-dimensional coordinate array assigned to the DataArray with a name and dimension name. b) Non-dimensional coordinate - a coordinate array assigned to DataArray with the name assigned to the coordinates and not to the dimensions.




In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import xarray as xr

## Working with DataArrays

First, we create a random data array a with 20 values with numpy's ```random.rand()``` function.

In [None]:
a = np.random.rand(20)

print(a)

Make an xarray DataArray from the numpy array a with ```xarrays.DataArray()```.

In [None]:
da_a = xr.DataArray(a)

print(da_a)

As you can see a dimension ```dim_0``` is assed to the array.

For n-dimensional arrays, a corresponding number of dimensions are used.

E.g. 3D data array:

In [None]:
data = np.random.rand(4,90,180)

print(data)

In [None]:
print(xr.DataArray(data))

<br>

The dimensions have no names and we want to change it in the next step with the ```coords``` and ```dims``` parameters.


In [None]:
time = pd.date_range("2020-01-01", periods=4)
lat = np.linspace( -90.0, 90.0,  90) 
lon = np.linspace(-180., 180.0, 180)

da = xr.DataArray(data, coords=[time,lat,lon], dims=['time','lat','lon'])

print(da)

<br>

Available DataArray attributes are

- values
- data
- coords
- dims
- sizes
- name
- attrs

In [None]:
print(da.values)

In [None]:
print(da.data)

In [None]:
print(da.coords)

In [None]:
print(da.dims)

In [None]:
print(da.sizes)

In [None]:
print(da.name)

da.name = 'data'

print(da.name)

In [None]:
print(da.attrs)

da.attrs['units'] = 'data units'

print(da.attrs['units'])

<br>

Like numpy xarray provides some array methods e.g. where, min, and max.

In [None]:
A = xr.DataArray(np.arange(1, 26).reshape(5, 5), dims=('x', 'y'))

In [None]:
A      #-- if used without print notebooks will give additional informations when available

```xr.DataArray.where()``` example

In [None]:
print(A.where(A.x > 2))

In [None]:
print(A.where(A.x + A.y > 2))

In [None]:
print(A.where(A > 10))

Use numpy's 'where' method for replacing values.

In [None]:
print(np.where(A > 10, A, -9999.9))

<br>

## Working with Datasets

A Dataset can contain multiple variables with different dimensions and coordinates.

Define two random data arrays, temp and prec of size (12,90,180).

In [None]:
temp = np.random.uniform(low=265, high=310, size=(12,90,180)) 
prec = np.random.uniform(low=0.0001, high=0.001, size=(12,90,180))

<br>

Now, we want to generate and add coordinate variables to the dataset.

To create a time coordinate we use pandas ```date_range()```function. 12 time steps, 15th Jan to 15th Dec 2020.
<br>

In [None]:
time = pd.date_range(start='2020-01-1', periods=12, freq='SM')

print(time)

Create the coordinate variable arrays for longitude and latitude with numpy's ```linspace()``` function.

In [None]:
lat = np.linspace(-90.0, 90.0, 90)
lon = np.linspace(-180.0, 180.0, 180)

print(lat)
print(lon)

<br>

All we need is defined and we can create the dataset. The coordinate variables and the variable temp will be assigned to the dataset.



In [None]:
ds = xr.Dataset(data_vars={'temperature':(['time','lat','lon'], temp),}, 
                coords={'time':('time', time), 
                        'lat':(['lat'], lat), 
                        'lon':(['lon'], lon)})

print(ds)

<br>

Instead of using the print function, the info method of xarray Datasets can be used. The result looks very similar to the output of ncdump.
<br><br>

In [None]:
ds.info()

<br>

## Read data from file

The function ```open_dataset()``` of xarray is used to read the content of the file. 
<br>

In [None]:
import xarray as xr
import numpy as np

fname = './data/tsurf.nc'

ds = xr.open_dataset(fname)

ds.info()

<br>
Printing the dataset content gives you an overview of the dimension and variable names, their sizes, and the global file attributes.
<br>

### Show variable names and coordinates

It is always good to have a closer look at your data, and this can be done very easily.

Ok, show me the variables stored in that file (ups - just one :D) and the coordinate variables, too.


In [None]:
coords    = ds.coords
variables = ds.variables

print('--> coords:    \n\n', coords)
print('--> variables: \n\n', variables)

Ah, that's better. Here we can see the time displayed in a readable way, because xarray use the datetime64 module under the hood. Also the variable and coordinate attributes are shown.

<br>


## Select variable and coordinate variables

At the moment, we only have created a dataset respectively a file object containing the coordinate variables and variable data. Now, we want to select the variable **tsurf** and the coordinate variables **lat** and **lon**.


In [None]:
tsurf = ds.tsurf
lat   = tsurf.lat
lon   = tsurf.lon

print('Variable tsurf:            \n', tsurf.data)
print('\nCoordinate variable lat: \n', lat.data)
print('\nCoordinate variable lon: \n', lon.data)

The variable types have the type ```xr.DataArray()```.

In [None]:
print(type(tsurf))

<br>

## Dimensions, shape and size

To get more informations about the dimension, shape and size of a variable we can use the approbriate attributes.


In [None]:
dimensions = ds.dims
shape = tsurf.shape
size  = tsurf.size
rank  = len(shape)

print('dimensions: ', dimensions)
print('shape:      ', shape)
print('size:       ', size)
print('rank:       ', rank)

<br>

## Variable attributes

Variable attributes are very important to work in a correct manor with the data.


In [None]:
attributes = list(tsurf.attrs)

print('attributes: ', attributes)

Let's see how we can get the content of an attribute.

In [None]:
long_name = tsurf.long_name
units = tsurf.units

print('long_name: ', long_name)
print('units:     ', units)

<br>

## Time

Xarray is able to convert the time values to readable times using the internally datetime64 module.

In [None]:
time = ds.time.data

print('timestep 0: ', time[0])

<br>

## Read a GRIB file

To read a GRIB file xarray needs an additional module ```cfgrib```, which is used as an so called _engine_.

In [None]:
import cfgrib

ds2 = xr.open_dataset('./data/MET9_IR108_cosmode_0909210000.grb2', engine='cfgrib')

variables2 = ds2.variables

print('--> variables2: \n\n', variables2)

<br>

## Reshaping

There are different ways to swap the dimensions of an array from (x,y) to (y,x). 


In [None]:
B = xr.DataArray(np.arange(1, 31).reshape(6, 5), dims=('x', 'y'))

print(B)

In [None]:
print(B.transpose())

In [None]:
print(B.T)

<br>

## Computations (xarray methods)

Xarray includes the scientific libraries of Python stack, Numpy and pandas. This means we can use Numpy's arithmetic functions for computations.

### Arithmetic computations with arrays<br>

In [None]:
C = xr.DataArray(np.random.uniform(low=0, high=100, size=(10,20)) , dims=('x','y'))
D = xr.DataArray(np.random.uniform(low=0, high=100, size=(10,20)) , dims=('x','y'))

print(C)
print(D)

#### Addition of constant value

In [None]:
C_add_value = C + 5.

print('Original value: %f  new value: %f' % (C[3,0], C_add_value[3,0]))

#### Addition of two arrays same size

In [None]:
CD_add = C + D

print('C[3,0]: %f  D[3,0]: %f  Added: %f' % (C[3,0], D[3,0], CD_add[3,0]))

#### Basic methods

In [None]:
print('Minimum value = ', C.min().values)
print('Maximum value = ', C.max().values)
print('Sum           = ', C.sum().values)


#### Advanced methods

In [None]:
print(C.mean(dim='x'))

<br>

### Working with missing values

Set the value -9999 in array to missing value. Numpy's np.nan method is used to define a missing value.

In [None]:
tarray = xr.DataArray(data=[0, 1, -9999, 3, 4, 5, 6, 7, -9999, 9, 10], dims='x')

tarray = tarray.where(tarray != -9999, np.nan)

print(tarray)

<br>
Check if missing values exist. It returns a mask array of True/False elements.

In [None]:
print(tarray.isnull())

Now, create a mask array where the values are not missing values.

In [None]:
print(tarray.notnull())

Count value that are not missing values.

In [None]:
print(tarray.count())

<br>
Return all array elements that are not missing values.

In [None]:
print(tarray.dropna(dim='x'))

Set missing value to a constant number.

In [None]:
print(tarray.fillna(0))

<br>

### Interpolation methods

Interpolation on 1D array.

Define 1D-array data1D:

    y-axis values in range -1.0 to 1.0 with increment 0.1
    x-axis values in range 0.0 to 1 with increment 0.05
    

In [None]:
data1D = xr.DataArray(data=[0., 3., 2.1, 2., 1.7, 5., 5.2, 3.3, 2.5, 4.], 
                                  dims='x', 
                                  coords={"x": np.linspace(0, 10, 10)})
print(data1D)

data1D.plot()

<br>

Interpolate 1D values from 21 elements to 101 elements:

In [None]:
data1D_interp = data1D.interp(x=np.linspace(0, 10, 30))

print(data1D_interp)

data1D_interp.plot()

<br>

Interpolation on nD array.

Define nD-array data1D:
    

In [None]:
data2D = xr.DataArray(np.random.uniform(low=0, high=2, size=(15,20)), 
                                        dims=('y','x'), 
                                        coords={'y':range(15), 'x':range(20)})
print(data2D)

In [None]:
data2D.plot()

<br>

Increase the resolution of the grid.

In [None]:
data2D_linear = data2D.interp(x=np.linspace(0, 20, 100), y=np.linspace(0, 15, 100))

data2D_linear.plot()

<br>

## Write a Dataset to netCDF file


In [None]:
temp = np.random.uniform(low=265, high=310, size=(12,90,180)) 
prec = np.random.uniform(low=0.0001, high=0.001, size=(12,90,180))

time = pd.date_range(start='2020-01-1', periods=12, freq='SM')
lat = np.linspace(-90.0, 90.0, 90)
lon = np.linspace(-180.0, 180.0, 180)

ds = xr.Dataset(data_vars={'temperature':(['time','lat','lon'], temp),}, 
                coords={'time':('time', time), 
                        'lat':(['lat'], lat), 
                        'lon':(['lon'], lon)})

ds.to_netcdf("my_data.nc")

$ ncdump -h my_data.nc
```
netcdf my_data {
dimensions:
	time = 12 ;
	lat = 90 ;
	lon = 180 ;
variables:
	double temperature(time, lat, lon) ;
		temperature:_FillValue = NaN ;
	int64 time(time) ;
		time:units = "days since 2020-01-15 00:00:00" ;
		time:calendar = "proleptic_gregorian" ;
	double lat(lat) ;
		lat:_FillValue = NaN ;
	double lon(lon) ;
		lon:_FillValue = NaN ;
}
```


<br>

## Plotting

Some additional examples how to use the plot method.

<br>


In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6,4))

da = xr.DataArray(np.sin(np.linspace(0, 2 * np.pi, 10)), dims="x", coords={"x": np.linspace(0, 1, 10)})

da.plot.line('o', label='original')
da.interp(x=np.linspace(0, 1, 100)).plot.line(label='linear (default)')
da.interp(x=np.linspace(0, 1, 100), method='cubic').plot.line(label='cubic')
plt.legend()

In [None]:
data2D.plot()

In [None]:
xr.open_dataset('./data/tsurf.nc').tsurf[0,:,:].plot.contourf(levels=20)