# 3.a xarray

![xarray logo](images/xarray_logo.png)
https://xarray.pydata.org/en/stable/index.html

**xarray** is a python package which allows us to handle multi-dimensional datasets in a simple way. It provides a huge set of functions for advanced analytics and visualization. It is part of the SciPy and Pangeo ecosystem.

**xarray** data structure deals with scientific data by using labels, attributes, dimensions and coordinates, and extend the capabilities of **NumPy** and **pandas**.


## Data structures

- DataArray
- DataSet
- Dimensions
- Coordinates


DataArray: 

    N-dimensional array with dimensions. The objects add dimension names, coordinates, and attibutes to the underlying data structure (numpy and dask arrays).

Dataset: 

    Dict-like collection of DataArray objects with aligned dimensions. Similar use of variables, dimensions, coordinates, and attributes like for DataArray. You can see an xarray Dataset as a netCDF file like object.
 
Dimensions: 

    Named dimension axes, if missing the dimension names are dim_0, dim_1, ...

Coordinates: 

    An array which labels a dimension. Two types are defined a) dimension coordinates - 1-dimensional coordinate array assigned to the DataArray with a name and dimension name. b) Non-dimensional coordinate - a coordinate array assigned to DataArray with the name assigned to the coordinates and not to the dimensions.




In [1]:
%matplotlib notebook

import numpy as np
import pandas as pd
import xarray as xr

## Working with DataArrays

First, we create a random data array a with 20 values with numpy's ```random.rand()``` function.

In [2]:
a = np.random.rand(20)

print(a)

[0.57169218 0.65319767 0.07656777 0.87298414 0.26218773 0.92400472
 0.45233262 0.76581825 0.07742864 0.43219322 0.88012044 0.30562539
 0.90834428 0.325218   0.20361452 0.90978106 0.06563712 0.55586434
 0.68717356 0.06213245]


Make an xarray DataArray from the numpy array a with ```xarrays.DataArray()```.

In [3]:
da_a = xr.DataArray(a)

print(da_a)

<xarray.DataArray (dim_0: 20)>
array([0.57169218, 0.65319767, 0.07656777, 0.87298414, 0.26218773,
       0.92400472, 0.45233262, 0.76581825, 0.07742864, 0.43219322,
       0.88012044, 0.30562539, 0.90834428, 0.325218  , 0.20361452,
       0.90978106, 0.06563712, 0.55586434, 0.68717356, 0.06213245])
Dimensions without coordinates: dim_0


As you can see a dimension ```dim_0``` is assed to the array.

For n-dimensional arrays, a corresponding number of dimensions are used.

E.g. 3D data array:

In [4]:
data = np.random.rand(4,90,180)

print(data)

[[[0.55650118 0.64407491 0.09264852 ... 0.45038673 0.90148364 0.97372424]
  [0.76473579 0.85914192 0.55944112 ... 0.99486477 0.95951188 0.49101152]
  [0.82284874 0.5040308  0.89029976 ... 0.9388651  0.06967998 0.81442009]
  ...
  [0.33670543 0.26803847 0.95953476 ... 0.76593967 0.56897754 0.96596041]
  [0.8361361  0.29468175 0.3697744  ... 0.35372776 0.60080751 0.10640466]
  [0.34560608 0.79951736 0.5954455  ... 0.45803551 0.78101061 0.78231313]]

 [[0.63601529 0.41753778 0.94113386 ... 0.47683513 0.41779538 0.97981877]
  [0.69212498 0.48868654 0.40780889 ... 0.19439009 0.95998833 0.3358649 ]
  [0.64299023 0.71189473 0.12707517 ... 0.78690753 0.46228408 0.59386383]
  ...
  [0.07542569 0.12333521 0.82713726 ... 0.77662548 0.03126677 0.10573628]
  [0.5429395  0.5898293  0.20813958 ... 0.55950419 0.53931044 0.16380535]
  [0.77986075 0.45109181 0.72063427 ... 0.88372338 0.88415339 0.8796321 ]]

 [[0.45701999 0.47220729 0.1494934  ... 0.26516369 0.76826824 0.57991726]
  [0.38812734 0.739342

In [5]:
print(xr.DataArray(data))

<xarray.DataArray (dim_0: 4, dim_1: 90, dim_2: 180)>
array([[[0.55650118, 0.64407491, 0.09264852, ..., 0.45038673,
         0.90148364, 0.97372424],
        [0.76473579, 0.85914192, 0.55944112, ..., 0.99486477,
         0.95951188, 0.49101152],
        [0.82284874, 0.5040308 , 0.89029976, ..., 0.9388651 ,
         0.06967998, 0.81442009],
        ...,
        [0.33670543, 0.26803847, 0.95953476, ..., 0.76593967,
         0.56897754, 0.96596041],
        [0.8361361 , 0.29468175, 0.3697744 , ..., 0.35372776,
         0.60080751, 0.10640466],
        [0.34560608, 0.79951736, 0.5954455 , ..., 0.45803551,
         0.78101061, 0.78231313]],

       [[0.63601529, 0.41753778, 0.94113386, ..., 0.47683513,
         0.41779538, 0.97981877],
        [0.69212498, 0.48868654, 0.40780889, ..., 0.19439009,
         0.95998833, 0.3358649 ],
        [0.64299023, 0.71189473, 0.12707517, ..., 0.78690753,
         0.46228408, 0.59386383],
...
        [0.63588141, 0.45026424, 0.35447692, ..., 0.43549556,
  

<br>

The dimensions have no names and we want to change it in the next step with the ```coords``` and ```dims``` parameters.


In [6]:
time = pd.date_range("2020-01-01", periods=4)
lat = np.linspace( -90.0, 90.0,  90) 
lon = np.linspace(-180., 180.0, 180)

da = xr.DataArray(data, coords=[time,lat,lon], dims=['time','lat','lon'])

print(da)

<xarray.DataArray (time: 4, lat: 90, lon: 180)>
array([[[0.55650118, 0.64407491, 0.09264852, ..., 0.45038673,
         0.90148364, 0.97372424],
        [0.76473579, 0.85914192, 0.55944112, ..., 0.99486477,
         0.95951188, 0.49101152],
        [0.82284874, 0.5040308 , 0.89029976, ..., 0.9388651 ,
         0.06967998, 0.81442009],
        ...,
        [0.33670543, 0.26803847, 0.95953476, ..., 0.76593967,
         0.56897754, 0.96596041],
        [0.8361361 , 0.29468175, 0.3697744 , ..., 0.35372776,
         0.60080751, 0.10640466],
        [0.34560608, 0.79951736, 0.5954455 , ..., 0.45803551,
         0.78101061, 0.78231313]],

       [[0.63601529, 0.41753778, 0.94113386, ..., 0.47683513,
         0.41779538, 0.97981877],
        [0.69212498, 0.48868654, 0.40780889, ..., 0.19439009,
         0.95998833, 0.3358649 ],
        [0.64299023, 0.71189473, 0.12707517, ..., 0.78690753,
         0.46228408, 0.59386383],
...
        [0.63588141, 0.45026424, 0.35447692, ..., 0.43549556,
       

<br>

## Working with Datasets

A Dataset can contain multiple variables with different dimensions and coordinates.

Define two random data arrays, temp and prec of size (12,90,180).

In [7]:
temp = np.random.uniform(low=265, high=310, size=(12,90,180)) 
prec = np.random.uniform(low=0.0001, high=0.001, size=(12,90,180))

<br>

Now, we want to generate and add coordinate variables to the dataset.

To create a time coordinate we use pandas ```date_range()```function. 12 time steps, 15th Jan to 15th Dec 2020.
<br>

In [8]:
time = pd.date_range(start='2020-01-1', periods=12, freq='SM')

print(time)

DatetimeIndex(['2020-01-15', '2020-01-31', '2020-02-15', '2020-02-29',
               '2020-03-15', '2020-03-31', '2020-04-15', '2020-04-30',
               '2020-05-15', '2020-05-31', '2020-06-15', '2020-06-30'],
              dtype='datetime64[ns]', freq='SM-15')


Create the coordinate variable arrays for longitude and latitude with numpy's ```linspace()``` function.

In [9]:
lat = np.linspace(-90.0, 90.0, 90)
lon = np.linspace(-180.0, 180.0, 180)

print(lat)
print(lon)

[-90.         -87.97752809 -85.95505618 -83.93258427 -81.91011236
 -79.88764045 -77.86516854 -75.84269663 -73.82022472 -71.79775281
 -69.7752809  -67.75280899 -65.73033708 -63.70786517 -61.68539326
 -59.66292135 -57.64044944 -55.61797753 -53.59550562 -51.57303371
 -49.5505618  -47.52808989 -45.50561798 -43.48314607 -41.46067416
 -39.43820225 -37.41573034 -35.39325843 -33.37078652 -31.34831461
 -29.3258427  -27.30337079 -25.28089888 -23.25842697 -21.23595506
 -19.21348315 -17.19101124 -15.16853933 -13.14606742 -11.12359551
  -9.1011236   -7.07865169  -5.05617978  -3.03370787  -1.01123596
   1.01123596   3.03370787   5.05617978   7.07865169   9.1011236
  11.12359551  13.14606742  15.16853933  17.19101124  19.21348315
  21.23595506  23.25842697  25.28089888  27.30337079  29.3258427
  31.34831461  33.37078652  35.39325843  37.41573034  39.43820225
  41.46067416  43.48314607  45.50561798  47.52808989  49.5505618
  51.57303371  53.59550562  55.61797753  57.64044944  59.66292135
  61.68539326

<br>

All we need is defined and we can create the dataset. The coordinate variables and the variable temp will be assigned to the dataset.



In [10]:
ds = xr.Dataset(data_vars={'temperature':(['time','lat','lon'], temp),}, 
                coords={'time':('time', time), 
                        'lat':(['lat'], lat), 
                        'lon':(['lon'], lon)})

print(ds)

<xarray.Dataset>
Dimensions:      (lat: 90, lon: 180, time: 12)
Coordinates:
  * time         (time) datetime64[ns] 2020-01-15 2020-01-31 ... 2020-06-30
  * lat          (lat) float64 -90.0 -87.98 -85.96 -83.93 ... 85.96 87.98 90.0
  * lon          (lon) float64 -180.0 -178.0 -176.0 -174.0 ... 176.0 178.0 180.0
Data variables:
    temperature  (time, lat, lon) float64 276.7 268.0 286.3 ... 267.8 302.8


<br>

Instead of using the print function, the info method of xarray Datasets can be used. The result looks very similar to the output of ncdump.
<br><br>

In [11]:
ds.info()

xarray.Dataset {
dimensions:
	lat = 90 ;
	lon = 180 ;
	time = 12 ;

variables:
	float64 temperature(time, lat, lon) ;
	datetime64[ns] time(time) ;
	float64 lat(lat) ;
	float64 lon(lon) ;

// global attributes:
}

<br>

## Read data from file

The function ```open_dataset()``` of xarray is used to read the content of the file. 
<br>

In [12]:
import xarray as xr
import numpy as np

fname = './data/tsurf.nc'

ds = xr.open_dataset(fname)

ds.info()

xarray.Dataset {
dimensions:
	lat = 96 ;
	lon = 192 ;
	time = 40 ;

variables:
	datetime64[ns] time(time) ;
		time:standard_name = time ;
		time:axis = T ;
	float64 lon(lon) ;
		lon:standard_name = longitude ;
		lon:long_name = longitude ;
		lon:units = degrees_east ;
		lon:axis = X ;
	float64 lat(lat) ;
		lat:standard_name = latitude ;
		lat:long_name = latitude ;
		lat:units = degrees_north ;
		lat:axis = Y ;
	float32 tsurf(time, lat, lon) ;
		tsurf:long_name = surface temperature ;
		tsurf:units = K ;
		tsurf:code = 169 ;
		tsurf:table = 128 ;

// global attributes:
	:CDI = Climate Data Interface version 1.9.6 (http://mpimet.mpg.de/cdi) ;
	:Conventions = CF-1.6 ;
	:history = Thu Oct 10 16:08:50 2019: cdo selname,tsurf rectilinear_grid_2D.nc tsurf.nc ;
	:CDO = Climate Data Operators version 1.9.6 (http://mpimet.mpg.de/cdo) ;
}

<br>
Printing the dataset content gives you an overview of the dimension and variable names, their sizes, and the global file attributes.
<br>

### Show variable names and coordinates

It is always good to have a closer look at your data, and this can be done very easily.

Ok, show me the variables stored in that file (ups - just one :D) and the coordinate variables, too.


In [13]:
coords    = ds.coords
variables = ds.variables

print('--> coords:    \n\n', coords)
print('--> variables: \n\n', variables)

--> coords:    

 Coordinates:
  * time     (time) datetime64[ns] 2001-01-01 ... 2001-01-10T18:00:00
  * lon      (lon) float64 -180.0 -178.1 -176.2 -174.4 ... 174.4 176.2 178.1
  * lat      (lat) float64 88.57 86.72 84.86 83.0 ... -83.0 -84.86 -86.72 -88.57
--> variables: 

 Frozen({'time': <xarray.IndexVariable 'time' (time: 40)>
array(['2001-01-01T00:00:00.000000000', '2001-01-01T06:00:00.000000000',
       '2001-01-01T12:00:00.000000000', '2001-01-01T18:00:00.000000000',
       '2001-01-02T00:00:00.000000000', '2001-01-02T06:00:00.000000000',
       '2001-01-02T12:00:00.000000000', '2001-01-02T18:00:00.000000000',
       '2001-01-03T00:00:00.000000000', '2001-01-03T06:00:00.000000000',
       '2001-01-03T12:00:00.000000000', '2001-01-03T18:00:00.000000000',
       '2001-01-04T00:00:00.000000000', '2001-01-04T06:00:00.000000000',
       '2001-01-04T12:00:00.000000000', '2001-01-04T18:00:00.000000000',
       '2001-01-05T00:00:00.000000000', '2001-01-05T06:00:00.000000000',
       '2

Ah, that's better. Here we can see the time displayed in a readable way, because xarray use the datetime64 module under the hood. Also the variable and coordinate attributes are shown.

<br>


## Select variable and coordinate variables

At the moment, we only have created a dataset respectively a file object containing the coordinate variables and variable data. Now, we want to select the variable **tsurf** and the coordinate variables **lat** and **lon**.


In [14]:
tsurf = ds.tsurf
lat   = tsurf.lat
lon   = tsurf.lon

print('Variable tsurf:            \n', tsurf.data)
print('\nCoordinate variable lat: \n', lat.data)
print('\nCoordinate variable lon: \n', lon.data)

Variable tsurf:            
 [[[242.38832 242.35121 242.23402 ... 242.62465 242.6266  242.63051]
  [246.98988 247.12074 247.23207 ... 246.65785 246.79262 246.9059 ]
  [246.2145  246.40785 246.66566 ... 245.78285 245.78285 246.00941]
  ...
  [256.27307 256.78674 257.43127 ... 254.5895  255.06606 255.69496]
  [242.54457 242.53676 242.91371 ... 241.52113 241.91566 242.33168]
  [236.11879 235.98012 235.96059 ... 236.11488 236.09145 236.07191]]

 [[245.4956  245.50732 245.51123 ... 245.4956  245.50732 245.4956 ]
  [246.65186 246.68701 246.75146 ... 246.62256 246.63232 246.6128 ]
  [244.76709 243.81787 243.66162 ... 245.00342 245.20264 245.35693]
  ...
  [257.26514 257.70264 258.19873 ... 255.81396 256.26318 256.78076]
  [243.08154 243.18115 243.6128  ... 242.05225 242.49365 242.85107]
  [236.3374  236.19873 236.17725 ... 236.33936 236.31396 236.29443]]

 [[246.92685 246.9542  246.97568 ... 246.87021 246.88583 246.90927]
  [244.24521 244.28036 243.90536 ... 245.24911 244.33505 244.1749 ]
  [

The variable types have the type ```xr.DataArray()```.

In [15]:
print(type(tsurf))

<class 'xarray.core.dataarray.DataArray'>


<br>

## Dimensions, shape and size

To get more informations about the dimension, shape and size of a variable we can use the approbriate attributes.


In [16]:
dimensions = ds.dims
shape = tsurf.shape
size  = tsurf.size
rank  = len(shape)

print('dimensions: ', dimensions)
print('shape:      ', shape)
print('size:       ', size)
print('rank:       ', rank)

dimensions:  Frozen(SortedKeysDict({'time': 40, 'lon': 192, 'lat': 96}))
shape:       (40, 96, 192)
size:        737280
rank:        3


<br>

## Variable attributes

Variable attributes are very important to work in a correct manor with the data.


In [17]:
attributes = list(tsurf.attrs)

print('attributes: ', attributes)

attributes:  ['long_name', 'units', 'code', 'table']


Let's see how we can get the content of an attribute.

In [18]:
long_name = tsurf.long_name
units = tsurf.units

print('long_name: ', long_name)
print('units:     ', units)

long_name:  surface temperature
units:      K


<br>

## Time

Xarray is able to convert the time values to readable times using the internally datetime64 module.

In [19]:
time = ds.time.data

print('timestep 0: ', time[0])

timestep 0:  2001-01-01T00:00:00.000000000


<br>

## Read a GRIB file

To read a GRIB file xarray needs an additional module ```cfgrib```, which is used as an so called _engine_.

In [20]:
import cfgrib

ds2 = xr.open_dataset('./data/MET9_IR108_cosmode_0909210000.grb2', engine='cfgrib')

variables2 = ds2.variables

print('--> variables2: \n\n', variables2)

--> variables2: 

 Frozen({'time': <xarray.Variable ()>
array('2009-09-21T00:00:00.000000000', dtype='datetime64[ns]')
Attributes:
    long_name:      initial time of forecast
    standard_name:  forecast_reference_time, 'latitude': <xarray.Variable (y: 461, x: 421)>
[194081 values with dtype=float64]
Attributes:
    units:          degrees_north
    standard_name:  latitude
    long_name:      latitude, 'longitude': <xarray.Variable (y: 461, x: 421)>
[194081 values with dtype=float64]
Attributes:
    units:          degrees_east
    standard_name:  longitude
    long_name:      longitude, 'p260532': <xarray.Variable (y: 461, x: 421)>
[194081 values with dtype=float32]
Attributes:
    GRIB_paramId:                             500393
    GRIB_shortName:                           OBSMSG_BT_IR10.8
    GRIB_units:                               Numeric
    GRIB_name:                                Obser. Sat. Meteosat sec. gener...
    GRIB_cfVarName:                           p260532
    G