# Xarray - an introduction
![image.png](https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png)

**Purpose** [Xarray](https://docs.xarray.dev/) was created to make it easy to work with **multidimensional** arrays (or **tensors**). These n-D arrays are common in data science, machine learning and in climate science. Although it is possible to work with n-D arrays entirely in NumPy, you lack the transparency, code readability and the facility to easily apply an operation on a "dataset" of choice.

## Core data structures
Xarray has 2 core data structures that extend the core strenghts of `NumPy` and `Pandas`.

![](https://docs.xarray.dev/en/stable/_images/dataset-diagram.png)

 - `DataArray` - labeled n-dim array. It is a n-d generalization of `pandas.Series`
 - `Dataset` - is a dict like container of `DataArray` aligned along any number of **shared dimensions**. It is similar to how `pandas.DataFrame` builds on `pandas.Series`.
 
The `Dataset` object allows the user to query, extract or combine `DataArray`s over a particular dimension across all variables. This pattern quickly becomes convenient when dealing with spatio-temporal datasets.

In [34]:
import numpy as np

# importing as xr is by convention
import xarray as xr
import pandas as pd

### Dataset object

In [38]:
ds = xr.tutorial.load_dataset("air_temperature")
ds

This dataset has air temperature (`2920` instances of it) for a set of `25` x `53` lat lon coordinates. The `lon`, `lat`, `time` are coordinates (nD) and `air` is a variable.

### DataArray object

In [7]:
da = ds.air  # can use .notation or ds['air'] dict notation
da

To extract just the data, use

In [10]:
air_temp = da.data
print(type(air_temp))
print(air_temp.shape)

<class 'numpy.ndarray'>
(2920, 25, 53)


### Dimensions, coordinates, attributes

A data array may have dimensions that are also coordinates. They may also have dimensions without coordinates

In [11]:
da.dims

('time', 'lat', 'lon')

In [12]:
da.coords

Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

In [13]:
da.attrs

{'long_name': '4xDaily Air temperature at sigma level 995',
 'units': 'degK',
 'precision': 2,
 'GRIB_id': 11,
 'GRIB_name': 'TMP',
 'var_desc': 'Air temperature',
 'dataset': 'NMC Reanalysis',
 'level_desc': 'Surface',
 'statistic': 'Individual Obs',
 'parent_stat': 'Other',
 'actual_range': array([185.16, 322.1 ], dtype=float32)}

## Interop with Pandas

In [15]:
# to and from Pandas
air_temp_pd = da.to_series()
air_temp_pd

time                 lat   lon  
2013-01-01 00:00:00  75.0  200.0    241.199997
                           202.5    242.500000
                           205.0    243.500000
                           207.5    244.000000
                           210.0    244.099991
                                       ...    
2014-12-31 18:00:00  15.0  320.0    297.389984
                           322.5    297.190002
                           325.0    296.489990
                           327.5    296.190002
                           330.0    295.690002
Name: air, Length: 3869000, dtype: float32

In [16]:
type(air_temp_pd)

pandas.core.series.Series

Air temp has `3` indices when it is turned to a Pandas Series

In [17]:
da.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,air
time,lat,lon,Unnamed: 3_level_1
2013-01-01 00:00:00,75.0,200.0,241.199997
2013-01-01 00:00:00,75.0,202.5,242.500000
2013-01-01 00:00:00,75.0,205.0,243.500000
2013-01-01 00:00:00,75.0,207.5,244.000000
2013-01-01 00:00:00,75.0,210.0,244.099991
...,...,...,...
2014-12-31 18:00:00,15.0,320.0,297.389984
2014-12-31 18:00:00,15.0,322.5,297.190002
2014-12-31 18:00:00,15.0,325.0,296.489990
2014-12-31 18:00:00,15.0,327.5,296.190002


## Composing a DataArray and DataSet
Say you have the raw data, how do you compose a DataArray and a DataSet with them?

In [18]:
raw_data = da.data
print(type(raw_data))
print(raw_data.shape)

<class 'numpy.ndarray'>
(2920, 25, 53)


In [21]:
raw_data[0,0,1]

242.5

In [23]:
# For now, let us not expand each array
xr.set_options(display_expand_data=False)

<xarray.core.options.set_options at 0x7f9eb1e21d00>

In [24]:
# use DataArray constructor
da2 = xr.DataArray(raw_data, dims=('time','lat','lon'))
da2

The coordinates is empty although the data has `3` dimensions. You can set the coordinates using another DataArray object or a numpy array. In this example, lat and long are evenly spaced.

In [26]:
lon_array = np.arange(start=200, stop=331, step=2.5)
print(lon_array.shape)

(53,)


In [28]:
da2.coords['lon'] = lon_array
da2

Similarly, set the latitude and time coordinates

In [30]:
da2.coords['lat'] = np.arange(start=75, stop=14.9, step=-2.5)
da2

You can also assign attributes in a similar fashion

In [31]:
da2.attrs['some_attribute'] = 'hello'
da2

### Composing a DataSet

In [33]:
ds2 = xr.Dataset({'air':da2, 'air2':da2})  # just pass a dict like mapping. any number of variables
ds2

In [37]:
ds2.coords['time'] = pd.date_range(start='2013-01-01', end="2014-12-31 18:00", freq="6H")
ds2