# Dask ❤️ Xarray

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dask-❤️-Xarray" data-toc-modified-id="Dask-❤️-Xarray-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dask ❤️ Xarray</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#AVISO-sea-surface-height-satellite-altimetry-dataset" data-toc-modified-id="AVISO-sea-surface-height-satellite-altimetry-dataset-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>AVISO sea-surface height satellite altimetry dataset</a></span></li><li><span><a href="#Create-and-Connect-to-Dask-Distributed-Cluster" data-toc-modified-id="Create-and-Connect-to-Dask-Distributed-Cluster-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Create and Connect to Dask Distributed Cluster</a></span></li><li><span><a href="#open_mfdataset" data-toc-modified-id="open_mfdataset-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span><code>open_mfdataset</code></a></span></li><li><span><a href="#Plot-an-absolute-dynamic-topography-Snapshot" data-toc-modified-id="Plot-an-absolute-dynamic-topography-Snapshot-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Plot an absolute dynamic topography Snapshot</a></span></li><li><span><a href="#Visualize-with-Holoviews-and-Datashader" data-toc-modified-id="Visualize-with-Holoviews-and-Datashader-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Visualize with Holoviews and Datashader</a></span></li><li><span><a href="#Compute-annual-temporal-mean" data-toc-modified-id="Compute-annual-temporal-mean-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Compute annual temporal mean</a></span></li><li><span><a href="#Visualize-annual-temporal-mean" data-toc-modified-id="Visualize-annual-temporal-mean-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Visualize annual temporal mean</a></span></li><li><span><a href="#Compute-latitudinal-variance" data-toc-modified-id="Compute-latitudinal-variance-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Compute latitudinal variance</a></span></li><li><span><a href="#Compute-spatial-anomaly" data-toc-modified-id="Compute-spatial-anomaly-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Compute spatial anomaly</a></span></li><li><span><a href="#Going-Further" data-toc-modified-id="Going-Further-1.11"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Going Further</a></span></li></ul></li></ul></div>

This notebook was adapted from [Ryan Abernathey's notebook](https://rabernat.github.io/research_computing/dask-for-parallel-computing-and-big-data.html).

## Introduction

Xarray can automatically wrap its data in dask arrays. This capability turns xarray into an extremely powerful tool for Big Data earth science.

<div class="alert alert-block alert-info">
The main theme throughout this notebook is that dask is almost invisible to the user, meaning that you can focus on your science, rather than the details of parallel computing.
</div>

## AVISO sea-surface height satellite altimetry dataset

To see this in action, we will use a subset of a fairly large dataset to analyze. This subset consists of 1 year of daily data from the [AVISO](http://www.aviso.oceanobs.com/en/data/products/index.html) sea-surface height satellite altimetry dataset.

In [None]:
# untar our dataset 
!mkdir -p ../../../data/aviso_madt && tar -xzf ../../../data/aviso_madt_2015.tar.gz --directory ../../../data/aviso_madt

In [None]:
# How many files 
!ls ../../../data/aviso_madt/2015 | wc -l

Let's load the first file as a regular xarray dataset.

In [None]:
import warnings
warnings.filterwarnings('ignore')
import xarray as xr
from matplotlib import pyplot as plt
%matplotlib inline
import hvplot.xarray
import numpy as np

In [None]:
ds_one_day = xr.open_dataset("../../../data/aviso_madt/2015/dt_global_allsat_madt_h_20151215_20160610.nc")
ds_one_day

In [None]:
from distributed.utils import format_bytes

In [None]:
format_bytes(ds_one_day.nbytes)

This one file is about 8 MB. So 365 of them will be nearly 3 GB. If we had downloaded all 25 years of data, it would be 73 GB. This is a good example of "medium dataset" described in [overview notebook](./01_overview.ipynb#What-is-%22Big-Data%22?)

## Create and Connect to Dask Distributed Cluster

In [None]:
from dask.distributed import Client, LocalCluster
# Setup a local cluster.
# By default this sets up 1 worker per core
cluster = LocalCluster()
client = Client(cluster)
cluster

☝️ Don't forget to click the link above to view the scheduler dashboard! (you may wish to have both the notebook and dashboard side-by-side)

## `open_mfdataset`

An incredibly useful function in xarray is `open_mfdataset`.

In [None]:
help(xr.open_mfdataset)

Using `open_mfdataset` we can easily open all the netcdf files into one `Dataset` object.

In [None]:
ds = xr.open_mfdataset("../../../data/aviso_madt/2015/*.nc", combine="by_coords")
ds

In [None]:
ssh = ds.adt
ssh

<div class="alert alert-block alert-info">
    Note that the values in our xarray DataArray consists of lazy, chunked, dask arrays. The number of chunks corresponds to the total number of files (365 in our case).
</div>

In [None]:
ssh.data

## Plot an absolute dynamic topography Snapshot

First we use the standard matplotlib approach.

In [None]:
ssh[0].load().plot(figsize=(15, 8))

<div class="alert alert-block alert-info">

For this case, using the matplotlib approach is okay. However, if your data is too high-resolution, visualizing with matplotlib can be slow and less efficient.

</div>

## Visualize with Holoviews and Datashader

Here we create an interactive browser for the data that dynamically resamples the image resolution as we zoom in and out.

In [None]:
from holoviews.operation.datashader import regrid

In [None]:
%output holomap='scrubber' fps=30
p1 = ssh.hvplot(x="lon", y="lat", cmap='RdBu_r', dynamic=True, 
               width=800, height=450, colorbar=True)
regrid(p1, precompute=True)

## Compute annual temporal mean 

In [None]:
ssh_2015_mean = ssh.mean(dim='time')
ssh_2015_mean.load()

## Visualize annual temporal mean

In [None]:
p2 = ssh_2015_mean.hvplot(x="lon", y="lat", cmap='RdBu_r', dynamic=True, 
               width=800, height=450, colorbar=True)
regrid(p2, precompute=True)

## Compute latitudinal variance

In [None]:
ssh_anom = ssh - ssh_2015_mean
ssh_variance_lonmean = (ssh_anom**2).mean(dim=('lon', 'time')).load()

In [None]:
ssh_variance_lonmean.hvplot(width=800, height=450)

## Compute spatial anomaly

In [None]:
weight = np.cos(np.deg2rad(ds.lat))
weight /= weight.mean()
(ssh_anom * weight).mean(dim=('lon', 'lat')).hvplot(width=800, height=450)

## Going Further

- [Xarray Documentation](https://xarray.pydata.org/en/stable/)
- [Hvplot Documentation](https://hvplot.pyviz.org/user_guide/Gridded_Data.html)

<div class="alert alert-block alert-success">
  <p>Previous: <a href="03_distributed.ipynb">Distributed</a></p>
  <p>Next: <a href="05_dask_hpc.ipynb">Dask + HPC</a></p>
</div>