# Sea Surface Altimetry Data Analysis

For this example we will use gridded sea level data in Great Barrier Reef to demonstrate how Dask handles expensive calculations.

- Examine dataset and its variables
- Timeseries of mean surface elevation in this region
- Sea level variability
---

- Authors: NCI Virtual Research Environment Team
- Keywords: Dask array, Sea Level, Great Barrier Reef
- Creation Date: 2020-Sep
-----

In [2]:
import numpy as np
import netCDF4 as nc
import xarray as xr
import dask.array as da
import hvplot.xarray
import hvplot.pandas

### About this dataset

The data used in this exercise is from the "eReefs GBR4 Hydro All v1.85" model output. The eReefs model includes near-real time and hindcast hydrodynamics components as well as ecological and sediment processes. The models are of varying resolution and incorporate boundary data from global and regional models as well as observed stream flow data. More information about this collection can be found at 
https://research.csiro.au/cem/software/ems/ and 
https://research.csiro.au/ereefs/models/model-outputs/access-to-raw-model-output/.

See data availability details in our [Geonetwork catalogue](https://geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#/metadata/f0538_9654_5729_1740).

In [3]:
!du -sh /g/data/fx3/gbr4_1.85/

1.3T	/g/data/fx3/gbr4_1.85/


In [4]:
!ls -l /g/data/fx3/gbr4_1.85/ | wc -l

72


### Access data from filesystem

It will take a little while to scan the data directory as it has 1.3TB of data!

In [5]:
filenames = '/g/data/fx3/gbr4_1.85/*.nc'
ds = xr.open_mfdataset(filenames,combine='by_coords')

In [6]:
ds

Unnamed: 0,Array,Chunk
Bytes,376 B,376 B
Shape,"(47,)","(47,)"
Count,345 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 376 B 376 B Shape (47,) (47,) Count 345 Tasks 1 Chunks Type float64 numpy.ndarray",47  1,

Unnamed: 0,Array,Chunk
Bytes,376 B,376 B
Shape,"(47,)","(47,)"
Count,345 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,864.00 kB,864.00 kB
Shape,"(180, 600)","(180, 600)"
Count,345 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 864.00 kB 864.00 kB Shape (180, 600) (180, 600) Count 345 Tasks 1 Chunks Type float64 numpy.ndarray",600  180,

Unnamed: 0,Array,Chunk
Bytes,864.00 kB,864.00 kB
Shape,"(180, 600)","(180, 600)"
Count,345 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,864.00 kB,864.00 kB
Shape,"(180, 600)","(180, 600)"
Count,345 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 864.00 kB 864.00 kB Shape (180, 600) (180, 600) Count 345 Tasks 1 Chunks Type float64 numpy.ndarray",600  180,

Unnamed: 0,Array,Chunk
Bytes,864.00 kB,864.00 kB
Shape,"(180, 600)","(180, 600)"
Count,345 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,43.65 GB,642.82 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,280 Tasks,70 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 43.65 GB 642.82 MB Shape (50521, 180, 600) (744, 180, 600) Count 280 Tasks 70 Chunks Type float64 numpy.ndarray",600  180  50521,

Unnamed: 0,Array,Chunk
Bytes,43.65 GB,642.82 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,280 Tasks,70 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.83 GB,321.41 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 21.83 GB 321.41 MB Shape (50521, 180, 600) (744, 180, 600) Count 210 Tasks 70 Chunks Type float32 numpy.ndarray",600  180  50521,

Unnamed: 0,Array,Chunk
Bytes,21.83 GB,321.41 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.03 TB 15.11 GB Shape (50521, 47, 180, 600) (744, 47, 180, 600) Count 210 Tasks 70 Chunks Type float32 numpy.ndarray",50521  1  600  180  47,

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.03 TB 15.11 GB Shape (50521, 47, 180, 600) (744, 47, 180, 600) Count 210 Tasks 70 Chunks Type float32 numpy.ndarray",50521  1  600  180  47,

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.03 TB 15.11 GB Shape (50521, 47, 180, 600) (744, 47, 180, 600) Count 210 Tasks 70 Chunks Type float32 numpy.ndarray",50521  1  600  180  47,

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.03 TB 15.11 GB Shape (50521, 47, 180, 600) (744, 47, 180, 600) Count 210 Tasks 70 Chunks Type float32 numpy.ndarray",50521  1  600  180  47,

Unnamed: 0,Array,Chunk
Bytes,1.03 TB,15.11 GB
Shape,"(50521, 47, 180, 600)","(744, 47, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.83 GB,321.41 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 21.83 GB 321.41 MB Shape (50521, 180, 600) (744, 180, 600) Count 210 Tasks 70 Chunks Type float32 numpy.ndarray",600  180  50521,

Unnamed: 0,Array,Chunk
Bytes,21.83 GB,321.41 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.83 GB,321.41 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 21.83 GB 321.41 MB Shape (50521, 180, 600) (744, 180, 600) Count 210 Tasks 70 Chunks Type float32 numpy.ndarray",600  180  50521,

Unnamed: 0,Array,Chunk
Bytes,21.83 GB,321.41 MB
Shape,"(50521, 180, 600)","(744, 180, 600)"
Count,210 Tasks,70 Chunks
Type,float32,numpy.ndarray


### Examine Metadata

To call those data variables explicitly, you can list them using `.data_vars` property.

In [7]:
for v in ds.data_vars:
    print(v)

botz
eta
u
v
salt
temp
wspeed_u
wspeed_v


### Create and Connect to Dask Distributed Cluster

Choose the appropriate one from the following two senarios (1) local or VDI or (2) Gadi HPC pangeo module

In [8]:
from dask.distributed import Client
client = Client()
print(client)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 39377 instead
  http_address["port"], self.http_server.port


<Client: 'tcp://127.0.0.1:42669' processes=4 threads=8, memory=33.56 GB>


In [10]:
from dask.distributed import Client, LocalCluster
client = Client(scheduler_file='../scheduler.json')
print(client)

<Client: 'tcp://10.6.76.35:8729' processes=96 threads=96, memory=322.12 GB>



numpy
+------------------------+---------+
|                        | version |
+------------------------+---------+
| client                 | 1.19.2  |
| scheduler              | 1.19.2  |
| tcp://10.6.76.35:32921 | 1.17.2  |
| tcp://10.6.76.35:33073 | 1.17.2  |
| tcp://10.6.76.35:33083 | 1.17.2  |
| tcp://10.6.76.35:33129 | 1.17.2  |
| tcp://10.6.76.35:33661 | 1.17.2  |
| tcp://10.6.76.35:34593 | 1.17.2  |
| tcp://10.6.76.35:34745 | 1.17.2  |
| tcp://10.6.76.35:34917 | 1.17.2  |
| tcp://10.6.76.35:34959 | 1.17.2  |
| tcp://10.6.76.35:35037 | 1.17.2  |
| tcp://10.6.76.35:35509 | 1.17.2  |
| tcp://10.6.76.35:35753 | 1.17.2  |
| tcp://10.6.76.35:35911 | 1.17.2  |
| tcp://10.6.76.35:35975 | 1.17.2  |
| tcp://10.6.76.35:36029 | 1.17.2  |
| tcp://10.6.76.35:36277 | 1.17.2  |
| tcp://10.6.76.35:36459 | 1.17.2  |
| tcp://10.6.76.35:36711 | 1.17.2  |
| tcp://10.6.76.35:36723 | 1.17.2  |
| tcp://10.6.76.35:37005 | 1.17.2  |
| tcp://10.6.76.35:37203 | 1.17.2  |
| tcp://10.6.76.35:37551 | 1.17

<div class="alert alert-info">
<b>Warning: Please make sure you specify the correct path to the scheduler.json file within your environment.</b>  
</div>

Starting the Dask Client will provide a dashboard which is useful to gain insight into the computation. The link to the dashboard will become visible when you create the Client. We recommend having the Client open on one side of your screen and your notebook open on the other side, which will be useful for learning purposes.

## Visually Examine Some of the Data

Let's do a sanity check that the data looks reasonable:

In [9]:
da.from_array(ds.eta)

  "Passing an object to dask.array.from_array which is already a "


Unnamed: 0,Array,Chunk
Bytes,21.83 GB,40.18 MB
Shape,"(50521, 180, 600)","(372, 90, 300)"
Count,545 Tasks,544 Chunks
Type,float32,xarray.DataArray
"Array Chunk Bytes 21.83 GB 40.18 MB Shape (50521, 180, 600) (372, 90, 300) Count 545 Tasks 544 Chunks Type float32 xarray.DataArray",600  180  50521,

Unnamed: 0,Array,Chunk
Bytes,21.83 GB,40.18 MB
Shape,"(50521, 180, 600)","(372, 90, 300)"
Count,545 Tasks,544 Chunks
Type,float32,xarray.DataArray


In [1]:
ds.eta.coords

NameError: name 'ds' is not defined

The surface elevation variable has three dimentions. It is a dask.array concatenating all 72 files in this directory with a total size of 21.83GB. The surface elevation variable is recorded hourly according the time step above. This dask.array is monthly (744/24=31 timesteps) and spatial chunked into quarters with a chunk size of 40.18MB. 

In [11]:
ds.sel(time='2010-08-31').eta.hvplot(colormap='RdBu_r', width=900, height=550, rasterize=True)

  [cmap for cmap in cm.cmap_d if not


## Timeseries of Mean Surface Elevation in this region

Here we make a simple yet fundamental calculation: the rate of increase of mean sea level over the observational period.

In [20]:
# the number of MB involved in the reduction
ds.eta.nbytes/1e6

21825.072

In [7]:
# the computationally intensive step 
# It tooks about 1 hour running on VDI using 8 cores 33GB memory！
%time eta_timeseries = ds.eta.mean(dim=('j', 'i')).load()

CPU times: user 1.75 s, sys: 238 ms, total: 1.99 s
Wall time: 18.9 s


With 96 cores and 192GB memory on Gadi, it took only ~20 seconds to get the result. The performance is way better in this case. 

In [22]:
eta_timeseries

In [23]:
eta_full = eta_timeseries.hvplot(label='full data', grid=True,
                          title='Sea surface height above sea level', 
                          width=800, height=400)
eta_full

Now let's take a closer look at the sea surface variation during the first 24 hours. If you place the mouse on the plot, the value of that point will show up automatically.

In [24]:
eta_timeseries[0:24*1].hvplot(rasterize=True, colormap='RdBu_r', width=900, height=400,clim=(-2,2))

  [cmap for cmap in cm.cmap_d if not
  [cmap for cmap in cm.cmap_d if not


## Sea Level Variability

We can examine the natural variability in sea level by looking at its standard deviation in time. 

This is another expensive calculation. It will fail if you run it on a single node (e.g. VDI) with "out of memory" error message, but you can get results on Gadi. 

Depend on how much memory you allocate to this instance, when exceeding memory the dask worker will automatically restart until enough memory is re-allocated. 

In [8]:
%%time
temp_std = ds.sel(time=slice("2011-01-01", "2011-01-07")).temp.std(dim='time').load()
temp_std.name = 'Sea Surface Tempreture Variability [C]'

CPU times: user 2.01 s, sys: 438 ms, total: 2.45 s
Wall time: 32.3 s


In [9]:
temp_std.hvplot(colormap='viridis', width=900, height=550, rasterize=True)

  [cmap for cmap in cm.cmap_d if not


### Close the client

Before moving on to the next exercise, make sure to close your client or stop this kernel.

In [12]:
client.close()