> **Data used**
>
>If teaching this lesson in a classroom, a copy of the dataset should be on hand on external media such as a USB drive in case of wifi limitations. For remote teaching, please note the data being used is quite large, if network issues arise, the participant should instead use the smaller precipitation data files from the previous chapters.

So far we've been working with small, individual data files that can be comfortably read into memory on a modern laptop. What if we wanted to process a bigger dataset that consists of many files and/or much larger file sizes? For instance, now that we've plotted a global map showing the ACCESS-ESM1-5 precipitation climatology, we might want to create a similar global map showing the daily maximum precipitation over the 1850-2014 period.

Rather than download all the ACCESS-ESM1-5 daily precipitation files to our laptop, we are going to make use of the fact that the National Computational Infrastructure (NCI) in Canberra, Australia has made their archive of CMIP6 data remotely available via a "THREDDS" (or TDS) server. A THREDDS server provides access to OPeNDAP, which is a protocol to remotely access netCDF data over a network as though it were a local file on your computer. In Python, we use the `siphon` library to query THREDDS catalogues to find available files.  


In [14]:
from siphon.catalog import TDSCatalog

In [19]:
cat = TDSCatalog("http://dapds00.nci.org.au/thredds/catalog/fs38/publications/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/day/pr/gn/latest/catalog.xml")

In [3]:
print("\n".join(cat.datasets.keys()))

pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_18500101-18991231.nc
pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_19000101-19491231.nc
pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_19500101-19991231.nc
pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_20000101-20141231.nc


We can see that daily time scale precipition data is spread across four data files for ACCESS-ESM1-5. To access those files, we just need to append the appropriate URL to the file names:

In [24]:
file_list=list(cat.datasets.keys())
DAProot='https://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/day/pr/gn/v20191108/'
accessesm15_pr_file_list = [ DAProot+f for f in file_list ]
print(accessesm15_pr_file_list)

['https://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/day/pr/gn/v20191108/pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_18500101-18991231.nc', 'https://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/day/pr/gn/v20191108/pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_19000101-19491231.nc', 'https://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/day/pr/gn/v20191108/pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_19500101-19991231.nc', 'https://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/day/pr/gn/v20191108/pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_20000101-20141231.nc']


Now that we have our list of files, our first instinct might be to proceed just as we did for the earlier precipitation climatology calculations. The first problem we run into, however, is that `xr.open_dataset` only accepts one input data file. What's more, even if we wanted to write a loop to process the files one at a time, each individual file is so large that it freaks out...

In [34]:
import xarray as xr

dset = xr.open_dataset(accessesm15_pr_file_list[0])
#daily_max = dset['pr'].max('time', keep_attrs=True)
#daily_max.data = daily_max.data * 86400

In [35]:
dset

In [36]:
dset['pr'].max('time', keep_attrs=True)

RuntimeError: NetCDF: Access failure

(That error is because we've exceeded the THREDDS limit.)

We can use `xarray` to open a "multifile" dataset as though it were a single file. We'll load a few libraries we might need here.

In [6]:
import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np

Recall that when we first open data in xarray it simply ("lazily") loads the metadata associated with the data and shows summary information about the contents of the dataset.
**This may take a little time for a large multifile dataset!**

In [7]:
dset = xr.open_mfdataset(path, combine='by_coords', chunks={'time':'100MB'})
print(dset)

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 144, lon: 192, time: 60265)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-01T12:00:00 ... 2014-12-31T12:00:00
  * lat        (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon        (lon) float64 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(18262, 2), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 dask.array<chunksize=(18262, 144, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 dask.array<chunksize=(18262, 192, 2), meta=np.ndarray>
    pr         (time, lat, lon) float32 dask.array<chunksize=(794, 144, 192), meta=np.ndarray>
Attributes:
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           0.0
    crea

We can see that our `dset` object is an `xarray.Dataset`, but notice now that each variable has type `dask.array`, meaning that xarray is aware of the netCDF "chunks" (how the data is packed in the files), and we'll be able to parallelise across these if we need/want to.

In this case, we are interested in the ocean surface temperature (`tos`) variable contained within that xarray Dataset:

In [8]:
print(dset['pr'])

<xarray.DataArray 'pr' (time: 60265, lat: 144, lon: 192)>
dask.array<concatenate, shape=(60265, 144, 192), dtype=float32, chunksize=(904, 144, 192), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 1850-01-01T12:00:00 ... 2014-12-31T12:00:00
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1
Attributes:
    standard_name:  precipitation_flux
    long_name:      Precipitation
    comment:        includes both liquid and solid phases
    units:          kg m-2 s-1
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2019-11-09T02:13:28Z altered by CMOR: replaced missing va...
    _ChunkSizes:    [  1 144 192]


Notice that we now have an attribute `_ChunkSizes` listed. This has shape `[1 300 360]`, while the `dask.array` itself has shape (60265, 300, 360), and chunksize (3653, 300, 360). 
This means that the underlying data is structured to be most efficiently accessed for the whole lat/lon range at each time step, but dask will load up 3653 of these "slices" at once, for a combined dataset size of 60265 timesteps.

So far we have not loaded any data, only metadata. Operating on this data is likely to be slow! But let's try making a sea surface temperature climatology, similar to the precipitation climatology we made in the Visualisation episode.

In [9]:
pr_max = dset['pr'].max('time', keep_attrs=True)
print(pr_max)

<xarray.DataArray 'pr' (lat: 144, lon: 192)>
dask.array<nanmax-aggregate, shape=(144, 192), dtype=float32, chunksize=(144, 192), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38
  * lon      (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1
Attributes:
    standard_name:  precipitation_flux
    long_name:      Precipitation
    comment:        includes both liquid and solid phases
    units:          kg m-2 s-1
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2019-11-09T02:13:28Z altered by CMOR: replaced missing va...
    _ChunkSizes:    [  1 144 192]


But wait! That was very fast! Why is that?
(**hint**, consider lazy loading and xarray operations, what have we done in the above step?)

We can investigate how chunks affect how quickly we can actually read the data. To move from metadata objects to actual data, we use the `.load()` or `.compute()` calls to dask.

> ## Changing chunks
> If we decide to change chunking to improve performance, note we
> can control the size of dask chunks used, but they *must* align
> with the netCDF file chunks or we will certainly make performance worse!
{: .callout}

> ## Investigating chunks
>
> Time how long it takes to load the ocean temperature data for `'2014-01-01T12:00:00`
> and then time how long it takes to load the data at `i=136` and `j=100` (-0.1662N, 180.5E).
> How much difference in time is there when using these different
> (time slice vs time series) access methods?
> **Hint:** Use the `%%time` magic to get a single timing, or `%%timeit`
> to get an average time -
> but note that an initial load will be much slower than subsequent calls!
>
> > ## Solution
> > ~~~
> > import time
> > 
> > %%time
> > dset.tos.sel(time='2014-01-01T12:00:00').load()
> > 
> > %%time
> > dset.tos.sel(i=100,j=136).load()
> > ~~~
> > We see that the first call (all lat/lon at a single time step)
> > is orders of magnitude faster than extracting all time steps at
> > a single point location with the current dataset chunking
> > (for me it was ~1 sec vs ~5 min using a single core).
> > {: .language-python}
> {: .solution}
{: .challenge}

Now let's look at that climatology, what type of data is it?

In [10]:
type(pr_max.data)

dask.array.core.Array

Let's start a `dask` "client" to allow the next calculation to be handled in parallel.

In [11]:
from dask.distributed import Client
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:52296  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 17.18 GB


In [12]:
%%time
pr_max.compute()

CPU times: user 1min 15s, sys: 10.5 s, total: 1min 26s
Wall time: 18min 40s
