# NCAR JupyterHub Large Data Example Notebook

*Note: If you do not have access to the NCAR machine, please look at the AWS-LENS example notebook instead.*

This notebook demonstrates how to compare large datasets on glade with ldcpy.  In particular, we will look at data from CESM-LENS1 project (http://www.cesm.ucar.edu/projects/community-projects/LENS/data-sets.html). In doing so, we will start a DASK client from Jupyter.  This notebook is meant to be run on NCAR's JupyterHub (https://jupyterhub.ucar.edu). We will use a subset of the CESM-LENS1 data on glade is located in /glade/p/cisl/asap/ldcpy_sample_data. 

We assume that you have a copy of the ldcpy code on NCAR's glade filesystem, obtained via:<br>
    `git clone https://github.com/NCAR/ldcpy.git`

When you launch your NCAR JupyterHub session, you will need to indicate a machine (Cheyenne or Casper) and then your charge account.  You can then launch the session and navigate to this notebook.  

Here's good resource for using NCAR's JupyterHub:<br>
https://ncar-hackathons.github.io/jupyterlab-tutorial/jhub.html)


**You need to run your notebook with the "cmip6-201910" kernel (choose from the dropdown in the upper left.)**



Note that the compressed data that we are using was generated for this paper:

Allison H. Baker, Dorit M. Hammerling, Sheri A. Mickelson, Haiying Xu, Martin B. Stolpe, Phillipe Naveau, Ben Sanderson, Imme Ebert-Uphoff, Savini Samarasinghe, Francesco De Simone, Francesco Carbone, Christian N. Gencarelli, John M. Dennis, Jennifer E. Kay, and Peter Lindstrom, “Evaluating Lossy Data Compression on Climate Simulation Data within a Large Ensemble.”  Geoscientific Model Development, 9, pp. 4381-4403, 2016 
(https://gmd.copernicus.org/articles/9/4381/2016/)




## Setup
First, let's set up our environment.  You will need to modify the path to indicate where you have cloned ldcpy.

In [5]:
#Make sure you are using the cmpi6-201910 kernel (see above)

# Add ldcpy root to system path (MODIFY FOR YOUR LDCPY CODE)
import sys
sys.path.insert(0,'/glade/u/home/abaker/repos/ldcpy')
import ldcpy

# Display output of plots directly in Notebook
%matplotlib inline

#silence warnings
import warnings
warnings.filterwarnings("ignore")

Create and connect to cluster:

In [7]:
#create cluster with 4 workers
from dask.distributed import Client
client = Client(n_workers=4)
client

0,1
Client  Scheduler: tcp://127.0.0.1:43275  Dashboard: https://jupyterhub.ucar.edu/ch/user/abaker/proxy/36887/status,Cluster  Workers: 4  Cores: 72  Memory: 1.07 GB


tornado.application - ERROR - Uncaught exception GET /status/ws (::1)
HTTPServerRequest(protocol='http', host='jupyterhub.ucar.edu', method='GET', uri='/status/ws', version='HTTP/1.1', remote_ip='::1')
Traceback (most recent call last):
  File "/ncar/usr/jupyterhub/envs/cmip6-201910/lib/python3.7/site-packages/tornado/websocket.py", line 956, in _accept_connection
    open_result = handler.open(*handler.open_args, **handler.open_kwargs)
  File "/ncar/usr/jupyterhub/envs/cmip6-201910/lib/python3.7/site-packages/bokeh/server/views/ws.py", line 123, in open
    raise ProtocolError("Subprotocol header is not 'bokeh'")
bokeh.protocol.exceptions.ProtocolError: Subprotocol header is not 'bokeh'


*Note: You can click on the link above to see your DASK dashboard!*

## Get data

In /glade/p/cisl/asap/ldcpy_sample_data on glade, we have TS (surface temperature), PRECT (precipiation rate), and PS (surface pressure) data from CESM-LENS1. These all all 2D variables.  TS and PRECT have daily output, and PS has monthly output.  We have the compressed and original versions of all these variables that we would like to compare.

First we list what is in this directory (two subdirectories):

In [3]:
#list directory contents
import os
os.listdir('/glade/p/cisl/asap/ldcpy_sample_data')

['lossy', 'orig']

Now we look at the contents of each subdirectory.  We have 6 files in each, consisting of 2 different timeseries files for each variable (1920-2005 and 2006-2080).

In [16]:
#list lossy directory contents
lossy_files = os.listdir('/glade/p/cisl/asap/ldcpy_sample_data/lossy')
lossy_files

['c.TS.daily.19200101-20051231.nc',
 'c.PRECT.daily.20060101-20801231.nc',
 'c.PS.monthly.200601-208012.nc',
 'c.PRECT.daily.19200101-20051231.nc',
 'c.TS.daily.20060101-20801231.nc',
 'c.PS.monthly.192001-200512.nc']

In [17]:
#list orig (i.e., uncompressed) directory contents
orig_files = os.listdir('/glade/p/cisl/asap/ldcpy_sample_data/orig')
orig_files

['PRECT.daily.20060101-20801231.nc',
 'PS.monthly.192001-200512.nc',
 'PRECT.daily.19200101-20051231.nc',
 'TS.daily.20060101-20801231.nc',
 'PS.monthly.200601-208012.nc',
 'TS.daily.19200101-20051231.nc']

How big are these files? (GB)

In [33]:
print ('Original files')
for f in orig_files:
    print(f ," ", os.stat('/glade/p/cisl/asap/ldcpy_sample_data/orig/' + f).st_size/1000000000)
print('-----------')    
print ('Lossy files')
for f in lossy_files:
    print(f ," ", os.stat('/glade/p/cisl/asap/ldcpy_sample_data/lossy/' + f).st_size/1000000000)
    

Original files
PRECT.daily.20060101-20801231.nc   4.999594482
PS.monthly.192001-200512.nc   0.129124682
PRECT.daily.19200101-20051231.nc   5.732874059
TS.daily.20060101-20801231.nc   3.504791459
PS.monthly.200601-208012.nc   0.112432632
TS.daily.19200101-20051231.nc   4.042316511
-----------
Lossy files
c.TS.daily.19200101-20051231.nc   6.966712275
c.PRECT.daily.20060101-20801231.nc   6.075658372
c.PS.monthly.200601-208012.nc   0.199824373
c.PRECT.daily.19200101-20051231.nc   6.966712311
c.TS.daily.20060101-20801231.nc   6.075658336
c.PS.monthly.192001-200512.nc   0.229107956


Let's look at Surface Pressure Data.

In [38]:
#load the first 86 years of surface pressure
ds_PS = ldcpy.open_datasets(["PS"],
                            ['/glade/p/cisl/asap/ldcpy_sample_data/orig/PS.monthly.192001-200512.nc', 
                                 '/glade/p/cisl/asap/ldcpy_sample_data/lossy/c.PS.monthly.192001-200512.nc'],
                         ['orig', 'lossy'])

chucks set to (default) {'time', 50}
dataset size in GB 0.46



Let's look at the comparison statistics at the first timeslice:

In [37]:
ldcpy.print_stats(ds_PS, 'PS', 'orig', 'lossy', time=0)

Comparing orig data (set1) to lossy data (set2)


KeyboardInterrupt: 

Now let's look at surface temperature (TS) data.

In [None]:
#load the first 86 years of surface temperature (TS) data
ds_TS = ldcpy.open_datasets(["TS"],
                            ['/glade/p/cisl/asap/ldcpy_sample_data/orig/TS.monthly.192001-200512.nc', 
                                 '/glade/p/cisl/asap/ldcpy_sample_data/lossy/c.TS.monthly.192001-200512.nc],
                         ['orig', 'lossy'])

And again, let's look at the comparison statistics at the first timeslice.

In [None]:
client.close()