# Large Data Example Notebook

This notebook demonstrates how to compare large datasets on glade with ldcpy.  In particular, we will look at data from CESM-LENS1 project (http://www.cesm.ucar.edu/projects/community-projects/LENS/data-sets.html). In doing so, we will start a DASK client from jupyter.  This notebook is meant to be run on NCAR's Jupyter Hub (https://jupyterhub.ucar.edu). We will use a subset of the CESM-LENS1 data on glade is located in /glade/p/cisl/asap/ldcpy_sample_data . If you do not have access to the NCAR machine, please look at the AWS-LENS example notebook.

This compressed data was generated for this paper:

Allison H. Baker, Dorit M. Hammerling, Sheri A. Mickelson, Haiying Xu, Martin B. Stolpe, Phillipe Naveau, Ben Sanderson, Imme Ebert-Uphoff, Savini Samarasinghe, Francesco De Simone, Francesco Carbone, Christian N. Gencarelli, John M. Dennis, Jennifer E. Kay, and Peter Lindstrom, “Evaluating Lossy Data Compression on Climate Simulation Data within a Large Ensemble.”  Geoscientific Model Development, 9, pp. 4381-4403, 2016 
(https://gmd.copernicus.org/articles/9/4381/2016/)




First, Let's setup our environment:

In [1]:
# Add ldcpy root to system path
import sys
sys.path.insert(0,'/glade/u/home/abaker/repos/ldcpy')

# Import ldcpy package
# Autoreloads package everytime the package is called, so changes to code will be reflected in the notebook if the above sys.path.insert(...) line is uncommented.
%load_ext autoreload
%autoreload 2
import ldcpy

# Display output of plots directly in Notebook
%matplotlib inline

#silence warnings
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'xrft'

Create and connect to cluster:

In [None]:
#create cluster with 4 workers
from dask.distributed import Client

client = Client(n_workers=4)


In /glade/p/cisl/asap/ldcpy_sample_data on glade, we have TS (surface temperature), PRECT (precipiation rate), and PS (surface pressure) data from CESM-LENS1. These all all 2D variables.  TS and PRECT have daily output, and PS has monthly output.  We have compresses and original versions of all these variables that we would like to compare.

TO DO: Can I list what it in this directory?

In [None]:
#list dir contents
import os
os.listdir('/glade/p/cisl/asap/ldcpy_sample_data')

In [None]:
os.listdir('/glade/p/cisl/asap/ldcpy_sample_data/lossy')
os.listdir('/glade/p/cisl/asap/ldcpy_sample_data/orig')

First let's look at the monthly surface pressure (PS) data:

TO DO: Can I contonate the two time periods?

In [None]:
#load the first 86 years of surface pressure
ds_PS = ldcpy.open_datasets(["PS"],
                            ['/glade/p/cisl/asap/ldcpy_sample_data/orig/PS.monthly.192001-200512.nc', 
                                 '/glade/p/cisl/asap/ldcpy_sample_data/lossy/c.PS.monthly.192001-200512.nc],
                         ['orig', 'lossy'])

Let's look at the comparison statistics at the first timeslice:

In [None]:
ldcpy.print_stats(ds_PS, 'PS', 'orig', 'lossy', time=0)

In [None]:
#load the first 86 years of surface temperature (TS) data
ds_TS = ldcpy.open_datasets(["TS"],
                            ['/glade/p/cisl/asap/ldcpy_sample_data/orig/TS.monthly.192001-200512.nc', 
                                 '/glade/p/cisl/asap/ldcpy_sample_data/lossy/c.TS.monthly.192001-200512.nc],
                         ['orig', 'lossy'])

In [None]:
client.close()