# Using Intake-ESM to Analyze Data from CESM2-LE

In mid June, the [CESM2 Large Ensemble](https://www.cesm.ucar.edu/projects/community-projects/LENS2/) dataset was made available to the public. This model was run in collaboration with the [IBS Center for Climate Physics](https://ibsclimate.org/) and the [National Center for Atmospheric Research](https://ncar.ucar.edu/) This dataset includes 100 ensemble members, at one degree spatial resolution, with each ensemble member including data from 1850 to 2100. If you are interested in learning more about how this ensemble was setup, be sure to check out [the main webpage](https://www.cesm.ucar.edu/projects/community-projects/LENS2/) or read the pre-print of [Rodgers et al. 2021](https://esd.copernicus.org/preprints/esd-2021-50/) which describes this dataset in detail.

## Main Challenge

One of these challenges with this dataset is dealing with the massive amount of output. The data are available through the [NCAR Climate Data Gateway](https://www.cesm.ucar.edu/projects/community-projects/LENS2/data-sets.html) and via the [IBS OpenDAP Server](https://climatedata.ibs.re.kr/data/cesm2-lens). There is also a subset of the dataset available on the GLADE file system on NCAR HPC resources available within the directory `/glade/campaign/cgd/cesm/CESM2-LE/timeseries/`. 

Through these traditional file-access methods, one would need to scroll through and find one or a few of the **millions** of files produced by the model. Fortunately, this dataset has been catalogued, available using the [`Intake-esm`](https://intake-esm.readthedocs.io/en/latest/) package, which enables one to query the data, and read into a dictionary of `xarray.Datasets`, preventing the user from having to setup the concatenation and file search themselves.

Within the [intake-esm FAQ](https://intake-esm.readthedocs.io/en/latest/supplemental-guide/faq.html) section, there is a list of existing catalogs, as shown below:

![intake-esm catalogs](images/intake_esm_catalogs.png)

At the top of the ["Is there a list of existing catalogs"](https://intake-esm.readthedocs.io/en/latest/supplemental-guide/faq.html) section, you can see the `CMIP6-GLADE` catalog, which includes:
* Description of the catalog
* Platform
* **Catalog path or url** - this is the path you will use when reading in the catalog via `intake`
* Data format
* Documentation page

---

For the CESM2-LE Catalog, we see:

**CESM2-LE-GLADE**

* Description: ESM collection for the CESM2 LENS data stored on GLADE in /glade/campaign/cgd/cesm/CESM2-LE/timeseries

* Platform: NCAR-GLADE

* Catalog path or url: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json

* Data Format: netCDF

* Documentation Page: https://www.cesm.ucar.edu/projects/community-projects/LENS2/

We are **most interested** in the catalog path here

## Using the Catalog

Now that we found the catalog file, and see that it is the dataset we are interested in, we can start our analysis!

### Imports

In [2]:
# Import intake-esm
import intake

# Import dask related things, since we will need some larger resources to deal with the size of this dataset
from distributed import Client
from ncar_jobqueue import NCARCluster

# Visualization
import matplotlib.pyplot as plt

### Spin up our Dask Cluster

In [3]:
# Create our NCAR Cluster - which uses PBSCluster under the hood 
cluster = NCARCluster()

# Spin up 20 workers
cluster.scale(20)

# Assign the cluster to our Client
client = Client(cluster)

In [4]:
client

0,1
Connection method: Cluster object,Cluster type: PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/8787/status,

0,1
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/8787/status,Workers: 9
Total threads:  18,Total memory:  209.52 GiB

0,1
Comm: tcp://10.12.206.59:33842,Workers: 9
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/8787/status,Total threads:  18
Started:  Just now,Total memory:  209.52 GiB

0,1
Comm: tcp://10.12.206.51:33642,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/35404/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.51:41280,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-74ag90wp,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-74ag90wp

0,1
Comm: tcp://10.12.206.51:46136,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/33593/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.51:34969,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-5lexa28g,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-5lexa28g

0,1
Comm: tcp://10.12.206.51:42454,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/32902/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.51:38690,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-kvyagp8v,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-kvyagp8v

0,1
Comm: tcp://10.12.206.38:35782,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/37221/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.38:45637,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-r89l02kb,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-r89l02kb

0,1
Comm: tcp://10.12.206.38:40116,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/33124/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.38:39986,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-tjow2j88,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-tjow2j88

0,1
Comm: tcp://10.12.206.51:45533,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/43338/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.51:34085,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-06kq2p65,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-06kq2p65

0,1
Comm: tcp://10.12.206.51:46266,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/46414/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.51:45180,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-588ihkf7,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-588ihkf7

0,1
Comm: tcp://10.12.206.38:46115,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/34360/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.38:45020,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-dtckeutx,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-dtckeutx

0,1
Comm: tcp://10.12.206.38:33925,Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/33002/status,Memory: 23.28 GiB
Nanny: tcp://10.12.206.38:37061,
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-s8vs2mv5,Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-s8vs2mv5


### Read in the data catalog

As mentioned before, we use the catalog path `/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json`

In [5]:
cat = intake.open_esm_datastore('/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json')

  exec(code_obj, self.user_global_ns, self.user_ns)


Let's take a second to investigate the catalog - each file on disk represents an `asset`, which means that there are over 4 **million** files within the `/glade/campaign/cgd/cesm/CESM2-LE/timeseries/` directory

In [6]:
cat

Unnamed: 0,unique
component,6
stream,25
case,36
member_id,90
experiment,2
forcing_variant,2
control_branch_year,14
variable,1868
start_time,184
end_time,185


### Querying for our desired variable
In this case, we are interested in the surface temperature in Boulder, Colorado. There are **numerous** temperature variables in the dataset, we can search for all the `long_names` from atmospheric output.

We setup a function to loop through and see which variables include 'temperature' in the `long_name`

In [7]:
for var in cat.search(component='atm').df.long_name.unique():
    
    if 'temperature' in var.lower():
        print(var)

sea surface temperature
Minimum reference height temperature over output period
Tropopause Temperature
Reference height temperature
Total temperature tendency
Minimum surface temperature over output period
Temperature tendency
Maximum reference height temperature over output period
Temperature
Maximum surface temperature over output period
Potential Temperature
Temperature Variance
Surface temperature (radiative)
Temperature at  700 mbar pressure surface
Temperature at 700 mbar pressure surface
Temperature at 200 mbar pressure surface
Temperature at  200 mbar pressure surface
Temperature at 1000 mbar pressure surface
Temperature at  100 mbar pressure surface
Temperature at   50 mbar pressure surface
Temperature at 500 mbar pressure surface
Temperature at  500 mbar pressure surface
Lowest model level temperature
Temperature at   10 mbar pressure surface
Temperature at 850 mbar pressure surface
Temperature at  850 mbar pressure surface


### Query and Subset our Catalog
Let's go with `Lowest model level temperature` since this represents the closest to the surface without being at the **actuall** surface of the earth. We pass the `component` and `long_name` into the query, which reduces the number of datasets to 8!

In [8]:
subset = cat.search(component='atm', long_name='Lowest model level temperature')

In [9]:
subset

Unnamed: 0,unique
component,1
stream,1
case,8
member_id,40
experiment,2
forcing_variant,1
control_branch_year,4
variable,1
start_time,26
end_time,26


## Read in using `.to_dataset_dict()`

We include the additional `dask.config.set()` to help with splitting up large chunks when reading in

In [None]:
import dask
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dsets = subset.to_dataset_dict(cdf_kwargs={"decode_times": True, "use_cftime": True})


--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.stream.forcing_variant.control_branch_year.variable'


Let's take a look at the keys - these are defined by the `groupby` attributes in the catalog. The groupby attributes in this case are:

`component.experiment.stream.forcing_variant.control_branch_year.variable`
* Component - which component this output is from (ex. atm represents the atmosphere)
* Experiment - which experiment this is from, in this case, this is `ssp370` which is one of the CMIP6 future experiments
* Stream - which stream this output is from, in this case, this is `cam.h1`, which represents daily output
* Control Branch Year - which year the ensemble branched off from, these are described within the [CESM2-LE documentation page](https://www.cesm.ucar.edu/projects/community-projects/LENS2/)
* Variable - which variable you are working with


In [None]:
import xarray as xr

In [None]:
def subset_ds(ds):
    # Subset for TBOT and choose Boulder, Colorado's lat and lon
    da = ds.TBOT.sel(lat=40.015, lon=-105.2705, method='nearest')
    da = da.groupby('time.year')
    
    # Make sure that time is the format of a string
    da['time'] = da.time.astype(str)[:5]
    
    return da