# Intake Plugin for CMIP5

The directory `intake-cmip5` contains a full package to do the job

In [1]:
!find ../intake_cmip5/

../intake_cmip5/
../intake_cmip5//generate_database.py
../intake_cmip5//_version.py
../intake_cmip5//__init__.py
../intake_cmip5//__pycache__
../intake_cmip5//__pycache__/_version.cpython-36.pyc
../intake_cmip5//__pycache__/generate_database.cpython-36.pyc
../intake_cmip5//__pycache__/source.cpython-36.pyc
../intake_cmip5//__pycache__/__init__.cpython-36.pyc
../intake_cmip5//source.py


`intake-cmip5` contains just one DataSource class, although it could have several related source. 
The [source code](https://github.com/NCAR/intake-cmip5) has `CMIP5DataSource` class which subclasses `intake_xarray.base.DataSourceMixin` class. `CMIP5DataSource` has the following class-level attributes:

```python

class CMIP5DataSource(intake_xarray.base.DataSourceMixin):
    container = 'xarray'
    version = '0.0.1'
    partition_access = True
    name = 'cmip5'
```

These attributes
 - name the new plugin, 
 - give it a version and 
 - an output data type (xarray container), 
 - and specify that the data will always be loaded in partitions. 

In [2]:
# Install the package
!cd .. && python setup.py develop

running develop
running egg_info
writing intake_cmip5.egg-info/PKG-INFO
writing dependency_links to intake_cmip5.egg-info/dependency_links.txt
writing entry points to intake_cmip5.egg-info/entry_points.txt
writing requirements to intake_cmip5.egg-info/requires.txt
writing top-level names to intake_cmip5.egg-info/top_level.txt
file intake_cmip5.py (for module intake_cmip5) not found
reading manifest file 'intake_cmip5.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'intake_cmip5.egg-info/SOURCES.txt'
running build_ext
Creating /Users/abanihi/opt/miniconda3/envs/pangeo/lib/python3.6/site-packages/intake-cmip5.egg-link (link to .)
intake-cmip5 0+untagged.46.g1490232.dirty is already the active version in easy-install.pth

Installed /Users/abanihi/devel/ncar/intake-cmip5
Processing dependencies for intake-cmip5==0+untagged.46.g1490232.dirty
Searching for intake-xarray==0.2.4
Best match: intake-xarray 0.2.4
Adding intake-xarray 0.2.4 to easy-install.pth f

Since the class is in the top-level of the package i.e `__init__.py`, and the package name starts with intake_, it will be scanned when Intake is imported. Now the plugin automatically appears in the set of known plugins in the Intake registry, and an associated `intake.open_cmip5` function is created at import time.

In [3]:
import intake

In [4]:
'cmip5' in intake.registry

True

## Generate fake cmip5 data + cmip5 database

For `intake-cmip5` to generate catalogs, we need to generate a database. To generate this database, we use [CMIP5 Data Reference Syntax](https://cmip.llnl.gov/cmip5/docs/cmip5_data_reference_syntax_v1-00_clean.pdf) to infer directory structure. Using `os.walk()`, we generate necessary information for each file. This database is then persisted to disk as an `csv` file. This file has the following columns:
- ensemble
- experiment
- file_basename
- file_fullpath
- frequency
- realm 
- institution
- model
- varname

In [5]:
import os
import pandas as pd
import xarray as xr
import shutil
import tempfile
from intake_cmip5.generate_database import create_CMIP5Database

In [6]:
CMIP5_TEST_DIR = tempfile.mkdtemp()
DB_DIR = tempfile.mkdtemp()
file_names = [
    "Tair_Amon_CanESM2_rcp85_r2i1p1_200601-203512.nc",
    "Tair_OImon_CSIRO-Mk3-6-0_historical_r2i1p1_200601-203512.nc",
]

In [7]:
def setup():
    test_paths = [
        f"{CMIP5_TEST_DIR}/output1/CCCma/CanESM2/rcp85/mon/atmos/Amon/r2i1p1",
        f"{CMIP5_TEST_DIR}/output2/CSIRO-QCCCE/CSIRO-Mk3-6-0/historical/mon/seaIce/OImon/r2i1p1/v1/sic",
    ]

    ds = (
        xr.tutorial.open_dataset("rasm")
        .load()
        .isel(time=slice(0, 2), x=slice(0, 5), y=slice(0, 3))
    )

    for idx, path in enumerate(test_paths):
        os.makedirs(path, exist_ok=True)
        file_path = f"{path}/{file_names[idx]}"
        ds.to_netcdf(file_path, mode="w")

In [8]:
# Generate fake cmip5 database
setup()
create_CMIP5Database(CMIP5_TEST_DIR, DB_DIR)

**** Persisting CMIP5 database in /var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x/T/tmpkwyvtz9y ****


Unnamed: 0,ensemble,experiment,file_basename,file_fullpath,frequency,institution,model,realm,root,varname,version
0,r2i1p1,historical,Tair_OImon_CSIRO-Mk3-6-0_historical_r2i1p1_200...,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,mon,CSIRO-QCCCE,CSIRO-Mk3-6-0,seaIce,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,Tair,v2
1,r2i1p1,rcp85,Tair_Amon_CanESM2_rcp85_r2i1p1_200601-203512.nc,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,mon,CCCma,CanESM2,atmos,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,Tair,v2


In [9]:
!ls /var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x/T/tmp5o57dy80

clean_cmip5_database.csv raw_cmip5_database.csv


## Use `intake-cmip5` Plugin

In [10]:
db_file_path = f"{DB_DIR}/clean_cmip5_database.csv"

In [11]:
source = intake.open_cmip5(database_file=db_file_path, model="CanESM2", experiment="rcp85",
                           frequency="mon", realm="atmos", ensemble="r2i1p1",
                           varname="Tair")

In [12]:
source

<intake_cmip5.source.CMIP5DataSource at 0x11c4bea58>

In [13]:
source.discover()

{'datashape': None,
 'dtype': None,
 'shape': None,
 'npartitions': None,
 'metadata': {'dims': {'ensemble': 1, 'time': 2, 'x': 5, 'y': 3},
  'data_vars': {'Tair': ['time', 'xc', 'yc', 'ensemble']},
  'coords': ('time', 'xc', 'yc', 'ensemble'),
  'title': '/workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc',
  'institution': 'U.W.',
  'source': 'RACM R1002RBRxaaa01a',
  'output_frequency': 'daily',
  'output_mode': 'averaged',
  'convention': 'CF-1.4',
  'references': 'Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.',
  'comment': 'Output from the Variable Infiltration Capacity (VIC) model.',
  'nco_openmp_thread_number': 1,
  'NCO': '"4.6.0"',
  'history': 'Tue Dec 27 14:15:22 2016: ncatted -a dimensions,,d,, rasm.nc rasm.nc\nTue Dec 27 13:38:40 2016: ncks -3 rasm.nc rasm.nc\nhistory deleted for brevity'}}

In [14]:
out = source.to_dask()

In [15]:
out

<xarray.Dataset>
Dimensions:   (ensemble: 1, time: 2, x: 5, y: 3)
Coordinates:
  * time      (time) object 1980-09-16 12:00:00 1980-10-17 00:00:00
    xc        (y, x) float64 189.2 189.4 189.6 189.7 ... 188.9 189.0 189.2 189.4
    yc        (y, x) float64 16.53 16.78 17.02 17.27 ... 17.1 17.34 17.59 17.84
  * ensemble  (ensemble) <U6 'r2i1p1'
Dimensions without coordinates: x, y
Data variables:
    Tair      (ensemble, time, y, x) float64 dask.array<shape=(1, 2, 3, 5), chunksize=(1, 2, 3, 5)>
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:  

In [16]:
print(source.yaml(True))

plugins:
  source:
  - module: intake_cmip5.source
sources:
  cmip5:
    args:
      database_file: /var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x/T/tmpkwyvtz9y/clean_cmip5_database.csv
      ensemble: r2i1p1
      experiment: rcp85
      frequency: mon
      model: CanESM2
      realm: atmos
      varname: Tair
    description: ''
    driver: cmip5
    metadata:
      NCO: '"4.6.0"'
      comment: Output from the Variable Infiltration Capacity (VIC) model.
      convention: CF-1.4
      coords: !!python/tuple
      - time
      - xc
      - yc
      - ensemble
      data_vars:
        Tair:
        - time
        - xc
        - yc
        - ensemble
      dims:
        ensemble: 1
        time: 2
        x: 5
        y: 3
      history: "Tue Dec 27 14:15:22 2016: ncatted -a dimensions,,d,, rasm.nc rasm.nc\n\
        Tue Dec 27 13:38:40 2016: ncks -3 rasm.nc rasm.nc\nhistory deleted for brevity"
      institution: U.W.
      nco_openmp_thread_number: !!python/object/apply:numpy.core.mu