# Quick overview

The directory `intake-cmip` contains a full package to do the job.

## Intake-cmip structure

In [1]:
!find ../../../intake_cmip/

../../../intake_cmip/
../../../intake_cmip//config.py
../../../intake_cmip//_version.py
../../../intake_cmip//cmip5.csv
../../../intake_cmip//database.py
../../../intake_cmip//__init__.py
../../../intake_cmip//__pycache__
../../../intake_cmip//__pycache__/database.cpython-36.pyc
../../../intake_cmip//__pycache__/_version.cpython-36.pyc
../../../intake_cmip//__pycache__/config.cpython-36.pyc
../../../intake_cmip//__pycache__/cmip5.cpython-36.pyc
../../../intake_cmip//__pycache__/__init__.cpython-36.pyc
../../../intake_cmip//cmip5.py


Intake-cmip contains just one DataSource class for the time being, although it could have several related source. 
The [`cmip5` module](https://github.com/NCAR/intake-cmip) has `CMIP5DataSource` class which subclasses [intake_xarray.base.DataSourceMixin](https://github.com/ContinuumIO/intake-xarray/blob/master/intake_xarray/base.py) class from `intake-xarray` plugin. 


`CMIP5DataSource` has the following class-level attributes:

```python

class CMIP5DataSource(intake_xarray.base.DataSourceMixin):
    container = 'xarray'
    version = '0.0.1'
    partition_access = True
    name = 'cmip5'
```

These attributes specify:

 - a name for the new plugin,
 - a version 
 - an output data type (xarray container)
 - that the data will always be loaded in partitions. 

In [2]:
!ls ../../../

LICENSE              [34mdocs[m[m                 setup.cfg
MANIFEST.in          environment-dev.yml  setup.py
README.rst           [34mintake_cmip[m[m          [34mtests[m[m
[34m__pycache__[m[m          [34mintake_cmip.egg-info[m[m versioneer.py
[34mci[m[m                   readthedocs.yml


In [3]:
# Install the package
!cd ../../../ && pip install -e . > /dev/null

Since the class is in the top-level of the package i.e `__init__.py`, and the package name starts with intake_, it will be scanned when Intake is imported. Now the plugin automatically appears in the set of known plugins in the Intake registry, and an associated `intake.open_cmip5` function is created at import time.

In [4]:
import intake

In [5]:
'cmip5' in intake.registry

True

##  Intake-cmip Database

For `intake-cmip` to generate catalogs, we need to generate a database of netcdf files with their corresponding attributes. To generate this database, we use [CMIP5 Data Reference Syntax](https://cmip.llnl.gov/cmip5/docs/cmip5_data_reference_syntax_v1-00_clean.pdf) to infer the directory structure. Using `os.walk()`, we generate necessary information for each file. This database is then persisted to disk as an `csv` file. 

This file has the following columns:

- **ensemble** :  `(r<N>i<M>p<L>)`: This triad of integers (N, M, L), formatted as
shown above (e.g., “r3i1p21”) distinguishes among closely related simulations by a single
model. All three are required even if only a single simulation is performed. 
- **experiment** : identifies either the experiment or both the experiment family and a specific type within that experiment family.
- **file_basename** : indicates the name of the file.
- **file_fullpath** : indicates the absolute file path.
- **frequency** : indicates the interval between individual time-samples in the atomic dataset. For CMIP5, the following are the only options:

   - yr
   - mon
   - day
   - 6hr
   - 3hr
   - subhr
   - monClim
   - fx
   
- **realm** : indicates which high level modeling component is of particular relevance for the dataset. For CMIP5, permitted values are: 

   - atmos
   - ocean 
   - land 
   - landIce 
   - seaIce
   - aerosol
   - atmosChem 
   - ocnBgchem
   
- **institution** : identifies the institute responsible for the model results (e.g. NCAR).
- **model** : identifies the model used (e.g. HADCM3, HADCM3-233). 
- **varname** : identify the simulated physical quantity.

## Generate fake data and fake intake-cmip5 database

In [6]:
# Import Packages
import os
import pandas as pd
import xarray as xr
import shutil
import tempfile
from intake_cmip.database import create_cmip5_database
import intake_cmip

In [7]:
CMIP5_TEST_DIR = tempfile.mkdtemp()
DB_DIR = tempfile.mkdtemp()
file_names = [
    "Tair_Amon_CanESM2_rcp85_r2i1p1_200601-203512.nc",
    "Tair_OImon_CSIRO-Mk3-6-0_historical_r2i1p1_200601-203512.nc",
]

In [8]:
def setup():
    test_paths = [
        f"{CMIP5_TEST_DIR}/output1/CCCma/CanESM2/rcp85/mon/atmos/Amon/r2i1p1",
        f"{CMIP5_TEST_DIR}/output2/CSIRO-QCCCE/CSIRO-Mk3-6-0/historical/mon/seaIce/OImon/r2i1p1/v1/sic",
    ]

    ds = (
        xr.tutorial.open_dataset("rasm")
        .load()
        .isel(time=slice(0, 2), x=slice(0, 5), y=slice(0, 3))
    )

    for idx, path in enumerate(test_paths):
        os.makedirs(path, exist_ok=True)
        file_path = f"{path}/{file_names[idx]}"
        ds.to_netcdf(file_path, mode="w")

In [9]:
# Generate fake cmip5 database
setup()
create_cmip5_database(CMIP5_TEST_DIR, DB_DIR)

**** Persisting CMIP5 database: /var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x/T/tmphff3n7cm/cmip5.csv ****


Unnamed: 0,ensemble,experiment,file_basename,file_fullpath,frequency,institution,model,realm,root,varname,version
0,r2i1p1,historical,Tair_OImon_CSIRO-Mk3-6-0_historical_r2i1p1_200...,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,mon,CSIRO-QCCCE,CSIRO-Mk3-6-0,seaIce,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,Tair,v2
1,r2i1p1,rcp85,Tair_Amon_CanESM2_rcp85_r2i1p1_200601-203512.nc,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,mon,CCCma,CanESM2,atmos,/var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x...,Tair,v2


In [10]:
!ls {DB_DIR}

cmip5.csv     raw_cmip5.csv


## Using `intake-cmip5` Plugin

In [11]:
# Specify path to database file
db_file_path = f"{DB_DIR}/cmip5.csv"

In [12]:
source = intake.open_cmip5(database=db_file_path, model="CanESM2", experiment="rcp85",
                           frequency="mon", realm="atmos", ensemble="r2i1p1",
                           varname="Tair")

In [13]:
source

<intake_cmip.cmip5.CMIP5DataSource at 0x11a4a8e48>

In [14]:
source.discover()

{'datashape': None,
 'dtype': None,
 'shape': None,
 'npartitions': None,
 'metadata': {'dims': {'ensemble': 1, 'time': 2, 'x': 5, 'y': 3},
  'data_vars': {'Tair': ['time', 'xc', 'yc', 'ensemble']},
  'coords': ('time', 'xc', 'yc', 'ensemble'),
  'title': '/workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc',
  'institution': 'U.W.',
  'source': 'RACM R1002RBRxaaa01a',
  'output_frequency': 'daily',
  'output_mode': 'averaged',
  'convention': 'CF-1.4',
  'references': 'Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.',
  'comment': 'Output from the Variable Infiltration Capacity (VIC) model.',
  'nco_openmp_thread_number': 1,
  'NCO': '"4.6.0"',
  'history': 'Tue Dec 27 14:15:22 2016: ncatted -a dimensions,,d,, rasm.nc rasm.nc\nTue Dec 27 13:38:40 2016: ncks -3 rasm.nc rasm.nc\nhistory deleted for brevity'}}

In [15]:
out = source.to_xarray()

In [16]:
out

<xarray.Dataset>
Dimensions:   (ensemble: 1, time: 2, x: 5, y: 3)
Coordinates:
  * time      (time) object 1980-09-16 12:00:00 1980-10-17 00:00:00
    xc        (y, x) float64 189.2 189.4 189.6 189.7 ... 188.9 189.0 189.2 189.4
    yc        (y, x) float64 16.53 16.78 17.02 17.27 ... 17.1 17.34 17.59 17.84
  * ensemble  (ensemble) <U6 'r2i1p1'
Dimensions without coordinates: x, y
Data variables:
    Tair      (ensemble, time, y, x) float64 dask.array<shape=(1, 2, 3, 5), chunksize=(1, 2, 3, 5)>
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:  

In [17]:
print(source.yaml(True))

plugins:
  source:
  - module: intake_cmip.cmip5
sources:
  cmip5:
    args:
      database: /var/folders/z7/sdhzbbr96bv2wjrsb92qsm3dwz5p3x/T/tmphff3n7cm/cmip5.csv
      ensemble: r2i1p1
      experiment: rcp85
      frequency: mon
      model: CanESM2
      realm: atmos
      varname: Tair
    description: ''
    driver: cmip5
    metadata:
      NCO: '"4.6.0"'
      comment: Output from the Variable Infiltration Capacity (VIC) model.
      convention: CF-1.4
      coords: !!python/tuple
      - time
      - xc
      - yc
      - ensemble
      data_vars:
        Tair:
        - time
        - xc
        - yc
        - ensemble
      dims:
        ensemble: 1
        time: 2
        x: 5
        y: 3
      history: "Tue Dec 27 14:15:22 2016: ncatted -a dimensions,,d,, rasm.nc rasm.nc\n\
        Tue Dec 27 13:38:40 2016: ncks -3 rasm.nc rasm.nc\nhistory deleted for brevity"
      institution: U.W.
      nco_openmp_thread_number: !!python/object/apply:numpy.core.multiarray.scalar
      

In [18]:
%load_ext watermark

In [19]:
%watermark --iversion -g -v -u -d

intake      0.2.9
pandas      0.23.4
xarray      0.11.0
intake_cmip  0+untagged.85.gd38875e.dirty
last updated: 2018-12-27 

CPython 3.6.7
IPython 7.0.1
Git hash: d38875e518ff2dcb9d30369342a4d84d2b26f217
