# Discovering data

This notebook shows how to find out what data is available locally as well as on ESGF. It also shows how to download the data from ESGF.

In [None]:
from esmvalcore.config import CFG
from esmvalcore.dataset import Dataset

Configure ESMValCore so it always searches the ESGF for data

In [2]:
CFG["search_data"] = "complete"
CFG["projects"].pop("CMIP6", None)  # Clear existing CMIP6 configuration
CFG.nested_update(
    {
        "projects": {
            "CMIP6": {
                "data": {
                    "intake-esgf": {
                        "type": "esmvalcore.io.intake_esgf.IntakeESGFDataSource",
                        "priority": 2,
                        "facets": {
                            "activity": "activity_drs",
                            "dataset": "source_id",
                            "ensemble": "member_id",
                            "exp": "experiment_id",
                            "institute": "institution_id",
                            "grid": "grid_label",
                            "mip": "table_id",
                            "project": "project",
                            "short_name": "variable_id",
                        },
                    },
                },
            },
        },
    },
)

We define a dataset template to search for all CMIP6 datasets that provide surface air temperature (tas) on a monthly resolution for the historical experiment. Note that ESMValCore uses its own names for the facets for a more uniform naming across different CMIP phases and other projects. The mapping to the facet names used on ESGF can be found in [esmvalcore.esgf.facets.FACETS](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/api/esmvalcore.esgf.html#esmvalcore.esgf.facets.FACETS).

In [3]:
dataset_template = Dataset(
    short_name="tas",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="*",
    grid="*",
)

Next, we use the `Dataset.from_files` method to build a list of datasets from the available files. This may take a while as searching the ESGF for many files is a bit slow. Because the search results are cached for a [configurable duration](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/quickstart/configure.html#esgf-configuration), subsequent searches will be faster.

In [4]:
datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets, showing the first 10:")
datasets[:10]

Found 906 datasets, showing the first 10:


[Dataset:
 {'dataset': 'TaiESM1',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AS-RCEC'},
 Dataset:
 {'dataset': 'TaiESM1',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r2i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AS-RCEC'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r2i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'tas',
  'ensemble': 'r3i1p1f1',
  'exp': 'historical',
  'grid': 'gn',
  'institute': 'AWI'},
 Dataset:
 {'dataset': 'AWI-CM-1-1-MR',
  'project': '

Let's look at the first dataset in more detail. We can print the facets describing the dataset:

In [5]:
dataset = datasets[0]
dataset

Dataset:
{'dataset': 'TaiESM1',
 'project': 'CMIP6',
 'mip': 'Amon',
 'short_name': 'tas',
 'ensemble': 'r1i1p1f1',
 'exp': 'historical',
 'grid': 'gn',
 'institute': 'AS-RCEC'}

and see what files are available:

In [6]:
dataset.files

[IntakeESGFDataset(name='CMIP6.CMIP.AS-RCEC.TaiESM1.historical.r1i1p1f1.Amon.tas.gn')]

Load a single file as `iris.cube.CubeList`:

In [7]:
cubes = dataset.files[0].to_iris()
cubes

Air Temperature (K),time,latitude,longitude
Shape,1980,192,288
Dimension coordinates,,,
time,x,-,-
latitude,-,x,-
longitude,-,-,x
Scalar coordinates,,,
height,2.0 m,2.0 m,2.0 m
Cell methods,,,
0,area: time: mean,area: time: mean,area: time: mean
Attributes,,,


`Dataset.from_files` can also handle derived variables properly:

In [8]:
dataset_template = Dataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
    derive=True,
    force_derivation=True,
)

In [9]:
datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets, showing the first 10:")
datasets[:10]

Found 37 datasets, showing the first 10:


[Dataset:
 {'dataset': 'GISS-E2-2-G',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'NASA-GISS'},
 Dataset:
 {'dataset': 'FGOALS-g3',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'CAS'},
 Dataset:
 {'dataset': 'CESM2-WACCM-FV2',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'NCAR'},
 Dataset:
 {'dataset': 'GISS-E2-1-H',
  'project': 'CMIP6',
  'mip': 'Amon',
  'short_name': 'lwcre',
  'derive': True,
  'ensemble': 'r1i1p1f1',
  'exp': 'historical',
  'force_derivation': True,
  'grid': 'gn',
  'institute': 'NASA-GISS'},
 Dataset:
 {'dataset': 'BCC-CSM2-MR',
  '

The facet `force_derivation=True` ensures variable derivation. If omitted and files that provide the variable `lwcre` without derivation are present, only those are returned.

If variable derivation is necessary (this will always be the case if `force_derivation=True` is used), the `files` attribute of the datasets may be empty. In this case, the input files of the input variables necessary for derivation can be accessed via the `Dataset.input_datasets` attribute:

In [10]:
dataset = datasets[0]
dataset.files

[]

In [11]:
for d in dataset.required_datasets:
    print(d["short_name"])
    print(d.files)

rlut
[IntakeESGFDataset(name='CMIP6.CMIP.NASA-GISS.GISS-E2-2-G.historical.r1i1p1f1.Amon.rlut.gn')]
rlutcs
[IntakeESGFDataset(name='CMIP6.CMIP.NASA-GISS.GISS-E2-2-G.historical.r1i1p1f1.Amon.rlutcs.gn')]
