### 🌍 Script Description: Filtering Terrain-Related Variables from Digital Earth Catalog

This script accesses and analyzes geospatial models from the Digital Earth catalog using the `intake` library. It filters models and variables based on specific keywords and physical units to identify terrain-related datasets.

#### 🧩 Modules Used
- `intake`: Opens and queries data catalogs.
- `pandas`: Used for tabular data manipulation.
- `warnings`: Suppresses unnecessary warnings for cleaner output.

#### 🔍 Steps Performed

1. **Suppress Warnings**  
   Future warnings from libraries (like `xarray`) are ignored to reduce output clutter.

2. **Load Catalog**  
   Opens a remote intake catalog hosted on GitHub and selects the `"online"` section containing geospatial models.

3. **Model Selection**  
   Retrieves the first 15 available model names for analysis.

4. **Keyword Filtering Setup**  
   - `keywords`: A list of terms that suggest the variable is related to terrain (e.g., "elevation", "orography").
   - `no_keywords`: Terms that would exclude a variable (e.g., "cloud", "boundary").

5. **Model Loop**  
   For each selected model:
   - Attempts to extract allowed zoom levels from the model metadata.
   - Tries to load the dataset starting from the lowest zoom level for efficiency.
   - If successful, loops through all data variables in the model.

6. **Variable Filtering**  
   For each variable:
   - Extracts its name, long description (`long_name`), and units.
   - Checks if the description contains **any keyword**, **none of the exclusion terms**, and the **unit is meters (`m`)**.
   - If matched, stores the result (model name, zoom level, variable info).

7. **Result Compilation**  
   A `DataFrame` is created with all matched variables, showing:
   - Model name
   - Lowest zoom level used
   - Variable name
   - Human-readable description
   - Units

8. **Formatted Output**  
   The resulting table is displayed with **left-aligned headers and content** using a `pandas Styler`, which ensures readability in Jupyter environments.

---



#### 🔎 Variable Filtering Logic

This block loops through all data variables (`ds.data_vars`) within a successfully loaded dataset and applies filtering criteria to identify relevant terrain-related variables.

Here's a breakdown:

- **`name = var`**: Stores the raw variable name (as it appears in the dataset).

- **`desc = ds[var].attrs.get('long_name', var)`**: 
  - Attempts to get the `long_name` attribute (a human-readable description of the variable).
  - If `long_name` is not available, it defaults to the variable name.
  - Then, it formats the description by replacing underscores with spaces and applying `.title()` to capitalize each word for better readability.

- **`unit = ds[var].attrs.get('units', '').strip()`**: 
  - Fetches the unit of the variable (e.g., `"m"` for meters).
  - `.strip()` removes any leading/trailing whitespace to ensure clean comparison.

- **`desc_lower = desc.lower()`**: 
  - Converts the description to lowercase for consistent keyword matching, regardless of capitalization.

- **Filter Condition**:
  ```python
  if any(kw in desc_lower for kw in keywords) 
     and all(nk not in desc_lower for nk in no_keywords) 
     and unit == 'm':


In [7]:
import intake
import pandas as pd
import warnings

# Suppress FutureWarning from xarray
warnings.filterwarnings("ignore", category=FutureWarning)

# Open catalog
cat = intake.open_catalog("https://digital-earths-global-hackathon.github.io/catalog/catalog.yaml")["online"]

# Select first 10 models
model_names = list(cat) #[0:15] # Remove test-range for full inquiry

# List to store matched results
data = []

# Keywords to include/exclude
keywords = ['elevation', 'altitude', 'height', 'orography', 'surface', 'terrain', 'topography', 'geopotential height']
no_keywords = ['boundary', 'cloud', 'roughness', 'plane', 'snow', 'water', 'geometric', 'cell', 'layer', 'runoff', 'edge']

# Loop over models
for model_name in model_names:
    model = cat[model_name]
    print(model_name)
    ds = None
    used_zoom = None

    try:
        # Get allowed zoom levels from model description
        params_df = pd.DataFrame(model.describe()["user_parameters"])
        zoom_levels = params_df[params_df["name"] == "zoom"]["allowed"].values[0]
        zoom_levels = sorted(zoom_levels)  # ascending order
    except Exception as e:
        print(f"Could not extract zoom levels for model '{model_name}': {e}")
        continue

    # Try each zoom level starting from lowest
    for zoom_level in zoom_levels:
        try:
            ds = model(zoom=zoom_level).to_dask()
            used_zoom = zoom_level
            break
        except Exception:
            continue

    if ds is None:
        print(f"\033[91mCould not load model '{model_name}' at any allowed zoom level.\033[0m") # Print in red
        continue

    # Search for matching variables
    for var in ds.data_vars:
        name = var
        desc = ds[var].attrs.get('long_name', var).replace('_', ' ').title()
        unit = ds[var].attrs.get('units', '').strip()
        desc_lower = desc.lower()

        if any(kw in desc_lower for kw in keywords) and all(nk not in desc_lower for nk in no_keywords) and unit == 'm':
            data.append([model_name, used_zoom, name, desc, unit])

# Create and display final DataFrame
df = pd.DataFrame(data, columns=["Model Name", "Lowest Zoom Level", "Variable Name", "Description", "Unit"])

# Pretty output with left-aligned headers and content
display(df.style.set_properties(**{'text-align': 'left'})
            .set_table_styles([{'selector': 'th', 'props': [('text-align', 'left')]}]))


ERA5
IR_IMERG
JRA3Q
MERRA2
casesm2_10km_nocumulus
icon_d3hp003
icon_d3hp003aug
icon_d3hp003feb
icon_ngc4008
ifs_tco3999-ng5_deepoff
[91mCould not load model 'ifs_tco3999-ng5_deepoff' at any allowed zoom level.[0m
ifs_tco3999-ng5_rcbmf
[91mCould not load model 'ifs_tco3999-ng5_rcbmf' at any allowed zoom level.[0m
ifs_tco3999-ng5_rcbmf_cf
nicam_gl11
scream-dkrz
um_Africa_km4p4_RAL3P3_n1280_GAL9_nest
um_CTC_km4p4_RAL3P3_n1280_GAL9_nest
um_SAmer_km4p4_RAL3P3_n1280_GAL9_nest
um_SEA_km4p4_RAL3P3_n1280_GAL9_nest
um_glm_n1280_CoMA9_TBv1p2
um_glm_n1280_GAL9
um_glm_n2560_RAL3p3


Unnamed: 0,Model Name,Lowest Zoom Level,Variable Name,Description,Unit
0,icon_d3hp003,0,orog,Surface Altitude,m
1,icon_d3hp003feb,0,orog,Surface Altitude,m
2,um_glm_n1280_CoMA9_TBv1p2,0,orog,Surface Altitude,m
3,um_glm_n1280_CoMA9_TBv1p2,0,orography,Orography,m
4,um_glm_n1280_GAL9,0,orog,Surface Altitude,m
5,um_glm_n2560_RAL3p3,0,orog,Surface Altitude,m


## Bellow cell does not work as intended
Intent: read metadata from files without downloading to speed up inquiry process.

In [26]:
import intake
import pandas as pd
import warnings
import xarray as xr

# Suppress FutureWarning from xarray
warnings.filterwarnings("ignore", category=FutureWarning)

# Open catalog
cat = intake.open_catalog("https://digital-earths-global-hackathon.github.io/catalog/catalog.yaml")["online"]

# Select first 15 models
model_names = list(cat)[0:15]

# List to store matched results
data = []

# Keywords to include/exclude
keywords = ['elevation', 'altitude', 'height', 'orography', 'surface', 'terrain', 'topography', 'geopotential height']
no_keywords = ['boundary', 'cloud', 'roughness', 'plane', 'snow', 'water', 'geometric', 'cell', 'layer']

# Loop over models
for model_name in model_names:
    model = cat[model_name]
    ds = None
    used_zoom = None

    try:
        # Get allowed zoom levels from model description
        params_df = pd.DataFrame(model.describe()["user_parameters"])
        zoom_levels = params_df[params_df["name"] == "zoom"]["allowed"].values[0]
        zoom_levels = sorted(zoom_levels)  # ascending order
    except Exception as e:
        print(f"Could not extract zoom levels for model '{model_name}': {e}")
        continue

    # Try each zoom level starting from lowest (without loading data, just metadata)
    for zoom_level in zoom_levels:
        try:
            # Get the Zarr URL for the current zoom level
            zarr_url = model(zoom=zoom_level).describe()['urlpath']

            # Open the Zarr file using xarray, but avoid loading full data by limiting to metadata
            ds = xr.open_zarr(zarr_url, consolidated=True)

            # Access the Zarr file's variable metadata (no data loading)
            ds_metadata = ds.attrs

            # Check if the necessary metadata for this dataset exists
            if 'variables' in ds_metadata:
                variables = ds_metadata['variables']  # Variables metadata in Zarr format
                used_zoom = zoom_level
                break  # If successful, exit loop
        except Exception as e:
            continue

    if ds is None:
        print(f"Could not load model '{model_name}' at any allowed zoom level.")
        continue

    # Search for matching variables based on description and unit
    for var in ds.data_vars:
        name = var
        desc = ds[var].attrs.get('long_name', var).replace('_', ' ').title()
        unit = ds[var].attrs.get('units', '').strip()
        desc_lower = desc.lower()

        # Check if description matches keywords and unit is in meters
        if any(kw in desc_lower for kw in keywords) and all(nk not in desc_lower for nk in no_keywords) and unit == 'm':
            data.append([model_name, used_zoom, name, desc, unit])

# Create and display final DataFrame
df = pd.DataFrame(data, columns=["Model Name", "Lowest Zoom Level", "Variable Name", "Description", "Unit"])

# Pretty output with left-aligned headers and content
display(df.style.set_properties(**{'text-align': 'left'})
            .set_table_styles([{'selector': 'th', 'props': [('text-align', 'left')]}]))


Could not load model 'ERA5' at any allowed zoom level.
Could not load model 'IR_IMERG' at any allowed zoom level.
Could not load model 'JRA3Q' at any allowed zoom level.
Could not load model 'MERRA2' at any allowed zoom level.
Could not load model 'casesm2_10km_nocumulus' at any allowed zoom level.
Could not load model 'icon_d3hp003' at any allowed zoom level.
Could not load model 'icon_d3hp003aug' at any allowed zoom level.
Could not load model 'icon_d3hp003feb' at any allowed zoom level.
Could not load model 'icon_ngc4008' at any allowed zoom level.
Could not load model 'ifs_tco3999-ng5_deepoff' at any allowed zoom level.
Could not load model 'ifs_tco3999-ng5_rcbmf' at any allowed zoom level.
Could not load model 'ifs_tco3999-ng5_rcbmf_cf' at any allowed zoom level.
Could not load model 'nicam_gl11' at any allowed zoom level.
Could not load model 'scream-dkrz' at any allowed zoom level.
Could not load model 'um_Africa_km4p4_RAL3P3_n1280_GAL9_nest' at any allowed zoom level.


Unnamed: 0,Model Name,Lowest Zoom Level,Variable Name,Description,Unit
