# Part 2: Intake - Data Cataloging Made Simple

## 1. Learning Objectives

By the end of this section, you will understand how to:

1. **Create and manage data catalogs** using Intake
2. **Access datasets** through standardized interfaces
3. **Handle different data formats** seamlessly
4. **Share data access patterns** across teams

## 2. The Problem: Data Access Chaos

In typical data science workflows, teams face these challenges:

### Traditional Pain Points:
- **Scattered Data:** Files stored in different locations with inconsistent naming
- **Format Confusion:** Different file formats (NetCDF, Zarr, CSV, Parquet) require different loading code
- **Path Dependencies:** Hard-coded file paths break when data moves
- **Knowledge Silos:** Only certain team members know where data is and how to load it
- **Inconsistent Loading:** Different team members load the same data differently

### The Intake Solution:
Intake provides a **unified catalog interface** that abstracts away the complexity of data access.

| Problem | Traditional Approach | Intake Solution |
|---------|---------------------|-----------------|
| **Multiple formats** | Different loading code for each format | Unified `.read()` and `.to_dask()` methods |
| **Path management** | Hard-coded paths in code | Centralized catalog with logical names |
| **Documentation** | Scattered README files | Rich metadata in catalog |
| **Sharing** | Email file paths and instructions | Share catalog YAML file |
| **Reproducibility** | "Works on my machine" | Consistent access patterns |

## 3. What is Intake?

**Intake** is a lightweight package for finding, investigating, loading and disseminating data. It provides:

### Key Concepts:

1. **Data Catalogs**: YAML files that describe where data lives and how to access it
2. **Data Sources**: Individual datasets with metadata and loading instructions  
3. **Drivers**: Plugins that know how to load specific data formats
4. **Parameters**: Dynamic configuration for flexible data access

### Catalog Anatomy:

```yaml
sources:                    # Collection of datasets
  my_dataset:              # Logical name for the dataset
    driver: zarr           # How to load the data (zarr, csv, netcdf, etc.)
    args:                  # Arguments passed to the driver
      urlpath: "/path/to/data.zarr"
    description: "..."     # Human-readable description
    metadata:              # Additional information
      tags: ["weather", "gridded"]
```

### Benefits:
- 🔍 **Discoverability**: Browse available datasets
- 📖 **Documentation**: Rich metadata and descriptions
- 🔄 **Reproducibility**: Consistent data access patterns
- 🤝 **Collaboration**: Share data access through catalog files
- ⚡ **Performance**: Lazy loading and optimization hints

## 4. Hands-On: Working with Our Data Catalog

Let's explore how to use Intake with real radar dataset. We'll start by loading and exploring our catalog.

### Step 1: Loading a Data Catalog

First, let's load our radar data catalog and explore what datasets are available.

**What this does:**
- Opens a YAML catalog file containing dataset definitions
- Creates a catalog object that provides unified access to multiple datasets
- Allows us to explore available data sources without loading the actual data

**Command to run:**

In [2]:
import intake 
# Load the radar data catalog
catalog = intake.open_catalog('../catalogs/radar_intake_catalog.yaml')

# Explore what's in the catalog
print("📁 Available datasets in catalog:")
print("-" * 40)
for name in catalog:
    print(f"  • {name}")
    
print(f"\n📊 Total datasets: {len(list(catalog))}")

# Let's look at the catalog object itself
print(f"\nCatalog type: {type(catalog)}")
print(f"Catalog path: {catalog.path}")

📁 Available datasets in catalog:
----------------------------------------
  • QPSUMS_tw

📊 Total datasets: 1

Catalog type: <class 'intake.catalog.local.YAMLFileCatalog'>
Catalog path: ../catalogs/radar_intake_catalog.yaml


**Now let's load the actual radar dataset and convert it to a Dask array:**

**What this does:**
- Accesses the QPSUMS_tw dataset from our catalog
- Converts it to a Dask-backed xarray Dataset for lazy loading
- Enables us to work with large datasets that don't fit in memory

**Command to run:**

In [3]:
radar_tw_ds = catalog.QPSUMS_tw.to_dask()
print(radar_tw_ds)

radar_tw_ds

<xarray.Dataset> Size: 553GB
Dimensions:    (time: 558420, latitude: 561, longitude: 441)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 4MB 2013-01-01 ... 2023-08-31T23:50:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 553GB dask.array<chunksize=(1, 561, 441), meta=np.ndarray>


Unnamed: 0,Array,Chunk
Bytes,514.66 GiB,0.94 MiB
Shape,"(558420, 561, 441)","(1, 561, 441)"
Dask graph,558420 chunks in 23 graph layers,558420 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 514.66 GiB 0.94 MiB Shape (558420, 561, 441) (1, 561, 441) Dask graph 558420 chunks in 23 graph layers Data type float32 numpy.ndarray",441  561  558420,

Unnamed: 0,Array,Chunk
Bytes,514.66 GiB,0.94 MiB
Shape,"(558420, 561, 441)","(1, 561, 441)"
Dask graph,558420 chunks in 23 graph layers,558420 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Step 2: Lazy Data Processing - Subsetting and Resampling

Now let's demonstrate Dask's **lazy evaluation** by building a data processing pipeline that includes subsetting and resampling operations.

**What we'll do:**
- Analyze the Dask array structure to understand chunking strategy
- Select a subset of data (single day) without loading into memory
- Apply filtering operations (removing invalid values)
- Resample from minute-level to hourly maximum values
- All operations remain **lazy** until we explicitly trigger computation

**Key Benefits of Lazy Evaluation:**
- 🧩 **Chunking**: Data is divided into manageable pieces for parallel processing
- 💾 **Memory Efficiency**: Only load chunks you're working with
- ⚡ **Optimization**: Dask optimizes the entire computational graph
- 🔄 **Scalability**: Process datasets larger than available memory

**Command to run:**

In [4]:
# 🔍 Explore Dask Array Structure
print("🔍 Dask Array Analysis:")
print("=" * 50)

# Examine MaxDBZ variable's Dask array
maxdbz_dask = radar_tw_ds.MaxDBZ.data
print(f"📊 Array type: {type(maxdbz_dask)}")
print(f"📏 Array shape: {maxdbz_dask.shape}")
print(f"🧩 Chunk structure: {maxdbz_dask.chunks}")
print(f"💾 Chunk size: {maxdbz_dask.chunksize}")
print(f"🔢 Total chunks: {maxdbz_dask.npartitions}")

# Calculate theoretical memory usage
total_size_gb = radar_tw_ds.MaxDBZ.nbytes / 1e9
chunk_size_mb = (maxdbz_dask.chunksize[0] * maxdbz_dask.chunksize[1] * maxdbz_dask.chunksize[2] * 4) / 1e6  # 4 bytes per float32

print(f"\n💾 Memory Information:")
print(f"  • Total data size: {total_size_gb:.1f} GB")
print(f"  • Single chunk size: ~{chunk_size_mb:.1f} MB")

🔍 Dask Array Analysis:
📊 Array type: <class 'dask.array.core.Array'>
📏 Array shape: (558420, 561, 441)
🧩 Chunk structure: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [29]:
print("Building a plan to create a subset for analysis...")
print("=" * 50)

# --- Step 1 of our plan: Select a single day ---
daily_ds = radar_tw_ds.sel(time='2023-08-01')

# --- Step 2 of our plan: Filter the data ---
# This keeps all values where MaxDBZ is >= 0, and turns the rest into NaN.
interactive_ds = daily_ds.where(daily_ds.MaxDBZ >= 0)

print("The result is still a lazy Dask object, and no computation has happened yet:")
print("-" * 50)
print(interactive_ds)

# --- Step 3: Resample to Hourly Maximum ---
hourly_max_dbz = interactive_ds.MaxDBZ.resample(time='h').max(dim='time')
print(hourly_max_dbz)
hourly_max_dbz


Building a plan to create a subset for analysis...
The result is still a lazy Dask object, and no computation has happened yet:
--------------------------------------------------
<xarray.Dataset> Size: 143MB
Dimensions:    (time: 144, latitude: 561, longitude: 441)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 1kB 2023-08-01 ... 2023-08-01T23:50:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 143MB dask.array<chunksize=(1, 561, 441), meta=np.ndarray>
<xarray.DataArray 'MaxDBZ' (time: 24, latitude: 561, longitude: 441)> Size: 24MB
dask.array<stack, shape=(24, 561, 441), dtype=float32, chunksize=(1, 561, 441), chunktype=numpy.ndarray>
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time  

Unnamed: 0,Array,Chunk
Bytes,22.65 MiB,0.94 MiB
Shape,"(24, 561, 441)","(1, 561, 441)"
Dask graph,24 chunks in 123 graph layers,24 chunks in 123 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 22.65 MiB 0.94 MiB Shape (24, 561, 441) (1, 561, 441) Dask graph 24 chunks in 123 graph layers Data type float32 numpy.ndarray",441  561  24,

Unnamed: 0,Array,Chunk
Bytes,22.65 MiB,0.94 MiB
Shape,"(24, 561, 441)","(1, 561, 441)"
Dask graph,24 chunks in 123 graph layers,24 chunks in 123 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Step 3: Adding CF-Compliant Metadata and Validation

Now we'll focus on adding scientific metadata following **CF (Climate and Forecast) conventions** and validating our data product.

**What we'll do:**
- Convert our DataArray to a proper Dataset structure
- Add comprehensive CF-compliant metadata (global, coordinate, and variable attributes)
- Save as NetCDF format with proper encoding
- Validate compliance using `compliance-checker`
- Verify the final data product

**Key Concepts - CF Conventions:**
- **Global Attributes**: Describe the entire dataset (title, history, conventions)
- **Coordinate Attributes**: Properly describe time, latitude, longitude dimensions
- **Variable Attributes**: Scientific metadata for data variables (units, standard_name, etc.)
- **Compliance Checking**: Automated validation against CF standards

**Scientific Value:**
- Ensures data interoperability across different tools and institutions
- Maintains full provenance and traceability
- Follows international metadata standards for scientific data
- Enables automatic discovery and understanding by analysis tools

**Command to run:**

In [None]:
from datetime import datetime, timezone
import os
import xarray as xr

# --- Step 4: Convert to Dataset and Add Full Metadata ---
print("Converting to Dataset and adding metadata...")

# Convert the DataArray to a Dataset.
# This creates the structure needed for full CF compliance.
# We'll name our main variable 'MaxDBZ'.
ds_level_silver = hourly_max_dbz.to_dataset(name='MaxDBZ')
print(hourly_max_dbz)

# --- 1: Add GLOBAL attributes to the Dataset ---
# These describe the file as a whole.
iso_timestamp = datetime.now(timezone.utc).isoformat()
ds_level_silver.attrs['title'] = 'level-Silver QPSUMS Hourly 2D Maximum Radar Reflectivity over Taiwan'
ds_level_silver.attrs['history'] = f'Created on {iso_timestamp}. Derived from Bronze data by taking the hourly maximum.'
ds_level_silver.attrs['Conventions'] = 'CF-1.8' # This fixes a warning
ds_level_silver.attrs['source'] = 'Data processed from catalog at https://github.com/Isongzhe/dataset-catalog/blob/master/catalogs/radar_intake_catalog.yaml'

# --- 2: Add attributes to the COORDINATES ---
ds_level_silver['time'].attrs['long_name'] = 'time'
ds_level_silver['time'].attrs['standard_name'] = 'time'

ds_level_silver['latitude'].attrs['long_name'] = 'latitude'
ds_level_silver['latitude'].attrs['standard_name'] = 'latitude'
ds_level_silver['latitude'].attrs['units'] = 'degrees_north'

ds_level_silver['longitude'].attrs['long_name'] = 'longitude'
ds_level_silver['longitude'].attrs['standard_name'] = 'longitude'
ds_level_silver['longitude'].attrs['units'] = 'degrees_east'


# --- 3: Add attributes to the DATA VARIABLE ---
ds_level_silver['MaxDBZ'].attrs['standard_name'] = 'equivalent_reflectivity_factor' # The checker's suggestion
ds_level_silver['MaxDBZ'].attrs['long_name'] = 'Hourly Maximum Equivalent Radar Reflectivity Factor'
ds_level_silver['MaxDBZ'].attrs['units'] = 'dBZ'
ds_level_silver['MaxDBZ'].attrs['comment'] = 'Maximum radar reflectivity observed within each one-hour interval.'
ds_level_silver['MaxDBZ'].attrs['coordinates'] = 'longitude latitude'


print("\n--- Final, Lazy Level-Silver Data Product ---")
print(ds_level_silver)

Converting to Dataset and adding metadata...
<xarray.DataArray 'MaxDBZ' (time: 24, latitude: 561, longitude: 441)> Size: 24MB
dask.array<stack, shape=(24, 561, 441), dtype=float32, chunksize=(1, 561, 441), chunktype=numpy.ndarray>
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 192B 2023-08-01 ... 2023-08-01T23:00:00
Attributes:
    processing_notes:  QPESUMS 原始資料中的 NODATA_VALUE (<None) 保留為原始值

--- Final, Lazy Level-3 Data Product ---
<xarray.Dataset> Size: 24MB
Dimensions:    (latitude: 561, longitude: 441, time: 24)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 192B 2023-08-01 ... 2023-08-01T23:00:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 24MB da

In [31]:
# --- Final Step: Save the fully compliant Dataset ---
print("\n💾 Saving CF-compliant NetCDF file...")

output_dir = '../data'
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'qpsums_radar_hourly_max_2023_08_01_CF.nc')

# Define time encoding to ensure it's saved correctly
time_encoding = {
    'units': 'days since 1970-01-01 00:00:00',
    'calendar': 'proleptic_gregorian',
    'dtype': 'float64'
}

# Now save the Dataset object
ds_level3.to_netcdf(
    output_path,
    mode='w',
    encoding={'time': time_encoding}
)

print(f"\nResults saved to: {output_path}")


💾 Saving CF-compliant NetCDF file...


PermissionError: [Errno 13] Permission denied: '/home/sungche/dataset-catalog/data/qpsums_radar_hourly_max_2023_08_01_CF.nc'

**Validate CF Compliance with `compliance-checker`:**

Now let's verify that our dataset meets CF-1.8 standards using the official compliance checker tool.

**What this does:**
- Installs the compliance-checker tool if needed
- Runs automated validation against CF-1.8 conventions
- Reports any issues or confirms compliance
- Ensures our data meets international scientific standards

**Commands to run in terminal:**
```bash
uv add compliance-checker
uv run compliance-checker --test cf:1.8 <nc-file>
```

In [22]:
import xarray as xr 
ds = xr.open_dataset('../data/qpsums_radar_hourly_max_2023_08_01_CF.nc')
print(ds)

ds

<xarray.Dataset> Size: 24MB
Dimensions:    (time: 24, latitude: 561, longitude: 441)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 192B 2023-08-01 ... 2023-08-01T23:00:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 24MB ...
Attributes:
    title:        Level-3 Hourly Maximum Radar Reflectivity over Taiwan
    history:      Created on 2025-08-19T17:41:27.837171+00:00. Derived from L...
    Conventions:  CF-1.8
    source_data:  QPSUMS Level-2 Radar Data


**Final Verification - Loading the CF-Compliant Dataset:**

Let's load our saved CF-compliant dataset to confirm everything was processed correctly.

**What this verifies:**
- The NetCDF file was saved correctly with all metadata
- All CF attributes are preserved and accessible
- The data structure matches our expectations
- Our scientific workflow produced a valid, standardized data product

**Command to run:**

## 5. Recap: What We Accomplished with Intake + Dask + CF Standards

### Our Three-Step Workflow

1. **📁 Data Discovery & Access** (Step 1): Used Intake catalogs for unified data access
2. **⚡ Lazy Data Processing** (Step 2): Demonstrated subsetting and resampling with Dask's lazy evaluation
3. **📖 Scientific Metadata** (Step 3): Added CF-compliant metadata and validated compliance

### Key Benefits Achieved

* **📁 Unified Access**: One interface for multiple data formats through Intake catalogs
* **⚡ Lazy Processing**: Efficiently handled large datasets through intelligent chunking and lazy evaluation
* **🔄 Scalability**: Processed data larger than memory without loading everything at once
* **📖 Scientific Standards**: Created CF-compliant datasets with rich, standardized metadata
* **🤝 Reproducibility**: Established a traceable, standardized scientific workflow
* **✅ Quality Assurance**: Validated output against international CF conventions

### What Each Step Delivered

1. **Data Access:** Found and loaded datasets through centralized catalog without format-specific code
2. **Efficient Processing:** Built complex processing pipelines that remain lazy until computation is needed
3. **Scientific Compliance:** Created properly documented, standards-compliant scientific data products

### Traditional vs Modern Scientific Data Workflow

| Task | Traditional Way | Our Intake + Dask + CF Way |
|------|----------------|------------------------|
| **Find data** | Browse file systems, ask colleagues | Query centralized Intake catalog |
| **Load large files** | Hope it fits in memory or manual chunking | Automatic lazy loading with Dask |
| **Process data** | Write loops, risk memory crashes | Lazy evaluation with automatic optimization |
| **Add metadata** | Manual, inconsistent documentation | Standardized CF-compliant metadata |
| **Validate quality** | Manual checks, if any | Automated compliance checking |
| **Share results** | Email files with separate documentation | Self-documenting, standards-compliant datasets |

### Next Steps for Scientific Data Workflows

- Create custom Intake catalogs for your research data
- Explore Dask's distributed computing for even larger datasets
- Learn more CF convention patterns for your scientific domain
- Implement automated compliance checking in your data pipelines
- Build reusable processing workflows that follow these patterns