# Part2: Intake - Data Cataloging Made Simple

## Learning Objectives

By the end of this section, you will understand how to:

1. **Create and manage data catalogs** using Intake
2. **Access datasets** through standardized interfaces
3. **Handle different data formats** seamlessly
4. **Share data access patterns** across teams



## 1. The Problem: Data Access Chaos

In typical data science workflows, teams face these challenges:

### Traditional Pain Points:
- **Scattered Data:** Files stored in different locations with inconsistent naming
- **Format Confusion:** Different file formats (NetCDF, Zarr, CSV, Parquet) require different loading code
- **Path Dependencies:** Hard-coded file paths break when data moves
- **Knowledge Silos:** Only certain team members know where data is and how to load it
- **Inconsistent Loading:** Different team members load the same data differently

### The Intake Solution:
Intake provides a **unified catalog interface** that abstracts away the complexity of data access.

| Problem | Traditional Approach | Intake Solution |
|---------|---------------------|-----------------|
| **Multiple formats** | Different loading code for each format | Unified `.read()` and `.to_dask()` methods |
| **Path management** | Hard-coded paths in code | Centralized catalog with logical names |
| **Documentation** | Scattered README files | Rich metadata in catalog |
| **Sharing** | Email file paths and instructions | Share catalog YAML file |
| **Reproducibility** | "Works on my machine" | Consistent access patterns |

## 2. What is Intake?

**Intake** is a lightweight package for finding, investigating, loading and disseminating data. It provides:

### Key Concepts:

1. **Data Catalogs**: YAML files that describe where data lives and how to access it
2. **Data Sources**: Individual datasets with metadata and loading instructions  
3. **Drivers**: Plugins that know how to load specific data formats
4. **Parameters**: Dynamic configuration for flexible data access

### Catalog Anatomy:

```yaml
sources:                    # Collection of datasets
  my_dataset:              # Logical name for the dataset
    driver: zarr           # How to load the data (zarr, csv, netcdf, etc.)
    args:                  # Arguments passed to the driver
      urlpath: "/path/to/data.zarr"
    description: "..."     # Human-readable description
    metadata:              # Additional information
      tags: ["weather", "gridded"]
```

### Benefits:
- 🔍 **Discoverability**: Browse available datasets
- 📖 **Documentation**: Rich metadata and descriptions
- 🔄 **Reproducibility**: Consistent data access patterns
- 🤝 **Collaboration**: Share data access through catalog files
- ⚡ **Performance**: Lazy loading and optimization hints

## 3. Hands-On: Working with Our Data Catalog

Let's explore how to use Intake with real radar dataset. We'll start by loading and exploring our catalog.

### Step 1: Loading a Data Catalog

First, let's load our radar data catalog and explore what datasets are available.

In [2]:
import intake 
# Load the radar data catalog
catalog = intake.open_catalog('../catalogs/radar_intake_catalog.yaml')

# Explore what's in the catalog
print("📁 Available datasets in catalog:")
print("-" * 40)
for name in catalog:
    print(f"  • {name}")
    
print(f"\n📊 Total datasets: {len(list(catalog))}")

# Let's look at the catalog object itself
print(f"\nCatalog type: {type(catalog)}")
print(f"Catalog path: {catalog.path}")

📁 Available datasets in catalog:
----------------------------------------
  • QPSUMS_tw

📊 Total datasets: 1

Catalog type: <class 'intake.catalog.local.YAMLFileCatalog'>
Catalog path: ../catalogs/radar_intake_catalog.yaml


In [3]:
radar_tw_ds = catalog.QPSUMS_tw.to_dask()
print(radar_tw_ds)

radar_tw_ds

<xarray.Dataset> Size: 553GB
Dimensions:    (time: 558420, latitude: 561, longitude: 441)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 4MB 2013-01-01 ... 2023-08-31T23:50:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 553GB dask.array<chunksize=(1, 561, 441), meta=np.ndarray>


Unnamed: 0,Array,Chunk
Bytes,514.66 GiB,0.94 MiB
Shape,"(558420, 561, 441)","(1, 561, 441)"
Dask graph,558420 chunks in 23 graph layers,558420 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 514.66 GiB 0.94 MiB Shape (558420, 561, 441) (1, 561, 441) Dask graph 558420 chunks in 23 graph layers Data type float32 numpy.ndarray",441  561  558420,

Unnamed: 0,Array,Chunk
Bytes,514.66 GiB,0.94 MiB
Shape,"(558420, 561, 441)","(1, 561, 441)"
Dask graph,558420 chunks in 23 graph layers,558420 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Step 2: Process the Data with Dask

In [4]:
# 🔍 Explore Dask Array Structure
print("🔍 Dask Array Analysis:")
print("=" * 50)

# Examine MaxDBZ variable's Dask array
maxdbz_dask = radar_tw_ds.MaxDBZ.data
print(f"📊 Array type: {type(maxdbz_dask)}")
print(f"📏 Array shape: {maxdbz_dask.shape}")
print(f"🧩 Chunk structure: {maxdbz_dask.chunks}")
print(f"💾 Chunk size: {maxdbz_dask.chunksize}")
print(f"🔢 Total chunks: {maxdbz_dask.npartitions}")

# Calculate theoretical memory usage
total_size_gb = radar_tw_ds.MaxDBZ.nbytes / 1e9
chunk_size_mb = (maxdbz_dask.chunksize[0] * maxdbz_dask.chunksize[1] * maxdbz_dask.chunksize[2] * 4) / 1e6  # 4 bytes per float32

print(f"\n💾 Memory Information:")
print(f"  • Total data size: {total_size_gb:.1f} GB")
print(f"  • Single chunk size: ~{chunk_size_mb:.1f} MB")

🔍 Dask Array Analysis:
📊 Array type: <class 'dask.array.core.Array'>
📏 Array shape: (558420, 561, 441)
🧩 Chunk structure: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [5]:
print("Building a plan to create a subset for analysis...")
print("=" * 50)

# --- Define Step 1 of our plan: Select a single day ---
daily_ds = radar_tw_ds.sel(time='2023-08-01')

# --- Define Step 2 of our plan: Filter the data ---
# This keeps all values where MaxDBZ is >= 0, and turns the rest into NaN.
interactive_ds = daily_ds.where(daily_ds.MaxDBZ >= 0)

print("The result is still a lazy Dask object, and no computation has happened yet:")
print("-" * 50)
print(interactive_ds)

Building a plan to create a subset for analysis...
The result is still a lazy Dask object, and no computation has happened yet:
--------------------------------------------------
<xarray.Dataset> Size: 143MB
Dimensions:    (time: 144, latitude: 561, longitude: 441)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 1kB 2023-08-01 ... 2023-08-01T23:50:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 143MB dask.array<chunksize=(1, 561, 441), meta=np.ndarray>


Step 3: Create an Hourly Maximum Product with CF Metadata

In [None]:
from datetime import datetime, timezone

# --- Define Step 3 of our plan: Resample to Hourly Maximum ---

# Use Xarray's .resample() method. This groups the data into 1-hour bins along the time dimension.
# Then, we apply .max() to find the maximum value within each bin.
# This entire operation remains lazy!
hourly_max_dbz = interactive_ds.MaxDBZ.resample(time='1H').max(dim='time')


# --- Define Step 4 of our plan: Add CF-Compliant Metadata ---

print("Processing plan defined. Now adding rich metadata...")

# Add descriptive attributes that follow CF (Climate and Forecast) conventions
hourly_max_dbz.attrs['standard_name'] = 'radar_equivalent_reflectivity_factor'
hourly_max_dbz.attrs['long_name'] = 'Hourly Maximum Equivalent Radar Reflectivity Factor'
hourly_max_dbz.attrs['units'] = 'dBZ'
hourly_max_dbz.attrs['comment'] = 'Maximum radar reflectivity observed within each one-hour interval.'

# Add provenance and history, using the ISO 8601 standard for the timestamp
# This records exactly how the data was created.
iso_timestamp = datetime.now(timezone.utc).isoformat()
hourly_max_dbz.attrs['history'] = f'Created on {iso_timestamp}. Derived from Level 2 data by taking the hourly maximum.'

# Explicitly list the coordinates, which is good practice for CF compliance
hourly_max_dbz.attrs['coordinates'] = 'longitude latitude'


print("\n--- Final, Lazy Level-3 Data Product ---")
print(hourly_max_dbz)

  self.index_grouper = pd.Grouper(


✅ Processing plan defined. Now adding rich metadata...

--- Final, Lazy Level-3 Data Product ---
<xarray.DataArray 'MaxDBZ' (time: 24, latitude: 561, longitude: 441)> Size: 24MB
dask.array<stack, shape=(24, 561, 441), dtype=float32, chunksize=(1, 561, 441), chunktype=numpy.ndarray>
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 192B 2023-08-01 ... 2023-08-01T23:00:00
Attributes:
    processing_notes:  QPESUMS 原始資料中的 NODATA_VALUE (<None) 保留為原始值
    standard_name:     radar_equivalent_reflectivity_factor
    long_name:         Hourly Maximum Equivalent Radar Reflectivity Factor
    units:             dBZ
    comment:           Maximum radar reflectivity observed within each one-ho...
    history:           Created on 2025-08-19T16:34:41.361252+00:00. Derived f...
    coordinates:       longitude latitude

💾 Saving Level-3 data... Th

In [11]:
print("\nSaving data... This will TRIGGER the Dask computation now!")
output_path = '../data/qpsums_radar_hourly_max_2023_08_01.zarr'

# The .to_zarr() action tells Dask to compute everything and write the result to disk.
hourly_max_dbz.to_zarr(output_path, mode='w')

print(f"\n🎉 Success! Computation is complete. Results saved to: {output_path}")


Saving data... This will TRIGGER the Dask computation now!

🎉 Success! Computation is complete. Results saved to: ../data/qpsums_radar_hourly_max_2023_08_01.zarr


In [None]:
import xarray as xr 
test_ds = xr.open_zarr('../data/level3_radar_hourly_max_2023_08_01.zarr')

print(test_ds)

<xarray.Dataset> Size: 24MB
Dimensions:    (time: 24, latitude: 561, longitude: 441)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 192B 2023-08-01 ... 2023-08-01T23:00:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 24MB dask.array<chunksize=(1, 561, 441), meta=np.ndarray>


In [10]:
test_ds

Unnamed: 0,Array,Chunk
Bytes,22.65 MiB,0.94 MiB
Shape,"(24, 561, 441)","(1, 561, 441)"
Dask graph,24 chunks in 2 graph layers,24 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 22.65 MiB 0.94 MiB Shape (24, 561, 441) (1, 561, 441) Dask graph 24 chunks in 2 graph layers Data type float32 numpy.ndarray",441  561  24,

Unnamed: 0,Array,Chunk
Bytes,22.65 MiB,0.94 MiB
Shape,"(24, 561, 441)","(1, 561, 441)"
Dask graph,24 chunks in 2 graph layers,24 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
