# Part2: Intake - Data Cataloging Made Simple

## 🎯 Learning Objectives

By the end of this section, you will understand how to:

1. **Create and manage data catalogs** using Intake
2. **Access datasets** through standardized interfaces
3. **Handle different data formats** seamlessly
4. **Share data access patterns** across teams
5. **Debug common data loading issues**

## 🤔 The Problem: Data Access Chaos

In typical data science workflows, teams face these challenges:

### Traditional Pain Points:
- **Scattered Data:** Files stored in different locations with inconsistent naming
- **Format Confusion:** Different file formats (NetCDF, Zarr, CSV, Parquet) require different loading code
- **Path Dependencies:** Hard-coded file paths break when data moves
- **Knowledge Silos:** Only certain team members know where data is and how to load it
- **Inconsistent Loading:** Different team members load the same data differently

### The Intake Solution:
Intake provides a **unified catalog interface** that abstracts away the complexity of data access.

| Problem | Traditional Approach | Intake Solution |
|---------|---------------------|-----------------|
| **Multiple formats** | Different loading code for each format | Unified `.read()` and `.to_dask()` methods |
| **Path management** | Hard-coded paths in code | Centralized catalog with logical names |
| **Documentation** | Scattered README files | Rich metadata in catalog |
| **Sharing** | Email file paths and instructions | Share catalog YAML file |
| **Reproducibility** | "Works on my machine" | Consistent access patterns |

## 1. What is Intake?

**Intake** is a lightweight package for finding, investigating, loading and disseminating data. It provides:

### Key Concepts:

1. **Data Catalogs**: YAML files that describe where data lives and how to access it
2. **Data Sources**: Individual datasets with metadata and loading instructions  
3. **Drivers**: Plugins that know how to load specific data formats
4. **Parameters**: Dynamic configuration for flexible data access

### Catalog Anatomy:

```yaml
sources:                    # Collection of datasets
  my_dataset:              # Logical name for the dataset
    driver: zarr           # How to load the data (zarr, csv, netcdf, etc.)
    args:                  # Arguments passed to the driver
      urlpath: "/path/to/data.zarr"
    description: "..."     # Human-readable description
    metadata:              # Additional information
      tags: ["weather", "gridded"]
```

### Benefits:
- 🔍 **Discoverability**: Browse available datasets
- 📖 **Documentation**: Rich metadata and descriptions
- 🔄 **Reproducibility**: Consistent data access patterns
- 🤝 **Collaboration**: Share data access through catalog files
- ⚡ **Performance**: Lazy loading and optimization hints

## 2. Hands-On: Working with Our Radar Data Catalog

Let's explore how to use Intake with real radar dataset. We'll start by loading and exploring our catalog.

### Step 1: Loading a Data Catalog

First, let's load our radar data catalog and explore what datasets are available.

In [10]:
import intake 
# Load the radar data catalog
catalog = intake.open_catalog('../catalogs/radar_intake_catalog.yaml')

# Explore what's in the catalog
print("📁 Available datasets in catalog:")
print("-" * 40)
for name in catalog:
    print(f"  • {name}")
    
print(f"\n📊 Total datasets: {len(list(catalog))}")

# Let's look at the catalog object itself
print(f"\nCatalog type: {type(catalog)}")
print(f"Catalog path: {catalog.path}")

📁 Available datasets in catalog:
----------------------------------------
  • QPSUMS_tw

📊 Total datasets: 1

Catalog type: <class 'intake.catalog.local.YAMLFileCatalog'>
Catalog path: ../catalogs/radar_intake_catalog.yaml


In [16]:
radar_tw_ds = catalog.QPSUMS_tw.to_dask()
print(radar_tw_ds)

radar_tw_ds


<xarray.Dataset> Size: 553GB
Dimensions:    (time: 558420, latitude: 561, longitude: 441)
Coordinates:
  * latitude   (latitude) float64 4kB 20.0 20.01 20.02 ... 26.98 26.99 27.0
  * longitude  (longitude) float64 4kB 118.0 118.0 118.0 ... 123.5 123.5 123.5
  * time       (time) datetime64[ns] 4MB 2013-01-01 ... 2023-08-31T23:50:00
Data variables:
    MaxDBZ     (time, latitude, longitude) float32 553GB dask.array<chunksize=(1, 561, 441), meta=np.ndarray>


Unnamed: 0,Array,Chunk
Bytes,514.66 GiB,0.94 MiB
Shape,"(558420, 561, 441)","(1, 561, 441)"
Dask graph,558420 chunks in 23 graph layers,558420 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 514.66 GiB 0.94 MiB Shape (558420, 561, 441) (1, 561, 441) Dask graph 558420 chunks in 23 graph layers Data type float32 numpy.ndarray",441  561  558420,

Unnamed: 0,Array,Chunk
Bytes,514.66 GiB,0.94 MiB
Shape,"(558420, 561, 441)","(1, 561, 441)"
Dask graph,558420 chunks in 23 graph layers,558420 chunks in 23 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
