# Extra Example: Parquet Data with Intake Catalog

This notebook demonstrates the basic workflow of loading tabular data (Parquet format) using Intake catalogs.

## Key Learning Points:

1. **Loading Parquet data through Intake catalogs**
2. **Basic difference between Pandas and Dask DataFrames**
3. **When to use Dask for larger datasets**

## Important Notes:

- Intake provides unified access to different data formats
- Pandas DataFrames: Load entire dataset into memory
- Dask DataFrames: Lazy loading, suitable for large datasets
- Both can be accessed through the same catalog interface

In [None]:
import intake

# Load catalog
catalog = intake.open_catalog('../catalogs/station_intake_catalog.yaml')

# Explore available datasets
print("📁 Available datasets:")
for name in catalog:
    print(f"  • {name}")

print(f"\n📊 Total datasets: {len(list(catalog))}")

In [None]:
# Method 1: Load as Pandas DataFrame (loads entire dataset into memory)
print("Loading data as Pandas DataFrame...")
df = catalog.nyc_taxi_sample.read()

print(f"\nDataset Information:")
print(f"- Shape: {df.shape}")
print(f"- Total rows: {len(df):,}")
print(f"- Total columns: {len(df.columns)}")
print(f"- Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\nColumns: {list(df.columns)}")

# Show first few rows
print("\nFirst 5 rows:")
df.head()

In [None]:
# Method 2: Load as Dask DataFrame (lazy loading for larger datasets)
print("Loading data as Dask DataFrame...")
ddf = catalog.nyc_taxi_sample.to_dask()

print(f"\nDataset Information:")
print(f"- Number of partitions: {ddf.npartitions}")
print(f"- Total columns: {len(ddf.columns)}")
print(f"- Estimated memory usage: {ddf.memory_usage(deep=True).sum().compute() / 1024**2:.2f} MB")

print(f"\nColumns: {list(ddf.columns)}")

# Show first few rows (this triggers computation)
print("\nFirst 5 ros:")
ddf.head(5)

In [None]:
# 📋 Comparison Summary
print("=" * 60)
print("PANDAS vs DASK DATAFRAME COMPARISON")
print("=" * 60)

print("\nPandas DataFrame:")
print(f"   • Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"   • Loading: All data loaded into memory immediately")
print(f"   • Operations: Immediate execution")
print(f"   • Best for: Small to medium datasets")

print("\nDask DataFrame:")
print(f"   • Partitions: {ddf.npartitions}")
print(f"   • Loading: Lazy evaluation (operations planned, not executed)")
print(f"   • Operations: Execute with .compute()")
print(f"   • Best for: Large datasets that don't fit in memory")

print("\n" + "=" * 60)
print("✅ Both methods access the same data through Intake catalog!")
print("💡 Choose based on your dataset size and memory constraints.")
print("=" * 60)