# Extra Example: Parquet Data with Intake Catalog

This notebook demonstrates the basic workflow of loading tabular data (Parquet format) using Intake catalogs.

## Key Learning Points:

1. **Loading Parquet data through Intake catalogs**
2. **Basic difference between Pandas and Dask DataFrames**
3. **When to use Dask for larger datasets**

## Important Notes:

- Intake provides unified access to different data formats
- Pandas DataFrames: Load entire dataset into memory
- Dask DataFrames: Lazy loading, suitable for large datasets
- Both can be accessed through the same catalog interface

In [54]:
import intake

# Load catalog
catalog = intake.open_catalog('../catalogs/station_intake_catalog.yaml')

# Explore available datasets
print("📁 Available datasets:")
for name in catalog:
    print(f"  • {name}")

print(f"\n📊 Total datasets: {len(list(catalog))}")

📁 Available datasets:
  • nyc_taxi_sample

📊 Total datasets: 1


In [55]:
# Method 1: Load as Pandas DataFrame (loads entire dataset into memory)
print("Loading data as Pandas DataFrame...")
df = catalog.nyc_taxi_sample.read()

print(f"\nDataset Information:")
print(f"- Shape: {df.shape}")
print(f"- Total rows: {len(df):,}")
print(f"- Total columns: {len(df.columns)}")
print(f"- Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\nColumns: {list(df.columns)}")

# Show first few rows
print("\nFirst 5 rows:")
df.head()

Loading data as Pandas DataFrame...

Dataset Information:
- Shape: (12741035, 19)
- Total rows: 12,741,035
- Total columns: 19
- Memory usage: 2843.29 MB

Columns: ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee']

First 5 rows:

Dataset Information:
- Shape: (12741035, 19)
- Total rows: 12,741,035
- Total columns: 19
- Memory usage: 2843.29 MB

Columns: ['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'airport_fee']

First 5 rows:


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2015-01-01 00:11:33,2015-01-01 00:16:48,1,1.0,1,N,41,166,1,5.7,0.5,0.5,1.4,0.0,0.0,8.4,,
1,1,2015-01-01 00:18:24,2015-01-01 00:24:20,1,0.9,1,N,166,238,3,6.0,0.5,0.5,0.0,0.0,0.0,7.3,,
2,1,2015-01-01 00:26:19,2015-01-01 00:41:06,1,3.5,1,N,238,162,1,13.2,0.5,0.5,2.9,0.0,0.0,17.4,,
3,1,2015-01-01 00:45:26,2015-01-01 00:53:20,1,2.1,1,N,162,263,1,8.2,0.5,0.5,2.37,0.0,0.0,11.87,,
4,1,2015-01-01 00:59:21,2015-01-01 01:05:24,1,1.0,1,N,236,141,3,6.0,0.5,0.5,0.0,0.0,0.0,7.3,,


In [None]:
# Method 2: Load as Dask DataFrame (lazy loading for larger datasets)
print("Loading data as Dask DataFrame...")
ddf = catalog.nyc_taxi_sample.to_dask()

print(f"\nDataset Information:")
print(f"- Number of partitions: {ddf.npartitions}")
print(f"- Total columns: {len(ddf.columns)}")
print(f"- Estimated memory usage: {ddf.memory_usage(deep=True).sum().compute() / 1024**2:.2f} MB")

print(f"\nColumns: {list(ddf.columns)}")

# Show first few rows (this triggers computation)
print("\nFirst 5 ros:")
ddf.head(5)

Loading data as Dask DataFrame...

Dataset Information:
- Number of partitions: 1
- Total columns: 19

Dataset Information:
- Number of partitions: 1
- Total columns: 19


In [None]:
# 📋 Comparison Summary
print("=" * 60)
print("PANDAS vs DASK DATAFRAME COMPARISON")
print("=" * 60)

print("\nPandas DataFrame:")
print(f"   • Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"   • Loading: All data loaded into memory immediately")
print(f"   • Operations: Immediate execution")
print(f"   • Best for: Small to medium datasets")

print("\nDask DataFrame:")
print(f"   • Partitions: {ddf.npartitions}")
print(f"   • Loading: Lazy evaluation (operations planned, not executed)")
print(f"   • Operations: Execute with .compute()")
print(f"   • Best for: Large datasets that don't fit in memory")

print("\n" + "=" * 60)
print("✅ Both methods access the same data through Intake catalog!")
print("💡 Choose based on your dataset size and memory constraints.")
print("=" * 60)

🐼 PANDAS vs ⚡ DASK DATAFRAME COMPARISON

🐼 Pandas DataFrame:
   • Memory usage: 2843.29 MB
   • Loading: All data loaded into memory immediately
   • Operations: Immediate execution
   • Best for: Small to medium datasets

⚡ Dask DataFrame:
   • Partitions: 1
   • Loading: Lazy evaluation (operations planned, not executed)
   • Operations: Execute with .compute()
   • Best for: Large datasets that don't fit in memory

✅ Both methods access the same data through Intake catalog!
💡 Choose based on your dataset size and memory constraints.
