# Parquet Deep Dive: Advanced Usage Patterns

This notebook covers advanced Parquet features that are essential for production ML systems:

1. **Chunked reading/writing** - Process large files without loading everything into memory
2. **Partitioning** - Organize data by columns (e.g., by date, region) for faster queries
3. **Predicate pushdown** - Filter data during read (at the file level, not in Python)
4. **Reading multiple files** - Work with partitioned datasets
5. **Row group optimization** - Fine-tune internal structure for query performance
6. **Statistics and metadata** - Inspect file contents without reading data

In [None]:
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
from pathlib import Path
import time
import shutil

# Create output directories
output_dir = Path('../fixtures/output')
partitioned_dir = output_dir / 'partitioned_data'
output_dir.mkdir(exist_ok=True, parents=True)

print(f"PyArrow version: {pa.__version__}")

## 1. Create Large Sample Dataset

We'll create a realistic time-series dataset with multiple dimensions for partitioning.

In [None]:
np.random.seed(42)

# Create 1 million rows of e-commerce transaction data
n_rows = 1_000_000

# Generate date range (1 year of data)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='30s')[:n_rows]

df = pd.DataFrame({
    'transaction_id': range(1, n_rows + 1),
    'timestamp': dates,
    'customer_id': np.random.randint(1000, 50000, n_rows),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books', 'Sports'], n_rows),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_rows),
    'amount': np.round(np.random.uniform(5, 500, n_rows), 2),
    'quantity': np.random.randint(1, 10, n_rows),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'PayPal', 'Cash'], n_rows),
})

# Extract date components for partitioning
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day

print(f"Dataset shape: {df.shape}")
print(f"Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
df.head()

## 2. Chunked Writing (Streaming Large DataFrames)

When writing large datasets, you may not be able to fit everything in memory. PyArrow supports **chunked writing** using `ParquetWriter`.

In [None]:
# Write data in chunks
chunk_size = 100_000
output_file = output_dir / 'large_data.parquet'

# Convert pandas DataFrame to PyArrow Table for the first chunk
table = pa.Table.from_pandas(df[:chunk_size])

# Initialize writer with schema from first chunk
writer = pq.ParquetWriter(output_file, table.schema, compression='snappy')

# Write chunks
print(f"Writing {len(df):,} rows in chunks of {chunk_size:,}...")
for i in range(0, len(df), chunk_size):
    chunk = df[i:i+chunk_size]
    table = pa.Table.from_pandas(chunk)
    writer.write_table(table)
    print(f"  Wrote chunk {i//chunk_size + 1}: rows {i:,} to {min(i+chunk_size, len(df)):,}")

writer.close()
print(f"\nFile size: {output_file.stat().st_size / 1024**2:.2f} MB")

## 3. Chunked Reading (Processing Large Files)

Read large Parquet files in chunks without loading everything into memory.

In [None]:
# Read file in batches
parquet_file = pq.ParquetFile(output_file)

print(f"File metadata:")
print(f"  Total rows: {parquet_file.metadata.num_rows:,}")
print(f"  Number of row groups: {parquet_file.num_row_groups}")
print(f"  Columns: {len(parquet_file.schema)}")

# Process data in batches
batch_size = 250_000
total_amount = 0

print(f"\nProcessing in batches of {batch_size:,} rows...")
for batch in parquet_file.iter_batches(batch_size=batch_size):
    # Convert to pandas for processing (or use PyArrow compute)
    batch_df = batch.to_pandas()
    total_amount += batch_df['amount'].sum()
    print(f"  Processed batch: {len(batch_df):,} rows")

print(f"\nTotal amount across all transactions: ${total_amount:,.2f}")

## 4. Partitioned Datasets

**Partitioning** organizes data into subdirectories based on column values. This is critical for query performance on large datasets.

Benefits:
- Skip entire partitions when filtering
- Parallel processing of partitions
- Better organization for time-series data

Common partitioning strategies:
- **Time-based**: year/month/day
- **Geographic**: region/country/city
- **Categorical**: product_type/category

In [None]:
# Clean up previous partitioned data
if partitioned_dir.exists():
    shutil.rmtree(partitioned_dir)

# Write partitioned dataset (by year and month)
print("Writing partitioned dataset (partitioned by year and month)...")
df.to_parquet(
    partitioned_dir,
    partition_cols=['year', 'month'],
    compression='snappy',
    index=False
)

print("\nPartitioned directory structure:")
# Show directory structure
for year_dir in sorted(partitioned_dir.glob('year=*')):
    print(f"  {year_dir.name}/")
    for month_dir in sorted(year_dir.glob('month=*')):
        files = list(month_dir.glob('*.parquet'))
        total_size = sum(f.stat().st_size for f in files)
        print(f"    {month_dir.name}/ ({len(files)} files, {total_size/1024**2:.2f} MB)")

### Reading Partitioned Datasets

In [None]:
# Read entire partitioned dataset
print("Reading entire partitioned dataset...")
start = time.time()
df_all = pd.read_parquet(partitioned_dir)
print(f"Read {len(df_all):,} rows in {time.time()-start:.3f}s")
print(f"Shape: {df_all.shape}")
df_all.head()

In [None]:
# Read only specific partitions (using filters)
print("Reading data only for year=2023, month=6...")
start = time.time()
df_june = pd.read_parquet(
    partitioned_dir,
    filters=[('year', '=', 2023), ('month', '=', 6)]
)
print(f"Read {len(df_june):,} rows in {time.time()-start:.3f}s")
print(f"Date range: {df_june['timestamp'].min()} to {df_june['timestamp'].max()}")
df_june.head()

## 5. Predicate Pushdown (Filtering During Read)

**Predicate pushdown** applies filters at the file/row-group level, avoiding reading unnecessary data.

This is much faster than:
1. Reading entire file
2. Then filtering in Python

In [None]:
# WITHOUT predicate pushdown (read then filter)
print("Method 1: Read entire file, then filter in pandas")
start = time.time()
df_full = pd.read_parquet(output_file)
df_filtered_python = df_full[df_full['region'] == 'North']
time_without_pushdown = time.time() - start
print(f"  Time: {time_without_pushdown:.3f}s")
print(f"  Rows: {len(df_filtered_python):,}")

# WITH predicate pushdown (filter during read)
print("\nMethod 2: Filter during read (predicate pushdown)")
start = time.time()
df_filtered_pushdown = pd.read_parquet(
    output_file,
    filters=[('region', '=', 'North')]
)
time_with_pushdown = time.time() - start
print(f"  Time: {time_with_pushdown:.3f}s")
print(f"  Rows: {len(df_filtered_pushdown):,}")

print(f"\nSpeedup: {time_without_pushdown/time_with_pushdown:.2f}x faster")

### Complex Filters

Parquet supports various filter operations:

In [None]:
# Multiple conditions (AND)
df_complex = pd.read_parquet(
    output_file,
    filters=[
        ('region', '=', 'North'),
        ('amount', '>', 100),
        ('product_category', 'in', ['Electronics', 'Books'])
    ]
)
print(f"Complex filter (North region, amount > 100, Electronics or Books):")
print(f"  Rows: {len(df_complex):,}")
print(f"  Average amount: ${df_complex['amount'].mean():.2f}")

# Show sample
df_complex.head()

## 6. Reading Multiple Parquet Files

Common pattern: Read all Parquet files from a directory as a single DataFrame.

In [None]:
# Create multiple small files
multi_file_dir = output_dir / 'multi_files'
multi_file_dir.mkdir(exist_ok=True)

# Split data into 4 files by region
for region in df['region'].unique():
    region_df = df[df['region'] == region]
    region_df.to_parquet(multi_file_dir / f'data_{region}.parquet', index=False)
    print(f"Wrote {region}: {len(region_df):,} rows")

print(f"\nFiles created:")
for f in sorted(multi_file_dir.glob('*.parquet')):
    print(f"  {f.name} ({f.stat().st_size/1024**2:.2f} MB)")

In [None]:
# Read all files in directory
print("Reading all Parquet files in directory...")
df_combined = pd.read_parquet(multi_file_dir)
print(f"Total rows: {len(df_combined):,}")
print(f"Regions: {df_combined['region'].unique()}")
print(f"\nRows per region:")
print(df_combined['region'].value_counts())

### Reading Specific Files with Glob Patterns

In [None]:
# Read only North and South regions
import glob

north_south_files = [
    str(multi_file_dir / 'data_North.parquet'),
    str(multi_file_dir / 'data_South.parquet')
]

df_north_south = pd.concat([
    pd.read_parquet(f) for f in north_south_files
])

print(f"Read {len(df_north_south):,} rows from North and South")
print(df_north_south['region'].value_counts())

## 7. Row Groups and File Structure

Parquet files are organized into **row groups**:
- Each row group contains a chunk of rows
- Stores min/max statistics per column
- Enables predicate pushdown at row group level

Smaller row groups = better filtering, larger overhead  
Larger row groups = less overhead, coarser filtering

In [None]:
# Write with custom row group size
file_small_groups = output_dir / 'small_row_groups.parquet'
file_large_groups = output_dir / 'large_row_groups.parquet'

# Small row groups (50k rows each)
table = pa.Table.from_pandas(df)
pq.write_table(
    table,
    file_small_groups,
    row_group_size=50_000,
    compression='snappy'
)

# Large row groups (500k rows each)
pq.write_table(
    table,
    file_large_groups,
    row_group_size=500_000,
    compression='snappy'
)

# Compare metadata
print("Small row groups file:")
pf_small = pq.ParquetFile(file_small_groups)
print(f"  Total rows: {pf_small.metadata.num_rows:,}")
print(f"  Row groups: {pf_small.num_row_groups}")
print(f"  File size: {file_small_groups.stat().st_size/1024**2:.2f} MB")

print("\nLarge row groups file:")
pf_large = pq.ParquetFile(file_large_groups)
print(f"  Total rows: {pf_large.metadata.num_rows:,}")
print(f"  Row groups: {pf_large.num_row_groups}")
print(f"  File size: {file_large_groups.stat().st_size/1024**2:.2f} MB")

## 8. Metadata and Statistics

Parquet stores rich metadata without reading the actual data.

In [None]:
# Inspect file metadata
parquet_file = pq.ParquetFile(output_file)

print("File Metadata:")
print(f"  Created by: {parquet_file.metadata.created_by}")
print(f"  Total rows: {parquet_file.metadata.num_rows:,}")
print(f"  Number of row groups: {parquet_file.num_row_groups}")
print(f"  Number of columns: {parquet_file.metadata.num_columns}")

print("\nSchema:")
print(parquet_file.schema)

In [None]:
# Row group statistics (min/max values per column)
print("Row Group Statistics (first row group):")
rg = parquet_file.metadata.row_group(0)

print(f"  Total rows in this row group: {rg.num_rows:,}")
print(f"  Total byte size: {rg.total_byte_size/1024**2:.2f} MB")

print("\n  Column Statistics:")
for i in range(min(5, rg.num_columns)):  # Show first 5 columns
    col = rg.column(i)
    print(f"    {col.path_in_schema}:")
    if col.statistics:
        stats = col.statistics
        print(f"      Min: {stats.min}")
        print(f"      Max: {stats.max}")
        print(f"      Null count: {stats.null_count}")

## 9. PyArrow Compute Functions

Process data without converting to pandas (more memory efficient).

In [None]:
# Read as PyArrow table
table = pq.read_table(output_file)

print(f"PyArrow Table:")
print(f"  Rows: {table.num_rows:,}")
print(f"  Columns: {table.num_columns}")

# Compute statistics using PyArrow (without pandas)
amount_col = table['amount']

print("\nColumn statistics (computed with PyArrow):")
print(f"  Sum: ${pc.sum(amount_col).as_py():,.2f}")
print(f"  Mean: ${pc.mean(amount_col).as_py():.2f}")
print(f"  Min: ${pc.min(amount_col).as_py():.2f}")
print(f"  Max: ${pc.max(amount_col).as_py():.2f}")

In [None]:
# Filter using PyArrow compute
filtered_table = table.filter(
    (pc.field('region') == 'North') & (pc.field('amount') > 200)
)

print(f"Filtered table (North region, amount > 200):")
print(f"  Rows: {filtered_table.num_rows:,}")

# Convert to pandas only at the end
filtered_df = filtered_table.to_pandas()
print(f"\nSample data:")
filtered_df.head()

## 10. Best Practices Summary

### Partitioning Strategy
1. **Partition by columns frequently used in filters** (e.g., date, region)
2. **Avoid high cardinality** - don't partition by columns with millions of unique values
3. **Aim for partition sizes of 100MB-1GB** - not too small, not too large
4. **Hierarchical partitioning**: year/month/day (coarse to fine)

### Row Group Size
1. **Default (128MB) is usually good** for most workloads
2. **Smaller row groups (50MB)** = better filtering, more overhead
3. **Larger row groups (256MB+)** = less overhead, coarser filtering

### Compression
1. **Snappy**: Default, good balance (fast compression/decompression)
2. **Zstd**: Better compression ratio, moderate speed
3. **Gzip**: Good compression, slower than Snappy/Zstd
4. **None**: Only for temporary files or when speed is critical

### Reading
1. **Always use column selection** - only read what you need
2. **Use filters/predicate pushdown** - filter during read, not after
3. **Process in batches** for large files - use `iter_batches()`
4. **Leverage partitioning** - read only relevant partitions

### Writing
1. **Use chunked writing** for large datasets - `ParquetWriter`
2. **Set appropriate row group size** based on query patterns
3. **Partition strategically** - balance between too many/too few partitions
4. **Don't include partition columns in data** - they're stored in directory names

In [None]:
# Cleanup
print("Notebook complete!")
print(f"\nCreated files in: {output_dir}")
print(f"Total disk usage: {sum(f.stat().st_size for f in output_dir.rglob('*.parquet'))/1024**2:.2f} MB")