# Loading Large/Nested Contest Analyze Data

This notebook demonstrates how to load contest_analyze data that has:
- Large nested JSON structures
- Objects exceeding DuckDB's default 16MB limit
- Deeply nested dictionaries and lists

**Problem:** Contest analyze data has very large nested objects (~33MB+) that exceed DuckDB's default `maximum_object_size`.

**Solution:** Configure DuckDB with larger limits and use memory-efficient queries.

## Setup with Large Object Support

In [None]:
import duckdb
import os
import json
import pandas as pd

# Get Wasabi credentials
wasabi_endpoint = os.getenv('WASABI_ENDPOINT', 's3.us-east-2.wasabisys.com')
wasabi_access_key = os.getenv('WASABI_ACCESS_KEY')
wasabi_secret_key = os.getenv('WASABI_SECRET_KEY')
bucket_name = os.getenv('WASABI_BUCKET_NAME')

# Create DuckDB connection
con = duckdb.connect()

# Configure S3 settings
con.execute(f"""
    SET s3_endpoint='{wasabi_endpoint}';
    SET s3_access_key_id='{wasabi_access_key}';
    SET s3_secret_access_key='{wasabi_secret_key}';
    SET s3_url_style='path';
""")

print("✓ DuckDB configured with S3 credentials")
print(f"Endpoint: {wasabi_endpoint}")
print(f"Bucket: {bucket_name}")

## Configuration

In [None]:
# Update these to match your data
SPORT = "NFL"  # e.g., "NFL", "NBA", etc.
DATE = "2025-10-02"  # Format: YYYY-MM-DD
GAME_TYPE = "dk_single_game"  # e.g., "classic", "dk_single_game", etc.

# Maximum JSON object size (default is 16MB, we're setting to 64MB)
MAX_OBJECT_SIZE = 67108864  # 64MB in bytes

# Construct path
contest_analyze_path = f"s3://{bucket_name}/staging/{SPORT}/contest_analyze/{GAME_TYPE}/{DATE}/data.json.gz"

print(f"Loading from: {contest_analyze_path}")
print(f"Max object size: {MAX_OBJECT_SIZE / 1024 / 1024:.0f}MB")

## Understanding the Error

**Error you saw:**
```
InvalidInputException: "maximum_object_size" of 16777216 bytes exceeded
while reading file (>33554428 bytes)
```

**What this means:**
- Each JSON object in your file is ~33.5MB
- DuckDB's default limit is 16MB
- Your contest_analyze data has deeply nested structures

**Why this happens:**
- Contest analyze responses contain massive nested dicts
- Each contest has detailed analysis data with many fields
- Pandas saved these as-is (preserving structure)

**The fix:**
- Set `maximum_object_size` parameter in `read_json()`
- Use memory-efficient queries (filter early, don't load all at once)

## Example 1: Load Contest Analyze Data (with large object support)

In [None]:
# Load with increased maximum_object_size
result = con.execute(f"""
    SELECT * 
    FROM read_json(
        '{contest_analyze_path}', 
        format='newline_delimited',
        compression='gzip',
        maximum_object_size={MAX_OBJECT_SIZE}
    )
    LIMIT 5
""").df()

print(f"✓ Successfully loaded {len(result)} rows")
print(f"\nColumns: {list(result.columns)}")
result

In [None]:
# Count total records
count = con.execute(f"""
    SELECT COUNT(*) as total_records
    FROM read_json(
        '{contest_analyze_path}',
        format='newline_delimited',
        compression='gzip',
        maximum_object_size={MAX_OBJECT_SIZE}
    )
""").df()

print(f"Total records: {count['total_records'][0]}")

## Example 2: Inspect Data Structure

In [None]:
# Get schema
schema = con.execute(f"""
    DESCRIBE 
    SELECT * FROM read_json(
        '{contest_analyze_path}',
        format='newline_delimited',
        compression='gzip',
        maximum_object_size={MAX_OBJECT_SIZE}
    )
""").df()

print("Schema of contest_analyze data:")
print("=" * 70)
for _, row in schema.iterrows():
    print(f"{row['column_name']:40s} : {row['column_type']}")

In [None]:
# Examine one record in detail
sample = con.execute(f"""
    SELECT *
    FROM read_json(
        '{contest_analyze_path}',
        format='newline_delimited',
        compression='gzip',
        maximum_object_size={MAX_OBJECT_SIZE}
    )
    LIMIT 1
""").df()

print("\nDetailed structure of one record:")
print("=" * 70)

for col in sample.columns:
    value = sample[col].iloc[0]
    print(f"\n{col}:")
    print(f"  Type: {type(value).__name__}")
    
    if isinstance(value, dict):
        keys = list(value.keys())
        print(f"  Dict with {len(keys)} keys")
        print(f"  First 10 keys: {keys[:10]}")
        if len(keys) > 10:
            print(f"  ... and {len(keys) - 10} more keys")
    elif isinstance(value, list):
        print(f"  List with {len(value)} items")
        if len(value) > 0:
            print(f"  First item type: {type(value[0]).__name__}")
            if isinstance(value[0], dict):
                print(f"  First item keys: {list(value[0].keys())[:5]}")
    elif isinstance(value, str) and len(value) > 100:
        print(f"  String (length: {len(value)} chars)")
        print(f"  Preview: {value[:100]}...")
    else:
        print(f"  Value: {value}")

## Example 3: Memory-Efficient Queries

**IMPORTANT:** Don't load all data at once! Use filters and aggregations.

In [None]:
# Example: Get just the columns you need (not all columns)
# Adjust column names based on your actual schema

# selected_cols = con.execute(f"""
#     SELECT 
#         contest_id,
#         -- Add other specific columns you need
#         load_ts
#     FROM read_json(
#         '{contest_analyze_path}',
#         format='newline_delimited',
#         compression='gzip',
#         maximum_object_size={MAX_OBJECT_SIZE}
#     )
#     LIMIT 100
# """).df()

print("Best practice: SELECT only columns you need")
print("This reduces memory usage significantly!")

In [None]:
# Example: Filter data before loading
# filtered = con.execute(f"""
#     SELECT *
#     FROM read_json(
#         '{contest_analyze_path}',
#         format='newline_delimited',
#         compression='gzip',
#         maximum_object_size={MAX_OBJECT_SIZE}
#     )
#     WHERE contest_id = 12345  -- Filter early!
# """).df()

print("Add WHERE clauses to filter BEFORE loading into pandas")
print("DuckDB will only load matching records")

In [None]:
# Example: Aggregate without loading all data
# stats = con.execute(f"""
#     SELECT 
#         COUNT(*) as total_contests,
#         COUNT(DISTINCT contest_id) as unique_contests,
#         MIN(load_ts) as earliest_load,
#         MAX(load_ts) as latest_load
#     FROM read_json(
#         '{contest_analyze_path}',
#         format='newline_delimited',
#         compression='gzip',
#         maximum_object_size={MAX_OBJECT_SIZE}
#     )
# """).df()

print("Aggregate in SQL - much faster and uses less memory")
print("Let DuckDB do the heavy lifting!")

## Example 4: Working with Nested Fields

If your data has nested dicts/lists, here's how to access them:

In [None]:
# Method 1: JSON operators (if stored as JSON strings)
# result = con.execute(f"""
#     SELECT 
#         contest_id,
#         nested_column->>'$.some_key' as extracted_value,
#         nested_column->'$.another_key' as extracted_json
#     FROM read_json(
#         '{contest_analyze_path}',
#         format='newline_delimited',
#         compression='gzip',
#         maximum_object_size={MAX_OBJECT_SIZE}
#     )
#     LIMIT 10
# """).df()

print("JSON operators for nested access:")
print("  column->'$.key'      -- Returns JSON")
print("  column->>'$.key'     -- Returns string")
print("  column['key']        -- For STRUCT types")
print("  column[1]            -- For LIST types (0-indexed)")

In [None]:
# Method 2: Load into pandas and process there
# (Only for small filtered datasets!)

# sample_df = con.execute(f"""
#     SELECT *
#     FROM read_json(
#         '{contest_analyze_path}',
#         format='newline_delimited',
#         compression='gzip',
#         maximum_object_size={MAX_OBJECT_SIZE}
#     )
#     LIMIT 1
# """).df()

# # Access nested fields in pandas
# for col in sample_df.columns:
#     value = sample_df[col].iloc[0]
#     if isinstance(value, dict):
#         print(f"{col} keys: {list(value.keys())}")

print("Process nested data in pandas if needed")
print("But ONLY after filtering to small dataset!")

## Example 5: Check Data Quality

In [None]:
# Verify load timestamp
ts_check = con.execute(f"""
    SELECT 
        MIN(load_ts) as earliest_load,
        MAX(load_ts) as latest_load,
        COUNT(*) as record_count,
        COUNT(DISTINCT load_ts) as unique_timestamps
    FROM read_json(
        '{contest_analyze_path}',
        format='newline_delimited',
        compression='gzip',
        maximum_object_size={MAX_OBJECT_SIZE}
    )
""").df()

print("Load timestamp verification:")
ts_check

## Alternative: Process in Chunks

If even filtered queries use too much memory, process data in chunks:

In [None]:
# Example: Process in chunks using LIMIT/OFFSET
chunk_size = 10
offset = 0

results = []

# Get total count first
total = con.execute(f"""
    SELECT COUNT(*) as cnt
    FROM read_json(
        '{contest_analyze_path}',
        format='newline_delimited',
        compression='gzip',
        maximum_object_size={MAX_OBJECT_SIZE}
    )
""").fetchone()[0]

print(f"Total records: {total}")
print(f"Processing in chunks of {chunk_size}...")

# Process first 2 chunks as example
for i in range(min(2, (total + chunk_size - 1) // chunk_size)):
    chunk = con.execute(f"""
        SELECT *
        FROM read_json(
            '{contest_analyze_path}',
            format='newline_delimited',
            compression='gzip',
            maximum_object_size={MAX_OBJECT_SIZE}
        )
        LIMIT {chunk_size}
        OFFSET {offset}
    """).df()
    
    print(f"  Chunk {i+1}: {len(chunk)} records")
    
    # Process chunk here
    # results.append(chunk)
    
    offset += chunk_size

print("\nChunked processing prevents memory overflow!")

## Troubleshooting

### Common Issues

**1. Still getting "maximum_object_size" error:**
- Increase `MAX_OBJECT_SIZE` further (try 128MB or 256MB)
- Your objects might be even larger than 64MB

**2. Out of memory when loading:**
- Don't use `.df()` on large result sets
- Add WHERE clauses to filter first
- Use aggregations instead of loading raw data
- Process in chunks (see example above)

**3. Slow queries:**
- JSONL is inherently slower than Parquet
- Consider converting to Parquet for repeated analysis
- But keep JSONL in staging for raw data preservation

**4. Can't access nested fields:**
- Check if nested data is stored as strings or native types
- Use `type(sample[col].iloc[0])` to check
- For strings: parse with `json.loads()` in pandas
- For STRUCT/LIST: use DuckDB's nested operators

### Memory Guidelines

**Rule of thumb:**
- Each 64MB JSON object → ~100-200MB in pandas DataFrame
- 10 records of 64MB each → 1-2GB memory needed
- Always filter/limit before calling `.df()`

**Best practices:**
1. Filter with WHERE in SQL
2. Select only needed columns
3. Aggregate in SQL when possible
4. Process in chunks for large datasets
5. Don't call `.df()` without LIMIT

## Next Steps

**If this works well:**
1. Use filtered queries to extract specific data you need
2. Consider creating processed views in DDS layer
3. Convert commonly-queried data to Parquet for speed

**If still having memory issues:**
1. Restructure contest_analyze data upstream (in scraper)
2. Save as multiple files instead of one large file
3. Flatten nested structures before saving
4. Use streaming processing (chunk by chunk)

**For production pipelines:**
```python
# Keep staging as raw JSONL
Staging: contest_analyze/*.json.gz  (raw, nested)

# Process to DDS as flattened Parquet
DDS: contest_analyze_processed/*.parquet  (clean, flat, fast)

# Aggregate to DDM for analytics
DDM: contest_summaries/*.parquet  (aggregated, business-ready)
```

In [None]:
# Close connection
con.close()
print("✓ DuckDB connection closed")