# Test Loading JSONL Staging Data from S3

This notebook demonstrates how to:
- Load gzipped JSONL files saved by `DuckDBStagingProcessor`
- Query contest_analyze_data with nested structures
- Inspect schemas and verify data integrity
- Work with memory-efficient queries

## Setup: Configure DuckDB with S3 Credentials

In [1]:
import duckdb
import os
import json
import pandas as pd

# Get Wasabi credentials from environment
wasabi_endpoint = os.getenv('WASABI_ENDPOINT', 's3.us-east-2.wasabisys.com')
wasabi_access_key = os.getenv('WASABI_ACCESS_KEY')
wasabi_secret_key = os.getenv('WASABI_SECRET_KEY')
bucket_name = os.getenv('WASABI_BUCKET_NAME')

# Create DuckDB connection
con = duckdb.connect()

# Configure S3 settings
con.execute(f"""
    SET s3_endpoint='{wasabi_endpoint}';
    SET s3_access_key_id='{wasabi_access_key}';
    SET s3_secret_access_key='{wasabi_secret_key}';
    SET s3_url_style='path';
""")

print("✓ DuckDB configured with S3 credentials")
print(f"Endpoint: {wasabi_endpoint}")
print(f"Bucket: {bucket_name}")

✓ DuckDB configured with S3 credentials
Endpoint: s3.us-east-2.wasabisys.com
Bucket: dfscrunch-data-lake


## Configuration: Set your sport and date

Update these values to match the data you've saved:

In [3]:
# Update these to match your data
SPORT = "NFL"  # e.g., "nfl", "nba", etc.
DATE = "2025-10-02"  # Format: YYYY-MM-DD
GAME_TYPE = "dk_single_game"  # e.g., "classic", "showdown", etc.

# Construct base path
base_path = f"s3://{bucket_name}/staging/{SPORT}/"

print(f"Base path: {base_path}")

Base path: s3://dfscrunch-data-lake/staging/NFL/


## Example 1: Load and Inspect contest_analyze Data

This is the most important test since contest_analyze has nested/unstructured data.

In [5]:
# Path to contest_analyze data
contest_analyze_path = f"{base_path}contest_analyze/{GAME_TYPE}/{DATE}/data.json.gz"

print(f"Loading from: {contest_analyze_path}")

# Load the data
result = con.execute(f"""
    SELECT * 
    FROM read_json('{contest_analyze_path}', 
                   format='newline_delimited',
                   compression='gzip',
                   maximum_object_size=67108864)
    LIMIT 5
""").df()

print(f"\nLoaded {len(result)} rows (showing first 5)")
result

Loading from: s3://dfscrunch-data-lake/staging/NFL/contest_analyze/dk_single_game/2025-10-02/data.json.gz

Loaded 1 rows (showing first 5)


Unnamed: 0,contest,players,users,salaries,flexUsage,cptUsage,cptBreakdown,favoriteUsage,homeUsage,exposures,teamStacks,gameStacks,cuts,load_ts
0,"{'contestId': 182808315, 'contestName': 'NFL S...","{'926:0': {'playerId': 926, 'firstName': 'Dava...","{'gfriedmann': {'userId': 'gfriedmann', 'total...",{'100': {'salaryCounts': {'49100': {'salaryCt'...,{'100': {'flexCounts': {'RB': {'flexUsageCt': ...,{'100': {'cptCounts': {'RB': {'cptUsageCt': 29...,"{'100': {'favorite': 56159, 'underdog': 31932,...",{'100': {'favoriteCounts': {'2:4': {'favoriteC...,{'100': {'homeCounts': {'2:4': {'homeCt': 1144...,{'100': {'exposureCounts': {'119535:1': {'expo...,{'100': {'teamStacksObject': {'119535:1:6957:0...,{'100': {'gameStacksObject': {'119535:33213:69...,"{'madeCutCounts': {'0': {'cutsCt': 88111, 'cut...",1759813350


In [None]:
# Check total record count
count = con.execute(f"""
    SELECT COUNT(*) as total_records
    FROM read_json('{contest_analyze_path}',
                   format='newline_delimited', 
                   compression='gzip')
""").df()

print(f"Total records in contest_analyze: {count['total_records'][0]}")

## Example 2: Inspect Schema

Use DuckDB's schema inspection to see column types and structure.

In [None]:
# Method 1: DESCRIBE - see column types
schema = con.execute(f"""
    DESCRIBE 
    SELECT * FROM read_json('{contest_analyze_path}',
                           format='newline_delimited',
                           compression='gzip')
""").df()

print("Schema of contest_analyze data:")
print("=" * 60)
for _, row in schema.iterrows():
    print(f"{row['column_name']:30s} : {row['column_type']}")

In [None]:
# Method 2: SUMMARIZE - schema + statistics
summary = con.execute(f"""
    SUMMARIZE 
    SELECT * FROM read_json('{contest_analyze_path}',
                           format='newline_delimited',
                           compression='gzip')
""").df()

print("Summary statistics:")
summary

## Example 3: Query Nested Fields

If your data has nested JSON structures, here's how to access them:

In [None]:
# First, let's see what one record looks like in detail
sample = con.execute(f"""
    SELECT *
    FROM read_json('{contest_analyze_path}',
                   format='newline_delimited',
                   compression='gzip')
    LIMIT 1
""").df()

print("Sample record structure:")
print("=" * 60)
for col in sample.columns:
    value = sample[col].iloc[0]
    print(f"\n{col}:")
    print(f"  Type: {type(value)}")
    if isinstance(value, dict):
        print(f"  Keys: {list(value.keys())[:10]}...")  # Show first 10 keys
    elif isinstance(value, list):
        print(f"  Length: {len(value)}")
        if len(value) > 0:
            print(f"  First item: {value[0]}")
    else:
        print(f"  Value: {value}")

In [None]:
# Example: Access nested fields with JSON operators
# Adjust column names based on your actual schema

# If you have nested JSON columns, use ->> or -> operators:
# nested_query = con.execute(f"""
#     SELECT 
#         id,
#         nested_column->>'$.some_key' as extracted_value
#     FROM read_json('{contest_analyze_path}',
#                    format='newline_delimited',
#                    compression='gzip')
#     LIMIT 10
# """).df()
# nested_query

print("For nested JSON access, use DuckDB's JSON operators:")
print("  column->'$.key'      -- Returns JSON")
print("  column->>'$.key'     -- Returns string")
print("  column['key']        -- For STRUCT types")

## Example 4: Load Other Datasets

Test loading contests, events, draft_groups, and lineups data.

In [None]:
# Load contests data
contests_path = f"{base_path}contests/{GAME_TYPE}/{DATE}/data.json.gz"

contests = con.execute(f"""
    SELECT * 
    FROM read_json('{contests_path}',
                   format='newline_delimited',
                   compression='gzip')
    LIMIT 5
""").df()

print(f"Contests data ({len(contests)} rows shown):")
contests

In [None]:
# Load events data
events_path = f"{base_path}events/{GAME_TYPE}/{DATE}/data.json.gz"

events = con.execute(f"""
    SELECT * 
    FROM read_json('{events_path}',
                   format='newline_delimited',
                   compression='gzip')
    LIMIT 5
""").df()

print(f"Events data ({len(events)} rows shown):")
events

In [None]:
# Load draft_groups data
draft_groups_path = f"{base_path}draft_groups/{DATE}/data.json.gz"

draft_groups = con.execute(f"""
    SELECT * 
    FROM read_json('{draft_groups_path}',
                   format='newline_delimited',
                   compression='gzip')
    LIMIT 5
""").df()

print(f"Draft Groups data ({len(draft_groups)} rows shown):")
draft_groups

In [None]:
# Load lineups data
lineups_path = f"{base_path}lineups/{GAME_TYPE}/{DATE}/data.json.gz"

lineups = con.execute(f"""
    SELECT * 
    FROM read_json('{lineups_path}',
                   format='newline_delimited',
                   compression='gzip')
    LIMIT 5
""").df()

print(f"Lineups data ({len(lineups)} rows shown):")
lineups

## Example 5: Verify load_ts Timestamp

Check that the timestamp was added correctly.

In [None]:
# Check load_ts across all datasets
timestamps = con.execute(f"""
    SELECT 
        'contest_analyze' as dataset,
        MIN(load_ts) as min_timestamp,
        MAX(load_ts) as max_timestamp,
        COUNT(*) as record_count
    FROM read_json('{contest_analyze_path}',
                   format='newline_delimited',
                   compression='gzip')
    
    UNION ALL
    
    SELECT 
        'contests' as dataset,
        MIN(load_ts) as min_timestamp,
        MAX(load_ts) as max_timestamp,
        COUNT(*) as record_count
    FROM read_json('{contests_path}',
                   format='newline_delimited',
                   compression='gzip')
""").df()

print("Timestamp verification:")
timestamps

## Example 6: Query Across Multiple Dates

Use wildcards to query multiple date partitions at once.

In [None]:
# Query all contest_analyze data for this game type (all dates)
all_dates_path = f"{base_path}contest_analyze/{GAME_TYPE}/*/data.json.gz"

all_data_count = con.execute(f"""
    SELECT 
        COUNT(*) as total_records,
        COUNT(DISTINCT load_ts) as unique_load_batches
    FROM read_json('{all_dates_path}',
                   format='newline_delimited',
                   compression='gzip')
""").df()

print(f"Querying all dates with wildcard: {all_dates_path}")
print(all_data_count)

## Example 7: Memory-Efficient Filtering

DuckDB only loads the data it needs - add filters to keep memory low.

In [None]:
# Example: Filter before loading into memory
# Adjust column names based on your schema

# filtered = con.execute(f"""
#     SELECT *
#     FROM read_json('{contest_analyze_path}',
#                    format='newline_delimited',
#                    compression='gzip')
#     WHERE contest_id = 12345  -- Only load matching records
#     LIMIT 100
# """).df()

print("Best practice: Add WHERE clauses to filter data BEFORE loading into pandas")
print("This keeps memory usage low, even with huge S3 files!")

## Example 8: Convert to Parquet for Faster Queries (Optional)

If you query the same data often, convert JSONL to Parquet for better performance.

In [None]:
# Optional: Convert JSONL to Parquet for faster subsequent queries
# parquet_path = f"{base_path}contest_analyze_parquet/{GAME_TYPE}/{DATE}/data.parquet"

# con.execute(f"""
#     COPY (
#         SELECT * 
#         FROM read_json('{contest_analyze_path}',
#                        format='newline_delimited',
#                        compression='gzip')
#     )
#     TO '{parquet_path}'
#     (FORMAT PARQUET, COMPRESSION 'SNAPPY')
# """)

# print(f"Converted to Parquet: {parquet_path}")
# print("Parquet queries are 5-10x faster than JSONL!")

print("Parquet conversion commented out - uncomment to use")

## Troubleshooting

Common issues and solutions:

In [None]:
print("""
Common Issues & Solutions:
========================

1. "No files found" error:
   - Check SPORT, DATE, GAME_TYPE variables
   - Verify bucket_name is correct
   - Check S3 credentials in .env

2. "Out of memory" when loading:
   - Add WHERE clause to filter first
   - Use LIMIT to test with small samples
   - Don't call .df() on huge results

3. Nested JSON showing as strings:
   - This is expected! Pandas saved nested dicts as JSON strings
   - Use DuckDB's JSON operators to extract:
     - column->>'$.key' for string values
     - column->'$.key' for JSON values

4. Slow queries:
   - JSONL is slower than Parquet
   - Consider converting to Parquet (see Example 8)
   - Add indexes if querying same filters often

5. Schema detection issues:
   - DuckDB auto-detects from first few rows
   - If inconsistent, specify schema manually:
     read_json(..., columns={'col1': 'VARCHAR', 'col2': 'INTEGER'})
""")

## Summary

✅ **Successfully demonstrated:**
1. Loading JSONL gzipped files from S3
2. Inspecting schemas and data structure
3. Querying nested/unstructured data
4. Working with multiple datasets
5. Memory-efficient filtering
6. Querying across date partitions

**Key takeaways:**
- DuckDB reads JSONL gzipped files natively ✓
- No need to load everything into memory ✓
- Use WHERE clauses for memory efficiency ✓
- Nested data preserved and queryable ✓
- Same S3 path structure as before ✓

**Your staging pipeline is working correctly!** 🎉

In [None]:
# Close connection
con.close()
print("✓ DuckDB connection closed")