# 00 - Setup and Validation

**Purpose**: Verify environment setup, establish DuckDB connection, and validate data access.

**Outputs**: 
- Confirmed working environment
- DuckDB connection established
- Basic dataset information (row count, columns)

**For Google Colab**: Uncomment the Drive mounting section below.

## 1. Environment Check

In [None]:
# Check Python version and installed packages
import sys
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Import key libraries
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("\n✓ All required libraries imported successfully")
print(f"  - DuckDB version: {duckdb.__version__}")
print(f"  - Pandas version: {pd.__version__}")
print(f"  - NumPy version: {np.__version__}")

## 2. Configure File Paths

### For LOCAL development:

In [None]:
# Local paths (default)
import os

# Get project root (one level up from notebooks/)
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
CSV_PATH = os.path.join(PROJECT_ROOT, 'battles.csv')
ARTIFACTS_DIR = os.path.join(PROJECT_ROOT, 'artifacts')

# Add src/ to Python path for importing custom modules
sys.path.insert(0, os.path.join(PROJECT_ROOT, 'src'))

print(f"Project root: {PROJECT_ROOT}")
print(f"CSV path: {CSV_PATH}")
print(f"Artifacts dir: {ARTIFACTS_DIR}")

### For GOOGLE COLAB (uncomment if using Colab):

In [None]:
# # Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# # Set paths (UPDATE THESE to match your Drive folder structure)
# PROJECT_ROOT = '/content/drive/MyDrive/DataRoyale'
# CSV_PATH = os.path.join(PROJECT_ROOT, 'battles.csv')
# ARTIFACTS_DIR = os.path.join(PROJECT_ROOT, 'artifacts')

# # Add src/ to path
# sys.path.insert(0, os.path.join(PROJECT_ROOT, 'src'))

# print(f"✓ Google Drive mounted")
# print(f"Project root: {PROJECT_ROOT}")
# print(f"CSV path: {CSV_PATH}")

## 3. Verify File Access

In [None]:
# Check if battles.csv exists
if os.path.exists(CSV_PATH):
    file_size_gb = os.path.getsize(CSV_PATH) / (1024**3)
    print(f"✓ battles.csv found!")
    print(f"  File size: {file_size_gb:.2f} GB")
else:
    print(f"❌ battles.csv NOT FOUND at {CSV_PATH}")
    print(f"\nPlease ensure battles.csv is in the correct location:")
    print(f"  - Local: Place in project root")
    print(f"  - Colab: Upload to Google Drive and update CSV_PATH above")

## 4. Create DuckDB Connection

In [None]:
# Import custom utility functions
from duckdb_utils import get_connection, create_battles_view, get_schema

# Create in-memory DuckDB connection
con = get_connection()

print("✓ DuckDB connection created")

## 5. Create Battles View

This creates a **view** (not a table), meaning DuckDB will stream data from the CSV without loading it all into memory.

In [None]:
# Create view over battles.csv
create_battles_view(con, CSV_PATH, view_name='battles', sample_size=-1)

print("\n✓ View 'battles' created successfully")
print("  You can now query with: con.sql('SELECT * FROM battles LIMIT 10').df()")

## 6. Basic Data Validation

In [None]:
# Get row count (fast approximate count)
row_count = con.sql("SELECT COUNT(*) as count FROM battles").df()['count'][0]

print(f"Total battles: {row_count:,}")

In [None]:
# Get schema (column names and types)
schema = get_schema(con, 'battles')

# Display schema
schema

In [None]:
# Preview first 10 rows
preview = con.sql("SELECT * FROM battles LIMIT 10").df()

print(f"Preview of first 10 battles:")
preview

## 7. Test Queries

Verify we can run queries efficiently.

In [None]:
# Test query: Count battles by arena
arena_counts = con.sql("""
    SELECT 
        "arena.id" as arena,
        COUNT(*) as battle_count
    FROM battles
    GROUP BY "arena.id"
    ORDER BY battle_count DESC
    LIMIT 10
""").df()

print("Top 10 arenas by battle count:")
arena_counts

In [None]:
# Test query: Average trophy count
avg_trophies = con.sql("""
    SELECT 
        AVG("average.startingTrophies") as avg_trophies,
        MIN("average.startingTrophies") as min_trophies,
        MAX("average.startingTrophies") as max_trophies
    FROM battles
""").df()

print("Trophy statistics:")
avg_trophies

## 8. Save Connection for Next Notebooks

**Note**: Each notebook will need to recreate the connection and view. Copy the setup code from cells 4-5 to future notebooks.

In [None]:
print("\n" + "="*60)
print("SETUP COMPLETE!")
print("="*60)
print("\nYour environment is ready for analysis.")
print("\nQuick reference:")
print("  - DuckDB connection: con")
print("  - View name: battles")
print("  - Query syntax: con.sql('SELECT...').df()")
print(f"  - Total rows: {row_count:,}")
print(f"  - Total columns: {len(schema)}")
print("\nNext steps:")
print("  1. Open 01-data-profiling.ipynb to explore data quality")
print("  2. Or start exploratory analysis in notebooks 02-04")