# 00 - Setup and Validation

**Purpose**: Verify environment setup, establish DuckDB connection, and validate data access.

**Outputs**: 
- Confirmed working environment
- DuckDB connection established
- Basic dataset information (row count, columns)

**For Google Colab**: Uncomment the Drive mounting section below.

## 1. Environment Check

In [1]:
# Check Python version and installed packages
import sys
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Import key libraries
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("\n‚úì All required libraries imported successfully")
print(f"  - DuckDB version: {duckdb.__version__}")
print(f"  - Pandas version: {pd.__version__}")
print(f"  - NumPy version: {np.__version__}")

Python version: 3.13.2 (tags/v3.13.2:4f8bb39, Feb  4 2025, 15:23:48) [MSC v.1942 64 bit (AMD64)]
Python executable: c:\Users\Danny\AppData\Local\Programs\Python\Python313\python.exe

‚úì All required libraries imported successfully
  - DuckDB version: 1.4.1
  - Pandas version: 2.3.3
  - NumPy version: 2.3.4


## 2. Configure File Paths

### For LOCAL development:

In [2]:
# Local paths (default)
import os

# Get project root (one level up from notebooks/)
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Use Parquet if available (faster), fallback to CSV
DATA_PATH = os.path.join(PROJECT_ROOT, 'battles.parquet')
if not os.path.exists(DATA_PATH):
    DATA_PATH = os.path.join(PROJECT_ROOT, 'battles.csv')
    print("Note: Using battles.csv (Parquet not found. Run 'python convert_to_parquet.py' for faster queries)")

ARTIFACTS_DIR = os.path.join(PROJECT_ROOT, 'artifacts')

# Add src/ to Python path for importing custom modules
sys.path.insert(0, os.path.join(PROJECT_ROOT, 'src'))

print(f"Project root: {PROJECT_ROOT}")
print(f"Data path: {DATA_PATH}")
print(f"Artifacts dir: {ARTIFACTS_DIR}")

Project root: c:\Users\Danny\Documents\GitHub\HeHeHaHa_DataRoyale
Data path: c:\Users\Danny\Documents\GitHub\HeHeHaHa_DataRoyale\battles.parquet
Artifacts dir: c:\Users\Danny\Documents\GitHub\HeHeHaHa_DataRoyale\artifacts


### When using Google Colab:

In [3]:
# # Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# # Set paths (UPDATE THESE to match your Drive folder structure)
# PROJECT_ROOT = '/content/drive/MyDrive/DataRoyale'
# # Use Parquet if available (faster), fallback to CSV
# DATA_PATH = os.path.join(PROJECT_ROOT, 'battles.parquet')
# if not os.path.exists(DATA_PATH):
#     DATA_PATH = os.path.join(PROJECT_ROOT, 'battles.csv')
# ARTIFACTS_DIR = os.path.join(PROJECT_ROOT, 'artifacts')

# # Add src/ to path
# sys.path.insert(0, os.path.join(PROJECT_ROOT, 'src'))

# print(f"‚úì Google Drive mounted")
# print(f"Project root: {PROJECT_ROOT}")
# print(f"Data path: {DATA_PATH}")

## 3. Verify File Access

In [4]:
# Check if dataset exists
if os.path.exists(DATA_PATH):
    file_size_gb = os.path.getsize(DATA_PATH) / (1024**3)
    file_type = "Parquet" if DATA_PATH.endswith('.parquet') else "CSV"
    print(f"‚úì Dataset found ({file_type})!")
    print(f"  File: {os.path.basename(DATA_PATH)}")
    print(f"  File size: {file_size_gb:.2f} GB")
    if file_type == "CSV":
        print(f"  üí° Tip: Convert to Parquet:")
        print(f"     python convert_to_parquet.py")
else:
    print(f"‚ùå Dataset NOT FOUND at {DATA_PATH}")
    print(f"\nPlease ensure battles.parquet or battles.csv is in the correct location:")
    print(f"  - Local: Place in project root")
    print(f"  - Colab: Upload to Google Drive and update DATA_PATH above")

‚úì Dataset found (Parquet)!
  File: battles.parquet
  File size: 2.10 GB


## 4. Create DuckDB Connection

In [5]:
# Import custom utility functions
from duckdb_utils import get_connection, create_battles_view, get_schema

# Create in-memory DuckDB connection
con = get_connection()

print("‚úì DuckDB connection created")

‚úì DuckDB connection created


## 5. Create Battles View

This creates a **view** (not a table), meaning DuckDB will stream data from the dataset (Parquet or CSV) without loading it all into memory.

In [6]:
# Create view over dataset
create_battles_view(con, DATA_PATH, view_name='battles', sample_size=-1)

print("\n‚úì View 'battles' created successfully")
print("  Now we can query with: con.sql('SELECT * FROM battles LIMIT 10').df()")

‚úì Created view 'battles' from Parquet: c:\Users\Danny\Documents\GitHub\HeHeHaHa_DataRoyale\battles.parquet

‚úì View 'battles' created successfully
  You can now query with: con.sql('SELECT * FROM battles LIMIT 10').df()


## 6. Basic Data Validation

In [7]:
# Get row count (fast approximate count)
row_count = con.sql("SELECT COUNT(*) as count FROM battles").df()['count'][0]

print(f"Total battles: {row_count:,}")

Total battles: 16,795,958


In [8]:
# Get schema (column names and types)
schema = get_schema(con, 'battles')

# Display schema
schema

Schema for 'battles':
  74 columns


Unnamed: 0,column_name,column_type,null,key,default,extra
0,column00,BIGINT,YES,,,
1,battleTime,TIMESTAMP WITH TIME ZONE,YES,,,
2,arena.id,DOUBLE,YES,,,
3,gameMode.id,DOUBLE,YES,,,
4,average.startingTrophies,DOUBLE,YES,,,
...,...,...,...,...,...,...
69,loser.common.count,BIGINT,YES,,,
70,loser.rare.count,BIGINT,YES,,,
71,loser.epic.count,BIGINT,YES,,,
72,loser.legendary.count,BIGINT,YES,,,


In [9]:
# Preview first 10 rows
preview = con.sql("SELECT * FROM battles LIMIT 10").df()

print(f"Preview of first 10 battles:")
preview

Preview of first 10 battles:


Unnamed: 0,column00,battleTime,arena.id,gameMode.id,average.startingTrophies,winner.tag,winner.startingTrophies,winner.trophyChange,winner.crowns,winner.kingTowerHitPoints,...,loser.cards.list,loser.totalcard.level,loser.troop.count,loser.structure.count,loser.spell.count,loser.common.count,loser.rare.count,loser.epic.count,loser.legendary.count,loser.elixir.average
0,0,2020-12-06 23:00:00-08:00,54000049.0,72000201.0,6590.0,#28RR8PJP0,6581.0,31.0,2.0,4768.0,...,"[26000000, 26000026, 26000030, 26000041, 27000...",104,4,1,3,4,1,1,2,3.125
1,1,2020-12-06 23:00:00-08:00,54000049.0,72000201.0,5582.5,#YV9VQUVP,5592.0,28.0,3.0,2014.0,...,"[26000000, 26000003, 26000007, 26000011, 26000...",104,6,0,2,2,3,3,0,4.125
2,2,2020-12-06 23:00:02-08:00,54000049.0,72000201.0,5684.0,#LPR2G0Q9L,5678.0,31.0,3.0,5304.0,...,"[26000011, 26000026, 26000030, 26000041, 27000...",103,4,1,3,3,2,2,1,2.875
3,3,2020-12-06 23:00:03-08:00,54000049.0,72000201.0,6031.0,#2GL899VCJ,6035.0,29.0,2.0,3368.0,...,"[26000032, 26000040, 26000041, 26000049, 26000...",104,6,1,1,3,2,1,2,3.375
4,4,2020-12-06 23:00:06-08:00,54000049.0,72000201.0,5140.0,#9Y2YJPGG2,5140.0,30.0,3.0,1507.0,...,"[26000012, 26000024, 26000045, 26000056, 26000...",93,5,1,2,3,1,4,0,3.875
5,5,2020-12-06 23:00:06-08:00,54000031.0,72000201.0,7036.0,#GVGYG89Y,7026.0,31.0,1.0,5832.0,...,"[26000008, 26000029, 26000032, 26000039, 26000...",104,6,1,1,3,3,0,2,4.125
6,6,2020-12-06 23:00:08-08:00,54000049.0,72000201.0,6117.0,#Y200Y28,6138.0,26.0,1.0,5832.0,...,"[26000004, 26000005, 26000036, 26000042, 26000...",104,6,0,2,2,1,2,3,3.75
7,7,2020-12-06 23:00:10-08:00,54000049.0,72000201.0,6706.5,#288200CPC,6727.0,26.0,2.0,4536.0,...,"[26000003, 26000012, 26000033, 26000039, 26000...",104,5,0,3,2,2,2,2,4.0
8,8,2020-12-06 23:00:12-08:00,54000049.0,72000201.0,5304.5,#9U002GJU2,5322.0,26.0,3.0,4414.0,...,"[26000011, 26000012, 26000017, 26000021, 26000...",99,5,0,3,1,3,4,0,3.625
9,9,2020-12-06 23:00:15-08:00,54000049.0,72000201.0,5299.5,#P9R8QLV0,5300.0,29.0,2.0,4091.0,...,"[26000007, 26000012, 26000021, 26000037, 26000...",96,6,0,2,1,2,3,2,4.375


## 7. Test Queries

Verify we can run queries efficiently.

In [10]:
# Test query: Count battles by arena
arena_counts = con.sql("""
    SELECT 
        "arena.id" as arena,
        COUNT(*) as battle_count
    FROM battles
    GROUP BY "arena.id"
    ORDER BY battle_count DESC
    LIMIT 10
""").df()

print("Top 10 arenas by battle count:")
arena_counts

Top 10 arenas by battle count:


Unnamed: 0,arena,battle_count
0,54000050.0,15828386
1,54000011.0,391043
2,54000006.0,100926
3,54000024.0,96298
4,54000004.0,64890
5,54000010.0,53590
6,54000007.0,48457
7,54000009.0,46623
8,54000003.0,41739
9,54000008.0,37646


In [11]:
# Test query: Average trophy count
avg_trophies = con.sql("""
    SELECT 
        AVG("average.startingTrophies") as avg_trophies,
        MIN("average.startingTrophies") as min_trophies,
        MAX("average.startingTrophies") as max_trophies
    FROM battles
""").df()

print("Trophy statistics:")
avg_trophies

Trophy statistics:


Unnamed: 0,avg_trophies,min_trophies,max_trophies
0,4596.092182,13.5,8220.0


## 8. Save Connection for Next Notebooks

**Note**: Each notebook will need to recreate the connection and view. Copy the setup code from cells 4-5 to future notebooks.

In [12]:
print("\n" + "="*60)
print("SETUP COMPLETE!")
print("="*60)
print("\nYour environment is ready for analysis.")
print("\nQuick reference:")
print("  - DuckDB connection: con")
print("  - View name: battles")
print("  - Query syntax: con.sql('SELECT...').df()")
print(f"  - Total rows: {row_count:,}")
print(f"  - Total columns: {len(schema)}")
print("\nNext steps:")
print("  1. Open 01-data-profiling.ipynb to explore data quality")
print("  2. Or start exploratory analysis in notebooks 02-04")


SETUP COMPLETE!

Your environment is ready for analysis.

Quick reference:
  - DuckDB connection: con
  - View name: battles
  - Query syntax: con.sql('SELECT...').df()
  - Total rows: 16,795,958
  - Total columns: 74

Next steps:
  1. Open 01-data-profiling.ipynb to explore data quality
  2. Or start exploratory analysis in notebooks 02-04
