# 01 - Data Profiling

**Purpose**: Understand data quality, identify missing values, outliers, and potential issues.

**Outputs**:
- Null value analysis
- Data type validation
- Outlier detection
- Data quality report

**Key Questions**:
- What percentage of values are missing in each column?
- Are there any unexpected data types or values?
- Do we have sufficient data for all arenas/game modes?
- Are there data quality issues to address before analysis?

## Setup

In [None]:
# Standard imports
import sys
import os
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure paths
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
CSV_PATH = os.path.join(PROJECT_ROOT, 'battles.csv')
sys.path.insert(0, os.path.join(PROJECT_ROOT, 'src'))

# Import custom utilities
from duckdb_utils import (
    get_connection, create_battles_view, get_schema, 
    get_null_counts, query_to_df
)

# Create connection and view
con = get_connection()
create_battles_view(con, CSV_PATH)

print("âœ“ Setup complete")

## 1. Schema Overview

In [None]:
# Get full schema
schema = get_schema(con, 'battles')
print(f"Total columns: {len(schema)}")
schema

## 2. Missing Values Analysis

In [None]:
# Get null counts for all columns
null_analysis = get_null_counts(con, 'battles')

# Show columns with missing data
columns_with_nulls = null_analysis[null_analysis['null_percentage'] > 0]

print(f"Columns with missing values: {len(columns_with_nulls)} / {len(null_analysis)}")
print(f"\nTop 20 columns by missing percentage:")
columns_with_nulls.head(20)

In [None]:
# Visualize missing data
plt.figure(figsize=(12, 8))
top_missing = columns_with_nulls.head(20)
sns.barplot(data=top_missing, x='null_percentage', y='column_name')
plt.title('Top 20 Columns by Missing Data Percentage', fontsize=14)
plt.xlabel('Missing %')
plt.ylabel('Column')
plt.tight_layout()
plt.show()

## 3. Data Distribution Checks

### 3.1 Trophy Distribution

In [None]:
# Get trophy statistics
trophy_stats = query_to_df(con, """
    SELECT 
        MIN("average.startingTrophies") as min_trophies,
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY "average.startingTrophies") as q1,
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY "average.startingTrophies") as median,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY "average.startingTrophies") as q3,
        MAX("average.startingTrophies") as max_trophies,
        AVG("average.startingTrophies") as mean_trophies,
        STDDEV("average.startingTrophies") as std_trophies
    FROM battles
""")

print("Trophy Distribution Statistics:")
trophy_stats

### 3.2 Arena Distribution

In [None]:
# Count battles by arena
arena_dist = query_to_df(con, """
    SELECT 
        "arena.id" as arena,
        COUNT(*) as battle_count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
    FROM battles
    WHERE "arena.id" IS NOT NULL
    GROUP BY "arena.id"
    ORDER BY arena
""")

print("Battle distribution by arena:")
arena_dist

### 3.3 Game Mode Distribution

In [None]:
# Count battles by game mode
gamemode_dist = query_to_df(con, """
    SELECT 
        "gameMode.id" as game_mode,
        COUNT(*) as battle_count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
    FROM battles
    WHERE "gameMode.id" IS NOT NULL
    GROUP BY "gameMode.id"
    ORDER BY battle_count DESC
""")

print("Battle distribution by game mode:")
gamemode_dist

## 4. Outlier Detection

### 4.1 Extreme Trophy Values

In [None]:
# Check for extreme trophy values
extreme_trophies = query_to_df(con, """
    SELECT 
        "winner.startingTrophies",
        "loser.startingTrophies",
        "average.startingTrophies"
    FROM battles
    WHERE "average.startingTrophies" > 10000 OR "average.startingTrophies" < 0
    LIMIT 100
""")

print(f"Battles with extreme trophy values: {len(extreme_trophies)}")
if len(extreme_trophies) > 0:
    extreme_trophies.head(10)

### 4.2 Invalid Card Levels

In [None]:
# Check for invalid card levels (should be 1-14 typically)
# TODO: Add query to check card level ranges
# Example:
# invalid_levels = query_to_df(con, """
#     SELECT ... WHERE "winner.card1.level" < 1 OR "winner.card1.level" > 15
# """)

print("TODO: Add card level validation")

## 5. Data Quality Summary

**Document your findings here:**

### Issues Found:
1. [List any data quality issues]
2. [Missing values in specific columns]
3. [Outliers or anomalies]

### Recommendations:
1. [How to handle missing data]
2. [Whether to filter outliers]
3. [Data cleaning steps needed]

### Impact on Analysis:
- [How these issues might affect your insights]
- [Limitations to acknowledge in presentation]

## Next Steps

Based on data quality findings, proceed to:
- **02-eda-battle-metadata.ipynb**: Explore battle-level patterns
- **03-eda-card-analysis.ipynb**: Analyze card usage and win rates
- **04-eda-player-progression.ipynb**: Study trophy progression patterns