# 01 - Data Exploration

**Urban Flood Risk Prediction - CSE 572**

This notebook explores the Kaggle Playground Series S4E5 flood prediction dataset.

## Contents
1. Load and inspect data
2. Statistical summary
3. Target variable analysis
4. Feature distributions
5. Correlation analysis
6. Missing value check

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

## 1. Load Data

In [None]:
# Load training data
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')

print(f"Training set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")

In [None]:
# Display first few rows
train_df.head()

## 2. Statistical Summary

In [None]:
# Statistical summary
train_df.describe()

In [None]:
# Data types and missing values
train_df.info()

## 3. Target Variable Analysis

In [None]:
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(train_df['FloodProbability'], bins=50, edgecolor='white')
axes[0].set_xlabel('Flood Probability')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Target Variable Distribution')

# Box plot
axes[1].boxplot(train_df['FloodProbability'])
axes[1].set_ylabel('Flood Probability')
axes[1].set_title('Target Variable Box Plot')

plt.tight_layout()
plt.savefig('../results/figures/target_distribution.png', dpi=300)
plt.show()

print(f"Target mean: {train_df['FloodProbability'].mean():.4f}")
print(f"Target std: {train_df['FloodProbability'].std():.4f}")
print(f"Target range: [{train_df['FloodProbability'].min():.4f}, {train_df['FloodProbability'].max():.4f}]")

## 4. Feature Distributions

In [None]:
# Feature columns (excluding id and target)
feature_cols = [col for col in train_df.columns if col not in ['id', 'FloodProbability']]
print(f"Number of features: {len(feature_cols)}")
print(f"Features: {feature_cols}")

In [None]:
# Plot feature distributions
fig, axes = plt.subplots(4, 5, figsize=(20, 16))
axes = axes.flatten()

for i, col in enumerate(feature_cols):
    axes[i].hist(train_df[col], bins=30, edgecolor='white', alpha=0.7)
    axes[i].set_title(col, fontsize=10)
    axes[i].set_xlabel('')

plt.tight_layout()
plt.savefig('../results/figures/feature_distributions.png', dpi=300)
plt.show()

## 5. Correlation Analysis

In [None]:
# Correlation matrix
corr_matrix = train_df[feature_cols + ['FloodProbability']].corr()

# Plot heatmap
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
            cmap='RdBu_r', center=0, linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('../results/figures/correlation_matrix.png', dpi=300)
plt.show()

In [None]:
# Correlation with target
target_corr = corr_matrix['FloodProbability'].drop('FloodProbability').sort_values(ascending=False)
print("Correlation with FloodProbability:")
print(target_corr)

## 6. Missing Values

In [None]:
# Check missing values
missing = train_df.isnull().sum()
missing_pct = (missing / len(train_df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
})

print("Missing values summary:")
print(missing_df[missing_df['Missing Count'] > 0])

if missing.sum() == 0:
    print("\nâœ… No missing values found!")

## Summary

Key findings from exploratory data analysis:

1. **Dataset size**: ~1.1M training instances, 745K test instances
2. **Features**: 20 numerical features, all continuous
3. **Target**: FloodProbability, ranges from ~0.29 to ~0.73
4. **Missing values**: None (or minimal)
5. **Correlations**: [Add findings]

Next steps:
- Proceed to preprocessing (notebook 02)
- Apply feature scaling
- Train baseline models