# Exploratory Data Analysis

This notebook provides an exploratory analysis of the **Diabetes 130-US Hospitals (1999-2008)** dataset.

**Problem Setting**: The goal of this project is to predict **30-day hospital readmissions** using an anomaly detection approach. Patients readmitted within 30 days (`readmitted == "<30"`) are treated as the positive class (anomalies), while others are considered normal cases.

## Setup

In [None]:
from pathlib import Path
import sys

CWD = Path.cwd().resolve()
if CWD.name == "notebooks":
    PROJECT_ROOT = CWD.parent
else:
    PROJECT_ROOT = CWD

sys.path.insert(0, str(PROJECT_ROOT))

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.preprocessing import load_raw_data

sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100

## Load Data

In [None]:
data_path = PROJECT_ROOT / 'data' / 'raw' / 'diabetic_data.csv'

if not data_path.exists():
    print(f"Data file not found at {data_path}")
    print("Please ensure the dataset is in the data/raw/ directory.")
else:
    df = load_raw_data(str(data_path))
    print(f"Dataset loaded: {df.shape[0]:,} rows, {df.shape[1]} columns")

In [None]:
df.head()

## Dataset Overview

In [None]:
print("Dataset shape:", df.shape)
print("\nColumn dtypes:")
print(df.dtypes.value_counts())

In [None]:
df.info()

## Target Variable: Readmission Status

The `readmitted` column indicates whether a patient was readmitted:
- `"NO"`: Not readmitted
- `">30"`: Readmitted after 30 days
- `"<30"`: **Readmitted within 30 days** (our positive class for anomaly detection)

In [None]:
print("Readmission distribution:")
print(df['readmitted'].value_counts())
print(f"\nPercentage readmitted <30 days: {(df['readmitted'] == '<30').mean() * 100:.2f}%")

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='readmitted', order=['NO', '>30', '<30'], palette='viridis')
plt.title('Distribution of Readmission Status', fontsize=14, fontweight='bold')
plt.xlabel('Readmission Status')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Missing Data Analysis

In [None]:
# Check for missing values
missing_data = df.isnull().sum()
missing_data = missing_data[missing_data > 0].sort_values(ascending=False)

if len(missing_data) > 0:
    print("Columns with missing values:")
    print(missing_data)
else:
    print("No missing values detected (NaN)")

In [None]:
# Check for '?' placeholder values (common in this dataset)
question_mark_cols = []
for col in df.columns:
    if df[col].dtype == 'object':
        if '?' in df[col].values:
            count = (df[col] == '?').sum()
            question_mark_cols.append((col, count))

if question_mark_cols:
    print("Columns with '?' placeholder:")
    for col, count in sorted(question_mark_cols, key=lambda x: x[1], reverse=True):
        print(f"  {col}: {count:,} ({count/len(df)*100:.2f}%)")
else:
    print("No '?' placeholders found")

## Numerical Features

In [None]:
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()
print(f"Numerical features ({len(numerical_cols)}): {numerical_cols}")

In [None]:
df[numerical_cols].describe()

## Key Feature Distributions

In [None]:
# Select important features for visualization
key_features = ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 
                'num_medications', 'number_outpatient', 'number_emergency', 
                'number_inpatient', 'number_diagnoses']

available_features = [f for f in key_features if f in df.columns]

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for i, feature in enumerate(available_features):
    if i < len(axes):
        df[feature].hist(bins=30, ax=axes[i], edgecolor='black', alpha=0.7)
        axes[i].set_title(feature, fontsize=10)
        axes[i].set_xlabel('')
        axes[i].set_ylabel('Frequency')

# Hide unused subplots
for j in range(i+1, len(axes)):
    axes[j].axis('off')

plt.suptitle('Distribution of Key Numerical Features', fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Categorical Features

In [None]:
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical features ({len(categorical_cols)}):")
print(categorical_cols)

In [None]:
# Show cardinality for each categorical feature
cat_cardinality = {col: df[col].nunique() for col in categorical_cols}
cat_cardinality_sorted = sorted(cat_cardinality.items(), key=lambda x: x[1], reverse=True)

print("Categorical feature cardinality:")
for col, count in cat_cardinality_sorted[:15]:  # Show top 15
    print(f"  {col}: {count} unique values")

## Correlation Analysis

In [None]:
if len(numerical_cols) > 1:
    plt.figure(figsize=(12, 10))
    corr_matrix = df[numerical_cols].corr()
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', 
                linewidths=0.5, cbar_kws={'label': 'Correlation'})
    plt.title('Correlation Matrix of Numerical Features', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

## Summary

**Key Findings**:
- The dataset exhibits **class imbalance**: patients readmitted within 30 days represent a minority class
- Several columns use `'?'` as a placeholder for missing values
- Important clinical features include hospital utilization patterns (emergency visits, inpatient admissions) and medication/procedure counts
- High cardinality in some categorical features (e.g., diagnosis codes) may require careful encoding

**Next Steps**: The preprocessing pipeline will handle missing values, encode categorical features, and scale numerical features before feeding data to anomaly detection models.