# Exploratory Data Analysis (EDA)

This notebook conducts exploratory data analysis on the harmonized and model-ready emergency incident datasets for Toronto and New York City. The objective of this analysis is to examine the distribution and variability of emergency response times, identify temporal and operational patterns associated with peak demand and delayed responses, and assess service-level performance beyond simple averages. Particular attention is given to tail delays and response-time threshold breaches, which are critical for understanding operational risk in emergency response systems. The findings from this EDA are used to guide feature engineering, model selection, and comparative analysis in subsequent stages of the project.



## 0. Import Libraries

In [0]:
# PySpark core
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col,
    sum as spark_sum,
    count,
    when,
    hour,
    dayofweek,
    date_format
)
from pyspark.sql import functions as F
# Optional: for local conversion & plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Plot settings
plt.style.use("default")
sns.set_context("notebook")

## 1. Sanity & Structure Check
Goal: Make sure the tables are truly “model-ready”.

- Row count (compare Toronto vs NYC scale)
- Column list & data types
- Missing values per column
- Duplicate incidents (by incident ID + timestamp)

### 1.1 Load Tables

In [0]:
toronto_df = spark.table("workspace.capstone_project.toronto_model_ready")
nyc_df = spark.table("workspace.capstone_project.nyc_model_ready")

### 1.2 Row Count (Scale Comparison)

In [0]:
toronto_count = toronto_df.count()
nyc_count = nyc_df.count()

toronto_count, nyc_count

**Row Count Summary**

The Toronto dataset contains 349,198 emergency incidents, while the New York City dataset contains 1,060,771 incidents. The substantially larger volume of incidents in New York City is expected due to differences in population size, urban density, and emergency service demand. These scale differences are taken into account during exploratory analysis and modeling, particularly when comparing response-time distributions and service-level risk across cities.

### 1.3 Column List & Data Types

In [0]:
toronto_df.printSchema()

In [0]:
nyc_df.printSchema()

In [0]:
set(toronto_df.columns) - set(nyc_df.columns), set(nyc_df.columns) - set(toronto_df.columns)

**Schema Consistency Check**

A comparison of column names across the Toronto and New York City datasets shows no differences in schema. Both datasets contain identical sets of analytical features, confirming that the data harmonization process successfully aligned the structure of the two datasets and enables direct cross-city comparison.

### 1.4 Missing Value per Column

In [0]:
def missing_value_summary(df):
    return df.select([
        spark_sum(col(c).isNull().cast("int")).alias(c)
        for c in df.columns
    ])

In [0]:
def missing_table(df):
    total = df.count()
    m = missing_value_summary(df).toPandas().T.reset_index()
    m.columns = ["column_name", "missing_count"]
    m["missing_pct"] = m["missing_count"] / total * 100
    return m.sort_values("missing_count", ascending=False)

display(missing_table(toronto_df))
display(missing_table(nyc_df))

### 1.5 Duplicate Values Check

In [0]:
toronto_dupes = (
    toronto_df
    .groupBy("incident_id")
    .count()
    .filter(F.col("count") > 1)
)

print("Toronto duplicate incident_id count:", toronto_dupes.count())
display(toronto_dupes.orderBy(F.desc("count")).limit(20))

In [0]:
nyc_dupes = (
    nyc_df
    .groupBy("incident_id")
    .count()
    .filter(F.col("count") > 1)
)

print("NYC duplicate incident_id count:", nyc_dupes.count())
display(nyc_dupes.orderBy(F.desc("count")).limit(20))


**NYC Data Sanity Summary**

The NYC model-ready dataset contains no duplicate incident records. All feature variables are fully populated, while the target variable `response_minutes` exhibits 422,625 missing values (28.49%). These missing values correspond to incidents without an observed response completion time and are retained to support censor-aware survival analysis. No unintended row filtering or imputation was applied during data preparation.

## 2. Target Variable Exploration
(Assuming response time or delay-based target)

- Distribution (histogram / KDE)
- Summary stats (mean, median, P90, P95)
- Skewness & outliers
- % of incidents breaching SLA thresholds (e.g. > X minutes)

Distributions and summary statistics below use completed incidents only
i.e. response_minutes IS NOT NULL.
<br>Censored cases are handled separately in survival analysis

In [0]:
toronto_complete = toronto_df.filter(F.col("response_minutes").isNotNull())
nyc_complete     = nyc_df.filter(F.col("response_minutes").isNotNull())

In [0]:
print("Toronto completed:", toronto_complete.count(), "/", toronto_df.count())
print("NYC completed:", nyc_complete.count(), "/", nyc_df.count())

## 3. Temporal Patterns
Create / validate:
- Hour of day
- Day of week
- Month / season
- Weekend vs weekday

Explore:
- Avg & P90 response time by hour
- Incident volume by hour
- Heatmap: hour × day_of_week

## 4. Spatial/ Operational Signals
**Toronto**
- Ward / Station Area
- Alarm level
- Call source

**NYC**
- Borough
- Incident type
- Alarm level

Explore:
- Response time by area (mean + tail)
- Volume vs delay by area
- High-volume ≠ fast response (important insight)

## 5. Incident Characteristic
- Incident type vs response time
- Alarm level vs response time
- Rare but high-risk categories

## 6. Cross-City Comparability Check (critical)
Before modeling:
- Are response-time definitions aligned?
- Same units? (seconds vs minutes)
- Similar feature engineering logic?

Create:
- Normalized response time distributions
- Percentile comparison (Toronto vs NYC)

## 7. Correlation & Leakage Scan
- Correlation matrix (numeric only)
- Watch for:
  - Features derived from response time
  - Post-arrival timestamps
- Flag anything suspicious early