Toronto–NYC Fire Incident Gold Data Harmonization & EDA

This notebook reviews and harmonizes Toronto and New York City fire incident **Gold-layer outputs** into a unified, model-ready dataset. It begins with exploratory data analysis (EDA), including schema validation, missingness checks, timestamp consistency, and distributional comparisons across both cities.

The notebook then applies harmonization logic to standardize incident identifiers, timestamps, boroughs, and incident types. The final Gold dataset is incident-level (first arrival) and enriched with response-time labels, survival analysis fields, and temporal features to support comparative EDA, predictive modeling, and survival analysis.


## 1. Import and Load Tables

In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql import DataFrame

# --- DATA TABLE DIRECTORIES ---
TORONTO_GOLD_TABLE = "workspace.capstone_project.tfs_incidents_gold"   
NYC_GOLD_TABLE     = "workspace.capstone_project.nyc_fire_incidents_gold"



### 1.1 Load Toronto Table

In [0]:
toronto_gold = spark.table(TORONTO_GOLD_TABLE)

print("Toronto Gold count:", toronto_gold.count())

display(toronto_gold.limit(5))

toronto_gold.printSchema()

### 1.2 Load NYC Table

In [0]:
nyc_gold     = spark.table(NYC_GOLD_TABLE)

print("NYC Gold count:", nyc_gold.count())

display(nyc_gold.limit(5))

nyc_gold.printSchema()


## 2. EDA (Gold Tables)
We run consistent checks for both cities:
- schema + missingness
- key categorical distributions
- response-time distribution (minutes)
- time feature sanity (hour/day/month)

#### Quick profiling helpers

In [0]:
def null_profile(df: DataFrame, cols: list):
    total = df.count()
    exprs = [F.sum(F.when(F.col(c).isNull(), 1).otherwise(0)).alias(c) for c in cols]
    res = df.agg(*exprs).collect()[0].asDict()
    rows = [(c, res[c], res[c]/total if total else None) for c in cols]
    return spark.createDataFrame(rows, ["column", "null_count", "null_pct"]).orderBy(F.desc("null_pct"))

def top_values(df: DataFrame, col: str, n=20):
    return df.groupBy(col).count().orderBy(F.desc("count")).limit(n)

def approx_distinct(df: DataFrame, cols: list):
    exprs = [F.approx_count_distinct(F.col(c)).alias(c) for c in cols]
    res = df.agg(*exprs).collect()[0].asDict()
    rows = [(c, int(res[c]) if res[c] is not None else None) for c in cols]
    return spark.createDataFrame(rows, ["column", "approx_distinct"]).orderBy(F.desc("approx_distinct"))

### 2.1 EDA on Toronto Gold

In [0]:
TOR_EDA_COLS = [
    "INCIDENT_NUMBER",
    "Final_Incident_Type",
    "Event_Alarm_Level",
    "Call_Source",
    "Incident_Station_Area",
    "Incident_Ward",
    "alarm_time",
    "arrival_time",
    "clear_time",
    "response_time_minutes",
    "incident_hour",
    "day_of_week",
    "season",
    "calls_past_30m",
    "calls_past_60m",
]

#### 2.1.1 Toronto Data Distint Value Counts

In [0]:
display(approx_distinct(toronto_gold, [c for c in TOR_EDA_COLS if c in toronto_gold.columns]))

#### 2.1.2 Toronto Data Missing Value Profile

In [0]:
display(null_profile(toronto_gold, [c for c in TOR_EDA_COLS if c in toronto_gold.columns]))

#### 2.1.3 Toronto Data Incident Classification Distribution
This table displays the most frequent incident classification groups. It provides insight into dominant incident types and helps validate category consistency for modeling and cross-city comparison.

In [0]:
display(top_values(toronto_gold, "Final_Incident_Type", 20))

#### 2.1.4 Toronto Data Incident Borough Distribution
This distribution summarizes incident counts by borough. It supports spatial EDA and highlights geographic differences in incident volume.

In [0]:
display(top_values(toronto_gold, "Incident_Station_Area", 20))

#### 2.1.5 Toronto Data Day-of-week and Seasonal Distribution

In [0]:
display(top_values(toronto_gold, "day_of_week", 10))

In [0]:
display(top_values(toronto_gold, "season", 10))

#### 2.1.6 Toronto Data Response Time Distribution Summary (EDA)
The following percentile-based statistics summarize the distribution of incident response times (in minutes). Unlike averages, percentiles capture both typical performance and tail behavior, which is critical for understanding delay risk.

- **p50 (median)**: Typical response time; 50% of incidents are responded to within this duration.
- **p90 / p95**: Upper-tail performance; indicate how response times behave during high-demand or stressed conditions.
- **p99**: Extreme delay risk; highlights rare but critical cases with very long response times.
- **Min / Max**: Used to identify potential data quality issues (e.g., negative or implausibly large values).

These statistics support comparative EDA across cities and motivate the use of survival analysis and tail-focused modeling approaches.

In [0]:
# Response time distribution quick percentiles
display(toronto_gold.select(
    F.expr("percentile_approx(response_time_minutes, 0.5)").alias("p50_min"),
    F.expr("percentile_approx(response_time_minutes, 0.9)").alias("p90_min"),
    F.expr("percentile_approx(response_time_minutes, 0.95)").alias("p95_min"),
    F.expr("percentile_approx(response_time_minutes, 0.99)").alias("p99_min"),
    F.max("response_time_minutes").alias("max_min"),
    F.min("response_time_minutes").alias("min_min"),
))

### 2.2 EDA on NYC Gold

In [0]:
NYC_EDA_COLS = [
    "incident_id",
    "response_minutes",
    "event_indicator",
    "hour",
    "day_of_week",
    "month",
    "season",
    "incident_classification_group",
    "alarm_level_index_description",
    "alarm_source_description_tx",
    "incident_borough",
    "calls_past_30min",
    "calls_past_60min",
]

#### 2.2.1 NYC Data Distinct Value Counts

In [0]:
display(approx_distinct(nyc_gold, [c for c in NYC_EDA_COLS if c in nyc_gold.columns]))

#### 2.2.2 NYC Data Missing Value Profile (EDA)


In [0]:
display(null_profile(nyc_gold, [c for c in NYC_EDA_COLS if c in nyc_gold.columns]))

#### 2.2.3 NYC Data Incident classification distribution

In [0]:
display(top_values(nyc_gold, "incident_classification_group", 20))

#### 2.2.4 NYC Data Incident Borough Distribution

In [0]:
display(top_values(nyc_gold, "incident_borough", 20))

#### 2.2.5 NYC Data Day-of-week and Seasonal Distribution

In [0]:
display(top_values(nyc_gold, "day_of_week", 10))

In [0]:
display(top_values(nyc_gold, "season", 10))

#### 2.2.6 NYC Data Response Time Distribution Summary (EDA)

In [0]:
display(nyc_gold.select(
    F.round(F.expr("percentile_approx(response_minutes, 0.5)"),2).alias("p50_min"),
    F.round(F.expr("percentile_approx(response_minutes, 0.9)"),2).alias("p90_min"),
    F.round(F.expr("percentile_approx(response_minutes, 0.95)"),2).alias("p95_min"),
    F.round(F.expr("percentile_approx(response_minutes, 0.99)"),2).alias("p99_min"),
    F.round(F.max("response_minutes"),2).alias("max_min"),
    F.round(F.min("response_minutes"),2).alias("min_min"),
))

## 3. Harmonization (Gold → Unified Model Dataset)
We harmonize Toronto and NYC Gold outputs to a shared schema:
- incident_id
- response_minutes
- event_indicator
- hour, day_of_week (int), month (int), season
- incident_borough, incident_classification_group
- alarm_level_index_description, alarm_source_description_tx
- calls_past_30min, calls_past_60min
