# Toronto–NYC Gold Merge & EDA (Unified Dataset)

This notebook loads Gold-layer outputs for Toronto and NYC fire incidents, harmonizes them to a shared schema, merges them into a unified dataset with a `city` indicator, and performs comparative EDA (distribution, missingness, categorical profiles, and tail-risk percentiles).



## 1. Import and Load Tables

In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark.sql import DataFrame

# --- DATA TABLE DIRECTORIES ---
TORONTO_GOLD_TABLE = "workspace.capstone_project.tfs_incidents_gold"   
NYC_GOLD_TABLE     = "workspace.capstone_project.nyc_fire_incidents_gold"



### 1.1 Load Toronto Table

In [0]:
toronto_gold = spark.table(TORONTO_GOLD_TABLE)

print("Toronto Gold count:", toronto_gold.count())

display(toronto_gold.limit(5))

toronto_gold.printSchema()

### 1.2 Load NYC Table

In [0]:
nyc_gold     = spark.table(NYC_GOLD_TABLE)

print("NYC Gold count:", nyc_gold.count())

display(nyc_gold.limit(5))

nyc_gold.printSchema()


## 2. Incident Type Harmonization

Incident type definitions differ substantially between Toronto and NYC. Toronto reports fine-grained incident types, while NYC reports a small number of aggregated categories. To enable valid cross-city comparison, incident types are harmonized into shared high-level categories.


### 2.1 Pre-Inspection: Raw Incident Type Distributions

Before harmonization, we inspect all distinct incident type values and their frequencies in each city. This ensures the harmonization logic is grounded in observed data and covers dominant categories.



#### 2.1.1 Toroto Data Inspection

Check NULL or Empty Incident Types

In [0]:
total_rows = toronto_gold.count()

null_or_empty = (
    toronto_gold
    .filter(
        F.col("Final_Incident_Type").isNull() |
        (F.trim(F.col("Final_Incident_Type")) == "")
    )
    .count()
)

print(f"Toronto Null or empty Final_Incident_Type: {null_or_empty} ({null_or_empty/total_rows:.2%})")

Get Distinct Count

In [0]:
toronto_distinct_count = (
    toronto_gold
    .select("Final_Incident_Type")
    .distinct()
    .count()
)

print("Toronto distinct Final_Incident_Type count:", toronto_distinct_count)

List Distinct Values

In [0]:
# Full distribution (all categories, sorted by frequency)
toronto_incident_dist = (
    toronto_gold
    .groupBy("Final_Incident_Type")
    .count()
    .orderBy(F.desc("count"))
)

display(toronto_incident_dist)

#### 2.1.2 NYC Data Inspection

Check NULL or Empty Incidents Types

In [0]:
total_rows_nyc = nyc_gold.count()

null_or_empty_nyc = (
    nyc_gold
    .filter(
        F.col("INCIDENT_CLASSIFICATION").isNull() |
        (F.trim(F.col("INCIDENT_CLASSIFICATION")) == "")
    )
    .count()
)

print(
    "NYC null or empty INCIDENT_CLASSIFICATION:",
    null_or_empty_nyc,
    "(",
    null_or_empty_nyc / total_rows_nyc,
    ")"
)

Get Distinct Count

In [0]:
nyc_distinct_count = (
    nyc_gold
    .select("INCIDENT_CLASSIFICATION")
    .distinct()
    .count()
)

print("NYC distinct INCIDENT_CLASSIFICATION count:", nyc_distinct_count)

List Distinct Values

In [0]:
nyc_incident_dist = (
    nyc_gold
    .groupBy("INCIDENT_CLASSIFICATION")
    .count()
    .orderBy(F.desc("count"))
)

display(nyc_incident_dist)

print(
    "NYC distinct INCIDENT_CLASSIFICATION count:",
    nyc_gold.select("INCIDENT_CLASSIFICATION").distinct().count()
)

### 2.2 Incident Coding: Unified Category Design

Based on the pre-inspection, raw incident types are grouped into the following unified categories:

- Medical  
- Fire – Structural  
- Fire – Non-Structural  
- Rescue / Entrapment  
- Hazardous / Utility  
- False Alarm / No Action  
- Other / Assistance  

This taxonomy preserves operational meaning while accommodating differences in reporting granularity between cities.


#### 2.2.1 Define Mapping Function

This function standardizes raw incident-type labels from Toronto and NYC into a common set of high-level incident categories. It applies explicit overrides for known edge cases first, followed by general pattern-matching rules, ensuring consistent, defensible classification across both datasets for downstream modeling and cross-city comparison.

In [0]:
# 1. Define a function to map the incident category to the unified incident category
# Toronto Data Mapping Function
def map_toronto_incident_category(col_expr):
    """
    Unified 7-category incident taxonomy for Toronto + NYC.

    Categories:
    - Medical
    - Fire – Structural
    - Fire – Non-Structural
    - Rescue / Entrapment
    - Hazardous / Utility
    - False Alarm / No Action
    - Other / Assistance

    Explicit overrides are included to match the user's "Correct Classification" table.
    """
    s = F.upper(F.trim(col_expr))

    return (
        # =========================================================
        # 0) NYC explicit mapping (since NYC is already aggregated)
        # =========================================================
        F.when(s.rlike(r"^MEDICAL EMERGENCIES$|^MEDICAL MFAS$"), F.lit("Medical"))
         .when(s.rlike(r"^STRUCTURAL FIRES$"), F.lit("Fire – Structural"))
         .when(s.rlike(r"^NONSTRUCTURAL FIRES$"), F.lit("Fire – Non-Structural"))
         .when(s.rlike(r"^NONMEDICAL MFAS$"), F.lit("Rescue / Entrapment"))
         .when(s.rlike(r"^NONMEDICAL EMERGENCIES$"), F.lit("Other / Assistance"))

        # =========================================================
        # 1) Toronto explicit overrides (to match Correct Classification)
        # =========================================================

        # Must be Other / Assistance (NOT false alarm)
         .when(s.rlike(r"^\s*98\s*-\s*ASSISTANCE NOT REQUIRED"), F.lit("Other / Assistance"))

        # Must be Other / Assistance (NOT Fire – Non-Structural)
         .when(s.rlike(r"^\s*23\s*-\s*OPEN AIR BURNING/UNAUTHORIZED CONTROLLED BURNING"), F.lit("Other / Assistance"))
         .when(s.rlike(r"^\s*21\s*-\s*OVERHEAT"), F.lit("Other / Assistance"))
         .when(s.rlike(r"^\s*29\s*-\s*OTHER PRE FIRE CONDITIONS"), F.lit("Other / Assistance"))

        # Must be Hazardous / Utility
         .when(s.rlike(r"^\s*49\s*-\s*RUPTURED WATER,\s*STEAM PIPE"), F.lit("Hazardous / Utility"))
         .when(s.rlike(r"^\s*11\s*-\s*OVERPRESSURE RUPTURE"), F.lit("Hazardous / Utility"))
         .when(s.rlike(r"^\s*13\s*-\s*OVERPRESSURE RUPTURE\s*-\s*GAS PIPE"), F.lit("Hazardous / Utility"))
         .when(s.rlike(r"^\s*53\s*-\s*CO INCIDENT,\s*CO PRESENT"), F.lit("Hazardous / Utility"))
         .when(s.rlike(r"^\s*48\s*-\s*RADIO-?ACTIVE MATERIAL PROBLEM"), F.lit("Hazardous / Utility"))

        # Must be Rescue / Entrapment
         .when(s.rlike(r"^\s*69\s*-\s*OTHER RESCUE"), F.lit("Rescue / Entrapment"))
         .when(s.rlike(r"^\s*605\s*-\s*ANIMAL RESCUE"), F.lit("Rescue / Entrapment"))
         .when(s.rlike(r"^\s*68\s*-\s*WATER ICE RESCUE"), F.lit("Rescue / Entrapment"))

        # Must be Other / Assistance
         .when(s.rlike(r"^\s*54\s*-\s*SUSPICIOUS SUBSTANCE"), F.lit("Other / Assistance"))
         .when(s.rlike(r"^\s*26\s*-\s*FIREWORKS\s*\(NO FIRE\)"), F.lit("Other / Assistance"))

        # =========================================================
        # 2) General rules (Toronto + any remaining strings)
        # =========================================================

        # Medical
         .when(s.rlike(r"\bMEDICAL\b|\bEMS\b"), F.lit("Medical"))

        # False Alarm / No Action (alarm/cancelled/not found/CO false alarm/prank)
         .when(
             s.rlike(
                 r"\bALARM\b|ALARM SYSTEM|ALARM EQUIPMENT|MALFUNCTION|ACCIDENTAL ACTIVATION|"
                 r"PERCEIVED EMERGENCY|PRANK|MALICIOUS|"
                 r"CO FALSE ALARM|NO CO PRESENT|"
                 r"INCIDENT NOT FOUND|CANCELLED ON ROUTE|CANCELLED|"
                 r"PUBLIC HAZARD CALL FALSE ALARM|PUBLIC HAZARD NO ACTION REQUIRED|"
                 r"RESCUE FALSE ALARM|RESCUE NO ACTION REQUIRED|"
                 r"NO ACTION REQUIRED"
             ),
             F.lit("False Alarm / No Action")
         )

        # Fire – Structural
         .when(s.rlike(r"\b01\s*-\s*FIRE\b|STRUCTURAL FIRE|STRUCTURE FIRE"), F.lit("Fire – Structural"))

        # Fire – Non-Structural (keep cooking/smoke/pot on stove/outdoor fire codes; exclude the overridden ones above)
         .when(
             s.rlike(
                 r"NONSTRUCTURAL FIRE|NO LOSS OUTDOOR FIRE|OUTDOOR FIRE|"
                 r"COOKING|TOASTING|SMOKE|STEAM|"
                 r"POT ON STOVE|STOVE"
             ),
             F.lit("Fire – Non-Structural")
         )

        # Rescue / Entrapment
         .when(
             s.rlike(
                 r"PERSONS TRAPPED|ENTRAPMENT|ELEVATOR|EXTRICATION|"
                 r"WATER RESCUE|HIGH ANGLE|LOW ANGLE|CONFINED SPACE|TRENCH|"
                 r"\b691\b\s*-\s*PERSONAL/INDUSTRIAL ENTRAPMENT"
             ),
             F.lit("Rescue / Entrapment")
         )

        # Hazardous / Utility (gas leaks, spills, power lines, CO present, radiation, etc.)
         .when(
             s.rlike(
                 r"GAS LEAK|NATURAL GAS|PROPANE|REFRIGERATION|"
                 r"\bCO INCIDENT\b|CO PRESENT|CARBON MONOXIDE|"
                 r"HAZMAT|HAZARDOUS|"
                 r"SPILL|TOXIC CHEMICAL|GASOLINE|FUEL|"
                 r"POWER LINES|ARCI?NG|"
                 r"RADIO-?ACTIVE|RADIATION|"
                 r"OVERPRESSURE RUPTURE|GAS PIPE|"
                 r"RUPTURED WATER|STEAM PIPE"
             ),
             F.lit("Hazardous / Utility")
         )

        # Fallback
         .otherwise(F.lit("Other / Assistance"))
    )

In [0]:
# NYC Data Mapping Function
def map_nyc_incident_category(incident_classification_col, incident_group_col=None):
    """
    Map NYC (FDNY) incident classification into unified incident categories.

    Preference order:
    1) incident_classification (detailed, e.g., "Utility Emergency - Gas")
    2) incident_classification_group (coarse, e.g., "NonMedical Emergencies") if provided

    Output categories (7):
    - Medical
    - Fire – Structural
    - Fire – Non-Structural
    - Rescue / Entrapment
    - Hazardous / Utility
    - False Alarm / No Action
    - Other / Assistance
    """
    s = F.upper(F.trim(incident_classification_col))
    g = F.upper(F.trim(incident_group_col)) if incident_group_col is not None else None

    # ---------- 1) Medical ----------
    medical = s.rlike(r"^MEDICAL\s*-|^MEDICAL MFA\s*-|^MEDICAL MFA\b")

    # ---------- 2) False Alarm / No Action ----------
    # FDNY alarm system / sprinkler / defective/testing/unnecessary
    false_alarm = s.rlike(
        r"ALARM SYSTEM\s*-|"
        r"NON-MEDICAL 10-91\s*\(UNNECESSARY ALARM\)|"
        r"SPRINKLER SYSTEM\s*-|"
        r"DEFECTIVE OIL BURNER|"
        r"UNNECESSARY|TESTING|DEFECTIVE"
    )

    # ---------- 3) Hazardous / Utility ----------
    # Utilities + CO (investigation/incident/emergency) + odor (non-smoke) + steam
    hazardous_utility = s.rlike(
        r"UTILITY EMERGENCY\s*-|"
        r"CARBON MONOXIDE\s*-|"
        r"\bCO\b|"
        r"ODOR\s*- OTHER THAN SMOKE|"
        r"ODOR\s*- OTHER\b|"
        r"MANHOLE FIRE\s*-|"
        r"STEAM\b"
    )

    # ---------- 4) Rescue / Entrapment ----------
    rescue = s.rlike(
        r"ELEVATOR EMERGENCY\s*- OCCUPIED|"
        r"VEHICLE ACCIDENT\s*- WITH EXTRICATION|"
        r"MARITIME EMERGENCY|"
        r"REMOVE CIVILIAN\s*- NON-FIRE|"
        r"ASSIST CIVILIAN\s*- NON-MEDICAL"
    ) | s.rlike(r"NON-MEDICAL MFA\s*-")  # MFAs are typically assistance/rescue-type dispatches

    # ---------- 5) Fire – Structural ----------
    # Structural building fires (dwellings, commercial, public buildings, schools, etc.)
    fire_structural = s.rlike(
        r"PRIVATE DWELLING FIRE|"
        r"MULTIPLE DWELLING\s*'A'\s*-.*FIRE|"
        r"MULTIPLE DWELLING\s*'B'\s*FIRE|"
        r"OTHER COMMERCIAL BUILDING FIRE|"
        r"STORE FIRE|SCHOOL FIRE|HOSPITAL FIRE|CHURCH FIRE|FACTORY FIRE|"
        r"OTHER PUBLIC BUILDING FIRE|THEATER OR TV STUDIO FIRE|"
        r"UNDER CONSTRUCTION / VACANT FIRE|"
        r"MANHOLE FIRE\s*- EXTENDED TO BUILDING|"
        r"TRANSIT SYSTEM\s*-\s*STRUCTURAL"
    )

    # ---------- 6) Fire – Non-Structural ----------
    # Rubbish/brush/auto fires, compactor fire, manhole fire (non-building extension), transit nonstructural
    fire_nonstructural = s.rlike(
        r"BRUSH FIRE|"
        r"DEMOLITION DEBRIS OR RUBBISH FIRE|"
        r"AUTOMOBILE FIRE|"
        r"ABANDONED DERELICT VEHICLE FIRE|"
        r"OTHER TRANSPORTATION FIRE|"
        r"MULTIPLE DWELLING\s*'A'\s*-\s*FOOD ON THE STOVE FIRE|"
        r"MULTIPLE DWELLING\s*'A'\s*-\s*COMPACTOR FIRE|"
        r"MANHOLE FIRE\s*-(?!\s*EXTENDED TO BUILDING)|"
        r"TRANSIT SYSTEM\s*-\s*NONSTRUCTURAL|"
        r"UNDEFINED NONSTRUCTURAL FIRE"
    )

    # ---------- Apply rules (order matters) ----------
    mapped = (
        F.when(medical, F.lit("Medical"))
         .when(false_alarm, F.lit("False Alarm / No Action"))
         .when(hazardous_utility, F.lit("Hazardous / Utility"))
         .when(fire_structural, F.lit("Fire – Structural"))
         .when(fire_nonstructural, F.lit("Fire – Non-Structural"))
         .when(rescue, F.lit("Rescue / Entrapment"))
    )

    # ---------- Fallback using group if available ----------
    if incident_group_col is not None:
        mapped = (
            mapped.otherwise(
                F.when(g == "MEDICAL EMERGENCIES", F.lit("Medical"))
                 .when(g == "MEDICAL MFAS", F.lit("Medical"))
                 .when(g == "STRUCTURAL FIRES", F.lit("Fire – Structural"))
                 .when(g == "NONSTRUCTURAL FIRES", F.lit("Fire – Non-Structural"))
                 .when(g == "NONMEDICAL MFAS", F.lit("Rescue / Entrapment"))
                 .otherwise(F.lit("Other / Assistance"))
            )
        )
    else:
        mapped = mapped.otherwise(F.lit("Other / Assistance"))

    return mapped


#### 2.2.2 Apply the Mapping

Toronto Data

In [0]:
# 2. Apply Mapping
toronto_gold = toronto_gold.withColumn(
    "incident_category",
    map_toronto_incident_category(F.col("Final_Incident_Type"))
)




3. Mapping Validation

In [0]:
# Validate Mapping for Toronto
display(
    toronto_gold
    .select(
        F.col("Final_Incident_Type").alias("raw_incident_type"),
        F.col("incident_category")
    )
    .groupBy("raw_incident_type", "incident_category")
    .count()
    .orderBy(F.desc("count"))
)


NYC Data

In [0]:
nyc_gold = nyc_gold.withColumn(
    "incident_category",
    map_nyc_incident_category(F.col("INCIDENT_CLASSIFICATION"))
)

In [0]:
# Validate Mapping for NYC
display(
    nyc_gold
    .select(
        F.col("INCIDENT_CLASSIFICATION").alias("raw_incident_type"),
        F.col("incident_category")
    )
    .groupBy("raw_incident_type", "incident_category")
    .count()
    .orderBy(F.desc("count"))
)

Unified Categorization of Incident Types for Toronto and NYC Data 

In [0]:
# Add city labels if not already present
toronto_counts = (
    toronto_gold
    .withColumn("city", F.lit("Toronto"))
    .groupBy("city", "incident_category")
    .count()
)

nyc_counts = (
    nyc_gold
    .withColumn("city", F.lit("NYC"))
    .groupBy("city", "incident_category")
    .count()
)

combined_counts = toronto_counts.unionByName(nyc_counts)
category_city_table = (
    combined_counts
    .groupBy("incident_category")
    .pivot("city", ["Toronto", "NYC"])
    .agg(F.first("count"))
    .fillna(0)
    .orderBy("incident_category")
)

display(category_city_table)