### Introduction

Fire department response times are a critical metric for the efficiency of emergency services, especially in large urban environments like New York City. The time it takes for fire departments to respond to an incident can significantly impact the severity of damage and the number of lives saved. Understanding the factors that influence these response times can help improve operational efficiency and allocate resources more effectively.

This project leverages the Fire Incident Dispatch Data from NYC Open Data, which contains detailed information about over 11 million fire incidents in New York City. The data includes timestamps for when incidents are created and closed, the type of incidents, the resources assigned, and the response times of the fire department. By analyzing this dataset, we aim to predict whether a fire incident's response time is valid (i.e., within an acceptable time frame). This prediction will help inform decisions on resource allocation and response optimization, ultimately improving the fire department's efficiency.

We will use machine learning classification techniques to predict whether the response time is valid, based on various features such as the location of the incident, the type of emergency, and the resources assigned. The dataset’s large size and variety of features make it a valuable resource for understanding patterns in response times across New York City.

### Data Description

The Fire Incident Dispatch Data from NYC Open Data tracks fire incidents across New York City, providing over 11.1 million rows of data across 29 columns. This dataset includes information about when and where an incident occurs, the type of emergency, and how long it took for the fire department to respond. For this analysis, we focused on the following relevant features, which directly affect the fire department's response time and the validity of that response.

Relevant Features:

1. INCIDENT_BOROUGH:

    Description: Indicates the borough where the fire incident occurred (e.g., Manhattan, Brooklyn, Queens, Bronx, Staten Island).

    Relevance: The borough where an incident takes place can significantly affect the response time, as factors such as traffic, proximity to fire stations, and the density of buildings influence how quickly emergency services can respond.

    Cleaning: This column was retained after removing redundant columns like ALARM_BOX_BOROUGH, which contained the same information.

2. CALL_TYPE:

    Description: The type of emergency call (e.g., fire, medical emergency, medical first aid).

    Relevance: The type of emergency determines the resources required and can influence how quickly those resources are dispatched. For example, medical emergencies may have a different urgency compared to fire-related incidents.

    Cleaning: This categorical feature was retained to understand how different types of calls impact response times.

3. RESPONSE_TIME:

    Description: The time taken by the fire department to respond to the incident (in seconds).

    Relevance: This is the target variable for our analysis. The goal is to predict whether this response time is valid (i.e., within the required time frame) based on the features in the dataset.

    Cleaning: This numerical feature was retained for modeling purposes.

4. VALID_INCIDENT_RSPNS_TIME_INDC:

    Description: A binary feature indicating whether the incident's response time was valid (1 for valid, 0 for invalid).

    Relevance: This is the target variable for our classification model. We aim to predict whether the response time meets the required standards for a valid response.

    Cleaning: This feature was mapped from categorical ('Y', 'N') values to binary (1, 0) values for classification purposes.

5. RESOURCE_ASSIGNMENT:

    Description: Describes the number and type of resources assigned to the incident (e.g., engines, trucks, ambulances).

    Relevance: The quantity and type of resources dispatched to an incident can directly affect the response time. More resources may lead to a faster and more efficient response.

    Cleaning: These features were retained as numerical variables for modeling.

Additional Features:

6. ALARM_SOURCE_DESCRIPTION_TX:

    Description: Indicates the source of the fire alarm (e.g., phone call, private fire alarm).

    Relevance: The source of the alarm can influence how quickly the fire department is alerted and how rapidly they can respond. Some sources may provide more accurate or urgent information, leading to quicker responses.

    Cleaning: This categorical feature was encoded numerically for use in machine learning models.

7. INCIDENT_CLASSIFICATION_GROUP:

    Description: Categorizes the type of incident (e.g., medical emergencies, non-medical emergencies, structural fires).

    Relevance: Different types of incidents may require different types of responses. This feature helps to segment the data by incident type and understand how these types affect response times.

    Cleaning: This feature was retained and encoded numerically for use in the classification model.

8. ENGINES_ASSIGNED_QUANTITY:

    Description: The number of fire engines assigned to the incident.

    Relevance: The number of engines dispatched can significantly influence the time it takes for the fire department to respond, especially for larger fires or incidents requiring more equipment.

    Cleaning: This numerical feature was retained for modeling.

9. LADDERS_ASSIGNED_QUANTITY:

    Description: The number of ladder trucks assigned to the incident.

    Relevance: Ladder trucks are used for accessing high-rise buildings or performing rescues. Their assignment can influence the efficiency of the response and the speed at which operations are carried out.

    Cleaning: This feature was retained as part of the resource assignment variables.

10. OTHER_UNITS_ASSIGNED_QUANTITY:

    Description: The number of other units (e.g., ambulances, rescue units) assigned to the incident.

    Relevance: The number of additional units assigned to an incident could affect the speed of response, especially if the incident involves multiple types of emergencies or requires a large number of responders.

    Cleaning: This feature was retained as part of the resource assignment data.

### Data Cleaning

The dataset required several cleaning steps to prepare it for analysis. These steps ensured that the dataset was relevant, complete, and ready for modeling.

To start, we imported the dependencies that we would need and read the csv file into a dataframe.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
df = pd.read_csv(r"C:\Users\Jason\Downloads\Fire_Incident_Dispatch_Data_20250408.csv")

1. Removing Irrelevant Columns:

    Columns like ZIPCODE, POLICEPRECINCT, CITYCOUNCILDISTRICT, and others were removed because they did not contribute meaningful information to the prediction of response time validity.

In [None]:
# Dropping unnecessary columns that contained no useful information
df.drop(columns=['STARFIRE_INCIDENT_ID', 'ZIPCODE', 'POLICEPRECINCT', 'CITYCOUNCILDISTRICT', 'COMMUNITYDISTRICT', 'COMMUNITYSCHOOLDISTRICT', 'CONGRESSIONALDISTRICT', 'ALARM_BOX_NUMBER', 'ALARM_BOX_LOCATION', 'INCIDENT_CLASSIFICATION', 'ALARM_LEVEL_INDEX_DESCRIPTION', 'HIGHEST_ALARM_LEVEL', 'VALID_DISPATCH_RSPNS_TIME_INDC'], inplace=True)

2. Dropping DateTime Columns:

    Columns with datetime values, such as FIRST_ASSIGNMENT_DATETIME and FIRST_ACTIVATION_DATETIME, were dropped because they were not useful for predicting response time validity. The only column kept was INCIDENT_DATETIME, as this will be converted to a categorical variable later and used as a feature

In [None]:
#Dropping most columns with DateTime data types, the other one will be converted to a categorical variable later
df.drop(columns=['FIRST_ASSIGNMENT_DATETIME', 'FIRST_ACTIVATION_DATETIME', 'FIRST_ON_SCENE_DATETIME', 'INCIDENT_CLOSE_DATETIME'], inplace=True)

3. Handling Missing Data:

    Rows with missing values (about 4% of the dataset) were dropped to maintain data integrity, as this small amount of missing data would not significantly affect the analysis.

In [None]:
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]
print(missing_counts)
# We can see that the rows with missing values only account for around 4% of the total rows, so we can drop them without losing too much data.
df.dropna(inplace=True)
print(f"Percentage of remaining rows: {df.shape[0] / total_rows * 100:.2f}%")

4. Mapping Categorical Variables:

    The VALID_INCIDENT_RSPNS_TIME_INDC column, which indicates whether the response time was valid, was mapped from categorical values ('Y', 'N') to binary (1, 0) for classification modeling.

In [None]:
# Remapping the values in VALID_INCIDENT_RSPNS_TIME_INDC
df['VALID_INCIDENT_RSPNS_TIME_INDC'] = df['VALID_INCIDENT_RSPNS_TIME_INDC'].map({'Y': 1, 'N': 0})

5. Combining Redundant Columns:

    The ALARM_BOX_BOROUGH and INCIDENT_BOROUGH columns were found to be identical, so ALARM_BOX_BOROUGH was removed to avoid redundancy.

In [None]:
# Checking to see if the ALARM_BOX_BOROUGH and INCIDENT_BOROUGH columns are equal enough to be used as a single column, they are comepletley equal, so they will be combined
print((df['ALARM_BOX_BOROUGH'] == df['INCIDENT_BOROUGH']).value_counts())
df.drop(columns=['ALARM_BOX_BOROUGH'], inplace=True)

#True    9866972

6. Converting the INCIDENT_DATETIME column to a categorical variable

    To get a better understanding of how the time of the incident impacts the validity of the response time, the INCIDENT_DATETIME column was converted into a categorical variable reresentitive of which 4-hour period of time it fall within. This was done by first extracting the hour from the datetime data and finding the value counts for this data. The counts of each hour value was used to find the optimal grouping, where each hour within each group had counts most similar to the other hours in its group


In [None]:
# Convert string to datetime
df['INCIDENT_DATETIME'] = pd.to_datetime(df['INCIDENT_DATETIME'], format="%m/%d/%Y %I:%M:%S %p")

# Extract hour in 24-hour format
df['INCIDENT_HOUR'] = df['INCIDENT_DATETIME'].dt.hour
df.drop(columns=['INCIDENT_DATETIME'], inplace=True)

# Function to convert 24-hour format to AM/PM format for labels
def hour_to_ampm(h):
    suffix = "AM" if h < 12 or h == 24 else "PM"
    hour12 = h % 12
    if hour12 == 0:
        hour12 = 12
    return f"{hour12} {suffix}"

def assign_optimal_hour_groups(df, hour_col='INCIDENT_HOUR', group_col='HOUR GROUP'):

    # Function to create the hour groupings that will be checked
    def get_hour_groups(start):
        hours = [(start + i) % 24 for i in range(24)]
        return [hours[i:i+4] for i in range(0, 24, 4)]

    # Precompute value counts of each hour
    hour_counts = df[hour_col].value_counts().reindex(range(24), fill_value=0)

    best_std = np.inf
    best_groups = None

    # Checking all possible hour groupings to find the one with the lowest standard deviation
    for start in range(24):
        groups = get_hour_groups(start)
        group_totals = [hour_counts[group].sum() for group in groups]
        std = np.std(group_totals, ddof=0)
        if std < best_std:
            best_std = std
            best_groups = groups

    # Create mapping from hour to group index + label
    hour_to_group = {}
    hour_to_label = {}

    # Creating and assigning the labels for each group
    for i, group in enumerate(best_groups):
        for hour in group:
            hour_to_group[hour] = i
            start_hour = group[0]
            end_hour = group[-1]
            label = f"{hour_to_ampm(start_hour)} - {hour_to_ampm((end_hour + 1) % 24)}"
            hour_to_label[hour] = label

    df[group_col] = df[hour_col].map(hour_to_group)
    df[f"{group_col} LABEL"] = df[hour_col].map(hour_to_label)

    print(f"Best grouping: {best_groups} with Std Dev: {best_std:.2f}")

    return df, best_groups

df, best_groups = assign_optimal_hour_groups(df)

7. Target Encoding for Categorical Variables:

    For categorical features INCIDENT_BOROUGH, ALARM_SOURCE_DESCRIPTION_TX, INCIDENT_CLASSIFICATION_GROUP, and HOUR GROUP LABEL target encoding was applied to convert them into numerical formats suitable for modeling. All of the counts for the values in these columns are over 5000, any values found with a count of less than 5000 (Only found in ALARM_SOURCE_DESCRIPTION_TX) were dropped as they made up an insignifigant amount of the dataset, and would not terget encoded as accurately. K-fold cross validation was used in this function to prevent overfitting.

In [None]:
# Target Encoding for Categorical Variables using Stratified K-Folds

def target_encode(df, col, target='VALID_INCIDENT_RSPNS_TIME_INDC', n_splits=10, alpha=10):

    df = df.copy()
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    encoded = pd.Series(index=df.index, dtype=float)
    
    global_mean = df[target].mean()

    for train_idx, val_idx in skf.split(df, df[target]):
        train, val = df.iloc[train_idx], df.iloc[val_idx]

        # Compute smoothed means
        category_stats = train.groupby(col)[target].agg(['mean', 'count'])
        smooth = (category_stats['count'] * category_stats['mean'] + alpha * global_mean) / (category_stats['count'] + alpha)

        # Map to validation fold
        encoded.iloc[val_idx] = val[col].map(smooth).fillna(global_mean)

    return encoded

# Target Encoding for categorical variables
df['Borough num'] = target_encode(df, 'INCIDENT_BOROUGH')
df['Alarm Source num'] = target_encode(df, 'ALARM_SOURCE_DESCRIPTION_TX')
df['Incident Classification num'] = target_encode(df, 'INCIDENT_CLASSIFICATION_GROUP')
df['Time num'] = target_encode(df, 'HOUR GROUP LABEL')

8. Renaming and reordering Columns:

    To make the columns more readable and easier to work with, we renamed the columns with more intuitive names and reordered them.

In [None]:
# Renaming columns to make them easier to read and work with
df.rename(columns={'INCIDENT_BOROUGH': 'Borough', 'ALARM_SOURCE_DESCRIPTION_TX': 'Alarm Source', 'INCIDENT_CLASSIFICATION_GROUP': 'Incident Classification', 'DISPATCH_RESPONSE_SECONDS_QY': 'Dispatch Response Time', 'VALID_INCIDENT_RSPNS_TIME_INDC': 'Valid Response Time', 'INCIDENT_RESPONSE_SECONDS_QY': 'Incident Response Time', 'INCIDENT_TRAVEL_TM_SECONDS_QY': 'Incident Travel Time', 'ENGINES_ASSIGNED_QUANTITY': 'Engines Assigned', 'LADDERS_ASSIGNED_QUANTITY': 'Ladders Assigned', 'OTHER_UNITS_ASSIGNED_QUANTITY': 'Other Units Assigned', 'HOUR GROUP LABEL' : 'Time'}, inplace=True)
order = ['Borough', 'Borough num', 'Time', 'Time num',  'Alarm Source', 'Alarm Source num', 'Incident Classification', 'Incident Classification num', 'Dispatch Response Time', 'Incident Response Time', 'Incident Travel Time', 'Engines Assigned', 'Ladders Assigned', 'Other Units Assigned', 'Valid Response Time']
df = df[order]

After completing the data cleaning steps, the dataset was reduced to 10 relevant features, all of which contribute to predicting the validity of fire department response times. The dataset is now prepared for Exploratory Data Analysis (EDA) and model development.

In [None]:
#Cleaned Dataset Size
print(df.shape)
df.head()

#(9861256, 10)