# An Analysis of the City of Toronto’s Fire Response between 2011-2019 
## by The City of Toronto Department of Pandas¶

The goal of this project is to create a predictive model capable of determining response time given inputs such as day, time, or location. Exploratory data analysis is iterative, therefore, until our exploratory data analysis is conducted, we will not know exactly which features will be used to implement the model, nor do we know exactly what type of prediction model will be used (e.g., linear regression, logistic regression, or other approaches as suitable). However, an example of our anticipated model can be seen below. 

###  Objectives
The questions we hope to answer with our exploratory data analysis are: 
- Are there temporal trends in factors such as response time, damages, or other relevant factors? 

- Identify the characteristics of fires with extensive damage/casualties. What factors make a fire most/least likely to cause extensive damage/casualties? 

- Analyze and intersect our datasets to find the closest TFS station to each fire. Stations that frequently fail to respond in a timely manner to nearby fires may require more resources from the city. 

- Identify the common fire incidents and causes in each ward. 

- Optional: Is there a correlation between the resources TFS deploys (number of personnel, vehicles, etc.) during incidents and damages (such as casualties/financial)?  

- Optional: Overlay incidents with socio-economic profiles of each neighborhood and analyze whether TFS services are equally and equitably distributed across the city. Are TFS services biased when responding to certain incidents or wards of the city?
- test - Sami

## Setup Notebook

In [1]:
pip install geopandas

Note: you may need to restart the kernel to use updated packages.


In [127]:
# Import 3rd party libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import geopandas as gpd
from scipy.stats import chi2_contingency 

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

# Overview

The primary dataset to be used for this project includes information about Toronto’s Fire Incidents (https://open.toronto.ca/dataset/fire-incidents/). This data is a .csv file from the City of Toronto’s open data portal with details of over 17,000 fire incidents between January 1, 2011, to June 30, 2019. The dataset contains 43 columns that describe the geographic location of the fire, the TFS response, impact of the fire, suspected cause, and details of the site (e.g., building condition and presence of sprinklers or alarms). The dataset contains categorical variables, many null values, and string values in many of the columns. Extensive data cleaning and data wrangling as well as feature engineering is needed for this project. Feature engineering techniques such as categorical encoding, datetime extraction, imputation, etc. will be used for this project. The appendix shows the data types and categories of each of the columns in the dataset.

The rest of the datasets described in this paragraph are secondary datasets that will only be used if time allows. The second dataset is the City of Toronto Wards data which includes spatial boundaries that we can overlay with the fire incident data for visualization purposes (https://open.toronto.ca/dataset/city-wards/). The third dataset is a .GeoJSON file with Toronto Fire station locations (https://open.toronto.ca/dataset/fire-station-locations/). Finally, if time allows, we will use a fourth dataset of Toronto’s neighbourhoods which includes information on variables such as income, population, household type, and education (https://open.toronto.ca/dataset/neighbourhood-profiles/).

# Import Data
Let's import the training datasets.

### Toronto Fire Incidents

In [3]:
fire_data = pd.read_csv('Fire Incidents Data.csv')

In [4]:
fire_data.head(10)

Unnamed: 0,_id,Area_of_Origin,Building_Status,Business_Impact,Civilian_Casualties,Count_of_Persons_Rescued,Estimated_Dollar_Loss,Estimated_Number_Of_Persons_Displaced,Exposures,Ext_agent_app_or_defer_time,...,Smoke_Alarm_at_Fire_Origin_Alarm_Failure,Smoke_Alarm_at_Fire_Origin_Alarm_Type,Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation,Smoke_Spread,Sprinkler_System_Operation,Sprinkler_System_Presence,Status_of_Fire_On_Arrival,TFS_Alarm_Time,TFS_Arrival_Time,TFS_Firefighter_Casualties
0,578689,81 - Engine Area,,,0,0,15000.0,,,2018-02-24T21:12:00,...,,,,,,,"7 - Fully involved (total structure, vehicle, ...",2018-02-24T21:04:29,2018-02-24T21:10:11,0
1,578690,"75 - Trash, rubbish area (outside)",,,0,0,50.0,,,2018-02-24T21:29:42,...,,,,,,,2 - Fire with no evidence from street,2018-02-24T21:24:43,2018-02-24T21:29:31,0
2,578691,,,,0,0,,,,,...,,,,,,,,2018-02-25T13:29:59,2018-02-25T13:36:49,0
3,578692,"75 - Trash, rubbish area (outside)",01 - Normal (no change),1 - No business interruption,0,0,0.0,0.0,,2018-02-25T14:19:25,...,98 - Not applicable: Alarm operated OR presenc...,9 - Type undetermined,"8 - Not applicable: No alarm, no persons present",99 - Undetermined,8 - Not applicable - no sprinkler system present,9 - Undetermined,3 - Fire with smoke showing only - including v...,2018-02-25T14:13:39,2018-02-25T14:18:07,0
4,578693,,,,0,0,,,,,...,,,,,,,,2018-02-25T18:20:43,2018-02-25T18:26:19,0
5,578694,81 - Engine Area,,,0,0,1500.0,,,2018-02-25T18:38:00,...,,,,,,,4 - Flames showing from small area (one storey...,2018-02-25T18:31:19,2018-02-25T18:35:17,0
6,578695,22 - Sleeping Area or Bedroom (inc. patients r...,01 - Normal (no change),1 - No business interruption,0,0,2000.0,0.0,,2018-02-26T18:28:00,...,98 - Not applicable: Alarm operated OR presenc...,8 - Not applicable - no smoke alarm or presenc...,7 - Not applicable: Occupant(s) first alerted ...,2 - Confined to part of room/area of origin,8 - Not applicable - no sprinkler system present,3 - No sprinkler system,2 - Fire with no evidence from street,2018-02-26T18:18:55,2018-02-26T18:24:47,0
7,578696,55 - Mechanical/Electrical Services Room,01 - Normal (no change),2 - May resume operations within a week,0,0,100000.0,0.0,,2018-02-27T10:57:32,...,98 - Not applicable: Alarm operated OR presenc...,8 - Not applicable - no smoke alarm or presenc...,2 - Some persons (at risk) self evacuated as a...,"7 - Spread to other floors, confined to building",8 - Not applicable - no sprinkler system present,3 - No sprinkler system,2 - Fire with no evidence from street,2018-02-27T10:28:12,2018-02-27T10:35:13,0
8,578697,28 - Office,01 - Normal (no change),1 - No business interruption,0,0,5000.0,0.0,,2018-02-25T15:57:00,...,98 - Not applicable: Alarm operated OR presenc...,2 - Hardwired (standalone),1 - All persons (at risk of injury) self evacu...,"4 - Spread beyond room of origin, same floor",3 - Did not activate: fire too small to trigge...,1 - Full sprinkler system present,4 - Flames showing from small area (one storey...,2018-02-25T15:48:34,2018-02-25T15:52:04,0
9,578698,,,,0,0,,,,,...,,,,,,,,2018-02-26T15:32:11,2018-02-26T15:37:40,0


### Toronto Wards

For now, I have commented out the map plot until we need it again.

In [5]:
# Write your code here.
ward = gpd.read_file('25-ward-model-december-2018-wgs84-latitude-longitude') #importing shapefile

# # View GeoDataFrame
ward.head()
# ward.plot(figsize=(15, 8), edgecolor='w', alpha=0.75);

Unnamed: 0,AREA_ID,AREA_TYPE,AREA_S_CD,AREA_L_CD,AREA_NAME,X,Y,LONGITUDE,LATITUDE,geometry
0,2551040,WD18,16,16,Don Valley East,318237.29,4844000.0,-79.33298,43.739716,"POLYGON ((-79.31335 43.71699, -79.31950 43.715..."
1,2551044,WD18,3,3,Etobicoke-Lakeshore,303099.474,4831000.0,-79.52087,43.621646,"POLYGON ((-79.49777 43.65198, -79.49725 43.651..."
2,2551048,WD18,15,15,Don Valley West,314825.876,4843000.0,-79.37536,43.728396,"POLYGON ((-79.35232 43.71573, -79.35209 43.715..."
3,2551052,WD18,23,23,Scarborough North,324522.149,4852000.0,-79.25467,43.809672,"POLYGON ((-79.22591 43.83960, -79.22556 43.839..."
4,2551056,WD18,11,11,University-Rosedale,313306.543,4837000.0,-79.39432,43.671139,"POLYGON ((-79.39004 43.69050, -79.39004 43.690..."


# Data Cleaning

In [6]:
fire_data.head()

Unnamed: 0,_id,Area_of_Origin,Building_Status,Business_Impact,Civilian_Casualties,Count_of_Persons_Rescued,Estimated_Dollar_Loss,Estimated_Number_Of_Persons_Displaced,Exposures,Ext_agent_app_or_defer_time,...,Smoke_Alarm_at_Fire_Origin_Alarm_Failure,Smoke_Alarm_at_Fire_Origin_Alarm_Type,Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation,Smoke_Spread,Sprinkler_System_Operation,Sprinkler_System_Presence,Status_of_Fire_On_Arrival,TFS_Alarm_Time,TFS_Arrival_Time,TFS_Firefighter_Casualties
0,578689,81 - Engine Area,,,0,0,15000.0,,,2018-02-24T21:12:00,...,,,,,,,"7 - Fully involved (total structure, vehicle, ...",2018-02-24T21:04:29,2018-02-24T21:10:11,0
1,578690,"75 - Trash, rubbish area (outside)",,,0,0,50.0,,,2018-02-24T21:29:42,...,,,,,,,2 - Fire with no evidence from street,2018-02-24T21:24:43,2018-02-24T21:29:31,0
2,578691,,,,0,0,,,,,...,,,,,,,,2018-02-25T13:29:59,2018-02-25T13:36:49,0
3,578692,"75 - Trash, rubbish area (outside)",01 - Normal (no change),1 - No business interruption,0,0,0.0,0.0,,2018-02-25T14:19:25,...,98 - Not applicable: Alarm operated OR presenc...,9 - Type undetermined,"8 - Not applicable: No alarm, no persons present",99 - Undetermined,8 - Not applicable - no sprinkler system present,9 - Undetermined,3 - Fire with smoke showing only - including v...,2018-02-25T14:13:39,2018-02-25T14:18:07,0
4,578693,,,,0,0,,,,,...,,,,,,,,2018-02-25T18:20:43,2018-02-25T18:26:19,0


## Irrelevant Data
First, we can begin by eliminating columns that we know, for sure, that we do not need. At the moment, this includes the "Incident_Number" and "Exposure" columns. "Incident_Number" because it is essentially equivalent to "_id", and "Exposure" because 98% of the data in this column are null.

In [7]:
fire_data = fire_data.drop(['Exposures', 'Incident_Number'], axis = 1)
fire_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17536 entries, 0 to 17535
Data columns (total 41 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   _id                                                            17536 non-null  int64  
 1   Area_of_Origin                                                 15623 non-null  object 
 2   Building_Status                                                11216 non-null  object 
 3   Business_Impact                                                11214 non-null  object 
 4   Civilian_Casualties                                            17536 non-null  int64  
 5   Count_of_Persons_Rescued                                       17536 non-null  int64  
 6   Estimated_Dollar_Loss                                          15627 non-null  float64
 7   Estimated_Number_Of_Persons_Displaced                     

## Missing Data
Next, it is important to understand the number of null values and how to approach eliminating or changing those values. It appears that the columns with null values are split into approximately 5 different categories.

First, we have columns missing about 36% (n = ~6300) rows of their data (mostly columns pertaining to fire alarm, smoke alarm, or sprinklers).

Second, we have columns missing about 10% (n = ~1900) rows of their data.

Third, we have one column missing 85 rows of data. This is the "Incident_Ward" column.

Fourth, we have four columns all missing 1 value. "Intersection", "Longitude", "Latitude", and "Property_Use".

Fifth, we have 12 columns with zero null values.

In [8]:
fire_data.isnull().sum().sort_values(ascending = False)

Sprinkler_System_Presence                                        6322
Fire_Alarm_System_Impact_on_Evacuation                           6322
Smoke_Alarm_at_Fire_Origin                                       6322
Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation    6322
Smoke_Spread                                                     6322
Level_Of_Origin                                                  6322
Sprinkler_System_Operation                                       6322
Smoke_Alarm_at_Fire_Origin_Alarm_Type                            6322
Fire_Alarm_System_Presence                                       6322
Fire_Alarm_System_Operation                                      6322
Smoke_Alarm_at_Fire_Origin_Alarm_Failure                         6322
Business_Impact                                                  6322
Extent_Of_Fire                                                   6322
Estimated_Number_Of_Persons_Displaced                            6321
Building_Status     

In [9]:
# Finding rows that the 85 null Incident_Ward values are in. Is there a way for us to fill these? Why are they null? A lot seem to include "Steeles Ave"
fire_data[fire_data['Incident_Ward'].isnull()]

Unnamed: 0,_id,Area_of_Origin,Building_Status,Business_Impact,Civilian_Casualties,Count_of_Persons_Rescued,Estimated_Dollar_Loss,Estimated_Number_Of_Persons_Displaced,Ext_agent_app_or_defer_time,Extent_Of_Fire,...,Smoke_Alarm_at_Fire_Origin_Alarm_Failure,Smoke_Alarm_at_Fire_Origin_Alarm_Type,Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation,Smoke_Spread,Sprinkler_System_Operation,Sprinkler_System_Presence,Status_of_Fire_On_Arrival,TFS_Alarm_Time,TFS_Arrival_Time,TFS_Firefighter_Casualties
211,578900,"44 - Trash, Rubbish Storage (inc garbage chute...",,,0,0,0.0,,2018-01-20T06:14:00,,...,,,,,,,3 - Fire with smoke showing only - including v...,2018-01-20T06:06:46,2018-01-20T06:13:44,0
214,578903,,,,0,0,,,,,...,,,,,,,,2018-03-04T08:59:06,2018-03-04T09:05:54,0
565,579254,83 - Electrical Systems,,,0,0,20000.0,,2018-02-12T09:10:50,,...,,,,,,,3 - Fire with smoke showing only - including v...,2018-02-12T09:01:02,2018-02-12T09:07:01,0
1417,580106,42 - Garage,01 - Normal (no change),9 - Undetermined,0,0,0.0,0.0,2019-01-25T18:57:07,1 - Confined to object of origin,...,98 - Not applicable: Alarm operated OR presenc...,8 - Not applicable - no smoke alarm or presenc...,"8 - Not applicable: No alarm, no persons present",5 - Multi unit bldg: spread beyond suite of or...,1 - Sprinkler system activated,1 - Full sprinkler system present,1 - Fire extinguished prior to arrival,2019-01-25T18:55:30,2019-01-25T18:56:07,0
2274,580963,,,,0,0,,,,,...,,,,,,,,2019-04-21T22:16:18,2019-04-21T22:22:22,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16858,595547,"31 - Process Manufacturing (inc manf, prod ass...",08 - Not Applicable,1 - No business interruption,0,0,5000.0,0.0,2017-08-21T17:32:21,1 - Confined to object of origin,...,98 - Not applicable: Alarm operated OR presenc...,9 - Type undetermined,9 - Undetermined,"4 - Spread beyond room of origin, same floor",9 - Activation/operation undetermined,1 - Full sprinkler system present,4 - Flames showing from small area (one storey...,2017-08-21T17:17:01,2017-08-21T17:23:08,0
16925,595614,29 - Electronic Equipment,01 - Normal (no change),2 - May resume operations within a week,0,0,1000.0,0.0,2017-11-02T00:24:30,2 - Confined to part of room/area of origin,...,4 - Remote from fire – smoke did not reach alarm,9 - Type undetermined,"8 - Not applicable: No alarm, no persons present","4 - Spread beyond room of origin, same floor",9 - Activation/operation undetermined,9 - Undetermined,3 - Fire with smoke showing only - including v...,2017-11-02T00:15:43,2017-11-02T00:22:44,0
17034,595723,24 - Cooking Area or Kitchen,01 - Normal (no change),8 - Not applicable (not a business),1,0,500.0,0.0,2017-12-01T17:50:59,2 - Confined to part of room/area of origin,...,98 - Not applicable: Alarm operated OR presenc...,4 - Interconnected,1 - All persons (at risk of injury) self evacu...,"4 - Spread beyond room of origin, same floor",8 - Not applicable - no sprinkler system present,3 - No sprinkler system,1 - Fire extinguished prior to arrival,2017-12-01T17:40:21,2017-12-01T17:46:49,0
17203,595892,24 - Cooking Area or Kitchen,01 - Normal (no change),8 - Not applicable (not a business),1,0,4000.0,0.0,2014-05-26T17:33:27,2 - Confined to part of room/area of origin,...,98 - Not applicable: Alarm operated OR presenc...,2 - Hardwired (standalone),3 - No one (at risk) evacuated as a result of ...,"4 - Spread beyond room of origin, same floor",9 - Activation/operation undetermined,9 - Undetermined,3 - Fire with smoke showing only - including v...,2014-05-26T17:22:50,2014-05-26T17:26:59,0


In [10]:
# Finding row that the singular Latitude, Longitude, Intersection null value is in
fire_data[fire_data['Latitude'].isnull()]

Unnamed: 0,_id,Area_of_Origin,Building_Status,Business_Impact,Civilian_Casualties,Count_of_Persons_Rescued,Estimated_Dollar_Loss,Estimated_Number_Of_Persons_Displaced,Ext_agent_app_or_defer_time,Extent_Of_Fire,...,Smoke_Alarm_at_Fire_Origin_Alarm_Failure,Smoke_Alarm_at_Fire_Origin_Alarm_Type,Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation,Smoke_Spread,Sprinkler_System_Operation,Sprinkler_System_Presence,Status_of_Fire_On_Arrival,TFS_Alarm_Time,TFS_Arrival_Time,TFS_Firefighter_Casualties
17500,596189,81 - Engine Area,,,0,0,3000.0,,2011-09-26T19:03:00,,...,,,,,,,3 - Fire with smoke showing only - including v...,2011-09-26T18:55:12,2011-09-26T19:01:00,0


In [None]:
# Finding row that the singular Property_Use null value is in
fire_data[fire_data['Property_Use'].isnull()]

Unnamed: 0,_id,Area_of_Origin,Building_Status,Business_Impact,Civilian_Casualties,Count_of_Persons_Rescued,Estimated_Dollar_Loss,Estimated_Number_Of_Persons_Displaced,Ext_agent_app_or_defer_time,Extent_Of_Fire,...,Smoke_Alarm_at_Fire_Origin_Alarm_Failure,Smoke_Alarm_at_Fire_Origin_Alarm_Type,Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation,Smoke_Spread,Sprinkler_System_Operation,Sprinkler_System_Presence,Status_of_Fire_On_Arrival,TFS_Alarm_Time,TFS_Arrival_Time,TFS_Firefighter_Casualties
17276,595965,79 - Other Outside Area,,,0,0,,,2013-04-17T18:37:00,,...,,,,,,,3 - Fire with smoke showing only - including v...,2013-04-17T18:29:19,2013-04-17T18:35:26,0


## Category 1 Data: Data with ~36% of data missing

Select columns with a high proportion (~36%) of null values. Most of these features have the exact same number of null values and relate to the presence of a fire alarm.

In [71]:
category_1 = []
for column in fire_data.columns:
    if fire_data[column].isnull().sum() >6000:
        category_1.append(column)
       # print(column + ' Unique Values: ' + str(fire_data[column].unique()))
        
category_1

['Building_Status',
 'Business_Impact',
 'Estimated_Number_Of_Persons_Displaced',
 'Extent_Of_Fire',
 'Fire_Alarm_System_Impact_on_Evacuation',
 'Fire_Alarm_System_Operation',
 'Fire_Alarm_System_Presence',
 'Level_Of_Origin',
 'Smoke_Alarm_at_Fire_Origin',
 'Smoke_Alarm_at_Fire_Origin_Alarm_Failure',
 'Smoke_Alarm_at_Fire_Origin_Alarm_Type',
 'Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation',
 'Smoke_Spread',
 'Sprinkler_System_Operation',
 'Sprinkler_System_Presence']

 The following features are obviously related to the presence of a smoke/fire alarm:
 ['Fire_Alarm_System_Impact_on_Evacuation',
 'Fire_Alarm_System_Operation',
 'Fire_Alarm_System_Presence',
 'Smoke_Alarm_at_Fire_Origin',
 'Smoke_Alarm_at_Fire_Origin_Alarm_Failure',
 'Smoke_Alarm_at_Fire_Origin_Alarm_Type',
 'Smoke_Alarm_Impact_on_Persons_Evacuating_Impact_on_Evacuation',
 'Smoke_Spread',
 'Sprinkler_System_Operation',
 'Sprinkler_System_Presence']
 
We care about all fires, not just ones where a smoke alarm was present. There are four categories under 'Fire_Alarm_System_Presence':

In [61]:
fire_data['Fire_Alarm_System_Presence'].unique()

array([nan, '9 - Undetermined',
       '8 - Not applicable (bldg not classified by OBC OR detached/semi/town home)',
       '1 -  Fire alarm system present', '2 - No Fire alarm system'],
      dtype=object)

Since there are already categories for whether or not a smoke alarm is present, we can't just assume that a blank row means there was no alarm. Instead, we can decide to use code '10 - No ' to indicate we do not know. There might be another feature that explains why these rows are empty. Let's start by looking at all the other features in rows where Fire_Alarm_System_Presence is null:

In [135]:
#initialize new column indicating whether Fire_Alarm_System_Presence is null
fire_data['null_fire_alarm_system'] = fire_data['Fire_Alarm_System_Presence'].isnull()

We can check whether the other columns are dependent on null_fire_system_alarm using a chi squared test. If pvalue < 0.05, the columns might be dependent on the null, and therefore might help us explain what's going on with all this missing data.

In [138]:
cols_related_fire_alarm = []

for column in fire_data.columns:
    chisqt = pd.crosstab(fire_data['null_fire_alarm_system'], fire_data[column], margins=True)
    value = np.array([chisqt.iloc[0][0:5].values,
                      chisqt.iloc[1][0:5].values])
    p_score = chi2_contingency(value)[1]
    if p_score < 0.05:
        cols_related_fire_alarm.append(column)
        print(column)
        print(chi2_contingency(value)[0:3])

Area_of_Origin
(15.56932941553631, 0.003655007428709224, 4)
Civilian_Casualties
(457.05725131923896, 1.2945408976018708e-97, 4)
Count_of_Persons_Rescued
(146.77558742721197, 9.990513918189449e-31, 4)
Final_Incident_Type
(3813.7225889754536, 0.0, 3)
Ignition_Source
(15.633912324234906, 0.0035518756674518936, 4)
Incident_Station_Area
(10.860166400008136, 0.028181325900777585, 4)
Incident_Ward
(34.13775262335736, 6.982522608085054e-07, 4)
Initial_CAD_Event_Type
(34.03175673208996, 7.340968523904089e-07, 4)
Latitude
(11.822222222222221, 0.01872340958246071, 4)
Method_Of_Fire_Control
(1147.9237551397152, 3.0985806084407864e-247, 4)
Number_of_responding_apparatus
(3766.500736266901, 0.0, 4)
Number_of_responding_personnel
(19.16567036364711, 0.0007291713426130122, 4)
Status_of_Fire_On_Arrival
(1693.9377367726977, 0.0, 4)
TFS_Firefighter_Casualties
(77.25712215303466, 6.634945866593336e-16, 4)
null_fire_alarm_system
(17536.0, 0.0, 2)


Now let's look at the types of values we get in these low p-value features when fire_alarm_system_presence is null

In [142]:
for column in cols_related_fire_alarm:
    print(column)
    print(fire_data[fire_data['Fire_Alarm_System_Presence'].isnull()][column].unique())

Area_of_Origin
['81 - Engine Area' '75 - Trash, rubbish area (outside)' nan
 '73 - Parking Area, Parking Lot'
 '44 - Trash, Rubbish Storage (inc garbage chute room, garbage/industri'
 '86 - Passenger Area' '99 - Undetermined  (formerly 98)'
 '85 - Operator/Control Area' '83 - Electrical Systems'
 '59 - Utility Shaft (eg. electrical wiring/phone, etc.)'
 '89 - Other Vehicle Area' '87 - Trunk/Cargo Area'
 '91 - Multiple Areas of Origin' '53 - Chimney/Flue Pipe'
 '79 - Other Outside Area' '97 - Other - unclassified'
 '64 - Porch or Balcony'
 '71 - Open Area (inc lawn, field, farmyard, park, playing field, pier,'
 '84 - Fuel Systems (eg. fuel tank, etc.)'
 '25 - Washroom or Bathroom (toilet,restroom/locker room)'
 '29 - Electronic Equipment' '47 - Shipping/Receiving/Loading Platform'
 '72 - Court, Patio, Terrace'
 '82 - Running Gear (inc wheels and braking systems, transmission syste'
 '42 - Garage'
 '31 - Process Manufacturing (inc manf, prod assembly, repair)'
 '11 - Lobby, Entranceway' 

This isn't too helpful! There are a lot of categories in each of these columns and it's hard to distinguish which ones are actually related to whether fire_alarm_system_presence is null. Stopping here (Jeff, Nov 23) I think it's probably best to just assign these rows something like "undetermined"

fire_data[category_1].fillna('Undetermined')

In addition to null values, some columns have data that includes "undetermined", "not applicable", or something related. It would be useful to explore these. An example is below.

In [12]:
fire_data.groupby('Smoke_Spread').size().sort_values(ascending = False)

Smoke_Spread
2 - Confined to part of room/area of origin                                         2888
4 - Spread beyond room of origin, same floor                                        2746
7 - Spread to other floors, confined to building                                    1823
3 - Spread to entire room of origin                                                  928
8 - Entire Structure                                                                 742
5 - Multi unit bldg: spread beyond suite of origin but not to separated suite(s)     547
99 - Undetermined                                                                    538
9 - Confined to roof/exterior structure                                              459
6 - Multi unit bldg: spread to separate suite(s)                                     296
10 - Spread beyond building of origin                                                247
dtype: int64

One thing we noticed is that some of this fire incidents data does not actually pertain to fire incidents. Fire trucks typically respond to the scene in non-fire emergencies like medical. It would be useful for us to explore how much of this data does not actually pertain to fires. Perhaps this will allow us to reason why there are specific groupings of null values.

"Initial_CAD_Event_Type" appears to be the column that tells us what each call is for. There are 115 unique values in this column. Let's look at what they are.

UPDATE: It is very difficult to tell what some of the abbreviations in this column mean (e.g., VEF, FIHR, FICI, FIG, etc.) for now I will move on to other data cleaning until we find a document that will help us with this. If we cannot find documentation - @Jeff something your Dad could help with?

In [13]:
fire_data['Initial_CAD_Event_Type'].nunique()

115

In [14]:
fire_data.groupby('Final_Incident_Type').size().sort_values(ascending = False).head()

Final_Incident_Type
01 - Fire                                                                                  15516
03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vandal,child playing,recycling or dump fires)     1914
02 - Explosion (including during Fire, excluding Codes 3 & 11-13)                            106
dtype: int64

In [84]:
fire_data.groupby('Initial_CAD_Event_Type').size().sort_values(ascending = False).head(20)

Initial_CAD_Event_Type
FIR                             3929
Fire - Grass/Rubbish            1698
VEF                             1652
FIHR                            1617
FICI                            1303
FIG                              917
Fire - Residential               898
FAHR                             787
VEFH                             548
Vehicle Fire                     478
Fire -  Highrise Residential     379
Fire - Commercial/Industrial     375
FACI                             303
Alarm Highrise Residential       235
FIHRD                            199
FAR                              175
Vehicle Fire - Highway           157
FAHRD                            154
FITP                             113
FIS                              107
dtype: int64

## Parsing DateTimes
We have 5 columns that should be DateTimes but are currently objects. Let's convert these.

In [16]:
datetime_columns = ['Ext_agent_app_or_defer_time', 'Fire_Under_Control_Time', 'Last_TFS_Unit_Clear_Time', 'TFS_Alarm_Time', 'TFS_Arrival_Time']

fire_data[datetime_columns] = fire_data[datetime_columns].apply(pd.to_datetime, format = '%Y-%m-%d %H:%M:%S')
fire_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17536 entries, 0 to 17535
Data columns (total 41 columns):
 #   Column                                                         Non-Null Count  Dtype         
---  ------                                                         --------------  -----         
 0   _id                                                            17536 non-null  int64         
 1   Area_of_Origin                                                 15623 non-null  object        
 2   Building_Status                                                11216 non-null  object        
 3   Business_Impact                                                11214 non-null  object        
 4   Civilian_Casualties                                            17536 non-null  int64         
 5   Count_of_Persons_Rescued                                       17536 non-null  int64         
 6   Estimated_Dollar_Loss                                          15627 non-null  float64       


## Check for Duplicates

## Splitting "## - Description Columns"
Below I created a function that could take in the columns with the format "## - Description" and split them into two separate columns: one for the number and one for the text.

I tested it and it works, but I think we should wait to apply this to our DataFrame until we have selected the columns we are interested in because this function will almost double the number of columns we have.

In [17]:
def column_split(df, column, number_column, string_column):

    df[number_column], df[string_column] = df[column].str.split('-', 1).str

    return df[number_column], df[string_column]

In [18]:
# column_split(fire_data, 'Area_of_Origin', 'Area_Of_Origin_No', 'Area_Of_Origin_Descr')
# fire_data