In this notebook I start with the complete merged dataset (data3) and compare data for the time period before 9 January 2014 with the time period since then. On 9 January 2014 ten of London Fire Stations closed. (Belsize, Bow, Clerkenwell, Downham, Kingsland, Knightsbridge, Silvertown, Westminster, Woolwich), leaving 102 stations serving London.
The purpose of this exercise is simple - I want to compare whether the closure of 10 fire stations had an impact on LFB attendence time based on the entirety of the data I found on London data store.This is the full population of LFB incidents and includes fires, specials services (such as road traffic collisions) and false alarms.

Main findings:
 - This simple comparison shows that for all incidents the mean attendence team of pump 1 was 4.8 seconds slower post closure compared with pre closure. The median attendence time was 9 seconds slower pre and post closure.
 - This simple comparison masks the fact that LFB attend a wide range of incidents; the majority are not fires.

In [1]:
import pandas as pd
pd.options.display.max_rows = 100  # To avoid displaying lots of truncated output
import numpy as np
from datetime import datetime

In [2]:
data = pd.read_csv("C:/Users/sonja/Desktop/MSc/data/LFB/data3.csv", low_memory = False)

In [3]:
print(data.iloc[0])# let's look at first item. Various data fields in the full dataset.
#Important to note null values in some cases. Will need to drop nulls.

Unnamed: 0                                                                    0
IncidentNumber                                                        235138081
DateOfCall                                                           2009-01-01
CalYear                                                                    2009
TimeOfCall                                                             00:00:37
HourOfCall                                                                    0
IncidentGroup                                                   Special Service
StopCodeDescription                                             Special Service
SpecialServiceType                                                          RTC
PropertyCategory                                                   Road Vehicle
PropertyType                                                               Car 
AddressQualifier                          In street close to gazetteer location
Postcode_full                           

Essentially I want to split the dataframe into two; the time period upto 8 January 2014 and the time period since then. Also, there are lots of variables that I don't need to answer the question I am interested in. Specially I need: Date of incident, Incident Group,FirstPumpArriving_AttendanceTime, FirstPumpArriving_DeployedFromStation,                
SecondPumpArriving_AttendanceTime, SecondPumpArriving_DeployedFromStation.                                          

In [4]:
#Selecting the relevant variables
data = data [['DateOfCall','IncidentGroup', 
              'FirstPumpArriving_AttendanceTime', 'SecondPumpArriving_AttendanceTime']]

#This shows a much smaller number of columns; also earlier data has null values. The 
#Incident recorded systems was upgraded in 2009 so data quality improves subsequent to that.

In [5]:
#Indexing by Data of call
data = data.set_index(data['DateOfCall'])
data = data.sort_index()

In [6]:
#Creating a pre_clusure and post closure dataframes
pre_closure = data['2009-01-01':'2014-01-08']
pre_closure.shape

(590496, 4)

In [7]:
post_closure = data ['2014-01-09':]
post_closure.shape

(474368, 4)

I have split the dataframe into to two: pre_closure and post_closure. Looking at the pre_closure dataset, the main column of interest is the FirstPumpAttendenceTime. Here we have 404, 916 non-null values (out of 590,496). In the post_closure dataframe there are 474 368 non null values.

In [8]:
pre_closure.info()

<class 'pandas.core.frame.DataFrame'>
Index: 590496 entries, 2009-01-01 to 2014-01-08
Data columns (total 4 columns):
DateOfCall                           590496 non-null object
IncidentGroup                        590496 non-null object
FirstPumpArriving_AttendanceTime     404916 non-null float64
SecondPumpArriving_AttendanceTime    3 non-null float64
dtypes: float64(2), object(2)
memory usage: 22.5+ MB


Here we also see that the data on second pump arriving attendance time is rather scarce - this is likely to be a data quality issue given LFB has a variety of attendence rules across the city and turning up with two pumps is common.


In [9]:
post_closure.info()
#Here we see a larger proportion of values for second pump attendence time. 
#However a comparison of before and after won't be possible.

<class 'pandas.core.frame.DataFrame'>
Index: 474368 entries, 2014-01-09 to 2018-08-31
Data columns (total 4 columns):
DateOfCall                           474368 non-null object
IncidentGroup                        474366 non-null object
FirstPumpArriving_AttendanceTime     438944 non-null float64
SecondPumpArriving_AttendanceTime    64808 non-null float64
dtypes: float64(2), object(2)
memory usage: 18.1+ MB


In [10]:
pre_closure.describe()

Unnamed: 0,FirstPumpArriving_AttendanceTime,SecondPumpArriving_AttendanceTime
count,404916.0,3.0
mean,318.946144,428.0
std,145.83116,190.197266
min,1.0,243.0
25%,227.0,330.5
50%,292.0,418.0
75%,378.0,520.5
max,1200.0,623.0


In [11]:
post_closure.describe()

Unnamed: 0,FirstPumpArriving_AttendanceTime,SecondPumpArriving_AttendanceTime
count,438944.0,64808.0
mean,323.273292,394.386094
std,140.97465,149.913919
min,1.0,1.0
25%,235.0,298.0
50%,301.0,369.0
75%,384.0,458.0
max,1200.0,1200.0


In [12]:
pre_closure ['FirstPumpArriving_AttendanceTime'].median()

292.0

In [13]:
post_closure ['FirstPumpArriving_AttendanceTime'].median()

301.0

This simple comparison shows that for all incidents the mean attendence team of pump 1 pre closure 
was 318.9 seconds compared with  323.7 seconds post closure - 4.8 seconds slower.
The median attendence time pre closure was 292 seconds pre-closure and 301 seconds post closure - 
9 seconds slower.