# Capstone 1: Data Wrangling (from CSV)

In [1]:
# Import packages and modules
import pandas as pd
import numpy as np
import datetime

### Data Cleansing: Define functions

REMOVE UNNECESSARY VALUES (**drop_invalid** and **drop_immaterial**)

For the purpose of this analysis, any observation with a missing value for 'incident_disposition_code' must be omitted since the target variable is derived from this feature. In addition, observations that contain the following outliers, errors, or immaterial information must also be removed from the dataset:
+ incidents created to transport a patient from one facility to another
+ incidents where units were assigned to stand by in case they were needed
+ incidents that pertain to special events
+ incidents that were once closed but later reopened
+ incidents with calculation errors for duration metrics
+ features that contain redundant geographic information for incident

In [2]:
def drop_invalid(dfObj):
    """
    This function drops rows and columns within the DataFrame 
    object that will confound future analyses
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Drop all rows with missing value for select features
    dfObj.dropna(subset=['incident_disposition_code',
                         'zipcode',
                         'dispatch_response_seconds_qy',
                         'incident_travel_tm_seconds_qy',
                         'incident_response_seconds_qy'],
                 inplace=True)
    
    # Identify all columns that validate duration metrics
    list_of_validation_cols = [name for name in list(dfObj.columns)
                               if 'valid' in str(name)]
    
    # Drop all rows with invalid duration metrics
    for name in list_of_validation_cols:
        invalid_idx = dfObj[dfObj[name]=='N'].index
        dfObj.drop(invalid_idx, inplace=True)
    
    # Drop all rows where EMS were not dispatched
    no_disp_idx = dfObj[dfObj.dispatch_response_seconds_qy==0].index
    dfObj.drop(no_disp_idx, inplace=True)
    
    return dfObj

In [3]:
def drop_immaterial(dfObj):
    """
    This function drops rows and columns within the DataFrame 
    object that relates to erroneous information
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Identify all columns with outlier event indicators
    list_of_indicator_cols = [name for name in list(dfObj.columns) 
                              if 'indicator' in str(name) and name !='held_indicator']
    
    # Drop all rows that pertain to outlier incidents
    for name in list_of_indicator_cols:
        outlier_idx = dfObj[dfObj[name]=='Y'].index
        dfObj.drop(outlier_idx, inplace=True)
    
    # Remove columns that contain incident indicator data
    dfObj.drop(list_of_indicator_cols,axis=1,inplace=True)
    
    # Identify and remove all columns that contain redundant geographic data
    list_of_zone_cols = [name for name in list(dfObj.columns) 
                         if ('district' in str(name) or name=='policeprecinct')]
    dfObj.drop(list_of_zone_cols,axis=1,inplace=True)
    
    return dfObj

REDUCE SIZE OF DATAFRAME (**reduce_memory**)

Modifying the data types for values contained within select columns will drastically reduce the memory usage of the dataframe object.

In [4]:
def reduce_memory(dfObj):
    """
    This function changes the dtypes for specific columns in 
    the DataFrame object in order to reduce its memory usage
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Truncate name for borough label: 'RICHMOND / STATEN ISLAND'
    dfObj['borough'] = dfObj.borough.replace('RICHMOND / STATEN ISLAND',
                                             'STATEN ISLAND')
    
    # Create list of all columns that contain ISO8601 datetime
    list_of_datetime_cols = [name for name in list(dfObj.columns) 
                             if 'datetime' in str(name)]

    # Convert dtypes for each element in list to datetime
    for name in list_of_datetime_cols:
        dfObj[name] = pd.to_datetime(dfObj[name],errors='coerce')
       
    # Create list of all columns that contain time duration
    list_of_numeric_cols = [name for name in list(dfObj.columns) 
                            if (('seconds' in str(name))|
                                ('severity' in str(name))|
                                ('cad' in str(name)))]

    # Convert dtypes for each element in list to numeric
    for name in list_of_numeric_cols:
        dfObj[name] = pd.to_numeric(dfObj[name],errors='coerce')
        
    # Convert columns to category dtypes to reduce size of dataframe object
    dfObj['borough'] = dfObj.borough.astype('category')
    dfObj['zipcode'] = dfObj.zipcode.astype('category')
    dfObj['held_indicator'] = dfObj.held_indicator.astype('category')
    dfObj['valid_dispatch_rspns_time_indc'] = dfObj.valid_dispatch_rspns_time_indc.astype('category')
    dfObj['valid_incident_rspns_time_indc'] = dfObj.valid_incident_rspns_time_indc.astype('category')
    dfObj['incident_dispatch_area'] = dfObj.incident_dispatch_area.astype('category')
    dfObj['incident_disposition_code'] = dfObj.incident_disposition_code.astype('category')
    
    return dfObj

REDESIGN THE DATAFRAME (**format_df**)

Construct a boolean series that represents the target variable (fatality) using the corresponding values in 'incident_disposition_code'. Also, apply aesthetic changes to help improve the readability of the dataframe object.

In [5]:
def format_df(dfObj):
    """
    This function creates a Pandas series for the target variable (fatality), 
    two Pandas series for a MultiIndex, and re-orders the columns 
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Rename select columns
    dfObj.rename(columns={'initial_severity_level_code':'initial_severity_level',
                          'final_severity_level_code':'final_severity_level',
                          'dispatch_response_seconds_qy':'dispatch_time',
                          'incident_travel_tm_seconds_qy':'travel_time',
                          'incident_response_seconds_qy':'total_response_time'},inplace=True)
    
    # Create a series for a new feature variable: life_threatening
    dfObj['life_threatening'] = [True if ((val ==1)|
                                         (val == 2)|
                                         (val == 3)) else False 
                                 for val in dfObj['final_severity_level'].astype('int64')]
    
    # Create a series for the target variable: fatality
    dfObj['fatality'] = np.logical_or(dfObj.incident_disposition_code.astype('int64') == 83,
                                      dfObj.incident_disposition_code.astype('int64') == 96)

    # Create separate columns for time components of the incident
    dfObj['year'] = pd.DatetimeIndex(dfObj.incident_datetime).year
    dfObj['month'] = pd.DatetimeIndex(dfObj.incident_datetime).month
    dfObj['weekday'] = pd.DatetimeIndex(dfObj.incident_datetime).weekday_name.astype('category')
    dfObj['hour'] = pd.DatetimeIndex(dfObj.incident_datetime).hour
    
    # Reorder dataframe columns
    col_order = ['cad_incident_id','incident_datetime',
                 'year','month','hour','weekday','borough','zipcode',
                 'initial_call_type','initial_severity_level',
                 'final_call_type','final_severity_level',
                 'held_indicator','first_assignment_datetime',
                 'incident_dispatch_area','valid_dispatch_rspns_time_indc',
                 'dispatch_time','first_activation_datetime',
                 'first_on_scene_datetime','travel_time',
                 'valid_incident_rspns_time_indc','total_response_time',
                 'first_to_hosp_datetime','first_hosp_arrival_datetime',
                 'incident_close_datetime','incident_disposition_code',
                 'life_threatening','fatality']
    dfObj=dfObj[col_order]
    
    return dfObj

### Data Acquisition: Inspect sample of source data

In [6]:
# Import sample of dataset
"""
    The original CSV file can be exported from 
    https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj
"""
file_path = '../data/EMS_Incident_Dispatch_Data.csv'

# Read CSV data into a Pandas dataframe
preview_df = pd.read_csv(file_path,header=0,nrows=1000)

In [7]:
preview_df.shape

(1000, 32)

In [8]:
preview_df.head()

Unnamed: 0,CAD_INCIDENT_ID,INCIDENT_DATETIME,INITIAL_CALL_TYPE,INITIAL_SEVERITY_LEVEL_CODE,FINAL_CALL_TYPE,FINAL_SEVERITY_LEVEL_CODE,FIRST_ASSIGNMENT_DATETIME,VALID_DISPATCH_RSPNS_TIME_INDC,DISPATCH_RESPONSE_SECONDS_QY,FIRST_ACTIVATION_DATETIME,...,ZIPCODE,POLICEPRECINCT,CITYCOUNCILDISTRICT,COMMUNITYDISTRICT,COMMUNITYSCHOOLDISTRICT,CONGRESSIONALDISTRICT,REOPEN_INDICATOR,SPECIAL_EVENT_INDICATOR,STANDBY_INDICATOR,TRANSFER_INDICATOR
0,130010001,01/01/2013 12:00:04 AM,RESPIR,4,RESPIR,4,01/01/2013 12:01:45 AM,Y,101,01/01/2013 12:01:51 AM,...,10472.0,43.0,18.0,209.0,12.0,15.0,N,N,N,N
1,130010002,01/01/2013 12:00:19 AM,CARD,3,CARD,3,01/01/2013 12:01:18 AM,Y,59,01/01/2013 12:02:08 AM,...,10454.0,40.0,8.0,201.0,7.0,15.0,N,N,N,N
2,130010004,01/01/2013 12:01:04 AM,ARREST,1,ARREST,1,01/01/2013 12:01:33 AM,Y,29,01/01/2013 12:01:58 AM,...,11418.0,102.0,29.0,409.0,27.0,5.0,N,N,N,N
3,130010005,01/01/2013 12:01:16 AM,SICK,6,SICK,6,01/01/2013 12:02:12 AM,Y,56,01/01/2013 12:02:55 AM,...,10453.0,46.0,14.0,205.0,10.0,15.0,N,N,N,N
4,130010006,01/01/2013 12:01:26 AM,INJURY,5,INJURY,5,01/01/2013 12:01:58 AM,Y,32,01/01/2013 12:02:55 AM,...,10457.0,48.0,15.0,206.0,10.0,15.0,N,N,N,N


In [9]:
preview_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
CAD_INCIDENT_ID                   1000 non-null int64
INCIDENT_DATETIME                 1000 non-null object
INITIAL_CALL_TYPE                 1000 non-null object
INITIAL_SEVERITY_LEVEL_CODE       1000 non-null int64
FINAL_CALL_TYPE                   1000 non-null object
FINAL_SEVERITY_LEVEL_CODE         1000 non-null int64
FIRST_ASSIGNMENT_DATETIME         935 non-null object
VALID_DISPATCH_RSPNS_TIME_INDC    1000 non-null object
DISPATCH_RESPONSE_SECONDS_QY      1000 non-null int64
FIRST_ACTIVATION_DATETIME         932 non-null object
FIRST_ON_SCENE_DATETIME           890 non-null object
VALID_INCIDENT_RSPNS_TIME_INDC    1000 non-null object
INCIDENT_RESPONSE_SECONDS_QY      889 non-null float64
INCIDENT_TRAVEL_TM_SECONDS_QY     890 non-null float64
FIRST_TO_HOSP_DATETIME            522 non-null object
FIRST_HOSP_ARRIVAL_DATETIME       513 non-null object
INCIDENT_CLOSE_DATETIME

### Data Acquisition: Obtain complete source data from CSV file

In [10]:
# Import full dataset
"""
    The original CSV file can be exported from 
    https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj
"""
file_path = '../data/EMS_Incident_Dispatch_Data.csv'

# Read CSV data into a Pandas dataframe
df = pd.read_csv(file_path,header=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [11]:
# Convert all column header names to lowercase
col_names = [str(name).lower() for name in list(df.columns)]
df.columns = col_names
df.columns

Index(['cad_incident_id', 'incident_datetime', 'initial_call_type',
       'initial_severity_level_code', 'final_call_type',
       'final_severity_level_code', 'first_assignment_datetime',
       'valid_dispatch_rspns_time_indc', 'dispatch_response_seconds_qy',
       'first_activation_datetime', 'first_on_scene_datetime',
       'valid_incident_rspns_time_indc', 'incident_response_seconds_qy',
       'incident_travel_tm_seconds_qy', 'first_to_hosp_datetime',
       'first_hosp_arrival_datetime', 'incident_close_datetime',
       'held_indicator', 'incident_disposition_code', 'borough', 'atom',
       'incident_dispatch_area', 'zipcode', 'policeprecinct',
       'citycouncildistrict', 'communitydistrict', 'communityschooldistrict',
       'congressionaldistrict', 'reopen_indicator', 'special_event_indicator',
       'standby_indicator', 'transfer_indicator'],
      dtype='object')

A downloadable description of each dataset field is available at https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj in the _Attachments_ section under the file name **EMS_incident_dispatch_data_description.xlsx**. 

### Inspect dataframe

In [12]:
df.shape

(8557848, 32)

In [13]:
df.head()

Unnamed: 0,cad_incident_id,incident_datetime,initial_call_type,initial_severity_level_code,final_call_type,final_severity_level_code,first_assignment_datetime,valid_dispatch_rspns_time_indc,dispatch_response_seconds_qy,first_activation_datetime,...,zipcode,policeprecinct,citycouncildistrict,communitydistrict,communityschooldistrict,congressionaldistrict,reopen_indicator,special_event_indicator,standby_indicator,transfer_indicator
0,130010001,01/01/2013 12:00:04 AM,RESPIR,4,RESPIR,4,01/01/2013 12:01:45 AM,Y,101,01/01/2013 12:01:51 AM,...,10472.0,43.0,18.0,209.0,12.0,15.0,N,N,N,N
1,130010002,01/01/2013 12:00:19 AM,CARD,3,CARD,3,01/01/2013 12:01:18 AM,Y,59,01/01/2013 12:02:08 AM,...,10454.0,40.0,8.0,201.0,7.0,15.0,N,N,N,N
2,130010004,01/01/2013 12:01:04 AM,ARREST,1,ARREST,1,01/01/2013 12:01:33 AM,Y,29,01/01/2013 12:01:58 AM,...,11418.0,102.0,29.0,409.0,27.0,5.0,N,N,N,N
3,130010005,01/01/2013 12:01:16 AM,SICK,6,SICK,6,01/01/2013 12:02:12 AM,Y,56,01/01/2013 12:02:55 AM,...,10453.0,46.0,14.0,205.0,10.0,15.0,N,N,N,N
4,130010006,01/01/2013 12:01:26 AM,INJURY,5,INJURY,5,01/01/2013 12:01:58 AM,Y,32,01/01/2013 12:02:55 AM,...,10457.0,48.0,15.0,206.0,10.0,15.0,N,N,N,N


In [14]:
df.info(verbose=True,null_counts=True,memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8557848 entries, 0 to 8557847
Data columns (total 32 columns):
cad_incident_id                   8557848 non-null int64
incident_datetime                 8557848 non-null object
initial_call_type                 8557848 non-null object
initial_severity_level_code       8557848 non-null int64
final_call_type                   8557848 non-null object
final_severity_level_code         8557848 non-null int64
first_assignment_datetime         8498783 non-null object
valid_dispatch_rspns_time_indc    8557848 non-null object
dispatch_response_seconds_qy      8557848 non-null int64
first_activation_datetime         8483588 non-null object
first_on_scene_datetime           8284662 non-null object
valid_incident_rspns_time_indc    8557848 non-null object
incident_response_seconds_qy      8283276 non-null float64
incident_travel_tm_seconds_qy     8284535 non-null float64
first_to_hosp_datetime            6031607 non-null object
first_hosp_arrival_

In [15]:
df.memory_usage(deep=True)

Index                                    80
cad_incident_id                    68462784
incident_datetime                 676069992
initial_call_type                 529894033
initial_severity_level_code        68462784
final_call_type                   529993248
final_severity_level_code          68462784
first_assignment_datetime         673293937
valid_dispatch_rspns_time_indc    530586576
dispatch_response_seconds_qy       68462784
first_activation_datetime         672579772
first_on_scene_datetime           663230250
valid_incident_rspns_time_indc    530586576
incident_response_seconds_qy       68462784
incident_travel_tm_seconds_qy      68462784
first_to_hosp_datetime            557336665
first_hosp_arrival_datetime       555408114
incident_close_datetime           675918323
held_indicator                    530586576
incident_disposition_code          68462784
borough                           554812429
atom                              478931352
incident_dispatch_area          

### Data Cleansing: Apply functions

In [16]:
# Apply drop_invalid to dataframe object
df = drop_invalid(df)

In [17]:
# Apply drop_immaterial to dataframe object
df = drop_immaterial(df)

In [18]:
# Apply reduce_memory to dataframe object
df = reduce_memory(df)

In [19]:
# Apply format_df to dataframe object
df = format_df(df)

In [20]:
# Set index to 'cad_incident_id'
df.set_index(['cad_incident_id'],inplace=True)

### Inspect clean dataframe

In [21]:
df.shape

(8011592, 27)

In [22]:
df.head()

Unnamed: 0_level_0,incident_datetime,year,month,hour,weekday,borough,zipcode,initial_call_type,initial_severity_level,final_call_type,...,first_on_scene_datetime,travel_time,valid_incident_rspns_time_indc,total_response_time,first_to_hosp_datetime,first_hosp_arrival_datetime,incident_close_datetime,incident_disposition_code,life_threatening,fatality
cad_incident_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
130010001,2013-01-01 00:00:04,2013,1,0,Tuesday,BRONX,10472.0,RESPIR,4,RESPIR,...,2013-01-01 00:13:21,696.0,Y,797.0,2013-01-01 00:28:49,2013-01-01 00:38:15,2013-01-01 01:04:56,82.0,False,False
130010002,2013-01-01 00:00:19,2013,1,0,Tuesday,BRONX,10454.0,CARD,3,CARD,...,2013-01-01 00:14:30,792.0,Y,851.0,NaT,NaT,2013-01-01 00:55:34,93.0,True,False
130010004,2013-01-01 00:01:04,2013,1,0,Tuesday,QUEENS,11418.0,ARREST,1,ARREST,...,2013-01-01 00:08:13,400.0,Y,429.0,NaT,NaT,2013-01-01 00:38:05,83.0,True,True
130010005,2013-01-01 00:01:16,2013,1,0,Tuesday,BRONX,10453.0,SICK,6,SICK,...,2013-01-01 00:15:04,772.0,Y,828.0,2013-01-01 00:34:54,2013-01-01 00:53:02,2013-01-01 01:20:28,82.0,False,False
130010006,2013-01-01 00:01:26,2013,1,0,Tuesday,BRONX,10457.0,INJURY,5,INJURY,...,2013-01-01 00:15:42,824.0,Y,856.0,2013-01-01 00:27:42,2013-01-01 00:31:13,2013-01-01 00:53:12,82.0,False,False


In [23]:
df.info(verbose=True,null_counts=True,memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8011592 entries, 130010001 to 183654386
Data columns (total 27 columns):
incident_datetime                 8011592 non-null datetime64[ns]
year                              8011592 non-null int64
month                             8011592 non-null int64
hour                              8011592 non-null int64
weekday                           8011592 non-null category
borough                           8011592 non-null category
zipcode                           8011592 non-null category
initial_call_type                 8011592 non-null object
initial_severity_level            8011592 non-null int64
final_call_type                   8011592 non-null object
final_severity_level              8011592 non-null int64
held_indicator                    8011592 non-null category
first_assignment_datetime         8011592 non-null datetime64[ns]
incident_dispatch_area            8011592 non-null category
valid_dispatch_rspns_time_indc    8011592 no

In [24]:
df.memory_usage(deep=True)

Index                              64092736
incident_datetime                  64092736
year                               64092736
month                              64092736
hour                               64092736
weekday                             8012361
borough                             8012142
zipcode                            16035320
initial_call_type                 496071904
initial_severity_level             64092736
final_call_type                   496169104
final_severity_level               64092736
held_indicator                      8011796
first_assignment_datetime          64092736
incident_dispatch_area              8014996
valid_dispatch_rspns_time_indc      8011734
dispatch_time                      64092736
first_activation_datetime          64092736
first_on_scene_datetime            64092736
travel_time                        64092736
valid_incident_rspns_time_indc      8011734
total_response_time                64092736
first_to_hosp_datetime          

### Export dataframe to CSV

In [25]:
# Export dataframe to CSV
output_path = '../data/clean_EMS_data_from_csv.csv'
print('Exporting dataframe to CSV...')
df.to_csv(output_path,index=False,compression='gzip')
print('Dataframe successfully exported to CSV using \'gzip\' compression.')

Exporting dataframe to CSV...
Dataframe successfully exported to CSV using 'gzip' compression.


###### Summary
The original dataset was comprised of 8,557,848 observations of mixed data types, and its source file occupied approximately 1.98 GB of hard disk space and 10.9 GB in system memory. The target variable ("fatality") was created by applying a boolean filter on the "incident_disposition_code" column within the dataframe, which indicates the outcome of any EMS incident. 

After all data pre-processing was complete, the resulting clean dataset was comprised of 8,011,592 observations of mixed data types, with a clear target variable and 26 predictor variables. Its output file occupied 286 MB of hard disk space and 2.0 GB in system memory.