# Capstone 1: Data Wrangling (from CSV)

<a id='TOC'></a>
**Table of Contents**
1. Preliminaries
    1. [Import Modules](#Sec01A)
2. Define Functions
    1. [Remove Unnecessary Values](#Sec02A)
    2. [Reduce Size of DataFrame](#Sec02B)
    3. [Redesign the DataFrame](#Sec02C)
3. Data Acquisition
    1. [Inspect Sample of EMS Dataset](#Sec03A)
    2. [Import Full EMS Dataset](#Sec03B)
    3. [Import Geographical Dataset](#Sec03C)
4. Data Wrangling
    1. [Apply Custom Functions](#Sec04A)
    2. [Inspect Clean Dataset](#Sec04B)
    3. [Export Clean Dataset](#Sec04C)
    4. [Process Summary](#Sec04D)

The goal of this project is to develop machine learning models that predict whether or not the outcome of an EMS incident will result in a fatality. This is a supervised, binary classification problem. Analyses will be performed on a collection of nearly 8 million records of documented incidents, which span the six year period from January 2013 through December 2018, and appropriate predictive models will be developed to achieve the primary objective. This dataset is robust and contains several feature variables, of mixed data types, that describe both various attributes of each incident as well as the responsive action taken by the FDNY. All of the aforementioned factors affect an individual’s survivability once a response is initiated.

Data wrangling will be performed on two datasets for this analysis. One dataset contains EMS incident data spanning a six-year period. The second dataset contains geographical information for all ZIP Code Tabulation Areas within the City of New York.

***

## 1. PRELIMINARIES

<a id='Sec01A'></a>
#### 1A: Import modules

In [1]:
# Import packages and modules
import pandas as pd
import numpy as np
import datetime

[TOC](#TOC)

***

## 2. DEFINE FUNCTIONS

<a id='Sec02A'></a>
#### 2A: Remove unnecessary values (`drop_invalid` and `drop_immaterial`)

For the purpose of this analysis, any observation with a missing value for `incident_disposition_code` must be omitted since the target variable is derived from this feature. In addition, observations that contain the following outliers, errors, or immaterial information must also be removed from the dataset:
+ incidents created to transport a patient from one facility to another
+ incidents where units were assigned to stand by in case they were needed
+ incidents that pertain to special events
+ incidents that were once closed but later reopened
+ incidents with calculation errors for duration metrics
+ features that contain redundant geographic information for incident

In [2]:
def drop_invalid(dfObj):
    """
    This function drops rows and columns within the DataFrame 
    object that will confound future analyses
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Drop all rows with missing value for select features
    dfObj.dropna(subset=['incident_disposition_code',
                         'zipcode',
                         'dispatch_response_seconds_qy',
                         'incident_travel_tm_seconds_qy',
                         'incident_response_seconds_qy'],
                 inplace=True)
    
    # Identify all columns that validate duration metrics
    list_of_validation_cols = [name for name in list(dfObj.columns)
                               if 'valid' in str(name)]
    
    # Drop all rows with invalid duration metrics
    for name in list_of_validation_cols:
        invalid_idx = dfObj[dfObj[name]=='N'].index
        dfObj.drop(invalid_idx, inplace=True)
    
    # Drop all rows where EMS were not dispatched
    no_disp_idx = dfObj[dfObj.dispatch_response_seconds_qy==0].index
    dfObj.drop(no_disp_idx, inplace=True)
    
    return dfObj

In [3]:
def drop_immaterial(dfObj):
    """
    This function removes all observations that contain
    outlier incicents. It also removes rows and columns 
    within the DataFrame object that contains erroneous 
    or redunant information
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Identify all columns with outlier event indicators
    list_of_indicator_cols = [name for name in list(dfObj.columns) 
                              if 'indicator' in str(name) and name !='held_indicator']
    
    # Drop all rows that pertain to outlier incidents
    for name in list_of_indicator_cols:
        outlier_idx = dfObj[dfObj[name]=='Y'].index
        dfObj.drop(outlier_idx, inplace=True)
    
    # Remove columns that contain incident indicator data
    dfObj.drop(list_of_indicator_cols,axis=1,inplace=True)
    dfObj.drop([name for name in list(dfObj.columns) 
                if '_indc' in str(name)],axis=1,inplace=True)
    
    # Identify and remove all columns that contain redundant geographic data
    list_of_zone_cols = [name for name in list(dfObj.columns)
                         if (('district' in str(name))| 
                             (name =='policeprecinct')| 
                             (name =='geoid'))]
    dfObj.drop(list_of_zone_cols,axis=1,inplace=True)
    
    return dfObj

[TOC](#TOC)

<a id='Sec02B'></a>
#### 2B: Reduce size of DataFrame (`reduce_memory`)

Modifying the data types for values contained within select columns will drastically reduce the memory usage of the dataframe object.

In [4]:
def reduce_memory(dfObj):
    """
    This function changes the dtypes for specific columns in 
    the DataFrame object in order to reduce its memory usage
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Truncate name for borough label: 'RICHMOND / STATEN ISLAND'
    dfObj['borough'] = dfObj.borough.replace('RICHMOND / STATEN ISLAND',
                                             'STATEN ISLAND')
    
    # Create list of all columns that contain ISO8601 datetime
    list_of_datetime_cols = [name for name in list(dfObj.columns) 
                             if 'datetime' in str(name)]

    # Convert dtypes for each element in list to datetime
    for name in list_of_datetime_cols:
        dfObj[name] = pd.to_datetime(dfObj[name],errors='coerce')
       
    # Create list of all columns that contain time duration
    list_of_numeric_cols = [name for name in list(dfObj.columns) 
                            if (('seconds' in str(name))|
                                ('severity' in str(name))|
                                ('cad' in str(name)))]

    # Convert dtypes for each element in list to numeric
    for name in list_of_numeric_cols:
        dfObj[name] = pd.to_numeric(dfObj[name],errors='coerce')
        
    # Convert columns to category dtypes to reduce size of dataframe object
    dfObj['borough'] = dfObj.borough.astype('category')
    dfObj['zipcode'] = dfObj.zipcode.astype('category')
    dfObj['held_indicator'] = dfObj.held_indicator.astype('category')
    dfObj['incident_dispatch_area'] = dfObj.incident_dispatch_area.astype('category')
    dfObj['incident_disposition_code'] = dfObj.incident_disposition_code.astype('category')
    
    return dfObj

[TOC](#TOC)

<a id='Sec02C'></a>
#### 2C: Redesign the DataFrame (`format_df`)

Construct a boolean series that represents the target variable (`fatality`) using the corresponding values in `incident_disposition_code`. Also, apply aesthetic changes to help improve the readability of the dataframe object.

In [5]:
def format_df(dfObj):
    """
    This function creates a Pandas series for the target 
    variable (fatality), parses datetime information from 
    existing columns into separate feature variables, and 
    re-orders the columns within the DataFrame object 
    
    Parameter(s)
    ------------
    dfObj: Pandas DataFrame object
        The object to be modified by the function
        
    Returns
    -------
    dfObj: Pandas DataFrame object
    """
    # Rename select columns
    dfObj.rename(columns={'initial_severity_level_code':'initial_severity_level',
                          'final_severity_level_code':'final_severity_level',
                          'dispatch_response_seconds_qy':'dispatch_time',
                          'incident_travel_tm_seconds_qy':'travel_time',
                          'incident_response_seconds_qy':'response_time',
                          'intptlat':'latitude',
                          'intptlong':'longitude'},inplace=True)
    
    # Create a series for a new feature variable: life_threatening
    dfObj['life_threatening'] = [True if ((val ==1)|
                                         (val == 2)|
                                         (val == 3)) else False 
                                 for val in dfObj['final_severity_level'].astype('int64')]
    
    # Create a series for the target variable: fatality
    dfObj['fatality'] = np.logical_or(dfObj.incident_disposition_code.astype('int64') == 83,
                                      dfObj.incident_disposition_code.astype('int64') == 96)

    # Create separate columns for time components of the incident
    dfObj['year'] = pd.DatetimeIndex(dfObj.incident_datetime).year
    dfObj['month'] = pd.DatetimeIndex(dfObj.incident_datetime).month
    dfObj['weekday'] = pd.DatetimeIndex(dfObj.incident_datetime).weekday_name.astype('category')
    dfObj['hour'] = pd.DatetimeIndex(dfObj.incident_datetime).hour
    
    # Reorder columns of DataFrame object
    col_order = ['cad_incident_id','incident_datetime',
                 'year','month','hour','weekday','borough',
                 'zipcode','latitude','longitude','aland_sqmi','awater_sqmi',
                 'initial_call_type','initial_severity_level',
                 'final_call_type','final_severity_level',
                 'held_indicator','first_assignment_datetime',
                 'incident_dispatch_area',
                 'dispatch_time','first_activation_datetime',
                 'first_on_scene_datetime','travel_time','response_time',
                 'first_to_hosp_datetime','first_hosp_arrival_datetime',
                 'incident_close_datetime','incident_disposition_code',
                 'life_threatening','fatality']
    dfObj=dfObj[col_order]
    
    return dfObj

[TOC](#TOC)

***

## 3. Data Acquisition

<a id='Sec03A'></a>
#### 3A: Inspect sample of EMS incident dataset

In [6]:
# Import sample of EMS incident dataset
"""
    The original CSV file can be exported from 
    https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj
"""
file_path = '../data/EMS_Incident_Dispatch_Data.csv'

# Read CSV data into a Pandas DataFrame
preview_df = pd.read_csv(file_path,header=0,nrows=1000)

In [7]:
preview_df.shape

(1000, 32)

In [8]:
preview_df.head()

Unnamed: 0,CAD_INCIDENT_ID,INCIDENT_DATETIME,INITIAL_CALL_TYPE,INITIAL_SEVERITY_LEVEL_CODE,FINAL_CALL_TYPE,FINAL_SEVERITY_LEVEL_CODE,FIRST_ASSIGNMENT_DATETIME,VALID_DISPATCH_RSPNS_TIME_INDC,DISPATCH_RESPONSE_SECONDS_QY,FIRST_ACTIVATION_DATETIME,...,ZIPCODE,POLICEPRECINCT,CITYCOUNCILDISTRICT,COMMUNITYDISTRICT,COMMUNITYSCHOOLDISTRICT,CONGRESSIONALDISTRICT,REOPEN_INDICATOR,SPECIAL_EVENT_INDICATOR,STANDBY_INDICATOR,TRANSFER_INDICATOR
0,130010001,01/01/2013 12:00:04 AM,RESPIR,4,RESPIR,4,01/01/2013 12:01:45 AM,Y,101,01/01/2013 12:01:51 AM,...,10472.0,43.0,18.0,209.0,12.0,15.0,N,N,N,N
1,130010002,01/01/2013 12:00:19 AM,CARD,3,CARD,3,01/01/2013 12:01:18 AM,Y,59,01/01/2013 12:02:08 AM,...,10454.0,40.0,8.0,201.0,7.0,15.0,N,N,N,N
2,130010004,01/01/2013 12:01:04 AM,ARREST,1,ARREST,1,01/01/2013 12:01:33 AM,Y,29,01/01/2013 12:01:58 AM,...,11418.0,102.0,29.0,409.0,27.0,5.0,N,N,N,N
3,130010005,01/01/2013 12:01:16 AM,SICK,6,SICK,6,01/01/2013 12:02:12 AM,Y,56,01/01/2013 12:02:55 AM,...,10453.0,46.0,14.0,205.0,10.0,15.0,N,N,N,N
4,130010006,01/01/2013 12:01:26 AM,INJURY,5,INJURY,5,01/01/2013 12:01:58 AM,Y,32,01/01/2013 12:02:55 AM,...,10457.0,48.0,15.0,206.0,10.0,15.0,N,N,N,N


In [9]:
preview_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 32 columns):
CAD_INCIDENT_ID                   1000 non-null int64
INCIDENT_DATETIME                 1000 non-null object
INITIAL_CALL_TYPE                 1000 non-null object
INITIAL_SEVERITY_LEVEL_CODE       1000 non-null int64
FINAL_CALL_TYPE                   1000 non-null object
FINAL_SEVERITY_LEVEL_CODE         1000 non-null int64
FIRST_ASSIGNMENT_DATETIME         935 non-null object
VALID_DISPATCH_RSPNS_TIME_INDC    1000 non-null object
DISPATCH_RESPONSE_SECONDS_QY      1000 non-null int64
FIRST_ACTIVATION_DATETIME         932 non-null object
FIRST_ON_SCENE_DATETIME           890 non-null object
VALID_INCIDENT_RSPNS_TIME_INDC    1000 non-null object
INCIDENT_RESPONSE_SECONDS_QY      889 non-null float64
INCIDENT_TRAVEL_TM_SECONDS_QY     890 non-null float64
FIRST_TO_HOSP_DATETIME            522 non-null object
FIRST_HOSP_ARRIVAL_DATETIME       513 non-null object
INCIDENT_CLOSE_DATETIME

[TOC](#TOC)

<a id='Sec03B'></a>
#### 3B: Import full EMS incident dataset (from CSV file)

In [10]:
# Import full EMS incident dataset
"""
    The original CSV file can be exported from 
    https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj
"""
file_path1 = '../data/EMS_Incident_Dispatch_Data.csv'

# Read CSV data into a Pandas DataFrame object
ems_df = pd.read_csv(file_path1,header=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [11]:
# Convert all column header names to lowercase
ems_df.rename(columns={i:i.lower().strip() for i in ems_df.columns},inplace=True)
ems_df.columns

Index(['cad_incident_id', 'incident_datetime', 'initial_call_type',
       'initial_severity_level_code', 'final_call_type',
       'final_severity_level_code', 'first_assignment_datetime',
       'valid_dispatch_rspns_time_indc', 'dispatch_response_seconds_qy',
       'first_activation_datetime', 'first_on_scene_datetime',
       'valid_incident_rspns_time_indc', 'incident_response_seconds_qy',
       'incident_travel_tm_seconds_qy', 'first_to_hosp_datetime',
       'first_hosp_arrival_datetime', 'incident_close_datetime',
       'held_indicator', 'incident_disposition_code', 'borough', 'atom',
       'incident_dispatch_area', 'zipcode', 'policeprecinct',
       'citycouncildistrict', 'communitydistrict', 'communityschooldistrict',
       'congressionaldistrict', 'reopen_indicator', 'special_event_indicator',
       'standby_indicator', 'transfer_indicator'],
      dtype='object')

In [12]:
# Obtain shape of DataFrame w/ EMS incident data
ems_df.shape

(8557848, 32)

A downloadable description of each dataset field is available at [NYC Open Data](https://data.cityofnewyork.us/Public-Safety/EMS-Incident-Dispatch-Data/76xm-jjuj) in the _Attachments_ section under the file name **EMS_incident_dispatch_data_description.xlsx**. [TOC](#TOC)

<a id='Sec03C'></a>
#### 3C: Import geographical dataset (from tab-delimited file)

In [13]:
# Import geographic data for all U.S. cities
"""
    The original TXT file can be exported from
    https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2019_Gazetteer/2019_Gaz_zcta_national.zip
"""
file_path2 = '../data/2019_Gaz_zcta_national.txt'

# Read tab-delimited text data into a Pandas DataFrame object
geo_df = pd.read_table(file_path2)

In [14]:
# Convert all column header names to lowercase
geo_df.rename(columns={i:i.lower().strip() for i in geo_df.columns},inplace=True)
geo_df.columns

Index(['geoid', 'aland', 'awater', 'aland_sqmi', 'awater_sqmi', 'intptlat',
       'intptlong'],
      dtype='object')

In [15]:
# Obtain shape of DataFrame w/ geographic data
geo_df.shape

(33144, 7)

The text file contains GPA coordinates for all ZIP Code Tabulation Areas (ZCTAs) within the United States. A downloadable file is available at the [U.S. Census Bureau](https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html) under the file name **ZIP Code Tabulation Areas**. [TOC](#TOC)

<a id='Sec03D'></a>
#### 3D: Inspect raw, merged dataset

In [16]:
# Join the two DataFrame objects
df = pd.merge(ems_df,geo_df,left_on='zipcode',right_on='geoid')

In [17]:
df.shape

(8368586, 39)

In [18]:
df.head()

Unnamed: 0,cad_incident_id,incident_datetime,initial_call_type,initial_severity_level_code,final_call_type,final_severity_level_code,first_assignment_datetime,valid_dispatch_rspns_time_indc,dispatch_response_seconds_qy,first_activation_datetime,...,special_event_indicator,standby_indicator,transfer_indicator,geoid,aland,awater,aland_sqmi,awater_sqmi,intptlat,intptlong
0,130010001,01/01/2013 12:00:04 AM,RESPIR,4,RESPIR,4,01/01/2013 12:01:45 AM,Y,101,01/01/2013 12:01:51 AM,...,N,N,N,10472,2729341,0,1.054,0.0,40.829556,-73.86931
1,130010022,01/01/2013 12:05:52 AM,EDP,7,EDP,7,01/01/2013 12:06:04 AM,Y,12,01/01/2013 12:06:24 AM,...,N,N,N,10472,2729341,0,1.054,0.0,40.829556,-73.86931
2,130010086,01/01/2013 12:20:37 AM,SICK,6,SICK,6,01/01/2013 12:21:29 AM,Y,52,01/01/2013 12:21:50 AM,...,N,N,N,10472,2729341,0,1.054,0.0,40.829556,-73.86931
3,130010615,01/01/2013 01:53:11 AM,INJURY,4,INJURY,4,01/01/2013 01:53:36 AM,Y,25,01/01/2013 01:53:52 AM,...,N,N,N,10472,2729341,0,1.054,0.0,40.829556,-73.86931
4,130010624,01/01/2013 01:54:28 AM,SICK,4,SICK,4,01/01/2013 01:55:10 AM,Y,42,01/01/2013 01:55:32 AM,...,N,N,N,10472,2729341,0,1.054,0.0,40.829556,-73.86931


In [19]:
df.info(verbose=True,null_counts=True,memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8368586 entries, 0 to 8368585
Data columns (total 39 columns):
cad_incident_id                   8368586 non-null int64
incident_datetime                 8368586 non-null object
initial_call_type                 8368586 non-null object
initial_severity_level_code       8368586 non-null int64
final_call_type                   8368586 non-null object
final_severity_level_code         8368586 non-null int64
first_assignment_datetime         8312407 non-null object
valid_dispatch_rspns_time_indc    8368586 non-null object
dispatch_response_seconds_qy      8368586 non-null int64
first_activation_datetime         8297820 non-null object
first_on_scene_datetime           8106919 non-null object
valid_incident_rspns_time_indc    8368586 non-null object
incident_response_seconds_qy      8105959 non-null float64
incident_travel_tm_seconds_qy     8106798 non-null float64
first_to_hosp_datetime            5935067 non-null object
first_hosp_arrival_

In [20]:
df.memory_usage(deep=True)

Index                              66948688
cad_incident_id                    66948688
incident_datetime                 661118294
initial_call_type                 518136144
initial_severity_level_code        66948688
final_call_type                   518235008
final_severity_level_code          66948688
first_assignment_datetime         658477881
valid_dispatch_rspns_time_indc    518852332
dispatch_response_seconds_qy       66948688
first_activation_datetime         657792292
first_on_scene_datetime           648819945
valid_incident_rspns_time_indc    518852332
incident_response_seconds_qy       66948688
incident_travel_tm_seconds_qy      66948688
first_to_hosp_datetime            546742901
first_hosp_arrival_datetime       544847767
incident_close_datetime           660970902
held_indicator                    518852332
incident_disposition_code          66948688
borough                           542579701
atom                              468511012
incident_dispatch_area          

[TOC](#TOC)

 ***

## 4. Data Wrangling

<a id='Sec04A'></a>
#### 4A: Apply custom functions to dataset

In [21]:
# Apply drop_invalid to DataFrame object
df = drop_invalid(df)

In [22]:
# Apply drop_immaterial to DataFrame object
df = drop_immaterial(df)

In [23]:
# Apply reduce_memory to DataFrame object
df = reduce_memory(df)

In [24]:
# Apply format_df to DataFrame object
df = format_df(df)

In [25]:
# Set index to 'cad_incident_id'
df.set_index(['cad_incident_id'],inplace=True)

[TOC](#TOC) 

<a id='Sec04B'></a>
#### 4B: Inspect clean dataset

In [26]:
df.shape

(7988028, 29)

In [27]:
df.head()

Unnamed: 0_level_0,incident_datetime,year,month,hour,weekday,borough,zipcode,latitude,longitude,aland_sqmi,...,first_activation_datetime,first_on_scene_datetime,travel_time,response_time,first_to_hosp_datetime,first_hosp_arrival_datetime,incident_close_datetime,incident_disposition_code,life_threatening,fatality
cad_incident_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
130010001,2013-01-01 00:00:04,2013,1,0,Tuesday,BRONX,10472.0,40.829556,-73.86931,1.054,...,2013-01-01 00:01:51,2013-01-01 00:13:21,696.0,797.0,2013-01-01 00:28:49,2013-01-01 00:38:15,2013-01-01 01:04:56,82.0,False,False
130010022,2013-01-01 00:05:52,2013,1,0,Tuesday,BRONX,10472.0,40.829556,-73.86931,1.054,...,2013-01-01 00:06:24,2013-01-01 00:14:46,522.0,534.0,2013-01-01 00:48:57,2013-01-01 01:02:02,2013-01-01 01:46:14,82.0,False,False
130010086,2013-01-01 00:20:37,2013,1,0,Tuesday,BRONX,10472.0,40.829556,-73.86931,1.054,...,2013-01-01 00:21:50,2013-01-01 00:32:14,645.0,697.0,NaT,NaT,2013-01-01 01:03:50,93.0,False,False
130010615,2013-01-01 01:53:11,2013,1,1,Tuesday,BRONX,10472.0,40.829556,-73.86931,1.054,...,2013-01-01 01:53:52,2013-01-01 01:56:54,198.0,223.0,2013-01-01 02:12:28,2013-01-01 02:26:09,2013-01-01 03:03:36,82.0,False,False
130010624,2013-01-01 01:54:28,2013,1,1,Tuesday,BRONX,10472.0,40.829556,-73.86931,1.054,...,2013-01-01 01:55:32,2013-01-01 01:59:26,256.0,298.0,2013-01-01 02:14:34,2013-01-01 02:23:06,2013-01-01 02:44:27,82.0,False,False


In [28]:
df.info(verbose=True,null_counts=True,memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7988028 entries, 130010001 to 183530054
Data columns (total 29 columns):
incident_datetime              7988028 non-null datetime64[ns]
year                           7988028 non-null int64
month                          7988028 non-null int64
hour                           7988028 non-null int64
weekday                        7988028 non-null category
borough                        7988028 non-null category
zipcode                        7988028 non-null category
latitude                       7988028 non-null float64
longitude                      7988028 non-null float64
aland_sqmi                     7988028 non-null float64
awater_sqmi                    7988028 non-null float64
initial_call_type              7988028 non-null object
initial_severity_level         7988028 non-null int64
final_call_type                7988028 non-null object
final_severity_level           7988028 non-null int64
held_indicator                 7988028 

In [29]:
df.memory_usage(deep=True)

Index                           63904224
incident_datetime               63904224
year                            63904224
month                           63904224
hour                            63904224
weekday                          7988797
borough                          7988578
zipcode                         15987960
latitude                        63904224
longitude                       63904224
aland_sqmi                      63904224
awater_sqmi                     63904224
initial_call_type              494612046
initial_severity_level          63904224
final_call_type                494709109
final_severity_level            63904224
held_indicator                   7988232
first_assignment_datetime       63904224
incident_dispatch_area           7991432
dispatch_time                   63904224
first_activation_datetime       63904224
first_on_scene_datetime         63904224
travel_time                     63904224
response_time                   63904224
first_to_hosp_da

[TOC](#TOC) 

<a id='Sec04C'></a>
#### 4C: Export clean dataset (to CSV)

In [30]:
output_path = '../data/clean_EMS_data_from_csv.csv'
print('Exporting DataFrame to CSV...')
df.to_csv(output_path,index=False,compression='gzip')
print('DataFrame successfully exported to CSV using \'gzip\' compression.')

Exporting DataFrame to CSV...
DataFrame successfully exported to CSV using 'gzip' compression.


[TOC](#TOC)

<a id='Sec04D'></a>
#### 4D: Process summary

The original EMS incident dataset was comprised of 8,557,848 observations with 32 variables of mixed data types, and its source file occupied approximately 1.98 GB of hard disk space. The original geographic info dataset was comprised of 33,144 observations with 7 variables, and its source file occupied 6.7 MB in system memory. The DataFrame objects generated from both source files were joined on the 'ZIP code' feature, though the labels had different names in each object.

The resulting raw dataset consists of 8,368,586 observations with 39 variables, and occupies 11.1 GB in system memory. The target variable (`fatality`) was created by applying a boolean filter on the `incident_disposition_code` column within the DataFrame, which indicates the outcome of any EMS incident. 

After all data pre-processing is complete, the resulting clean dataset consists of 7,988,028 observations of mixed data types, with a clear target variable and 26 feature variables. Its output file occupies 329 MB of hard disk space and 2.2 GB in system memory. [TOC](#TOC)