# Maryland Crash Analysis Prototyping

This Jupyter Notebook contains prototyping of workflows to explore the Maryland crash data for 2016. 

The purpose is to explore the nature of crashes and related spatial and non-spatial variables. The output would be feature engineering for each crash feature, composed of a set of attributes that can be passed to a Machine Learning model that can accurately predict crashes on a given intersection in a given time period. 

The prototyping start plan includes conversion of the crash locations to GIS-ready formats so we can explore the nature of the data in Insights for ArcGIS, as well as exploration via the Spatial Statistics tools. Some of the outputs of the Spatial Statistics tools may be candidate features during the feature engineering phase.

Additionally, the workflow may be leveraged in Spatial Statistics workshops for understanding how the tools can get us closer to understanding the characteristics and important information from Maryland crash data. 

Let's get started.

## Data Sources

Maryland 2016 Crash Data:
https://esri.box.com/s/m7dyfet29gtyb9wy02rkrx21j817llr8 

Data Dictionary:
https://data.maryland.gov/Public-Safety/Maryland-Statewide-Vehicle-Crash-Data-Dictionary/7xpx-5fte/data 

Maryland Road Network Layer (with Average Daily Traffic):
http://data.imap.maryland.gov/datasets/maryland-annual-average-daily-traffic-annual-average-daily-traffic-sha-statewide-aadt-lines?geometry=-86.553%2C37.336%2C-67.986%2C40.331

Maryland Road Network Layer (another version, applicant can choose to work with either this or the previous road layer)
https://data.maryland.gov/Transportation/MD-iMAP-Maryland-Road-Centerlines-Local-and-Other-/c6up-awfw


# Phase 1: "Get Data" (Data Exploration and Prototyping)

#### Reason: We want to explore the data in Insights for ArcGIS as well as run spatial statistics data mining tools such as Getis-Ord Gi Hot Spots and Emerging Hot Spots. 

#### Pseudocode

- Create link to folder containing excel files
- Helper function to convert excel to feature class

## Approach A: Convert the Maryland Crash data into Local GIS formats (Feature Class, FGDB) and Explore Locally

In [2]:
# Import needed modules
import os
import glob
import arcpy
import arcgis

In [3]:
# Set reference directories
inputs_dir = r"C:\Users\albe9057\Documents\ANieto_SolutionEngineering\Projects\MachineLearning\MarylandCrashPrediction\Inputs\MarylandData"
workspace_dir = r"C:\Users\albe9057\Documents\ANieto_SolutionEngineering\Projects\MachineLearning\MarylandCrashPrediction\Work"
outputs_dir = r"C:\Users\albe9057\Documents\ANieto_SolutionEngineering\Projects\MachineLearning\MarylandCrashPrediction\Outputs"

In [4]:
# Set arcpy config to overwrite outputs by default
arcpy.env.overwriteOutput = True

In [5]:
# Create workspace, checking if one already exists
if arcpy.Exists(os.path.join(workspace_dir, "MarylandCrashData.gdb")):
    print("Workspace found... Using it.")
    fgdb = os.path.join(workspace_dir, "MarylandCrashData.gdb")
else:
    print("Creating workspace...")
    fgdb = arcpy.CreateFileGDB_management(out_folder_path=workspace_dir, out_name="MarylandCrashData").getOutput(0)
fgdb

Workspace found... Using it.


'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Work\\MarylandCrashData.gdb'

In [6]:
# Retrieve Excels from inputs directory
in_excels = glob.glob(inputs_dir+"\\*.xlsx")
in_excels

['C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Inputs\\MarylandData\\Crash_Qtr01_2016.xlsx',
 'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Inputs\\MarylandData\\Crash_Qtr02_2016.xlsx',
 'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Inputs\\MarylandData\\Crash_Qtr03_2016.xlsx',
 'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Inputs\\MarylandData\\Crash_Qtr04_2016.xlsx']

In [7]:
# Set reference to the location attributes
in_excel_x_field = "LONGITUDE"
in_excel_y_field = "LATITUDE"

In [8]:
# Iterate on each input excel to convert to feature class format, adding them to a python list
in_fcs_temp_list = []

for in_excel in in_excels:
    # Get the excel name
    excel_name = in_excel.split("\\")[-1].split(".")[0]
    print("Converting {0}...".format(excel_name))
    # Convert excel to FGDB table
    fgdb_table = arcpy.ExcelToTable_conversion(Input_Excel_File=in_excel, 
                                               Output_Table=os.path.join(fgdb, excel_name+"_table")).getOutput(0)
    temp_layer = arcpy.MakeXYEventLayer_management(table=fgdb_table, in_x_field=in_excel_x_field, in_y_field=in_excel_y_field).getOutput(0)
    data_fc = arcpy.FeatureClassToFeatureClass_conversion(in_features=temp_layer, out_path=fgdb, out_name=excel_name).getOutput(0)
    print("Excel converted.\n")
    in_fcs_temp_list.append(data_fc)
    del temp_layer, data_fc

Converting Crash_Qtr01_2016...
Excel converted.

Converting Crash_Qtr02_2016...
Excel converted.

Converting Crash_Qtr03_2016...
Excel converted.

Converting Crash_Qtr04_2016...
Excel converted.



In [9]:
in_fcs_temp_list_temp_list

['C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Work\\MarylandCrashData.gdb\\Crash_Qtr01_2016',
 'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Work\\MarylandCrashData.gdb\\Crash_Qtr02_2016',
 'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Work\\MarylandCrashData.gdb\\Crash_Qtr03_2016',
 'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Work\\MarylandCrashData.gdb\\Crash_Qtr04_2016']

In [21]:
# Merge the feature classes into a single feature class for exploration via emerging hot spots
crashes_fc = arcpy.Merge_management(inputs=in_fcs_temp_list, output=os.path.join(fgdb, "MarylandCrashData")).getOutput(0)
crashes_fc

'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Work\\MarylandCrashData.gdb\\MarylandCrashData'

In [23]:
# Add an incident count field for the crashes_fc to pass as the analysis field in Optimized Hot Spots analysis
arcpy.AddField_management(crashes_fc, "INCIDENT_COUNT", field_type="DOUBLE")

<Result 'C:\\Users\\albe9057\\Documents\\ANieto_SolutionEngineering\\Projects\\MachineLearning\\MarylandCrashPrediction\\Work\\MarylandCrashData.gdb\\MarylandCrashData'>

In [24]:
# Calculate the single incident value to all records
with arcpy.da.UpdateCursor(crashes_fc, "INCIDENT_COUNT") as cursor:
    for row in cursor:
        row[0] = 1
        cursor.updateRow(row)

In [25]:
# Quick check
with arcpy.da.SearchCursor(crashes_fc, "INCIDENT_COUNT") as cursor:
    for row in cursor:
        print(row[0])
        break

1.0


In [26]:
# Use a projected coordinate system for spatial analysis
crashes_proj_fc = arcpy.management.Project(crashes_fc, os.path.join(fgdb, "MarylandCrashData_Projected"), "PROJCS['NAD_1983_UTM_Zone_18N',GEOGCS['GCS_North_American_1983',DATUM['D_North_American_1983',SPHEROID['GRS_1980',6378137.0,298.257222101]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]],PROJECTION['Transverse_Mercator'],PARAMETER['False_Easting',500000.0],PARAMETER['False_Northing',0.0],PARAMETER['Central_Meridian',-75.0],PARAMETER['Scale_Factor',0.9996],PARAMETER['Latitude_Of_Origin',0.0],UNIT['Meter',1.0]]", "WGS_1984_(ITRF00)_To_NAD_1983", "GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]]", "NO_PRESERVE_SHAPE", None, "NO_VERTICAL")

We are now ready to run Spatial Stats tools on this data!

<img src="doc/img/MarylandCrash_Points.JPG"></img>

In [27]:
# Run Optimized Hot Spots analysis
ohs_firstrun = arcpy.stats.OptimizedHotSpotAnalysis(crashes_proj_fc, os.path.join(fgdb, "MarylandCrash_OHS_01"), None, "COUNT_INCIDENTS_WITHIN_HEXAGON_POLYGONS").getOutput(0)

Our first run of Optimized Hot Spots analysis does not yield a lot of information...

<img src="doc/img/OHS_01.JPG"></img>

Let's take a look at the tool messaging to see if we can refine the run using Getis-Ord GI Hot Spots...

#### Optimized Hot Spots - Messaging at First Run
##### Messages
Start Time: Tuesday, October 31, 2017 3:51:20 PM
Running script OptimizedHotSpotAnalysis...
************************** Initial Data Assessment ***************************
Making sure there are enough incidents for analysis....
- There are 117977 valid input features.
Looking for locational outliers....
- There were 663 outlier locations; these will not be used to compute the hexagon size.
**************************** Incident Aggregation ****************************
Creating hexagon mesh to use for aggregating incidents....
- Using a hexagon of width 4763.1397 Meters and height 4125.0000 Meters
Counting the number of incidents in each hexagon....
- Analysis is performed on all hexagons containing at least one incident.
Evaluating incident counts and number of polygons....
- The aggregation process resulted in 1726 weighted polygons.
- Incident Count Properties:
        Min:          1.0000
        Max:       5147.0000
        Mean:        68.3528
        Std. Dev.:  225.6602
***************************** Scale of Analysis ******************************
Looking for an optimal scale of analysis by assessing the intensity of clustering at increasing distances....
- The optimal fixed distance band is based on peak clustering found at 15099.5727 Meters
***************************** Hot Spot Analysis ******************************
Finding statistically significant clusters of high and low incident counts....
- There are 226 output features statistically significant based on an FDR correction for multiple testing and spatial dependence.
- 1.9% of features had less than 8 neighbors based on the distance band of 15099.5727 Meters
*********************************** Output ***********************************
Creating output feature class: C:\Users\albe9057\Documents\ANieto_SolutionEngineering\Projects\MachineLearning\MarylandCrashPrediction\Work\MarylandCrashAnalysis\MarylandCrashAnalysis.gdb\MarylandCrashData_OHS
- Red output features represent hot spots where high incident counts cluster.
- Blue output features represent cold spots where low incident counts cluster.
Completed script OptimizedHotSpotAnalysis...
Succeeded at Tuesday, October 31, 2017 3:51:41 PM (Elapsed Time: 21.36 seconds)


The item that stands out to me (visually and through the tool messaging) is the size of the hexagonal bins. 

Let's test with different .

<img src="doc/img/OHS_02.JPG"></img>

<img src="doc/img/OHS_03.JPG"></img>

Let's refine this 

#### Baltimore Tests

##### Messages
Start Time: Tuesday, October 31, 2017 4:18:00 PM
Running script OptimizedHotSpotAnalysis...
************************** Initial Data Assessment ***************************
Making sure there are enough incidents for analysis....
- There are 25763 valid input features.
Looking for locational outliers....
- There were 27 outlier locations; these will not be used to compute the hexagon size.
**************************** Incident Aggregation ****************************
Creating hexagon mesh to use for aggregating incidents....
- Using a hexagon of width 3922.5177 Meters and height 3397.0000 Meters
Counting the number of incidents in each hexagon....
- Analysis is performed on all hexagons containing at least one incident.
Evaluating incident counts and number of polygons....
- The aggregation process resulted in 78 weighted polygons.
- Incident Count Properties:
        Min:          1.0000
        Max:       3424.0000
        Mean:       330.2949
        Std. Dev.:  670.8273
***************************** Scale of Analysis ******************************
Looking for an optimal scale of analysis by assessing the intensity of clustering at increasing distances....
- No optimal distance was found using this method.
Determining an optimal distance using the spatial distribution of features....
- The optimal fixed distance band is based on the average distance to 3 nearest neighbors: 13834.0000 Meters
***************************** Hot Spot Analysis ******************************
Finding statistically significant clusters of high and low incident counts....
- There are 48 output features statistically significant based on an FDR correction for multiple testing and spatial dependence.
- 23.1% of features had less than 8 neighbors based on the distance band of 13834.0000 Meters
*********************************** Output ***********************************
Creating output feature class: C:\Users\albe9057\Documents\ANieto_SolutionEngineering\Projects\MachineLearning\MarylandCrashPrediction\Work\MarylandCrashAnalysis\MarylandCrashAnalysis.gdb\BaltimoreCrashData_Projected_OHS_01
- Red output features represent hot spots where high incident counts cluster.
- Blue output features represent cold spots where low incident counts cluster.
Completed script OptimizedHotSpotAnalysis...
Succeeded at Tuesday, October 31, 2017 4:18:15 PM (Elapsed Time: 15.19 seconds)

<img src="doc/img/Baltimore_OHS01.jpg"></img>

<img src="doc/img/Baltimore_OHS02.jpg"></img>

<img src="doc/img/Baltimore_OHS03.jpg"></img>

## Spatiotemporal Trend Exploration with Emerging Hot Spots

#### Run 01: Maryland

##### Parameters
Input Features	Maryland Analysis\MarylandCrashData_Projected
Output Space Time Cube	C:\Users\albe9057\Documents\ANieto_SolutionEngineering\Projects\MachineLearning\MarylandCrashPrediction\Work\STCs\Maryland_STC_01.nc
Time Field	ACCIDENT_DATE
Template Cube	
Time Step Interval	
Time Step Alignment	END_TIME
Reference Time	
Distance Interval	
Summary Fields	
Aggregation Shape Type	HEXAGON_GRID
Defined Polygon Locations	
Location ID	
 
##### Messages
Start Time: Tuesday, October 31, 2017 4:33:28 PM
Running script CreateSpaceTimeCube...
 WARNING 110035: The default Distance Interval is 4125 meters.
 WARNING 110013: The default Time Step Interval is 4 days.
The space time cube has aggregated 117977 points into 7371 hexagon grid locations over 92 time step intervals.  Each location has a height of 4125 meters, a width of 4763.14 meters, sides of 2381.57 meters, and an area of 14735963.51 square meters.  The entire space time cube spans an area 419156.3 meters west to east and 259875 meters north to south.  Each of the time step intervals is 4 days in duration so the entire time period covered by the space time cube is 368 days.  Of the 7371 total locations, 1726 (23.42%) contain at least one point for at least one time step interval.  These 1726 locations comprise 158792 space time bins of which 37613 (23.69%) have point counts greater than zero.  There is a statistically significant increase in point counts over time.

---------- Space Time Cube Characteristics -----------
Input feature time extent          2016-01-01 00:00:00
                                to 2016-12-31 00:00:00
                                                      
Number of time steps                                92
Time step interval                              4 days
Time step alignment                                End
                                                      
First time step temporal bias                   75.00%
First time step interval                         after
                                   2015-12-29 00:00:00
                                       to on or before
                                   2016-01-02 00:00:00
                                                      
Last time step temporal bias                     0.00%
Last time step interval                          after
                                   2016-12-27 00:00:00
                                       to on or before
                                   2016-12-31 00:00:00
                                                      
Cube extent across space       (coordinates in meters)
Min X                                       83206.5504
Min Y                                     4171536.3384
Max X                                      502362.8459
Max Y                                     4433473.8384
Rows                                                63
Columns                                            117
Total bins                                      678132

------------- Overall Data Trend - COUNT -------------
Trend direction                             Increasing
Trend statistic                                 3.7341
Trend p-value                                   0.0002
Completed script CreateSpaceTimeCube...
Succeeded at Tuesday, October 31, 2017 4:33:35 PM (Elapsed Time: 6.69 seconds)


## Approach B: Convert the Maryland Crash data into Distributed GIS Formats (Feature Service, Hosted Layer)

# Phase 2: Feature Engineering (Clean, Prepare, & Manipulate Data)

# Phase 3: Train Model

# Phase 4: Test Model Performance

# Phase 5: Document, Iterate, and Improve