# Introduction: Business Problem & Project aims

Each year about 1.25 million people die in road traffic accidents, and an additional 20-50 million are injured or disabled. If the locations of traffic accidents could be predicted, this could have a huge beneficial impact in potentially helping to reduce the number of accidents each year. For example, routing software could avoid the most dangerous areas - particularly in the context of the coming advent of driverless cars. It could also be useful in an insurance context, in order to predict risk, as well as for governments and local road authorities looking to create more efficient systems of road maintenance and improvements. The aim of this project is to predict where traffic accidents are likely to occur. 

For this problem, we didn’t want to just look at traditional structured data and machine learning models. Instead, we wanted to find out if satellite imagery could be combined with other datasets in order to increase our ability to predict where traffic accidents are likely to occur. Our methodology for this project was to make four models:

*Model 1* uses a combination of accident, population density and traffic data from the UK, where we focused on accidents in London.  Different machine learning models were built to see if the level of accident severity could be predicted.  

*Model 2* uses satellite images of London that were scraped using Google Maps Static API and fed into a Convolutional Neural Network (CNN) in order to predict where traffic accidents are likely to occur.

*Model 3* then makes use of Keras functional API to combine the top features from model 1 and the image features extracted from a CNN (similar to model 2) to create a mixed-input or mixed data model. Both data types are fed into separate deep learning models and their outputs are combined for the final layers in order to predict whether a given area is likely to have traffic accidents or not. 

*Model 4* uses the same model architecture as model 3, but applies it to the task of distinguishing between areas with no traffic accidents and areas with serious or fatal traffic accidents, in order to predict the locations of the worst traffic accidents.

# The Data

The dataset is composed by traffic accidents in England and Wales from 2013-2017, and it counts 32 columns.
These files provide detailed road safety data about the circumstances of personal injury road accidents in GB from 1979, the types of vehicles involved and the consequential casualties. The statistics relate only to personal injury accidents on public roads that are reported to the police, and subsequently recorded.                                     All the data variables are coded rather than containing textual strings.  

Source: https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data

In [13]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
import urllib, os
import glob
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, auc, roc_curve, classification_report
import warnings
warnings.filterwarnings("ignore")

In [14]:
accidents_df = pd.concat([pd.read_csv(f, delimiter = ';') for f in glob.glob('data/accidents/Accidents*.csv')], ignore_index = True)

In [15]:
len(accidents_df)

316542

In [16]:
accidents_df.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
0,201501BS70001,525130.0,180050.0,-0.198465,51.505.538,1,3,1.0,1.0,12/01/2015,...,0.0,0.0,4.0,1.0,1.0,0.0,0.0,1.0,1.0,E01002825
1,201501BS70002,526530.0,178560.0,-0.178838,51.491.836,1,3,1.0,1.0,12/01/2015,...,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,E01002820
2,201501BS70004,524610.0,181080.0,-0.20559,51.514.910,1,3,1.0,1.0,12/01/2015,...,0.0,1.0,4.0,2.0,2.0,0.0,0.0,1.0,1.0,E01002833
3,201501BS70005,524420.0,181080.0,-0.208327,51.514.952,1,3,1.0,1.0,13/01/2015,...,0.0,0.0,1.0,1.0,2.0,0.0,0.0,1.0,2.0,E01002874
4,201501BS70008,524630.0,179040.0,-0.206022,51.496.572,1,2,2.0,1.0,09/01/2015,...,0.0,5.0,1.0,2.0,2.0,0.0,0.0,1.0,2.0,E01002814


In [17]:
accidents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316542 entries, 0 to 316541
Data columns (total 32 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   Accident_Index                               316542 non-null  object 
 1   Location_Easting_OSGR                        316526 non-null  float64
 2   Location_Northing_OSGR                       316526 non-null  float64
 3   Longitude                                    316526 non-null  object 
 4   Latitude                                     316526 non-null  object 
 5   Police_Force                                 316542 non-null  int64  
 6   Accident_Severity                            316542 non-null  int64  
 7   Number_of_Vehicles                           316541 non-null  float64
 8   Number_of_Casualties                         316541 non-null  float64
 9   Date                                         316541 non-nul