# Road Safety Data in the UK

#### The Data
The files provide detailed road safety data about the circumstances of personal injury road accidents in GB, the types (including Make and Model) of vehicles involved and the consequential casualties. The statistics relate only to personal injury accidents on public roads that are reported to the police, and subsequently recorded, using the STATS19 accident reporting form. The files used here span 2013 to 2017.

#### The Task
The purpose of the analysis is 
- To summarize the main characteristics of the data, and obtain interesting facts that are worth highlighting.
- Identity and quantify associations (if any) between the number of causalities (in the Accidents table) and other variables in the data set.
- Explore whether it is possible to predict accident hotspots based on the data.

#### The OSEMiN-approach

The OSEMiN Process is an acronym that rhymes with “awesome” and stands for **Obtain, Scrub, Explore, Model, and iNterpret**. It can be used as a blueprint for working on data problems using machine learning tools. Preprocessing involves scrubbing (also called cleaning) and exploring the data. Building the model, evaluating, and optimizing it make up the process of machine learning.

# Table of Contents
<a id='Table of Contents'></a>

### <a href='#1. Obtaining and Viewing the Data'>1. Obtaining and Viewing the Data</a>

### <a href='#2. Preprocessing the Data'>2. Preprocessing the Data</a>

### <a href='#3. Data Visualization'>3. Data Visualization</a>

### <a href='#4. Modeling the Data'>4. Modeling the Data</a>

* <a href='#4.1. Recoding Categorical Features'>4.1. Recoding Categorical Features</a>
* <a href='#4.2. Training a Logistic Regression'>4.2. Training a Logistic Regression</a>

### <a href='#5. Interpreting the Data'>5. Interpreting the Data</a>

### 1. Obtaining and Viewing the Data
<a id='1. Obtaining and Viewing the Data'></a>

In [1]:
# import libraries
import pandas as pd
import numpy as np

**Accidents Table**

In [3]:
df1 = pd.read_csv('data/dftRoadSafetyData_Accidents_2017.zip', compression='zip')
print(df1.shape)
df1.head()

(129982, 32)


Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
0,2017010001708,532920.0,196330.0,-0.080107,51.650061,1,1,2,3,05/08/2017,...,0,0,4,1,1,0,0,1,1,E01001450
1,2017010009342,526790.0,181970.0,-0.173845,51.522425,1,3,2,1,01/01/2017,...,0,0,4,1,2,0,0,1,1,E01004702
2,2017010009344,535200.0,181260.0,-0.052969,51.514096,1,3,3,1,01/01/2017,...,0,0,4,1,1,0,0,1,1,E01004298
3,2017010009348,534340.0,193560.0,-0.060658,51.624832,1,3,2,1,01/01/2017,...,0,4,4,2,2,0,0,1,1,E01001429
4,2017010009350,533680.0,187820.0,-0.072372,51.573408,1,2,1,1,01/01/2017,...,0,5,4,1,2,0,0,1,1,E01001808


In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129982 entries, 0 to 129981
Data columns (total 32 columns):
Accident_Index                                 129982 non-null object
Location_Easting_OSGR                          129963 non-null float64
Location_Northing_OSGR                         129963 non-null float64
Longitude                                      129953 non-null float64
Latitude                                       129953 non-null float64
Police_Force                                   129982 non-null int64
Accident_Severity                              129982 non-null int64
Number_of_Vehicles                             129982 non-null int64
Number_of_Casualties                           129982 non-null int64
Date                                           129982 non-null object
Day_of_Week                                    129982 non-null int64
Time                                           129979 non-null object
Local_Authority_(District)                     129

The accidents table contains almost 130.000 records and 32 columns, with only very few missing values. If we decided to work with date and/or time, we are likely to convert the string values into datetime format. Besides that, almost all data is properly stored as numeric data.

In [23]:
#df_acc = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv', 'data/d3.csv']), compression='zip')

import glob
df_test = pd.concat([pd.read_csv(f, compression='zip') for f in glob.glob('data/*Accidents*.zip')])
df_test.shape

  if (yield from self.run_code(code, result)):
  if (yield from self.run_code(code, result)):


(691641, 32)

In [25]:
df_test.head()

Unnamed: 0,Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,...,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location
0,201501BS70001,525130.0,180050.0,-0.198465,51.505538,1,3,1,1,12/01/2015,...,0,0,4,1,1,0,0,1,1,E01002825
1,201501BS70002,526530.0,178560.0,-0.178838,51.491836,1,3,1,1,12/01/2015,...,0,0,1,1,1,0,0,1,1,E01002820
2,201501BS70004,524610.0,181080.0,-0.20559,51.51491,1,3,1,1,12/01/2015,...,0,1,4,2,2,0,0,1,1,E01002833
3,201501BS70005,524420.0,181080.0,-0.208327,51.514952,1,3,1,1,13/01/2015,...,0,0,1,1,2,0,0,1,2,E01002874
4,201501BS70008,524630.0,179040.0,-0.206022,51.496572,1,2,2,1,09/01/2015,...,0,5,1,2,2,0,0,1,2,E01002814


In [19]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138660 entries, 0 to 138659
Data columns (total 32 columns):
Accident_Index                                 138660 non-null object
Location_Easting_OSGR                          138660 non-null int64
Location_Northing_OSGR                         138660 non-null int64
Longitude                                      138660 non-null float64
Latitude                                       138660 non-null float64
Police_Force                                   138660 non-null int64
Accident_Severity                              138660 non-null int64
Number_of_Vehicles                             138660 non-null int64
Number_of_Casualties                           138660 non-null int64
Date                                           138660 non-null object
Day_of_Week                                    138660 non-null int64
Time                                           138652 non-null object
Local_Authority_(District)                     138660 

**Casualties Table**

In [4]:
df2 = pd.read_csv('data/dftRoadSafetyData_Casualties_2017.zip', compression='zip')
print(df2.shape)
df2.head()

(170993, 16)


Unnamed: 0,Accident_Index,Vehicle_Reference,Casualty_Reference,Casualty_Class,Sex_of_Casualty,Age_of_Casualty,Age_Band_of_Casualty,Casualty_Severity,Pedestrian_Location,Pedestrian_Movement,Car_Passenger,Bus_or_Coach_Passenger,Pedestrian_Road_Maintenance_Worker,Casualty_Type,Casualty_Home_Area_Type,Casualty_IMD_Decile
0,2017010001708,1,1,2,2,18,4,3,0,0,1,0,0,9,1,2
1,2017010001708,2,2,1,1,19,4,2,0,0,0,0,0,2,-1,-1
2,2017010001708,2,3,2,1,18,4,1,0,0,0,0,0,2,-1,-1
3,2017010009342,1,1,2,2,33,6,3,0,0,1,0,0,9,1,5
4,2017010009344,3,1,1,2,31,6,3,0,0,0,0,0,9,1,5


In [8]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170993 entries, 0 to 170992
Data columns (total 16 columns):
Accident_Index                        170993 non-null object
Vehicle_Reference                     170993 non-null int64
Casualty_Reference                    170993 non-null int64
Casualty_Class                        170993 non-null int64
Sex_of_Casualty                       170993 non-null int64
Age_of_Casualty                       170993 non-null int64
Age_Band_of_Casualty                  170993 non-null int64
Casualty_Severity                     170993 non-null int64
Pedestrian_Location                   170993 non-null int64
Pedestrian_Movement                   170993 non-null int64
Car_Passenger                         170993 non-null int64
Bus_or_Coach_Passenger                170993 non-null int64
Pedestrian_Road_Maintenance_Worker    170993 non-null int64
Casualty_Type                         170993 non-null int64
Casualty_Home_Area_Type               170993 non

The casualties table has almost 171,000 with 16 columns providing detailed information about the casualties. The data is complete with no missing values and - apart from the index - is stored in a numeric format.

**Vehicles Table**

In [5]:
df3 = pd.read_csv('data/dftRoadSafetyData_Vehicles_2017.zip', compression='zip')
print(df3.shape)
df3.head()

(238926, 23)


Unnamed: 0,Accident_Index,Vehicle_Reference,Vehicle_Type,Towing_and_Articulation,Vehicle_Manoeuvre,Vehicle_Location-Restricted_Lane,Junction_Location,Skidding_and_Overturning,Hit_Object_in_Carriageway,Vehicle_Leaving_Carriageway,...,Journey_Purpose_of_Driver,Sex_of_Driver,Age_of_Driver,Age_Band_of_Driver,Engine_Capacity_(CC),Propulsion_Code,Age_of_Vehicle,Driver_IMD_Decile,Driver_Home_Area_Type,Vehicle_IMD_Decile
0,2017010001708,1,9,0,18,0,0,0,0,0,...,6,1,24,5,1997,2,1,-1,-1,-1
1,2017010001708,2,2,0,18,0,0,1,0,0,...,6,1,19,4,-1,-1,-1,-1,-1,-1
2,2017010009342,1,9,0,18,0,1,0,0,0,...,6,1,33,6,1797,8,8,9,1,9
3,2017010009342,2,9,0,18,0,1,1,0,0,...,6,1,40,7,2204,2,12,2,1,2
4,2017010009344,1,9,0,18,0,1,0,0,0,...,6,3,-1,-1,-1,-1,-1,-1,-1,-1


In [9]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238926 entries, 0 to 238925
Data columns (total 23 columns):
Accident_Index                      238926 non-null object
Vehicle_Reference                   238926 non-null int64
Vehicle_Type                        238926 non-null int64
Towing_and_Articulation             238926 non-null int64
Vehicle_Manoeuvre                   238926 non-null int64
Vehicle_Location-Restricted_Lane    238926 non-null int64
Junction_Location                   238926 non-null int64
Skidding_and_Overturning            238926 non-null int64
Hit_Object_in_Carriageway           238926 non-null int64
Vehicle_Leaving_Carriageway         238926 non-null int64
Hit_Object_off_Carriageway          238926 non-null int64
1st_Point_of_Impact                 238926 non-null int64
Was_Vehicle_Left_Hand_Drive?        238926 non-null int64
Journey_Purpose_of_Driver           238926 non-null int64
Sex_of_Driver                       238926 non-null int64
Age_of_Driver     

The vehicles table is the largest of all three and contains roughly 239.000 records with 23 columns with detailed information about the vehicle and its driver. We face no missing values.

In [17]:
df3.Accident_Index.iloc[1]

2017010001708

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 2. Preprocessing the Data
<a id='2. Preprocessing the Data'></a>

#### 2.1. Renaming Columns 
<a id='2.1. Renaming Columns'></a>

In [11]:
df1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Location_Easting_OSGR,129963.0,451170.256719,95152.629739,73639.0,387278.5,457594.0,528910.0,655391.0
Location_Northing_OSGR,129963.0,283578.410194,153491.812607,12107.0,176000.0,224126.0,388828.5,1177531.0
Longitude,129953.0,-1.268385,1.395881,-7.40955,-2.190772,-1.149752,-0.141685,1.759641
Latitude,129953.0,52.439387,1.382508,49.929558,51.470399,51.900636,53.393024,60.48092
Police_Force,129982.0,28.527996,25.064407,1.0,5.0,23.0,45.0,98.0
Accident_Severity,129982.0,2.800849,0.430441,1.0,3.0,3.0,3.0,3.0
Number_of_Vehicles,129982.0,1.838147,0.722479,1.0,1.0,2.0,2.0,23.0
Number_of_Casualties,129982.0,1.315513,0.765469,1.0,1.0,1.0,1.0,42.0
Day_of_Week,129982.0,4.105245,1.930446,1.0,2.0,4.0,6.0,7.0
Local_Authority_(District),129982.0,328.899286,258.587181,1.0,91.0,303.0,513.0,941.0


In [13]:
df2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Vehicle_Reference,170993.0,1.482166,0.656579,1.0,1.0,1.0,2.0,101.0
Casualty_Reference,170993.0,1.397285,1.125848,1.0,1.0,1.0,1.0,201.0
Casualty_Class,170993.0,1.499155,0.726935,1.0,1.0,1.0,2.0,3.0
Sex_of_Casualty,170993.0,1.406642,0.492172,-1.0,1.0,1.0,2.0,2.0
Age_of_Casualty,170993.0,36.503921,19.283721,-1.0,22.0,33.0,50.0,100.0
Age_Band_of_Casualty,170993.0,6.298246,2.377378,-1.0,5.0,6.0,8.0,11.0
Casualty_Severity,170993.0,2.833812,0.399427,1.0,3.0,3.0,3.0,3.0
Pedestrian_Location,170993.0,0.755733,2.119174,0.0,0.0,0.0,0.0,10.0
Pedestrian_Movement,170993.0,0.59875,1.926425,0.0,0.0,0.0,0.0,9.0
Car_Passenger,170993.0,0.250782,0.576786,-1.0,0.0,0.0,0.0,2.0
