# Group 10 - First Year Project
### Data Science at ITU
## Project 1 - Road collisions analysis 2021 

The first project of Data Science's second semester required us to inform various city leaders around the United Kingdom about traffic fatalities and injuries; and give some insight for urban transport planning. This jupyter notebook presents Group 10's exploration of 2019 UK road collision data with complimentary analysis, visualisation and findings with relation to the city of Sheffield.


# Task 0 - Data filtering and cleaning
## Loading data
Our first task invovles loading our 3 csv files (downloaded from https://data.gov.uk/dataset/road-accidents-safety-data on the 4th January 2021) into a traversable data structure we can start working with; such as a dictionary. We will then need to start the process of filtering and cleaning our data.

In [1]:
import numpy as np
import pandas as pd

In [8]:
raw_data = {}

TITLES = ['accidents', 'casualties', 'vehicles']

Pandas has a convinient method for reading in csv files and storing the data in a DataFrame structure offers flexibility.

In [17]:
raw_data['accidents'] = pd.read_csv("../data/raw/Road Safety Data - Accidents 2019.csv")
raw_data['casualties'] = pd.read_csv("../data/raw/Road Safety Data - Casualties 2019.csv")
raw_data['vehicles'] = pd.read_csv("../data/raw/Road Safety Data- Vehicles 2019.csv") #file name missing a space before hyphen 

In [44]:
for TITLE in TITLES:
    print(TITLE + "\tDataFrame shape: " + str(raw_data[TITLE].shape))

accidents	DataFrame shape: (117536, 32)
casualties	DataFrame shape: (153158, 16)
vehicles	DataFrame shape: (216381, 23)


 ### Data type insight?

In [24]:
#todo

## Locating our city
With our data loaded into a dictionary of DataFrames, we needed to locate the accidents that occured in Sheffield. To do this we made a mask identifying which accidents occured in the local district authority 215.


In [28]:
sheffield_mask = (raw_data['accidents']["Local_Authority_(District)"] == 215)

In [31]:
sheffield_data = {}

sheffield_data['accidents'] = raw_data['accidents'][sheffield_mask]
sheffield_data['accidents'].shape

(909, 32)

We can now see that there are 909 records of road traffic accidents occuring in Sheffield over 2019. The next step was using the primary key (accident_index) of the accidents to find the accidents in Sheffield that each casualty and vehicle linked to, in order to complete our sheffield_data dictionary.

In [34]:
sheffield_accident_index = sheffield_data['accidents']["Accident_Index"]
sheffield_accident_index[:5]

44673    2019140805407
44674    2019140805409
44676    2019140805414
44677    2019140805567
44680    2019140805955
Name: Accident_Index, dtype: object

In [38]:
sheffield_data['casualties'] = raw_data['casualties'][raw_data['casualties']['Accident_Index'].isin(sheffield_accident_index)]
sheffield_data['vehicles'] = raw_data['vehicles'][raw_data['vehicles']['Accident_Index'].isin(sheffield_accident_index)]

In [45]:
for TITLE in TITLES:
    print("Sheffield " + TITLE + "\tDataFrame shape: " + str(sheffield_data[TITLE].shape))

Sheffield accidents	DataFrame shape: (909, 32)
Sheffield casualties	DataFrame shape: (1191, 16)
Sheffield vehicles	DataFrame shape: (1648, 23)


## Relational data table model 
In terms of the structure of the three databases or data tables, the accidents table's primary key is used as a foreign key by both the casualties and vehicles data tables, creating a one-to-many link between the accidents and both the casualties and vehicles tables. From this data table structure, we can infer that multiple casualties and vehicles can be involved in a single accident; which, based on our pre-exitsting knowledge of how accidents come to be, makes sense logically.

## Sanity check
Doing a sanity check on our Sheffield road accident data is a critical first step before we can begin to think about further analysing or drawing conclusions from the data

In [None]:
#Check for empty strings in categorical fields
#Check for NULL values
#Check duplicate records
#Check non-linked records in secondary tables

## Numerical summary 


In [46]:
import matplotlib.pyplot as plt

In [None]:
#Five Number Summary

In [None]:
#Box Plots

In [None]:
#Frequency Histogram

# Task 1 - Single variable analysis
### Report the frequency of road collisions in your city for 

1. different age groups 
2. in different times (during the day, the week, or year)
3. and for differences in one other condition.


# Task 2 - Associations
### Research whether there is a significant statistical association in your city, either
1. between some vehicle attributes and accident circumstances, or 
2. between some casualty attributes and accident circumstances. 

Report whether there is a statistically significant association between such variables or not, together with the appropriate statistical metric(s).
 
Discuss why this association, or the lack of
this association, is relevant for urban/transport planning.

# Task 3 - Map visualization
Visualize the reported collisions on a map of your city. Make a visual distinction between different
classes of casualty severity

# Task 4: Open question
Use the data to formulate, motivate, answer, and discuss another research question of your choice.
For example, compare your city to the whole UK, or investigate different collision participants (bicycle
vs car, bicycle-motored two-wheeler, pedestrian vs car, car vs car,…), or compare the age distribution
of casualties to the existing age distribution, or rank/compare how problematic different driver
demographics are (young/old male/female),…