# 01 - Data importing, decoding, joining, and saving.
___

Step 1! Let's get our hands on the data and select for only collisions in South Yorkshire, as that's what the project focuses on. We'll decode the data to make it human readable, so that we can do some exploratory data analysis to get familiar with the data and have a bit more direction in training our model. We'll join the Casualty and Collision datasets into one big dataframe that we wil export to a .csv to read into our code for Step 2: Data cleaning.

In [1]:
import pandas as pd

Here i'm reading in the information we need to convert the numbers in the downloaded data tables into their human readable equivalent. This is stored in the 'stats19_schema.csv' document

In [2]:
schema = pd.read_csv('stats19_schema.csv')
coll_schema = schema[schema.table=='Accident']
cas_schema = schema[schema.table=='Casualty']

Now time to download the data. I'm only downloading the last 5 years data because this is nice and fast for me to do (gov.uk website has compiled them for us!). This cell will take a little bit of time to run whilst it downloads everything.

In [3]:
coll_url = r"https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv"
cas_url = r"https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-last-5-years.csv"

coll_df = pd.read_csv(coll_url, low_memory = False)
cas_df = pd.read_csv(cas_url, low_memory = False)

Below is a neat little piece of code that i'm proud of! In just one line of code we're converting all of the data to their corresponding strings. 😎

In [4]:
coll_df = pd.concat([coll_df[name].map(coll_schema[coll_schema.variable == name].set_index('code').label).copy() if name in coll_schema.variable.unique() else coll_df[name] for name in coll_df.columns], axis=1)

In [5]:
cas_df = pd.concat([
    cas_df[name].map(cas_schema[cas_schema.variable == name].set_index('code').label).copy() if (name in cas_schema.variable.unique() and name != 'age_of_casualty') else cas_df[name] for name in cas_df.columns], axis=1)

Now we'll merge the two tables of data. Using a left join we drop any casualty data that doeesn't correspond to an accident in South Yorkshire.

In [6]:
joined = coll_df.merge(cas_df, on='accident_index', how='left')

And export it to a .csv for us to read in later on.

In [7]:
joined.to_csv('dft_statistics_collision_and_casualty_last_5_years.csv', index=False)