# Required Libraries

In [24]:
import pandas as pd
import numpy as np

import random

# Set the random seed for the random module
random.seed(42)

# Set the random seed for NumPy
np.random.seed(42)

# Datasets

Data retrieved from: https://www.data.qld.gov.au/dataset/crash-data-from-queensland-roads/resource/18ee2911-992f-40ed-b6ae-e756859786e6?inner_span=True

There are 4 crash datasets to be preprocessed. They involve:
1. Road crash locations
2. Driver demographics
3. Vehicle types
4. Road crash factors

These 4 were chosen as they share multiple columns which will allow for joining. The road casualties and seatbelt restraints datasets were excluded as they did not share columns.

In [25]:
df_crash_locations = pd.read_csv("crash_data_queensland_1_crash_locations.csv")
df_crash_driver_demographics = pd.read_csv("crash_data_queensland_b_driver_involvement.csv")
df_crash_vehicle_types = pd.read_csv("crash_data_queensland_d_vehicle_involvement.csv")
df_crash_factors = pd.read_csv("crash_data_queensland_e_alcohol_speed_fatigue_defect.csv")

Property damage crashes were recorded only up to 2010, therefore crashes that resulted in these will be excluded

In [26]:
df_locations = df_crash_locations[df_crash_locations['Crash_Severity'] != 'Property damage only']
df_driver_demographics = df_crash_driver_demographics[df_crash_driver_demographics['Crash_Severity'] != 'Property damage only']
df_vehicle_types = df_crash_vehicle_types[df_crash_vehicle_types["Crash_Severity"] != "Property damage"]
df_factors = df_crash_factors[df_crash_factors["Crash_Severity"] != "Property damage"]

Now to get an initial look at the four datasets

In [27]:
df_locations.info()

<class 'pandas.core.frame.DataFrame'>
Index: 279798 entries, 0 to 367229
Data columns (total 52 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Crash_Ref_Number                 279798 non-null  int64  
 1   Crash_Severity                   279798 non-null  object 
 2   Crash_Year                       279798 non-null  int64  
 3   Crash_Month                      279798 non-null  object 
 4   Crash_Day_Of_Week                279798 non-null  object 
 5   Crash_Hour                       279798 non-null  int64  
 6   Crash_Nature                     279798 non-null  object 
 7   Crash_Type                       279798 non-null  object 
 8   Crash_Longitude                  279798 non-null  float64
 9   Crash_Latitude                   279798 non-null  float64
 10  Crash_Street                     279791 non-null  object 
 11  Crash_Street_Intersecting        121005 non-null  object 
 12  State_R

In [28]:
df_driver_demographics.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16713 entries, 0 to 19895
Data columns (total 16 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Crash_Year                          16713 non-null  int64 
 1   Crash_Police_Region                 16713 non-null  object
 2   Crash_Severity                      16713 non-null  object
 3   Involving_Male_Driver               16713 non-null  object
 4   Involving_Female_Driver             16713 non-null  object
 5   Involving_Young_Driver_16-24        16713 non-null  object
 6   Involving_Senior_Driver_60plus      16713 non-null  object
 7   Involving_Provisional_Driver        16713 non-null  object
 8   Involving_Overseas_Licensed_Driver  16713 non-null  object
 9   Involving_Unlicensed_Driver         16713 non-null  object
 10  Count_Crashes                       16713 non-null  int64 
 11  Count_Casualty_Fatality             16713 non-null  int64 


In [29]:
df_vehicle_types.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3246 entries, 0 to 3245
Data columns (total 12 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   Crash_Year                       3246 non-null   int64 
 1   Crash_Police_Region              3246 non-null   object
 2   Crash_Severity                   3246 non-null   object
 3   Involving_Motorcycle_Moped       3246 non-null   object
 4   Involving_Truck                  3246 non-null   object
 5   Involving_Bus                    3246 non-null   object
 6   Count_Crashes                    3246 non-null   int64 
 7   Count_Casualty_Fatality          3246 non-null   int64 
 8   Count_Casualty_Hospitalised      3246 non-null   int64 
 9   Count_Casualty_MedicallyTreated  3246 non-null   int64 
 10  Count_Casualty_MinorInjury       3246 non-null   int64 
 11  Count_Casualty_All               3246 non-null   int64 
dtypes: int64(7), object(5)
memory usag

In [30]:
df_factors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4799 entries, 0 to 4798
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Crash_Year                   4799 non-null   int64 
 1   Crash_Police_Region          4799 non-null   object
 2   Crash_Severity               4799 non-null   object
 3   Involving_Drink_Driving      4799 non-null   object
 4   Involving_Driver_Speed       4799 non-null   object
 5   Involving_Fatigued_Driver    4799 non-null   object
 6   Involving_Defective_Vehicle  4799 non-null   object
 7   Count_Crashes                4799 non-null   int64 
 8   Count_Fatality               4799 non-null   int64 
 9   Count_Hospitalised           4799 non-null   int64 
 10  Count_Medically_Treated      4799 non-null   int64 
 11  Count_Minor_Injury           4799 non-null   int64 
 12  Count_All_Casualties         4799 non-null   int64 
dtypes: int64(7), object(6)
memory usa

From the above, one can see that they share some columns such as:

- Crash Year
- Crash Severity
- Crash Police Region (excluding locations dataset)
- Count variables (excluding the factors dataset)

The locations and factors datasets will have to be preprocessed as their variable names do not match with the other datasets.

For the factors datasets, a column mapping was defined, and the columns were renamed based on this column mapping

In [31]:
column_mapping = {'Count_Fatality': 'Count_Casualty_Fatality',
                  'Count_Hospitalised': 'Count_Casualty_Hospitalised',
                  'Count_Medically_Treated': 'Count_Casualty_MedicallyTreated',
                  'Count_Minor_Injury': 'Count_Casualty_MinorInjury',
                  'Count_All_Casualties': 'Count_Casualty_All',
}

# renaming columns based on the dictionary mapping
df_factors_renamed = df_factors.rename(columns=column_mapping)

df_factors_renamed.columns

Index(['Crash_Year', 'Crash_Police_Region', 'Crash_Severity',
       'Involving_Drink_Driving', 'Involving_Driver_Speed',
       'Involving_Fatigued_Driver', 'Involving_Defective_Vehicle',
       'Count_Crashes', 'Count_Casualty_Fatality',
       'Count_Casualty_Hospitalised', 'Count_Casualty_MedicallyTreated',
       'Count_Casualty_MinorInjury', 'Count_Casualty_All'],
      dtype='object')

For the locations dataset, there were three different police location variables (District, Division, Region). To determine which column corresponded with the Crash_Police_Region column in the other three datasets, the unique values of these location variables were analysed.

In [32]:
# Unique district values
df_locations["Loc_Police_District"].unique()

array(['South Brisbane', 'Unknown', 'North Brisbane', 'Ipswich', 'Logan',
       'Moreton', 'Gold Coast', 'Sunshine Coast', 'Wide Bay Burnett',
       'Darling Downs', 'Capricornia', 'South West', 'Mackay',
       'Mount Isa', 'Townsville', 'Far North'], dtype=object)

In [33]:
# unique division values
df_locations["Loc_Police_Division"].unique()

array(['Acacia Ridge', 'Moorooka', 'Upper Mount Gravatt', 'Calamvale',
       'Dutton Park', 'Annerley', 'Unknown', 'Coorparoo', 'Holland Park',
       'Sherwood', 'The Gap', 'Ferny Grove', 'Indooroopilly',
       'Brisbane City', 'Carina', 'Wynnum', 'Morningside', 'Capalaba',
       'Fortitude Valley', 'West End', 'Inala', 'Mount Ommaney', 'Goodna',
       'Browns Plains', 'Sandgate', 'Carseldine',
       'Mango Hill North Lakes', 'Albany Creek', 'Petrie', 'Stafford',
       'Boondall', 'Logan Central', 'Springwood', 'Hendra', 'Dayboro',
       'Beaudesert', 'Kalbar', 'Beenleigh', 'Coomera', 'North Tamborine',
       'Caboolture', 'Burpengary', 'Deception Bay', 'Bribie Island',
       'Woodford', 'Beerwah', 'Cleveland', 'Redland Bay',
       'Russell Island', 'Macleay Island', 'Loganholme', 'Esk', 'Dunwich',
       'Karana Downs', 'Ipswich', 'Redcliffe', 'Crestmead', 'Jimboomba',
       'Yamanto', 'Booval', 'Springfield', 'Marburg', 'Rosewood',
       'Broadbeach', 'Robina', 'Southpor

In [34]:
# unique region values
df_locations["Loc_Police_Region"].unique()

array(['Brisbane', 'Unknown', 'Southern', 'South Eastern', 'North Coast',
       'Central', 'Northern', 'Far Northern'], dtype=object)

In [35]:
# unique Crash_Police_Region values in the other three datasets
print(df_driver_demographics["Crash_Police_Region"].unique())
print(df_vehicle_types["Crash_Police_Region"].unique())
print(df_factors_renamed["Crash_Police_Region"].unique())

['Brisbane' 'Central' 'Far Northern' 'North Coast' 'Northern'
 'South Eastern' 'Southern' 'Unknown']
['Brisbane' 'Central' 'Far Northern' 'North Coast' 'Northern'
 'South Eastern' 'Southern' 'Unknown']
['Brisbane' 'Central' 'Far Northern' 'North Coast' 'Northern'
 'South Eastern' 'Southern' 'Unknown']


From the above data exploration, it would seem that Loc_Police_Region matches the Crash_Police_Region column in the other three datasets.

As such, the Loc_Police_Region will be renamed to follow the naming scheme of the other datasets. Note that the Count_Casualty_Total column will also be renamed to Count_Casualty_All for the same reason.

In [36]:
df_locations_renamed = df_locations.rename(columns = {"Loc_Police_Region": "Crash_Police_Region",
                                                      "Count_Casualty_Total": "Count_Casualty_All"})

df_locations.columns

Index(['Crash_Ref_Number', 'Crash_Severity', 'Crash_Year', 'Crash_Month',
       'Crash_Day_Of_Week', 'Crash_Hour', 'Crash_Nature', 'Crash_Type',
       'Crash_Longitude', 'Crash_Latitude', 'Crash_Street',
       'Crash_Street_Intersecting', 'State_Road_Name', 'Loc_Suburb',
       'Loc_Local_Government_Area', 'Loc_Post_Code', 'Loc_Police_Division',
       'Loc_Police_District', 'Loc_Police_Region',
       'Loc_Queensland_Transport_Region', 'Loc_Main_Roads_Region',
       'Loc_ABS_Statistical_Area_2', 'Loc_ABS_Statistical_Area_3',
       'Loc_ABS_Statistical_Area_4', 'Loc_ABS_Remoteness',
       'Loc_State_Electorate', 'Loc_Federal_Electorate',
       'Crash_Controlling_Authority', 'Crash_Roadway_Feature',
       'Crash_Traffic_Control', 'Crash_Speed_Limit',
       'Crash_Road_Surface_Condition', 'Crash_Atmospheric_Condition',
       'Crash_Lighting_Condition', 'Crash_Road_Horiz_Align',
       'Crash_Road_Vert_Align', 'Crash_DCA_Code', 'Crash_DCA_Description',
       'Crash_DCA_Group_De

Now to verify that the police region and count columns match across all three datasets

In [37]:
# Get the column names for each DataFrame
columns_locations = set(df_locations_renamed.columns)
columns_driver_demographics = set(df_driver_demographics.columns)
columns_vehicle_types = set(df_vehicle_types.columns)
columns_factors = set(df_factors_renamed.columns)

# Find the shared columns using .intersection
shared_columns = list(columns_locations.intersection(columns_driver_demographics,
                                                    columns_vehicle_types,
                                                    columns_factors))


print(shared_columns)

['Crash_Severity', 'Count_Casualty_Hospitalised', 'Count_Casualty_MedicallyTreated', 'Crash_Police_Region', 'Count_Casualty_Fatality', 'Crash_Year', 'Count_Casualty_All', 'Count_Casualty_MinorInjury']


Before merging the datasets by performing a join, duplicate rows based on the datasets' shared columns should be inspected

In [38]:
# Function to calculate the number of duplicates in a DataFrame
def count_duplicates(df, key_columns):
    return df.duplicated(subset=key_columns).sum()

# Define the key columns to compare duplicates
key_columns = ['Crash_Year', 
               'Crash_Police_Region', 
               'Crash_Severity', 
               'Count_Casualty_All', 
               'Count_Casualty_Fatality', 
               'Count_Casualty_Hospitalised', 
               'Count_Casualty_MedicallyTreated', 
               'Count_Casualty_MinorInjury']

# Calculate the number of duplicates for each dataset
num_duplicates_locations = count_duplicates(df_locations_renamed, key_columns)
num_duplicates_driver = count_duplicates(df_driver_demographics, key_columns)
num_duplicates_vehicle = count_duplicates(df_vehicle_types, key_columns)
num_duplicates_factors = count_duplicates(df_factors_renamed, key_columns)

# Print the number of duplicates for each dataset
print("Number of duplicates in df_locations:", num_duplicates_locations)
print("Number of duplicates in df_driver_demographics:", num_duplicates_driver)
print("Number of duplicates in df_vehicle_types:", num_duplicates_vehicle)
print("Number of duplicates in df_factors:", num_duplicates_factors)


Number of duplicates in df_locations: 272562
Number of duplicates in df_driver_demographics: 5068
Number of duplicates in df_vehicle_types: 353
Number of duplicates in df_factors: 920


From the above, it would seem that there are a large number of duplicates for these shared columns. Because of this, performing an inner join would be computationally expensive, as the resulting merged dataframe will contain all the possible combinations of the duplicated rows.

For the purposes of this notebook,the join method will be a __left join__. While left joins will result in a large amount of NaN values, the data that remains will be the most accurate in representing the population.

In [39]:
# Perform inner join on df_crash_locations and df_crash_driver_demographics
df_merged = pd.merge(df_locations_renamed, df_driver_demographics, on=key_columns, how='left')

print(df_locations_renamed.shape)
print(df_driver_demographics.shape)
print(df_merged.shape)
print(df_merged.isna().sum())

(279798, 52)
(16713, 16)
(1228181, 60)
Crash_Ref_Number                           0
Crash_Severity                             0
Crash_Year                                 0
Crash_Month                                0
Crash_Day_Of_Week                          0
Crash_Hour                                 0
Crash_Nature                               0
Crash_Type                                 0
Crash_Longitude                            0
Crash_Latitude                             0
Crash_Street                              39
Crash_Street_Intersecting             694103
State_Road_Name                       674420
Loc_Suburb                                 0
Loc_Local_Government_Area                  0
Loc_Post_Code                              0
Loc_Police_Division                        0
Loc_Police_District                        0
Crash_Police_Region                        0
Loc_Queensland_Transport_Region            0
Loc_Main_Roads_Region                      0
Loc_ABS_Statisti

In [40]:
df_merged.columns

Index(['Crash_Ref_Number', 'Crash_Severity', 'Crash_Year', 'Crash_Month',
       'Crash_Day_Of_Week', 'Crash_Hour', 'Crash_Nature', 'Crash_Type',
       'Crash_Longitude', 'Crash_Latitude', 'Crash_Street',
       'Crash_Street_Intersecting', 'State_Road_Name', 'Loc_Suburb',
       'Loc_Local_Government_Area', 'Loc_Post_Code', 'Loc_Police_Division',
       'Loc_Police_District', 'Crash_Police_Region',
       'Loc_Queensland_Transport_Region', 'Loc_Main_Roads_Region',
       'Loc_ABS_Statistical_Area_2', 'Loc_ABS_Statistical_Area_3',
       'Loc_ABS_Statistical_Area_4', 'Loc_ABS_Remoteness',
       'Loc_State_Electorate', 'Loc_Federal_Electorate',
       'Crash_Controlling_Authority', 'Crash_Roadway_Feature',
       'Crash_Traffic_Control', 'Crash_Speed_Limit',
       'Crash_Road_Surface_Condition', 'Crash_Atmospheric_Condition',
       'Crash_Lighting_Condition', 'Crash_Road_Horiz_Align',
       'Crash_Road_Vert_Align', 'Crash_DCA_Code', 'Crash_DCA_Description',
       'Crash_DCA_Group_

After the left join, multiple rows will have duplicates.

In [41]:
duplicates_key_columns = count_duplicates(df_merged, key_columns)

# Crash_Ref_Number, which is a unique ID
duplicates_CRN = count_duplicates(df_merged, 'Crash_Ref_Number')

print("No. duplicates based on key columns: ", duplicates_key_columns)
print("No. duplicates based on CRN: ", duplicates_CRN)

No. duplicates based on key columns:  1220945
No. duplicates based on CRN:  948383


To address this, duplicates will be dropped based on Crash_Ref_Number. This will allow for 1 unique row per 1 unique CRN.

In [42]:
df_merged = df_merged.drop_duplicates(subset = 'Crash_Ref_Number')
df_merged.shape

(279798, 60)

Afterwards, another merge is done with the current merged dataset and the vehicle types dataset

In [43]:
# Perform inner join on df_merged and df_crash_vehicle_types

# Added Count_Crashes as it is a shared column with current merged dataset (locations, demographics) and vehicle types
key_columns = ['Crash_Year', 
               'Crash_Police_Region', 
               'Crash_Severity', 
               'Count_Casualty_All',
               'Count_Casualty_Fatality', 
               'Count_Casualty_Hospitalised', 
               'Count_Casualty_MedicallyTreated', 
               'Count_Casualty_MinorInjury',
               'Count_Crashes']

print(df_merged.shape)
print(df_vehicle_types.shape)

df_merged = pd.merge(df_merged, df_vehicle_types, on=key_columns, how='outer')

print(df_merged.shape)
print(df_merged.isna().sum())

(279798, 60)
(3246, 12)
(299964, 63)
Crash_Ref_Number                 2766
Crash_Severity                      0
Crash_Year                          0
Crash_Month                      2766
Crash_Day_Of_Week                2766
                                ...  
Involving_Unlicensed_Driver     30290
Count_Crashes                   27524
Involving_Motorcycle_Moped     173076
Involving_Truck                173076
Involving_Bus                  173076
Length: 63, dtype: int64


In [44]:
df_merged.columns

Index(['Crash_Ref_Number', 'Crash_Severity', 'Crash_Year', 'Crash_Month',
       'Crash_Day_Of_Week', 'Crash_Hour', 'Crash_Nature', 'Crash_Type',
       'Crash_Longitude', 'Crash_Latitude', 'Crash_Street',
       'Crash_Street_Intersecting', 'State_Road_Name', 'Loc_Suburb',
       'Loc_Local_Government_Area', 'Loc_Post_Code', 'Loc_Police_Division',
       'Loc_Police_District', 'Crash_Police_Region',
       'Loc_Queensland_Transport_Region', 'Loc_Main_Roads_Region',
       'Loc_ABS_Statistical_Area_2', 'Loc_ABS_Statistical_Area_3',
       'Loc_ABS_Statistical_Area_4', 'Loc_ABS_Remoteness',
       'Loc_State_Electorate', 'Loc_Federal_Electorate',
       'Crash_Controlling_Authority', 'Crash_Roadway_Feature',
       'Crash_Traffic_Control', 'Crash_Speed_Limit',
       'Crash_Road_Surface_Condition', 'Crash_Atmospheric_Condition',
       'Crash_Lighting_Condition', 'Crash_Road_Horiz_Align',
       'Crash_Road_Vert_Align', 'Crash_DCA_Code', 'Crash_DCA_Description',
       'Crash_DCA_Group_

Here, duplicates are again dropped based on Crash_Ref_Number

In [45]:
df_merged = df_merged.drop_duplicates(subset = 'Crash_Ref_Number')
df_merged.shape

(279799, 63)

Below is the final left join

In [46]:
# Perform inner join on df_merged and df_crash_factors
print(df_merged.shape)
print(df_factors_renamed.shape)

df_merged = pd.merge(df_merged, df_factors_renamed, on=key_columns, how='outer')

print(df_merged.shape)

(279799, 63)
(4799, 13)


(362974, 67)


and the final deduplication

In [47]:
df_merged = df_merged.drop_duplicates(subset = 'Crash_Ref_Number')
df_merged.shape

(279799, 67)

The left joins have resulted in 279k rows and 67 columns. However, a large proportion of these rows will have NaN values, due to the right datasets containing columns that the left datasets did not have at the time of the left join. 

To address this, __all rows with NaN values will be dropped__.

This is to ensure that the remaining columns are as accurate as possible.

In [48]:
df_dropped_na = df_merged.dropna()

df_dropped_na.shape

(11978, 67)

Even after dropping all rows with NaNs, we are still left with almost almost 12k samples. This is more than sufficient for analyses with machine learning and deep learning models.

In [53]:
df_dropped_na.columns

Index(['Crash_Ref_Number', 'Crash_Severity', 'Crash_Year', 'Crash_Month',
       'Crash_Day_Of_Week', 'Crash_Hour', 'Crash_Nature', 'Crash_Type',
       'Crash_Longitude', 'Crash_Latitude', 'Crash_Street',
       'Crash_Street_Intersecting', 'State_Road_Name', 'Loc_Suburb',
       'Loc_Local_Government_Area', 'Loc_Post_Code', 'Loc_Police_Division',
       'Loc_Police_District', 'Crash_Police_Region',
       'Loc_Queensland_Transport_Region', 'Loc_Main_Roads_Region',
       'Loc_ABS_Statistical_Area_2', 'Loc_ABS_Statistical_Area_3',
       'Loc_ABS_Statistical_Area_4', 'Loc_ABS_Remoteness',
       'Loc_State_Electorate', 'Loc_Federal_Electorate',
       'Crash_Controlling_Authority', 'Crash_Roadway_Feature',
       'Crash_Traffic_Control', 'Crash_Speed_Limit',
       'Crash_Road_Surface_Condition', 'Crash_Atmospheric_Condition',
       'Crash_Lighting_Condition', 'Crash_Road_Horiz_Align',
       'Crash_Road_Vert_Align', 'Crash_DCA_Code', 'Crash_DCA_Description',
       'Crash_DCA_Group_

Currently, we are dealing with a dataset with 11978 rows and 67 columns. 

In [50]:
df_dropped_na.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11978 entries, 21 to 357113
Data columns (total 67 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Crash_Ref_Number                    11978 non-null  float64
 1   Crash_Severity                      11978 non-null  object 
 2   Crash_Year                          11978 non-null  int64  
 3   Crash_Month                         11978 non-null  object 
 4   Crash_Day_Of_Week                   11978 non-null  object 
 5   Crash_Hour                          11978 non-null  float64
 6   Crash_Nature                        11978 non-null  object 
 7   Crash_Type                          11978 non-null  object 
 8   Crash_Longitude                     11978 non-null  float64
 9   Crash_Latitude                      11978 non-null  float64
 10  Crash_Street                        11978 non-null  object 
 11  Crash_Street_Intersecting           11978 no

Below is a final check for duplicates

In [54]:
df_deduplicated = df_dropped_na.drop_duplicates(subset = ['Crash_Ref_Number'])

df_deduplicated.shape

(11978, 67)

And a final check for missing values

In [64]:
df_deduplicated.isna().sum()

Crash_Ref_Number               0
Crash_Severity                 0
Crash_Year                     0
Crash_Month                    0
Crash_Day_Of_Week              0
                              ..
Involving_Bus                  0
Involving_Drink_Driving        0
Involving_Driver_Speed         0
Involving_Fatigued_Driver      0
Involving_Defective_Vehicle    0
Length: 67, dtype: int64

Looking at the first five rows, we can see that Crash_Ref_Number is no longer in increments of 1. This would be due to the numerous drops based on both the key columns and Crash_Ref_Number itself

In [52]:
df_deduplicated.head()

Unnamed: 0,Crash_Ref_Number,Crash_Severity,Crash_Year,Crash_Month,Crash_Day_Of_Week,Crash_Hour,Crash_Nature,Crash_Type,Crash_Longitude,Crash_Latitude,...,Involving_Overseas_Licensed_Driver,Involving_Unlicensed_Driver,Count_Crashes,Involving_Motorcycle_Moped,Involving_Truck,Involving_Bus,Involving_Drink_Driving,Involving_Driver_Speed,Involving_Fatigued_Driver,Involving_Defective_Vehicle
21,54.0,Hospitalisation,2001,March,Wednesday,7.0,Angle,Multi-Vehicle,153.027251,-27.588532,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
60,165.0,Hospitalisation,2001,September,Sunday,21.0,Hit object,Single Vehicle,153.01674,-27.557919,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
69,205.0,Hospitalisation,2001,November,Wednesday,14.0,Angle,Multi-Vehicle,153.03614,-27.595905,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
72,217.0,Hospitalisation,2001,December,Tuesday,22.0,Angle,Multi-Vehicle,153.02075,-27.558559,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
147,4944.0,Hospitalisation,2001,March,Thursday,16.0,Rear-end,Multi-Vehicle,152.989263,-27.445968,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes


Changing Crash_Ref_Number back to int (became float due to dropna)

In [55]:
df_deduplicated['Crash_Ref_Number'] = df_deduplicated['Crash_Ref_Number'].astype('int64')

In [61]:
df_deduplicated.head()

Unnamed: 0,Crash_Ref_Number,Crash_Severity,Crash_Year,Crash_Month,Crash_Day_Of_Week,Crash_Hour,Crash_Nature,Crash_Type,Crash_Longitude,Crash_Latitude,...,Involving_Overseas_Licensed_Driver,Involving_Unlicensed_Driver,Count_Crashes,Involving_Motorcycle_Moped,Involving_Truck,Involving_Bus,Involving_Drink_Driving,Involving_Driver_Speed,Involving_Fatigued_Driver,Involving_Defective_Vehicle
21,54,Hospitalisation,2001,March,Wednesday,7.0,Angle,Multi-Vehicle,153.027251,-27.588532,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
60,165,Hospitalisation,2001,September,Sunday,21.0,Hit object,Single Vehicle,153.01674,-27.557919,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
69,205,Hospitalisation,2001,November,Wednesday,14.0,Angle,Multi-Vehicle,153.03614,-27.595905,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
72,217,Hospitalisation,2001,December,Tuesday,22.0,Angle,Multi-Vehicle,153.02075,-27.558559,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes
147,4944,Hospitalisation,2001,March,Thursday,16.0,Rear-end,Multi-Vehicle,152.989263,-27.445968,...,No,No,1.0,Yes,No,Yes,No,Yes,No,Yes


The last step of pre-processing is to simply output to a csv file that can be used by other scripts/notebooks for analyses.

In [65]:
df_qld_crash_data = df_deduplicated
df_qld_crash_data.to_csv('qld_crash_data_merged_processed.csv', index = False)