# Association Rule Mining on Australian Road Fatality Data

This notebook explores patterns in fatal road crashes using association rule mining. By applying the Apriori algorithm to real-world data, we find key patterns across road user types, age groups, speed zones, times of day, and states. The goal is to uncover rules that could help inform targeted road safety policies.


In [36]:
#  Install necessary libraries
!pip install mlxtend

# Import libraries
import pandas as pd
import numpy as np
import mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import os
import matplotlib.pyplot as plt
import seaborn as sns




In [2]:
#uploading files
from google.colab import files
uploaded = files.upload()

Saving joined_data.csv to joined_data.csv


##  Explore the Dataset

We inspect the dataset to understand column names, data types, and missing values. Some columns are dropped due to high missing rates (>80%).


In [18]:
#Read the csv file
df = pd.read_csv('joined_data.csv')

#print the df
print(df)

       FactFatalityID  Crash ID  Road User  Gender  Age Age Group Time of day  \
0                   1  20241115     Driver    Male   74  65_to_74       Night   
1                   2  20241125     Driver  Female   19  17_to_25         Day   
2                   3  20246013     Driver  Female   33  26_to_39         Day   
3                   4  20241002     Driver  Female   32  26_to_39         Day   
4                   5  20242261  Passenger    Male   62  40_to_64         Day   
...               ...       ...        ...     ...  ...       ...         ...   
56869           56870  19896006  Passenger  Female   13   0_to_16       Night   
56870           56871  19896006  Passenger    Male   13   0_to_16       Night   
56871           56872  19896006     Driver    Male   18  17_to_25       Night   
56872           56873  19896006  Passenger  Female   14   0_to_16       Night   
56873           56874  19895133     Driver    Male   70  65_to_74       Night   

      State_crash  Month_cr

In [19]:
#Show the data head
df.head()


Unnamed: 0,FactFatalityID,Crash ID,Road User,Gender,Age,Age Group,Time of day,State_crash,Month_crash,Year_crash,...,Articulated Truck Involvement_crash,Speed Limit_crash,National Remoteness Areas_crash,SA4 Name 2021_crash,National LGA Name 2021_crash,National Road Type_crash,Christmas Period_crash,Easter Period_crash,Day of week_crash,Time of Day
0,1,20241115,Driver,Male,74,65_to_74,Night,NSW,12,2024,...,No,100,Inner Regional Australia,Riverina,Wagga Wagga,Arterial Road,Yes,No,Weekday,Night
1,2,20241125,Driver,Female,19,17_to_25,Day,NSW,12,2024,...,No,80,Inner Regional Australia,Sydney - Baulkham Hills and Hawkesbury,Hawkesbury,Local Road,No,No,Weekday,Day
2,3,20246013,Driver,Female,33,26_to_39,Day,Tas,12,2024,...,No,50,Inner Regional Australia,Launceston and North East,Northern Midlands,Local Road,Yes,No,Weekday,Day
3,4,20241002,Driver,Female,32,26_to_39,Day,NSW,12,2024,...,No,100,Outer Regional Australia,New England and North West,Armidale Regional,National or State Highway,No,No,Weekday,Day
4,5,20242261,Passenger,Male,62,40_to_64,Day,Vic,12,2024,...,Missing,0,Missing,Missing,Missing,Missing,No,No,Weekday,Day


In [20]:
#Check the data types in the dataframe
df.dtypes

Unnamed: 0,0
FactFatalityID,int64
Crash ID,int64
Road User,object
Gender,object
Age,int64
Age Group,object
Time of day,object
State_crash,object
Month_crash,int64
Year_crash,int64


In [21]:
#The columns of the dataframe
df.columns

Index(['FactFatalityID', 'Crash ID', 'Road User', 'Gender', 'Age', 'Age Group',
       'Time of day', 'State_crash', 'Month_crash', 'Year_crash',
       'Dayweek_crash', 'Time_crash', 'Crash Type_crash', 'Number Fatalities',
       'Bus Involvement_crash', 'Heavy Rigid Truck Involvement_crash',
       'Articulated Truck Involvement_crash', 'Speed Limit_crash',
       'National Remoteness Areas_crash', 'SA4 Name 2021_crash',
       'National LGA Name 2021_crash', 'National Road Type_crash',
       'Christmas Period_crash', 'Easter Period_crash', 'Day of week_crash',
       'Time of Day'],
      dtype='object')

In [45]:
#Check the number of rows and columns
df.shape

(56874, 27)

In [62]:
# Check for missing values

print("\nChecking for 'Missing' string values:")
missing_str_counts = {}
for col in df.columns:
    missing_str_count = (df[col] == "Missing").sum()
    if missing_str_count > 0:
        missing_str_counts[col] = missing_str_count

for col, count in sorted(missing_str_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{col}: {count} ({count/len(df)*100:.2f}%)")




Checking for 'Missing' string values:
SA4 Name 2021_crash: 45851 (80.62%)
National LGA Name 2021_crash: 45849 (80.62%)
National Road Type_crash: 45846 (80.61%)
National Remoteness Areas_crash: 45520 (80.04%)
Heavy Rigid Truck Involvement_crash: 20557 (36.14%)
Speed Category: 1493 (2.63%)
Age Group: 117 (0.21%)
Bus Involvement_crash: 68 (0.12%)
Articulated Truck Involvement_crash: 62 (0.11%)
Time of day: 44 (0.08%)
Time of Day: 44 (0.08%)
Time_crash: 43 (0.08%)
Gender: 34 (0.06%)
Day of week_crash: 13 (0.02%)
Road User: 11 (0.02%)


##  Preprocess Key Features

We:
- Create a `Speed Category` column to categorize numeric speed into "Low", "Medium", or "High".
- Drop or encode missing values.
- Focus on 5 variables: Age Group, State, Time of Day, Speed Category, and Road User.


In [90]:
# Create speed limit categories
def categorize_speed(speed):
    """Categorize speed limits into meaningful groups"""
    if pd.isna(speed) or speed == 0 or speed == "Missing":
        return None

    if speed <= 50:
        return "Low (≤50 km/h)"
    elif speed <= 80:
        return "Medium (51-80 km/h)"
    else:
        return "High (>80 km/h)"

In [91]:
# Apply speed categorization
df['Speed Category'] = df['Speed Limit_crash'].apply(categorize_speed)


In [93]:
# Step 5: Explore the distribution of key variables
print(df['Road User'].value_counts())


Road User
Driver                          25681
Passenger                       12918
Pedestrian                       8757
Motorcycle rider                 7456
Pedal cyclist                    1546
Motorcycle pillion passenger      384
Other/-9                          121
Missing                            11
Name: count, dtype: int64


In [94]:
print(df['State_crash'].value_counts())


State_crash
NSW    17326
Vic    12436
Qld    11440
WA      6842
SA      4853
NT      1787
Tas     1677
ACT      513
Name: count, dtype: int64


In [99]:
print(df['Age Group'].value_counts())

Age Group
40_to_64       14642
17_to_25       14542
26_to_39       13239
75_or_older     5612
65_to_74        4443
0_to_16         4279
Missing          117
Name: count, dtype: int64


In [96]:
print(df['Speed Category'].value_counts())


Speed Category
High (>80 km/h)        27328
Medium (51-80 km/h)    24217
Low (≤50 km/h)          3836
Name: count, dtype: int64


In [98]:

print(df['Time of Day'].value_counts())


Time of Day
Day        32393
Night      24437
Missing       44
Name: count, dtype: int64


In [100]:
# Select focused columns for analysis - removed less important columns
focus_columns = [
    'Road User',
    'Age Group',
    'State_crash',
    'Speed Category',
    'Time of Day'
]

## Transaction Encoding and Rule Mining

We:
- Encode selected feature values into item-style transactions.
- Use Apriori with minimum support = 1% to find frequent patterns.
- Generate association rules with minimum confidence = 10%.


In [101]:
# Create transactions for rule mining
transactions = []
for _, row in df.iterrows():
    transaction = []

    for col in focus_columns:
        if col in df.columns:
            value = row[col]
            if pd.notna(value) and value != "Missing":
                # Encode the item as "attribute=value"
                transaction.append(f"{col}={value}")

    # Only include transactions with at least 3 items
    if len(transaction) >= 3:
        transactions.append(transaction)

print(len(transactions))
print(transactions[0])

56873
['Road User=Driver', 'Age Group=65_to_74', 'State_crash=NSW', 'Speed Category=High (>80 km/h)', 'Time of Day=Night']


In [103]:
# Apply Apriori algorithm to find frequent itemsets
min_support = 0.01  # 1% support threshold
frequent_itemsets = apriori(df_encoded, min_support=min_support, use_colnames=True)
print(len(frequent_itemsets))


581


In [104]:
# Generate association rules
min_confidence = 0.1  # 10% confidence threshold
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
print(len(rules))


2608


##  Filter and Focus on Road User Rules

We filter rules to keep only those where the outcome is the road user type (e.g., Driver, Pedestrian). These rules reveal how certain conditions lead to a higher likelihood of specific user types in fatal crashes.


In [105]:
# Filter rules to have "Road User" as consequent
road_user_rules = rules[rules['consequents'].apply(lambda x: any(item.startswith('Road User=') for item in x))]
print(len(road_user_rules))


722


In [106]:
# Ensure each rule has exactly one Road User item in the consequent
filtered_rules = []
for _, rule in road_user_rules.iterrows():
    antecedents = set(rule['antecedents'])
    consequents = set(rule['consequents'])

    # Extract only Road User items from consequents
    road_user_consequents = {item for item in consequents if item.startswith('Road User=')}

    if len(road_user_consequents) == 1:
        # Create a new rule with just the Road User consequent
        new_rule = rule.copy()
        new_rule['consequents'] = frozenset(road_user_consequents)
        filtered_rules.append(new_rule)

filtered_rules_df = pd.DataFrame(filtered_rules)
print(len(filtered_rules_df))


722


In [109]:
#  Sort and analyze rules
# Sort by lift (primary) and confidence (secondary)
filtered_rules_df = filtered_rules_df.sort_values(['lift', 'confidence'], ascending=[False, False])

# Define function to format rules
def format_rule(row):
    antecedents_str = ', '.join(list(row['antecedents']))
    consequents_str = ', '.join(list(row['consequents']))
    return f"{antecedents_str} => {consequents_str} (Support: {row['support']:.4f}, Confidence: {row['confidence']:.4f}, Lift: {row['lift']:.4f})"

# Display top rules
for i, rule in enumerate(filtered_rules_df.head(10).apply(format_rule, axis=1)):
    print(f"Rule {i+1}: {rule}")


Rule 1: Age Group=75_or_older, Speed Category=Medium (51-80 km/h) => Road User=Pedestrian (Support: 0.0175, Confidence: 0.3290, Lift: 4.3907)
Rule 2: Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0172, Confidence: 0.5109, Lift: 4.0247)
Rule 3: Time of Day=Day, Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0172, Confidence: 0.7811, Lift: 3.4387)
Rule 4: Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0263, Confidence: 0.7781, Lift: 3.4258)
Rule 5: Age Group=75_or_older => Road User=Pedestrian (Support: 0.0175, Confidence: 0.1773, Lift: 3.3986)
Rule 6: Age Group=75_or_older => Road User=Pedestrian (Support: 0.0244, Confidence: 0.2470, Lift: 3.2956)
Rule 7: Speed Category=Low (≤50 km/h) => Road User=Pedestrian (Support: 0.0103, Confidence: 0.1533, Lift: 2.8953)
Rule 8: Age Group=26_to_39, Time of Day=Day, Speed Category=Medium (51-80 km/h) => Road User=Motorcycle rider (Suppo

##  Analyze Rules by Age, Speed, Time, and State

We break down and rank rules by:
- Age group (e.g., 0–16, 75+)
- Speed zone
- Time of day (Day vs Night)
- State (e.g., NSW, VIC, WA)

Lift and confidence values help prioritize high-impact patterns.


In [110]:
#  Age Group analysis
def extract_feature(items, prefix):
    """Extract a specific feature from a set of items based on prefix"""
    for item in items:
        if item.startswith(prefix):
            return item.split('=')[1]
    return None

# Age Group Analysis
filtered_rules_df['age_group'] = filtered_rules_df['antecedents'].apply(
    lambda x: extract_feature(x, 'Age Group='))
filtered_rules_df['road_user'] = filtered_rules_df['consequents'].apply(
    lambda x: extract_feature(x, 'Road User='))

age_rules = filtered_rules_df[filtered_rules_df['age_group'].notna()]
print(len(age_rules))


382


In [111]:
# Display top rules for each age group
for age in sorted(age_rules['age_group'].unique()):
    age_specific = age_rules[age_rules['age_group'] == age].sort_values('lift', ascending=False)
    print(f"\n{age}:")
    for i, rule in enumerate(age_specific.head(2).apply(format_rule, axis=1)):
        print(f"  Rule {i+1}: {rule}")



0_to_16:
  Rule 1: Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0172, Confidence: 0.5109, Lift: 4.0247)
  Rule 2: Time of Day=Day, Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0172, Confidence: 0.7811, Lift: 3.4387)

17_to_25:
  Rule 1: Time of Day=Day, Age Group=17_to_25, Speed Category=Medium (51-80 km/h) => Road User=Motorcycle rider (Support: 0.0117, Confidence: 0.3016, Lift: 2.3004)
  Rule 2: Age Group=17_to_25, Speed Category=Medium (51-80 km/h) => Road User=Passenger (Support: 0.0216, Confidence: 0.1886, Lift: 1.8843)

26_to_39:
  Rule 1: Age Group=26_to_39, Time of Day=Day, Speed Category=Medium (51-80 km/h) => Road User=Motorcycle rider (Support: 0.0155, Confidence: 0.3759, Lift: 2.8669)
  Rule 2: Age Group=26_to_39, Speed Category=Medium (51-80 km/h) => Road User=Motorcycle rider (Support: 0.0108, Confidence: 0.1139, Lift: 2.5221)

40_to_64:
  Rule 1: Age Group=40_to_64, Speed Category=Medium (51

In [114]:
# Speed Category analysis
filtered_rules_df['speed_category'] = filtered_rules_df['antecedents'].apply(
    lambda x: extract_feature(x, 'Speed Category='))

speed_rules = filtered_rules_df[filtered_rules_df['speed_category'].notna()]
print(len(speed_rules))

# Display top rules for each speed category
for speed in sorted(speed_rules['speed_category'].unique()):
    speed_specific = speed_rules[speed_rules['speed_category'] == speed].sort_values('lift', ascending=False)
    print(f"\n{speed}:")
    for i, rule in enumerate(speed_specific.head(2).apply(format_rule, axis=1)):
        print(f"  Rule {i+1}: {rule}")


298

High (>80 km/h):
  Rule 1: Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0172, Confidence: 0.5109, Lift: 4.0247)
  Rule 2: Time of Day=Day, Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0172, Confidence: 0.7811, Lift: 3.4387)

Low (≤50 km/h):
  Rule 1: Speed Category=Low (≤50 km/h) => Road User=Pedestrian (Support: 0.0103, Confidence: 0.1533, Lift: 2.8953)
  Rule 2: Speed Category=Low (≤50 km/h) => Road User=Pedestrian (Support: 0.0145, Confidence: 0.2145, Lift: 2.8630)

Medium (51-80 km/h):
  Rule 1: Age Group=75_or_older, Speed Category=Medium (51-80 km/h) => Road User=Pedestrian (Support: 0.0175, Confidence: 0.3290, Lift: 4.3907)
  Rule 2: Age Group=26_to_39, Time of Day=Day, Speed Category=Medium (51-80 km/h) => Road User=Motorcycle rider (Support: 0.0155, Confidence: 0.3759, Lift: 2.8669)


In [115]:
# 12.3: Time of Day analysis
filtered_rules_df['time_of_day'] = filtered_rules_df['antecedents'].apply(
    lambda x: extract_feature(x, 'Time of Day='))

time_rules = filtered_rules_df[filtered_rules_df['time_of_day'].notna()]
print(len(time_rules))

# Display top rules for each time of day
for time in sorted(time_rules['time_of_day'].unique()):
    time_specific = time_rules[time_rules['time_of_day'] == time].sort_values('lift', ascending=False)
    print(f"\n{time}:")
    for i, rule in enumerate(time_specific.head(2).apply(format_rule, axis=1)):
        print(f"  Rule {i+1}: {rule}")

302

Day:
  Rule 1: Time of Day=Day, Speed Category=High (>80 km/h), Age Group=0_to_16 => Road User=Passenger (Support: 0.0172, Confidence: 0.7811, Lift: 3.4387)
  Rule 2: Age Group=26_to_39, Time of Day=Day, Speed Category=Medium (51-80 km/h) => Road User=Motorcycle rider (Support: 0.0155, Confidence: 0.3759, Lift: 2.8669)

Night:
  Rule 1: Age Group=0_to_16, Time of Day=Night => Road User=Passenger (Support: 0.0174, Confidence: 0.6242, Lift: 2.7482)
  Rule 2: Age Group=40_to_64, Speed Category=Medium (51-80 km/h), Time of Day=Night => Road User=Pedestrian (Support: 0.0135, Confidence: 0.3406, Lift: 2.2118)


In [113]:
# 12.4: State analysis
filtered_rules_df['state'] = filtered_rules_df['antecedents'].apply(
    lambda x: extract_feature(x, 'State_crash='))

state_rules = filtered_rules_df[filtered_rules_df['state'].notna()]
print(len(state_rules))

# Display top rules for major states
major_states = ['NSW', 'Vic', 'Qld', 'WA']
for state in major_states:
    state_specific = state_rules[state_rules['state'] == state].sort_values('lift', ascending=False)
    print(f"\n{state}:")
    for i, rule in enumerate(state_specific.head(2).apply(format_rule, axis=1)):
        print(f"  Rule {i+1}: {rule}")

369

NSW:
  Rule 1: State_crash=NSW, Age Group=0_to_16 => Road User=Passenger (Support: 0.0125, Confidence: 0.5666, Lift: 2.4947)
  Rule 2: State_crash=NSW, Speed Category=Low (≤50 km/h) => Road User=Pedestrian (Support: 0.0103, Confidence: 0.3661, Lift: 2.3778)

Vic:
  Rule 1: Speed Category=Medium (51-80 km/h), State_crash=Vic => Road User=Pedestrian (Support: 0.0137, Confidence: 0.1414, Lift: 1.8870)
  Rule 2: Time of Day=Day, Speed Category=Medium (51-80 km/h), State_crash=Vic => Road User=Pedestrian (Support: 0.0137, Confidence: 0.2761, Lift: 1.7932)

Qld:
  Rule 1: Time of Day=Day, Speed Category=Medium (51-80 km/h), State_crash=Qld => Road User=Motorcycle rider (Support: 0.0121, Confidence: 0.2400, Lift: 1.8304)
  Rule 2: Age Group=26_to_39, State_crash=Qld => Road User=Motorcycle rider (Support: 0.0116, Confidence: 0.2373, Lift: 1.8099)

WA:
  Rule 1: Speed Category=High (>80 km/h), State_crash=WA => Road User=Passenger (Support: 0.0116, Confidence: 0.1826, Lift: 1.4384)
  Rule