# Road Traffic Accidents Dataset Analysis

This project aims to analyze Addis Ababa road accident data, uncover patterns, and present the findings in a clear and accessible way.
<br>
A combination of numerical, univariate, and multivariate analysis methods has been applied to derive meaningful insights.

---

In [9]:
import pandas as pd
import plotly.express as px
import warnings
from utils.dataset_download import download_kaggle_dataset
from utils.plot_config import setup_summary_chart, setup_horizontal_bar
pd.options.mode.chained_assignment = None
warnings.simplefilter("ignore")

download_kaggle_dataset()
accidents = pd.read_csv('data\RTA Dataset.csv')

Dataset URL: https://www.kaggle.com/datasets/saurabhshahane/road-traffic-accidents


---

# Numerical Analysis

In [10]:
display(accidents['Number_of_casualties'].describe())
display(accidents['Number_of_casualties'].value_counts())

count    12316.000000
mean         1.548149
std          1.007179
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          8.000000
Name: Number_of_casualties, dtype: float64

Number_of_casualties
1    8397
2    2290
3     909
4     394
5     207
6      89
7      22
8       8
Name: count, dtype: int64

The distribution of the number of casualties is highly positively skewed, with a cluster of higher values that considerably raise the mean compared to the median. This suggests that accidents with a larger number of casualties are relatively rare. The relatively high standard deviation indicates that the data points are more widely dispersed around the mean.

---

In [11]:
display(accidents['Number_of_vehicles_involved'].describe())
display(accidents['Number_of_vehicles_involved'].value_counts().sort_index())

count    12316.000000
mean         2.040679
std          0.688790
min          1.000000
25%          2.000000
50%          2.000000
75%          2.000000
max          7.000000
Name: Number_of_vehicles_involved, dtype: float64

Number_of_vehicles_involved
1    1996
2    8340
3    1568
4     363
6      42
7       7
Name: count, dtype: int64

The distribution of the number of vehicles involved in accidents is highly symmetrical, indicating that the likelihood of an accident involving fewer than two vehicles is comparable to that of an accident involving more than two. However, as the number of vehicles deviates further from two, the probability of such accidents decreases steadily.

---

# Univariate Analysis

In [12]:
fig = setup_summary_chart(accidents)
fig.show()

- Two spikes in the number of accidents are observed throughout the day, likely corresponding to the morning and afternoon rush hours, when people are commuting.
- The probability of accidents remains fairly consistent throughout the week, with a slight increase on Fridays and a small decrease over the weekend.
- A total of 118 people have been involved in accidents without possessing a driver's license.
- Lorries and automobiles account for approximately 55% of accidents, likely reflecting their widespread use and prevalence on the road.
- The majority of accidents occur while vehicles are moving straight.
- Around 8% of accidents involve pedestrians.

---

# Multivariate Analisys

In [13]:
lighting_df = accidents.copy()
lighting_df.loc[lighting_df['Light_conditions'].str.contains('Darkness', case=False, na=False), 'Light_conditions'] = 'Darkness'

daylight_fatal = lighting_df[lighting_df['Light_conditions'] == 'Daylight'].Accident_severity.value_counts(normalize=True).values[2]
darkness_fatal = lighting_df[lighting_df['Light_conditions'] == 'Darkness'].Accident_severity.value_counts(normalize=True).values[2]

joint_fatal = pd.DataFrame({'Daylight': [daylight_fatal], 'Darkness': [darkness_fatal]})
melt_fatal = pd.melt(joint_fatal, var_name='Light_conditions', value_name='Probability')
melt_fatal = melt_fatal.sort_values(by='Probability', ascending=False)

fig = setup_horizontal_bar(melt_fatal, 'Probability', 'Light_conditions', '#4A5A6F', 'tan',
                           'Probability of a fatal injury', 'Light conditions', 'Fatal Injury Probability by Daylight vs Darkness')
fig.add_annotation(text='', x=0.017, y=0.95, ax=-300, ay=0, arrowhead=2)
fig.add_annotation(text='Twice as frequent', x=0.0145, y=1.05, ax=0, ay=0, showarrow=False, font=dict(size=14))
fig.show()

1. We can see that the probability of fatal accidents at night is twice as high as during the day (2% to 1%), which highlights the increased risks associated with reduced visibility and fatigue. Drivers are more likely to make errors or fail to notice hazards in low-light conditions, which can lead to more severe accidents.

---

In [14]:
pede_df = accidents.copy()
pede_df.loc[pede_df['Pedestrian_movement'].str.contains('masked', case=False, na=False), 'Pedestrian_movement'] = 'Masked'
pede_df.loc[pede_df['Pedestrian_movement'].str.contains('Crossing|carriageway', case=False, na=False), 'Pedestrian_movement'] = 'Visible'

vis_fatal = pede_df[pede_df['Pedestrian_movement'] == 'Visible'].Accident_severity.value_counts(normalize=True).values[2]
mask_fatal = pede_df[pede_df['Pedestrian_movement'] == 'Masked'].Accident_severity.value_counts(normalize=True).values[2]

joint_fatal = pd.DataFrame({'Visible': [vis_fatal], 'Masked by vehicle': [mask_fatal]})
melt_fatal = pd.melt(joint_fatal, var_name='Pedestrian_visibility', value_name='Probability')
melt_fatal = melt_fatal.sort_values(by='Probability', ascending=False)

fig = setup_horizontal_bar(melt_fatal, 'Probability', 'Pedestrian_visibility', '#8A8D8F', '#4A90E2', 'Probability of a fatal injury',
                           'Pedestrian visibility', 'Fatal Injury Probability by Pedestrian Visibility to the Driver')
fig.add_annotation(text='', x=0.014, y=0.95, ax=-300, ay=0, arrowhead=2)
fig.add_annotation(text='4 times as frequent', x=0.0116, y=1.05, ax=0, ay=0, showarrow=False, font=dict(size=14))
fig.show()

2. The probability of an accident resulting in a fatal injury is four times higher when a pedestrian emerges from behind a vehicle compared to when they are clearly visible to the driver (1.9% to 0.5%).

---

In [15]:
val_list = []
for key, value in accidents[accidents['Accident_severity'] == 'Fatal injury']['Day_of_week'].value_counts().items():
    val_list.append([key, value / accidents['Day_of_week'].value_counts().loc[key]])

val_df = pd.DataFrame(val_list, columns=['Day_of_week', 'probability'])
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
val_df['Day_of_week'] = pd.Categorical(val_df['Day_of_week'], categories=order, ordered=True)
val_df = val_df.sort_values(by='Day_of_week').reset_index(drop=True)

color_main = 'salmon'
color_second = 'slategray'
colors = [color_main if i in [5, 6] else color_second for i in range(7)]
fig = px.bar(val_df, x='Day_of_week', y='probability', color=colors, color_discrete_map={color_main: color_main, color_second: color_second},
             labels={'Day_of_week': 'Day of the week', 'probability': 'Probability of a fatal injury'},
             title="Fatal Injury Probability by Day of the Week", width=750)
fig.update_layout(xaxis=dict(fixedrange=True),
                  yaxis=dict(fixedrange=True, range=[0, max(val_df['probability'])*1.2], tickformat='.1%', showgrid=False),
                  title_x=0.5, showlegend=False)
fig.update_traces(texttemplate='%{y}', textposition='outside')

fig.show()

3. Fatal road accidents are more likely on weekends (around 2.3% compared to ~1%), possibly due to increased leisure travel, social activities, and factors like alcohol consumption, leading to riskier driving behaviors. Despite Friday being the most accident-prone day (16.6% compared to an average of 14.3%), it has the second-lowest fatal injury probability, which may hint that during Friday there are many non-fatal accidents.

---

In [16]:
val_list = []
for key, value in accidents[(accidents['Accident_severity'] == 'Fatal injury') & (accidents['Vehicle_movement'] != 'Unknown')]['Vehicle_movement'].value_counts().items():
    val_list.append([key, value / accidents['Vehicle_movement'].value_counts().loc[key]])

val_df = pd.DataFrame(val_list, columns=['Vehicle_movement', 'probability']).sort_values(by='probability', ascending=False)

color_main = 'salmon'
color_second = 'slategray'
colors = [color_main if i in [0] else color_second for i in range(10)]
fig = px.bar(val_df, x='Vehicle_movement', y='probability', color=colors, color_discrete_map={color_main: color_main, color_second: color_second},
             labels={'Vehicle_movement': 'Vehicle movement type', 'probability': 'Probability of a fatal injury'},
             title="Fatal Injury Probability by Vehicle Movement Type", width=1000)
fig.update_layout(xaxis=dict(fixedrange=True),
                  yaxis=dict(fixedrange=True, range=[0, max(val_df['probability'])*1.2], tickformat='.1%', showgrid=False), title_x=0.5, showlegend=False)
fig.update_traces(texttemplate='%{y}', textposition='outside')

fig.show()

4. The overtaking maneuver has by far the highest probability of a fatal injury in an accident (5.2%) in comparison to other movement types, which may be because of the elevated risk of high-speed head-on collisions while overtaking.

---

In [17]:
val_list = []
for key, value in accidents[accidents['Accident_severity'] == 'Fatal injury']['Type_of_collision'].value_counts().items():
    val_list.append([key, value / accidents['Type_of_collision'].value_counts().loc[key]])

val_df = pd.DataFrame(val_list, columns=['Type_of_collision', 'probability']).sort_values(by='probability', ascending=False)

color_main = 'salmon'
color_second = 'slategray'
colors = [color_main if i == 0 else color_second for i in range(5)]

fig = px.bar(val_df, x='Type_of_collision', y='probability', color=colors, color_discrete_map={color_main: color_main, color_second: color_second},
             labels={'Type_of_collision': 'Type of collision', 'probability': 'Probability of a fatal injury'},
             title="Fatal Injury Probability by Type of Collision", width=750)
fig.update_layout(xaxis=dict(fixedrange=True),
                  yaxis=dict(fixedrange=True, range=[0, max(val_df['probability'])*1.2], tickformat='.1%', showgrid=False), title_x=0.5, showlegend=False)
fig.update_traces(texttemplate='%{y}', textposition='outside')

fig.show()

5. Collisions with pedestrians have the highest fatal injury probability (2.5%) due to the lack of protection for pedestrians, the severity of impact, and the vulnerability of the human body in such accidents.

---

In [18]:
severity = accidents.groupby('Accident_severity')[['Number_of_vehicles_involved', 'Number_of_casualties']].mean()
severity.columns = ['Number of vehicles involved','Number of casualties']
fig = px.bar(severity, 
             title="Average Vehicles and Casualties by Accident Severity",
             barmode='group',
             labels={'value': 'Average', 'variable': '','Accident_severity': 'Accident Severity'},
             color_discrete_map={'Number of vehicles involved': 'cornflowerblue', 'Number of casualties': 'lightcoral'},
             width=800
             )

fig.update_layout(
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.11,
        xanchor="left",
        x=0.4))

fig.update_layout(xaxis=dict(fixedrange=True),
                  yaxis=dict(fixedrange=True, range=[0, 2.7], tickformat='.1f', showgrid=False), title_x=0.5)
fig.update_traces(texttemplate='%{y}', textposition='outside')

fig.show()

6. The number of vehicles is higher the less serious the injury is, while the number of casualties is similar in cases of slight and serious injuries but is quite higher when it comes to fatal ones.

---

In [19]:
value_counts_df = accidents.groupby(['Number_of_casualties', 'Number_of_vehicles_involved']).size().reset_index(name='counts')
total_accidents = accidents.shape[0]
value_counts_df['probability'] = value_counts_df['counts'] / total_accidents

custom_color_scale = [
    [0, "royalblue"],
    [0.35, "orange"],
    [1, "salmon"]
]

fig = px.scatter(value_counts_df, 
                 x="Number_of_casualties", 
                 y="Number_of_vehicles_involved", 
                 size="counts",
                 color="probability",
                 title="Number of Casualties vs Number of Vehicles Involved with Accident Probability",
                 labels={"Number_of_casualties": "Number of Casualties", 
                         "Number_of_vehicles_involved": "Number of Vehicles Involved", 
                         "probability": "Probability"},
                 color_continuous_scale=custom_color_scale,
                 hover_data=["counts", "probability"],
                 size_max=40,
                 width=900,
                 height=700)

fig.update_layout(xaxis=dict(fixedrange=True),
                yaxis=dict(fixedrange=True), title_x=0.5)
fig.show()

7. The data shows that most accidents involve a small number of casualties and vehicles, with the highest frequency observed in accidents with 1 casualty and 2 vehicles. These account for a significant portion of all incidents (47.1%). On the other hand, accidents involving a larger number of casualties or vehicles are much less frequent, indicating that they are rare but potentially more severe events. The probability of accidents decreases as the number of casualties or vehicles increases, highlighting the rarity of large-scale accidents.

---

In [20]:
accidents['Hour'] = pd.to_datetime(accidents['Time'],format='%H:%M:%S', errors='coerce').dt.hour
hours = accidents.groupby('Hour')[['Number_of_vehicles_involved', 'Number_of_casualties']].mean()
hours.columns = ['Number of vehicles involved', 'Number of casualties']
fig = px.line(hours, 
              title="Average Vehicles and Casualties by Hour of the Day",
              labels={'value': 'Average', 'variable': ' ', 'Hour': 'Hour of the day'},
              color_discrete_map={'Number of vehicles involved': 'cornflowerblue', 'Number of casualties': 'lightcoral'},)

fig.update_layout(
    legend=dict(
        orientation="h",
        yanchor="top",
        y=1.12,
        xanchor="left",
        x=0.71))

fig.update_layout(xaxis=dict(fixedrange=True, tickvals=[x for x in range(1, 25)], ticktext=[x for x in range(1, 25)]),
                  yaxis=dict(fixedrange=True, range=[0, 2.5]), title_x=0.5)

fig.show()

8. We can see that both the number of vehicles and casualties are stable throughout the day with the exception of 4 a.m., when the number of casualties is visibly higher. 4 a.m. is also the only hour where the number of casualties is higher than the number of vehicles involved.

---

# Summary

This analysis highlights key patterns in road accidents in Addis Ababa. Severe accidents are rare, with most involving a small number of vehicles and casualties. Fatal accidents are more common at night, on weekends, and during overtaking maneuvers, with pedestrian collisions having the highest fatality rates. Friday has the highest accident frequency but lower fatality rates. Notably, 118 people involved in accidents lacked a driver’s license. The data emphasizes the need to address factors like visibility and pedestrian safety to reduce fatalities.