**Saudi Arabian road accident mortality and traffic safety interventions dataset (2010–2020)**

# **Import libraries**

In [202]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime

In [203]:
# xslx file
df = pd.read_excel("/content/EP-traffic-mortality-and-policy-interventions-dataset.xlsx")
df.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,,1. Dataset Description,
1,,This file provides two primary datasets on roa...,
2,,,
3,,,
4,,,


In [204]:
# Load Excel file
excel_file_path = "/content/EP-traffic-mortality-and-policy-interventions-dataset.xlsx"

# Check the sheets
datafile = pd.ExcelFile(excel_file_path)
print(datafile.sheet_names)

['Data Description', 'Raw Accident Mortality Data', 'Perodic Accid. Mortality Data', 'Govern. Accid. Mortality Data', 'Policy Interventions', 'EP Map', 'Governerate & Cities', 'Credits']


In [205]:
# Load the second sheet Raw Accident Mortality Data
df = pd.read_excel(excel_file_path, sheet_name=datafile.sheet_names[1])
df.head()

Unnamed: 0,Home,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,,1. Traffic Accident Mortality Primary dataset,,,,,,,,,,,,,,,,2. Data Source
1,,death date (Gregorian),death date (Hijri),Place of Death,Age,,,Gender,Nationality,Hospital Code,City,Governorate,Population,,Total Records,7351.0,,This data is collected from Ministry of Health...
2,,,,,Day,Month,Year,,,,,,,,,,,
3,,2020-01-01 00:00:00,05/05/1441,At Hospital,0,0,60,Male,Saudi,10002,Al-Qaṭīf,Al-Qaṭīf,118327,,,,,
4,,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,11,3,35,Male,Palestinian,10039,Ar Rafiah,Qaryat al-'Ulyā,<1000,,,,,


In [206]:
# drop the first column and the last five columns
df = df.drop(df.columns[0], axis=1)
df = df.iloc[:, :-5]

# we drop the columns 4 and 5
df = df.drop(df.columns[[3, 4]], axis=1)

# And drop the first row
df = df.drop(df.index[0])
df.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
1,death date (Gregorian),death date (Hijri),Place of Death,,Gender,Nationality,Hospital Code,City,Governorate,Population
2,,,,Year,,,,,,
3,2020-01-01 00:00:00,05/05/1441,At Hospital,60,Male,Saudi,10002,Al-Qaṭīf,Al-Qaṭīf,118327
4,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,35,Male,Palestinian,10039,Ar Rafiah,Qaryat al-'Ulyā,<1000
5,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,44,Female,Saudi,10022,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788


In [207]:
# we make the first row , the columns name
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df.head()

1,death date (Gregorian),death date (Hijri),Place of Death,NaN,Gender,Nationality,Hospital Code,City,Governorate,Population
2,,,,Year,,,,,,
3,2020-01-01 00:00:00,05/05/1441,At Hospital,60,Male,Saudi,10002.0,Al-Qaṭīf,Al-Qaṭīf,118327
4,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,35,Male,Palestinian,10039.0,Ar Rafiah,Qaryat al-'Ulyā,<1000
5,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,44,Female,Saudi,10022.0,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788
6,2019-12-29 00:00:00,02/05/1441,Before Reaching Hospital,25,Male,Saudi,10007.0,Al-Jubaīl,Al-Jubaīl,337778


In [208]:
# remove the second row
df = df.drop(df.index[0])

# rename the NaN column>> "Age"
df = df.rename(columns={np.nan: 'Age'})

df.head()

1,death date (Gregorian),death date (Hijri),Place of Death,Age,Gender,Nationality,Hospital Code,City,Governorate,Population
3,2020-01-01 00:00:00,05/05/1441,At Hospital,60,Male,Saudi,10002,Al-Qaṭīf,Al-Qaṭīf,118327
4,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,35,Male,Palestinian,10039,Ar Rafiah,Qaryat al-'Ulyā,<1000
5,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,44,Female,Saudi,10022,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788
6,2019-12-29 00:00:00,02/05/1441,Before Reaching Hospital,25,Male,Saudi,10007,Al-Jubaīl,Al-Jubaīl,337778
7,2019-12-29 00:00:00,02/05/1441,Before Reaching Hospital,65,Male,Bengali,10022,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788


In [209]:
df

1,death date (Gregorian),death date (Hijri),Place of Death,Age,Gender,Nationality,Hospital Code,City,Governorate,Population
3,2020-01-01 00:00:00,05/05/1441,At Hospital,60,Male,Saudi,10002,Al-Qaṭīf,Al-Qaṭīf,118327
4,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,35,Male,Palestinian,10039,Ar Rafiah,Qaryat al-'Ulyā,<1000
5,2019-12-30 00:00:00,03/05/1441,Before Reaching Hospital,44,Female,Saudi,10022,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788
6,2019-12-29 00:00:00,02/05/1441,Before Reaching Hospital,25,Male,Saudi,10007,Al-Jubaīl,Al-Jubaīl,337778
7,2019-12-29 00:00:00,02/05/1441,Before Reaching Hospital,65,Male,Bengali,10022,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788
...,...,...,...,...,...,...,...,...,...,...
7350,2011-04-05 00:00:00,02/05/1432,Before Reaching Hospital,21,Male,Saudi,10007,Al-Jubaīl,Al-Jubaīl,337778
7351,2011-02-05 00:00:00,02/03/1432,Before Reaching Hospital,47,Male,Saudi,34,Al-Khafjī,Al-Khafjī,67012
7352,2010-08-11 00:00:00,01/09/1431,Before Reaching Hospital,18,Male,Saudi,34,Al-Khafjī,Al-Khafjī,67012
7353,2010-08-06 00:00:00,25/08/1431,Before Reaching Hospital,17,Male,Saudi,10001,Ad-Dammām,Ad-Dammām,903312


In [210]:
# the last row need to be removed
df = df.drop(df.index[-1])
df.tail()

1,death date (Gregorian),death date (Hijri),Place of Death,Age,Gender,Nationality,Hospital Code,City,Governorate,Population
7349,2011-04-12 00:00:00,09/05/1432,Before Reaching Hospital,23,Male,Saudi,10057,Al-'Aḥsā',Al-'Aḥsā',1067691
7350,2011-04-05 00:00:00,02/05/1432,Before Reaching Hospital,21,Male,Saudi,10007,Al-Jubaīl,Al-Jubaīl,337778
7351,2011-02-05 00:00:00,02/03/1432,Before Reaching Hospital,47,Male,Saudi,34,Al-Khafjī,Al-Khafjī,67012
7352,2010-08-11 00:00:00,01/09/1431,Before Reaching Hospital,18,Male,Saudi,34,Al-Khafjī,Al-Khafjī,67012
7353,2010-08-06 00:00:00,25/08/1431,Before Reaching Hospital,17,Male,Saudi,10001,Ad-Dammām,Ad-Dammām,903312


# **Data Exploration**

In [211]:
# Shape and columns
print("Rows:", df.shape[0])
print("Columns:", df.shape[1])
print("Columns:", df.columns.tolist())

Rows: 7351
Columns: 10
Columns: ['death date (Gregorian)', 'death date (Hijri)', 'Place of Death', 'Age', 'Gender', 'Nationality', 'Hospital Code', 'City', 'Governorate', 'Population']


In [212]:
# Check missing values
df.isna().sum()

Unnamed: 0_level_0,0
1,Unnamed: 1_level_1
death date (Gregorian),0
death date (Hijri),0
Place of Death,0
Age,0
Gender,0
Nationality,0
Hospital Code,31
City,0
Governorate,0
Population,0


In [213]:
# Basic info and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7351 entries, 3 to 7353
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   death date (Gregorian)  7351 non-null   object
 1   death date (Hijri)      7351 non-null   object
 2   Place of Death          7351 non-null   object
 3   Age                     7351 non-null   object
 4   Gender                  7351 non-null   object
 5   Nationality             7351 non-null   object
 6   Hospital Code           7320 non-null   object
 7   City                    7351 non-null   object
 8   Governorate             7351 non-null   object
 9   Population              7351 non-null   object
dtypes: object(10)
memory usage: 574.4+ KB


In [214]:
# Summary statistics
df.describe(include='all')


The behavior of value_counts with object-dtype is deprecated. In a future version, this will *not* perform dtype inference on the resulting index. To retain the old behavior, use `result.index = result.index.infer_objects()`



1,death date (Gregorian),death date (Hijri),Place of Death,Age,Gender,Nationality,Hospital Code,City,Governorate,Population
count,7351,7351,7351,7351,7351,7351,7320,7351,7351,7351
unique,2879,2892,2,152,2,41,60,24,12,21
top,2016-12-29 00:00:00,04/02/1437,Before Reaching Hospital,25,Male,Saudi,10001,Ad-Dammām,Al-'Aḥsā',903312
freq,13,13,5762,277,6594,4265,1311,1387,1921,1387


In [215]:
# duplicates
df.duplicated().sum()

np.int64(12)

In [216]:
# check the duplicate rows that are true
df[df.duplicated()]

1,death date (Gregorian),death date (Hijri),Place of Death,Age,Gender,Nationality,Hospital Code,City,Governorate,Population
2187,2016-12-29 00:00:00,30/03/1438,Before Reaching Hospital,17,Male,Saudi,10001.0,Ad-Dammām,Ad-Dammām,903312
2188,2016-12-29 00:00:00,30/03/1438,Before Reaching Hospital,17,Male,Saudi,10001.0,Ad-Dammām,Ad-Dammām,903312
2189,2016-12-29 00:00:00,30/03/1438,Before Reaching Hospital,17,Male,Saudi,10001.0,Ad-Dammām,Ad-Dammām,903312
4224,2014-10-05 00:00:00,12/12/1435,Before Reaching Hospital,31,Male,Saudi,10036.0,Urayarah,Al-'Aḥsā',<1000
5367,2013-03-20 00:00:00,08/05/1434,Before Reaching Hospital,19,Male,Saudi,34.0,Al-Khafjī,Al-Khafjī,67012
5464,2013-01-22 00:00:00,10/03/1434,Before Reaching Hospital,43,Male,Saudi,34.0,Al-Khafjī,Al-Khafjī,67012
5628,2012-10-28 00:00:00,12/12/1433,Before Reaching Hospital,19,Male,Egypt,10022.0,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788
6095,2012-01-18 00:00:00,23/02/1433,Before Reaching Hospital,17,Male,Anonymous,10022.0,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788
6096,2012-01-18 00:00:00,23/02/1433,Before Reaching Hospital,17,Male,Anonymous,10022.0,Al-Hufūf - Al-Mubarraz,Al-'Aḥsā',660788
6268,2011-11-03 00:00:00,08/12/1432,Before Reaching Hospital,20,Male,Saudi,10014.0,An-Nu'ayriyah,An-Nu'ayriyah,26470


# **Data Cleaning**

In [217]:
# Rename columns
df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()
df.columns

Index(['death_date_(gregorian)', 'death_date_(hijri)', 'place_of_death', 'age',
       'gender', 'nationality', 'hospital_code', 'city', 'governorate',
       'population'],
      dtype='object', name=1)

In [218]:
# rename death_date_(gregorian) > death_date_gregorian & death_date_(hijri) > death_date_hijri
df = df.rename(columns={'death_date_(gregorian)': 'death_date_gregorian', 'death_date_(hijri)': 'death_date_hijri'})
df.columns

Index(['death_date_gregorian', 'death_date_hijri', 'place_of_death', 'age',
       'gender', 'nationality', 'hospital_code', 'city', 'governorate',
       'population'],
      dtype='object', name=1)

In [219]:
# Drop duplicates
df = df.drop_duplicates()
df.duplicated().sum()

np.int64(0)

In [220]:
df.shape

(7339, 10)

12 row duplicated!

from 7351 >> 7339

In [221]:
# handle mising values in hosiptal_code with "Unknown"
df['hospital_code'] = df['hospital_code'].fillna('Unknown')
df.isna().sum()

Unnamed: 0_level_0,0
1,Unnamed: 1_level_1
death_date_gregorian,0
death_date_hijri,0
place_of_death,0
age,0
gender,0
nationality,0
hospital_code,0
city,0
governorate,0
population,0


In [222]:
# Convert dates to datetime
df['death_date_gregorian'] = pd.to_datetime(df['death_date_gregorian'], errors='coerce', dayfirst = True)

# Extract data
df['greg_year'] = df['death_date_gregorian'].dt.year
df['greg_month'] = df['death_date_gregorian'].dt.month
df['greg_day'] = df['death_date_gregorian'].dt.day

In [225]:
# Extract data from hijri dates
df[['hijri_day', 'hijri_month', 'hijri_year']] = df['death_date_hijri'].str.split('/', expand=True)

In [226]:
# change dtypes

num_cols = ['age', 'population', 'hospital_code', 'hijri_day', 'hijri_month', 'hijri_year', 'greg_year', 'greg_month', 'greg_day']

for col in num_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

cat_cols = ['place_of_death', 'gender', 'nationality', 'city', 'governorate']

for col in cat_cols:
    df[col] = df[col].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7339 entries, 3 to 7353
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   death_date_gregorian  7339 non-null   datetime64[ns]
 1   death_date_hijri      7339 non-null   object        
 2   place_of_death        7339 non-null   category      
 3   age                   7339 non-null   int64         
 4   gender                7339 non-null   category      
 5   nationality           7339 non-null   category      
 6   hospital_code         7309 non-null   float64       
 7   city                  7339 non-null   category      
 8   governorate           7339 non-null   category      
 9   population            6748 non-null   float64       
 10  greg_year             7339 non-null   int32         
 11  greg_month            7339 non-null   int32         
 12  greg_day              7339 non-null   int32         
 13  hijri_day             7

In [227]:
# Hijri month names
hijri_months = {
    1: 'Muharram', 2: 'Safar', 3: 'Rabi al-Awwal', 4: 'Rabi al-Thani',
    5: 'Jumada al-Awwal', 6: 'Jumada al-Thani', 7: 'Rajab', 8: 'Shaaban',
    9: 'Ramadan', 10: 'Shawwal', 11: 'Dhul-Qadah', 12: 'Dhul-Hijjah'
}
df['hijri_month_name'] = df['hijri_month'].map(hijri_months)

# Gregorian month names
greg_months = {
    1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
    7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'
}
df['greg_month_name'] = df['greg_month'].map(greg_months)

In [229]:
# adding age_group column
age_groups = {
    '0-14': (0, 14), '15-24': (15, 24), '25-34': (25, 34), '35-44': (35, 44),
    '45-54': (45, 54), '55-64': (55, 64), '65-74': (65, 74), '75-84': (75, 84),
    '85+': (85, float('inf')) # 85 to infinity
}
df['age_group'] = df['age'].apply(lambda x: next((k for k, v in age_groups.items() if v[0] <= x <= v[1]), None))

In [181]:
# save the cleaned data
df.to_csv('Traffic_Accident_Mortality_Primary_cleaned.csv', index=False)

# **EDA**

In [230]:
# Deaths by Gender
fig = px.histogram(df, x='gender', color='gender', title='Deaths by Gender', text_auto=True)
fig.show()

Looking at the gender distribution, it is immediately obvious the number of male deaths is just way higher than female. Like, not even close. This really tells me how much more men are exposed to road risks here or the data collected its imbalanced

In [231]:
# Deaths by Nationality , TOP 5
death_by_nationality = df['nationality'].value_counts().head(5).reset_index()
death_by_nationality.columns = ['nationality', 'count']

fig = px.histogram(death_by_nationality, x='nationality', y='count', color='nationality', title='Deaths by Nationality', text_auto=True)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

Here, I noticed most deaths belong to Saudi nationals, which makes sense since the data in saudi and they form the most population. But after that, the highest groups are Indians and Pakistanis, both common in jobs related to driving or transportation. Still, it made me wonder if the data might slightly lean toward certain nationalities who form the highest here in saudi

In [232]:
# Place of death
fig = px.pie(df, names='place_of_death', title='Place of death', hole=0.35, color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textinfo='percent+label'); fig.show()

The majority of deaths happen before even reaching the hospital >> around 78%. That's a huge number! This tells two things: either accidents are extremely severe, or the emergency response time is too slow. I think in the same time we can't confirm it, maybe the ones that reached the hospital got saved (which the data just consider the deaths), and thats why it shows a low percent.

In [233]:
# Death by city
death_by_city = df['city'].value_counts().head(10).reset_index()
death_by_city.columns = ['city', 'count']

fig = px.histogram(death_by_city, x='city', y='count', color='city', title='Deaths by City', text_auto=True)
fig.update_layout(xaxis_tickangle=-45)
fig.show()

Dammam, Hafar al-Batin, and Al-Hufuf were the highest. These are major cities in the Eastern Province with high traffic flow and large populations, so its understandable.

In [234]:
# Death By AGE
death_by_age = df['age'].value_counts().head(5).reset_index()
death_by_age.columns = ['age', 'count']

fig = px.pie(death_by_age, names='age', values='count', title='Deaths by Age')
fig.show()

In [235]:
# Death By AGE GROUP (taking age groups from age columns)
death_by_age_group = df['age_group'].value_counts().reset_index()
death_by_age_group.columns = ['age_group', 'count']

fig = px.pie(death_by_age_group, names='age_group', values='count', title='Deaths by Age Group')
fig.show()

Most deaths are from 15 to mid thirty represnting almost 55% of the Deaths in Road Accident!

**Time patterns**

In [236]:
# month orderings
HIJRI_ORDER = ['Muharram','Safar','Rabi al-Awwal','Rabi al-Thani','Jumada al-Awwal','Jumada al-Thani','Rajab','Shaaban','Ramadan','Shawwal','Dhul-Qadah','Dhul-Hijjah']
GREG_ORDER  = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

In [237]:
# Monthly trend (Gregorian)
monthly = (df
           .groupby(['greg_year','greg_month','greg_month_name'])
           .size().reset_index(name='deaths'))
monthly = monthly.sort_values(['greg_year','greg_month'])

fig = px.line(monthly, x='greg_month_name', y='deaths', color='greg_year',
              category_orders={'greg_month_name': GREG_ORDER},
              markers=True, title='Monthly deaths by Gregorian year')

fig.update_layout(xaxis_title='Gregorian month', yaxis_title='Deaths')
fig.show()

In [249]:
# Monthly trend (Hijri)

monthly = (df.groupby(['hijri_year','hijri_month','hijri_month_name']).size().reset_index(name='deaths').sort_values(['hijri_year','hijri_month']))
monthly['hijri_month_name'] = pd.Categorical(monthly['hijri_month_name'],categories=HIJRI_ORDER,ordered=True)

fig = px.line(monthly, x='hijri_month_name', y='deaths', color='hijri_year',
              category_orders={'hijri_month_name': HIJRI_ORDER},
              markers=True, title='Monthly deaths by Hijri year')

fig.update_layout(xaxis_title='Hijri month', yaxis_title='Deaths')
fig.show()

In [240]:
# Monthly trend (Hijri)
hijri_counts = (df.groupby(['hijri_month','hijri_month_name']).size()
                  .reset_index(name='deaths')
                  .sort_values('hijri_month'))
fig = px.bar(hijri_counts, x='hijri_month_name', y='deaths',
             category_orders={'hijri_month_name': HIJRI_ORDER},
             color='deaths', color_continuous_scale='YlGnBu',
             title='Deaths by Hijri month (all years)')
fig.update_layout(xaxis_title='Hijri month', yaxis_title='Deaths'); fig.show()

In [241]:
# Hijri month x Hijri year (heatmap)
hijri_heat = (df.groupby(['hijri_year','hijri_month_name']).size()
                .reset_index(name='deaths'))
fig = px.density_heatmap(hijri_heat, x='hijri_month_name', y='hijri_year', z='deaths',
                         category_orders={'hijri_month_name': HIJRI_ORDER},
                         color_continuous_scale='Blues',
                         title='Heatmap: deaths by Hijri year × Hijri month')
fig.update_layout(xaxis_title='Hijri month', yaxis_title='Hijri year')
fig.show()

In [242]:
# Gender × Place of death
fig = px.histogram(df, x='gender', color='place_of_death', barmode='stack',
                   title='Deaths by gender and place of death',
                   color_discrete_sequence=px.colors.qualitative.Set3)
fig.update_layout(xaxis_title='Gender', yaxis_title='Deaths')
fig.show()

In [243]:
# Nationality × Gender (top 10 nationalities)
top_nat = df['nationality'].value_counts().nlargest(5).index
tmp = df[df['nationality'].isin(top_nat)]
fig = px.histogram(tmp, x='nationality', color='gender', barmode='group',
                   title='Top 10 nationalities by gender',
                   color_discrete_sequence=px.colors.qualitative.Set3)
fig.update_layout(xaxis_title='Nationality', yaxis_title='Deaths')
fig.show()

In [244]:
# City × Place of death (top 12 cities)
top_cities = df['city'].value_counts().nlargest(12).index
tmp = df[df['city'].isin(top_cities)]
fig = px.histogram(tmp, x='city', color='place_of_death', barmode='stack',
                   title='Top cities: deaths by place of death',
                   color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(xaxis_title='City', yaxis_title='Deaths'); fig.show()

**Categorical with Numeric**

In [245]:
# Aggregate by city
city_agg = (df.groupby(['city','governorate','population'])
              .size().reset_index(name='deaths'))
city_agg['deaths_per_100k'] = city_agg['deaths'] / city_agg['population'] * 1e5

# Top cities by deaths
fig = px.bar(city_agg.sort_values('deaths', ascending=False).head(15),
             x='city', y='deaths', color='deaths', color_continuous_scale='Viridis',
             title='Top 15 cities by total deaths')
fig.update_layout(xaxis_title='City', yaxis_title='Deaths'); fig.show()






In [246]:
# Hijri month × place of death
tab = pd.crosstab(df['hijri_month_name'], df['place_of_death']).reindex(HIJRI_ORDER)
fig = px.imshow(tab, text_auto=True, color_continuous_scale='Blues',
                title='Hijri month × Place of death')
fig.update_layout(xaxis_title='Place of death', yaxis_title='Hijri month'); fig.show()

In [247]:
# Governorate × gender (top 12 governorates)
top_gov = df['governorate'].value_counts().nlargest(5).index
tab = pd.crosstab(df[df['governorate'].isin(top_gov)]['governorate'], df['gender'])
fig = px.imshow(tab, text_auto=True, color_continuous_scale='Teal',
                title='Governorate × Gender (top 5)')
fig.update_layout(xaxis_title='Gender', yaxis_title='Governorate')
fig.show()

Final Overall Insights

After finishing the EDA, I realized that the dataset only covers the Eastern Province, not all of Saudi Arabia like I first thought. So, while it gave really good insights into the accidents in that area, it can't represent the whole country. Its more useful for understanding patterns and causes, not for prediction or generalization.

From what I saw, most deaths were male, which probably means the data is a bit imbalanced. Saudis, followed by Indians and Pakistanis, make up most of the cases. And something that stood out to me is that around 78% of deaths happened before reaching the hospital.

The cities with the most deaths were>> Ad-Dammam, Al-Hufuf, and Hafar al-Batin and are also the most populated. That matches what national reports say: Riyadh, Makkah, and the Eastern Region always record the most driving licenses and accidents, simply because they have more people and vehicles.

Overall, this data helps a lot to understand how and where accidents happen, especially in the Eastern Province. But it's not the best dataset for building an AI model or prediction system since its not balanced and limited to one region. Still, it was really useful for seeing the patterns.

# Data Structure:

	•	The dataset used in this notebook part of the Raw Accident Mortality Data taken from one of the sheets in the original file.
	•	It contains 7351 rows and 10 columns, after the cleaning and extracting features it became 7339 rows and 19 columns

[Data Source](https://data.mendeley.com/datasets/f5t4kvmn8g/2)


