---
title: "Crash Course: What's Behind New Jersey's Fatal Road Puzzle?"
subtitle: "Exploring Road Safety Factors And Accident Patterns"
author: "Sai Harshitha Dalli"
bibliography: references.bib
number-sections: false
format:
  html:
    theme: default
    rendering: embed-resources
    code-fold: true
    code-tools: true
    toc: true
  #pdf: default
jupyter: python3
---


![Photo downloaded from Unsplash: Credits - Karl Solano](https://images.unsplash.com/photo-1608694385922-2a5173401a2e?q%3D80%26w%3D1031%26auto%3Dformat%26fit%3Dcrop%26ixlib%3Drb-4.0.3%26ixid%3DM3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D){fig-alt="A photo of a car crash"}

Road accidents happen quite often, and some of these accidents can result in deaths/fatalities. This is a major public safety and public health issue that has been ongoing. Many of us travel on the road daily, and you might have encountered a car crash at least once. When it comes to New Jersey, drivers have a [bad reputation](https://nj1015.com/you-give-all-nj-drivers-a-bad-reputation-if-you-drive-like-this/).

I have lived in New Jersey for half my life, and when I tell people that fact, they always bring up the bad driving. I've had my fair share of seeing drivers not using their signals, speeding, going below the speed limit, and have seen many do the "Jersey slide". For those who are not familiar with what a Jersey slide is, it is when a driver goes from the far left lane all the way over to the exit ramp in one swift move without using a blinker. This mostly happens on the highways and happens quite often. This kind of driving behavior leads to possible accidents on the road.

My curiosity about these patterns arises from witnessing many car crashes on the road while driving, which is why I want to examine some of the factors that could potentially cause fatal accidents in New Jersey.

# Data

### Dataset

The dataset utilized for this analysis is sourced from [The Fatality Analysis Reporting System (FARS)](https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/), a comprehensive nationwide database maintained by the National Highway Traffic Safety Administration (NHTSA) in the United States. FARS provides detailed information on fatal injuries sustained in motor vehicle traffic crashes, offering valuable insights for research and policymaking.

The dataset, spanning from 1975 to 2021, includes various files related to accidents, persons involved, vehicles, weather conditions, distractions, damages, and more. While data is available for multiple years, I have chosen to focus specifically on the year 2021 due to its relevance and recency. Leveraging pandas, a python library, I will merge key datasets using the unique identifier "ST_Case". It is a variable which represents every road accident that took place across all the csv files.

The selected datasets include accidents.csv, drimpair.csv, vehicle.csv, weather.csv, and person.csv, each containing essential variables pertinent to addressing research questions.

::: {.callout-note title="Libraries and packages" collapse="false"}

In [None]:
# Importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.graph_objs as go
import plotly.express as px
import seaborn as sns
import geopandas as gpd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

:::

### Preprocessing


In [None]:
# Reading the data
accident_df = pd.read_csv('C:\\Users\\saiha\\OneDrive\\Desktop\\Data Science\\Capstone\\FARS2021NationalCSV\\accident.csv', encoding='latin1')
drimpair_df = pd.read_csv('C:\\Users\\saiha\\OneDrive\\Desktop\\Data Science\\Capstone\\FARS2021NationalCSV\\drimpair.csv', encoding='latin1')
vehicle_df = pd.read_csv('C:\\Users\\saiha\\OneDrive\\Desktop\\Data Science\\Capstone\\FARS2021NationalCSV\\vehicle.csv', encoding='latin1')
person_df = pd.read_csv('C:\\Users\\saiha\\OneDrive\\Desktop\\Data Science\\Capstone\\FARS2021NationalCSV\\person.csv', encoding='latin1')
weather_df = pd.read_csv('C:\\Users\\saiha\\OneDrive\\Desktop\\Data Science\\Capstone\\FARS2021NationalCSV\\weather.csv', encoding='latin1')
# Filtering rows with only New Jersey
accident_nj = accident_df[accident_df['STATE'] == 34 ]
drimpair_nj = drimpair_df[drimpair_df['STATE'] == 34]
vehicle_nj = vehicle_df[vehicle_df['STATE'] == 34]
person_nj = person_df[person_df['STATE'] == 34]
weather_nj = weather_df[weather_df['STATE'] == 34]
# Filtering for wanted columns
accident_nj_selected = accident_nj[['ST_CASE', 'COUNTYNAME', 'PEDS', 'RUR_URB', 'FUNC_SYS', 'FUNC_SYSNAME', 'MONTHNAME', 'DAY', 'DAY_WEEKNAME', 'HOUR', 'WEATHER', 'LATITUDE', 'LONGITUD', 'FATALS']]
drimpair_nj_selected = drimpair_nj[['ST_CASE', 'DRIMPAIR', 'DRIMPAIRNAME']]
person_nj_selected = person_nj[['ST_CASE', 'AGE', 'SEX', 'INJ_SEV', 'INJ_SEVNAME']]
vehicle_nj_selected = vehicle_nj[['ST_CASE', 'HARM_EVNAME', 'MAKENAME', 'MOD_YEAR']]
weather_nj_selected = weather_nj[['ST_CASE', 'WEATHER', 'WEATHERNAME']]

I've narrowed down my dataset to focus only on fatal accident data within New Jersey. I believe it's important to analyze each state's data separately because driving laws vary significantly from one state to another. Once I've filtered the dataset to include only New Jersey's data, I further refine it by selecting the specific variables required for my analysis.

The accidents.csv is my main data set that I will be using which includes variables like pedestrians involved, rural/urban area, county, functional system, month, day, week, hour, longitude, latitude, and number of fatals.


In [None]:
accident_nj_selected.head(5)

The drimpair.csv data set has variables consisting driver impairment when they got in an accident. This variable is a self report of the drivers after the accidents.


In [None]:
drimpair_nj_selected.head(5)

The person.csv data set has demographic variables such as the age and sex of the drivers involved in accidents.


In [None]:
person_nj_selected.head(5)

The vehicle.csv data set consists of variables such as make name and model year of the vehicles involved in accidents.


In [None]:
vehicle_nj_selected.head(5)

The weather.csv has data on the weather at the time of the accident.


In [None]:
weather_nj_selected.head(5)

# County Analysis

### What counties are pedestrian accidents more prevalent in?

I wanted to examine all the counties in NJ and analyze which of them have the highest number of pedestrian accidents. Understanding the distribution of pedestrian accidents across counties is crucial for identifying areas with higher risk levels and potential hotspots. This insight could aid in implementing targeted road safety strategies and plans to mitigate pedestrian accidents. To visualize the high-risk areas for pedestrian accidents, I created a choropleth heatmap.


In [None]:
# Filtering out the important columns for this question
accident_nj_peds = accident_nj[['ST_CASE', 'PEDS', 'RUR_URB', 'FUNC_SYS', 'COUNTYNAME','ROUTENAME','LATITUDE', 'LONGITUD', 'FATALS']]
# Filter only the pedestrian accidents from the dataset
pedestrian_accidents = accident_nj_peds[accident_nj_peds['PEDS'] > 0]
# Rename the column COUNTYNAME to COUNTY
pedestrian_accidents_df = pedestrian_accidents.rename(columns={'COUNTYNAME': 'COUNTY'})
# Removes everything after the parenthesis
pedestrian_accidents_df['COUNTY'] = pedestrian_accidents_df['COUNTY'].str.split('(').str[0].str.strip()
# Aggregate the accident counts by county
county_accident_counts = pedestrian_accidents_df.groupby('COUNTY')['PEDS'].sum().reset_index(name='Total_Accidents')

# Load geospatial data for New Jersey counties
nj_counties = gpd.read_file(r'C:\Users\saiha\OneDrive\Desktop\Data Science\Capstone\capstone\data\County_Boundaries_of_NJ\County_Boundaries_of_NJ.shp')

# Merge pedestrian accident data with geospatial data
merged_data = nj_counties.merge(county_accident_counts, on='COUNTY', how='left')
rename_county = accident_nj_peds.rename(columns={'COUNTYNAME': 'COUNTY'})
rename_county['COUNTY'] = rename_county['COUNTY'].str.split('(').str[0].str.strip()
accident_counts = rename_county.groupby('COUNTY')['FATALS'].sum().reset_index(name='Total_Accidents')
nj_counties = gpd.read_file(r'C:\Users\saiha\OneDrive\Desktop\Data Science\Capstone\capstone\data\County_Boundaries_of_NJ\County_Boundaries_of_NJ.shp')
merged_data_total = nj_counties.merge(accident_counts, on='COUNTY', how='left')

# Used https://github.com/RomanDataLab/Data_science_analysis_realestate_BCN/blob/main/DS3_FIN_2%2B3.ipynb as reference for this part of the code
'''
This is the part I used for my data:
# Set the plot size to full size
fig, ax = plt.subplots(figsize=(15, 10))
totcol = 'Wistia'
# Plot gdfb with the "AdvIn" column
gdfb.plot(column="AdvIn", cmap=totcol, ax=ax)

# Add names of 'BARRI' in the centers of polygons
for idx, row in gdfb.iterrows():
    centroid = row['geometry'].centroid
    ax.text(centroid.x, centroid.y, row['BARRI'], ha='center', fontsize=6, color='gray')
'''
# Create choropleth map
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
merged_data.plot(column='Total_Accidents', cmap='Purples', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)

# Add county names
for idx, row in merged_data.iterrows():
    ax.text(row.geometry.centroid.x, row.geometry.centroid.y, row['COUNTY'], fontsize=4, ha='center', va='center', color='black')

ax.set_title('Total Pedestrian Accidents by County')

# Removes the axes
ax.set_axis_off()

plt.show()

::: callout-caution
The map displays all pedestrian accidents, not just fatal ones.
:::

Looking at the map, we can see that Essex and Camden counties have high pedestrian accident counts. A few reasons for this could be that both counties are among the most densely populated in NJ, which could correlate with increased pedestrian activity and, consequently, pedestrian accidents. Newark, located in Essex County, and Camden City, situated in Camden County, are major urban areas with heavy pedestrian traffic as people walk to colleges, work, and other places. Densely populated urban areas with heavy traffic can increase the risk of pedestrian accidents.

Salem and Hunterdon counties show no pedestrian accidents. This may be due to their lower population density compared to Essex and Camden counties. Additionally, they may have less busy streets and lower pedestrian activity. Another possibility for the absence of pedestrian accidents could be underreporting or non-reporting, as these counties are smaller jurisdictions.

### What counties have the most number of fatal accidents?

Examining fatal accidents at a county level is crucial because it enables us to visualize areas with high and low accident rates, providing insights into which counties should implement stricter policing measures for drivers exhibiting dangerous behaviors, including speeding, impaired driving, distracted driving, and failure to follow traffic rules. I created a choropleth heatmap to visualize the total fatal accidents by counties. This heatmap provides a clear overview of the distribution of fatal accidents across New Jersey, highlighting areas with higher concentrations of accidents.


In [None]:
# Used https://github.com/RomanDataLab/Data_science_analysis_realestate_BCN/blob/main/DS3_FIN_2%2B3.ipynb as reference for this part of the code
'''
This is the part I used for my data:
# Set the plot size to full size
fig, ax = plt.subplots(figsize=(15, 10))
totcol = 'Wistia'
# Plot gdfb with the "AdvIn" column
gdfb.plot(column="AdvIn", cmap=totcol, ax=ax)

# Add names of 'BARRI' in the centers of polygons
for idx, row in gdfb.iterrows():
    centroid = row['geometry'].centroid
    ax.text(centroid.x, centroid.y, row['BARRI'], ha='center', fontsize=6, color='gray')
'''
# Create choropleth map for total fatal accidents
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
merged_data_total.plot(column='Total_Accidents', cmap='Purples', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)

# Add county names
for idx, row in merged_data.iterrows():
    ax.text(row.geometry.centroid.x, row.geometry.centroid.y, row['COUNTY'], fontsize=4, ha='center', va='center', color='black')

# Removes the axes
ax.set_axis_off()
ax.set_title('Total Fatal Accidents by County')

#plt.savefig('choropleth_map.png', dpi=800, bbox_inches='tight')

plt.show()

The heatmap once again highlights Essex and Camden counties as having the highest number of fatal accidents in NJ, includes both pedestrian and vehicle accident deaths. This trend may be due to the population density in these counties and the prevalence of reckless driving behaviors among drivers. Factors such as speeding, distracted driving, and driving under the influence of alcohol and drugs likely contribute to the high number of fatal accidents observed in these areas.

# Vehicle Analysis

### What vehicle make has the most and least number of fatal accidents?

When shopping for a car, one important factor people consider is the safety and reputation of the vehicle. For many, the brand name carries significant weight and influences their purchasing decisions. Understanding how different vehicle brands perform in terms of fatal accidents is crucial in helping consumers make informed choices. I've created a bar graph to visualize the performance of various vehicle brands in fatal accidents. This information could have a significant impact on the decisions of individuals looking to buy a car, as it provides insight into the safety records of different brands.


In [None]:
# Merge the accident and vehicle data
accident_vehicle_merged = pd.merge(accident_nj_selected, vehicle_nj_selected, on='ST_CASE', how='inner')
#Filter the fatalities greater than 1
fatal_accidents_df = accident_vehicle_merged[accident_vehicle_merged['FATALS'] > 0]

# Group by Vehicle Make and Fatals
grouped_df = fatal_accidents_df.groupby('MAKENAME')['FATALS'].sum().reset_index()
grouped_df = grouped_df.sort_values(by='FATALS', ascending=False)

# Visualize with Stacked Bar Graph
plt.figure(figsize=(12,10))
plt.barh(grouped_df['MAKENAME'], grouped_df['FATALS'], color='purple', label='Total Fatal Accidents')
plt.xlabel('Number of Fatal Accidents')
plt.ylabel('Vahicle Make')
plt.title('Total Fatal Accidents by Vehicle Make')
plt.xticks(rotation=90)
plt.legend()
plt.tight_layout()
plt.show()

According to the analysis, Honda, Ford, and Toyota have the highest number of fatal accidents, while GMC and Jaguar have the lowest. These results may be attributed to several factors, but one possible explanation is the global popularity and widespread presence of Honda, Ford, and Toyota vehicles. With these brands being highly recognizable and commonly seen on roads worldwide, the increased number of vehicles in circulation may contribute to a higher probability of accidents involving them.

### What vehicle model year has the highest and lowest fatal accidents?

I also wanted to explore the impact of vehicle model year on fatal accidents, as I had a hunch that newer cars might be safer due to advancements in technology and safety features. By analyzing fatal accident data by model year, I wanted to see if there is a correlation between the age of a vehicle and its involvement in fatal accidents. If newer model years are associated with fewer fatal accidents, it suggests that investing in a newer vehicle with updated safety features may contribute to enhanced road safety.


In [None]:
# Filter the years to include only those in the range of 1983 to 2021
fatal_accidents_df = fatal_accidents_df[fatal_accidents_df['MOD_YEAR'].between(1983, 2021)]

# Group by Vehicle Make and Fatals
modyear_grouped_df = fatal_accidents_df.groupby('MOD_YEAR')['FATALS'].sum().reset_index()

# Create an interactive bar graph using Plotly
fig = go.Figure(data=[go.Bar(x=modyear_grouped_df['MOD_YEAR'], y=modyear_grouped_df['FATALS'], 
                             marker_color='#800080')])
fig.update_layout(title='Total Fatal Accidents by Model Year',
                  xaxis_title='Model Year',
                  yaxis_title='Number of Fatal Accidents',
                  xaxis_tickangle=-45)
fig.show()

To my surprise, the analysis revealed that cars manufactured in 2015 and 2017 had the highest number of fatal accidents, while those produced before the 2000s had lower accident rates. One possible explanation is the prevalence of vehicles on the road from different model years. Cars made before the 2000s may comprise a smaller proportion of the total vehicle population, leading to fewer fatal accidents involving these older vehicles. On the other hand, the higher number of cars manufactured in 2015 and 2017 on the road may contribute to their elevated accident rates.

# Road Type Analysis

### Are accidents more common in urban or rural areas, and how does road type affect accident rates?

![A Photo By Alex Cecchini](https://streetsmn.s3.us-east-2.amazonaws.com/wp-content/uploads/2013/12/Road__Functional_Classification-500x414.png)

Building upon the insights gained from the county analysis, which revealed higher fatal accident rates in counties with more urbanized areas, I wanted to explore whether urban areas indeed experience more accidents compared to rural areas. Additionally, I wanted to investigate how road types contribute to these differences in accident rates across urban and rural settings. I created a grouped bar chart comparing the number of fatal accidents across different road types in both urban and rural areas to examine the relationship.


In [None]:
# Filter data to only include values with 1 and 2
filtered_data = accident_nj[accident_nj['RUR_URB'].isin([1, 2])]

# Rename "Principal Arterial-Other Freeways and Expressways" to "Other Freeways"
filtered_data['FUNC_SYSNAME'] = filtered_data['FUNC_SYSNAME'].replace({'Principal Arterial - Other Freeways and Expressways': 'Other Freeways'})

# Rename "Principal Arterial-Other" to "Principal Arterial"
filtered_data['FUNC_SYSNAME'] = filtered_data['FUNC_SYSNAME'].replace({'Principal Arterial - Other': 'Principal Arterial'})

# Group the data by urban/rural and road type and compute the total accidents for each group
func_ru_grouped = filtered_data.groupby(['FUNC_SYSNAME', 'RUR_URB']).size().unstack(fill_value=0)

# Reset index
func_ru_grouped = func_ru_grouped.reset_index()

# Melt the dataframe
melted_data = func_ru_grouped.melt(id_vars='FUNC_SYSNAME', var_name='Urban/Rural', value_name='Number of Accidents')

# Create an interactive grouped bar plot using plotly express
fig = px.bar(melted_data, y='FUNC_SYSNAME', x='Number of Accidents', color='Urban/Rural',
             color_discrete_map={'1': '#800080', '2': '#800080'}, 
             labels={'FUNC_SYSNAME': 'Road Types', 'Number of Accidents': 'Number of Accidents'},
             barmode='group') 

fig.update_xaxes(categoryorder="total ascending")

# Update legend title
fig.update_layout(legend_title_text='Urban/Rural')

# Rename legend labels
fig.for_each_trace(lambda trace: trace.update(name='Urban' if trace.name == '1' else 'Rural'))


fig.show()

Through my analysis, I found that rural areas actually had a higher number of accidents compared to urban areas. I was curious to understand why, so after researching this topic, I concluded that rural areas have less traffic and higher speed limits, which may lead drivers to be more inclined to speed up on these roads, contributing to a higher incidence of accidents. For the road types, Principal Arterial and Minor Arterial roads have the highest accidents in rural areas, while Local and Major Collector roads have the highest accidents in urban areas. This could be because Principal Arterial and Minor Arterial roads in rural areas are typically designed for higher speeds and longer-distance travel, but they may lack safety features such as medians, guardrails, and lighting. In urban areas, Local and Major Collector roads typically have a higher density of intersections and junctions compared to arterial roads. The increased interaction points between vehicles, pedestrians, and cyclists on urban collector roads can increase the risk of accidents, particularly at intersections where conflicting movements occur.

# Demographic Analysis

### What age group are the number of accidents most and least prevalent in?

It is crucial to look at demographics such as age and sex of the people involved in accidents. Age plays a significant role in determining whether targeted interventions are needed for drivers within certain age groups or for passengers involved in accidents. By identifying which age groups have the highest number of accidents, public health officials can focus more on programs aimed at preventing injuries, providing emergency medical assistance, and offering trauma care. I created a histogram displaying the frequencies of age groups present in the dataset.


In [None]:
# Choosing ages of people above the age of 98
age_drop = person_nj_selected['AGE'] <= 97
Age = person_nj_selected[age_drop]
# Historam for Age
plt.figure(figsize=(10, 6))
plt.hist(Age['AGE'], bins=10, edgecolor='black', alpha=0.7, color='purple')
plt.title('Age Distribution of Accidents')
plt.xlabel('Age')
plt.ylabel('Number of Accidents')
plt.show()

::: callout-caution
This graph doesn't represent just the number of fatal accidents; it considers the overall number of accidents, including instances where no fatalities occurred.
:::

The age group with the highest number of accidents falls between 20 and 30 years old. This could be due to several factors. Firstly, individuals in their early 20s often start driving more frequently for college or work, and their lack of experience behind the wheel may increase the likelihood of accidents. Additionally, as age increases, there is a decrease in the number of accidents. This trend could be due to factors such as drivers becoming more experienced over time or fewer individuals in older age groups engaging in activities that involve driving.

### How does driver impairment relate to fatal accidents differently for male and female drivers?

I wanted to explore two other interesting factors which are driver impairment while driving and the sex of the driver. Examining driving impairments is crucial due to the numerous warnings against driving under the influence. Whether emphasized in driver education programs or through road signs, individuals are consistently reminded to refrain from driving when impaired. This highlights the significance of understanding the impact of driving impairments on road safety. Analyzing the distribution of driving impairments among both males and females provides valuable insights into gender-specific driving behaviors. Understanding these patterns can inform policy-making processes aimed at enhancing road safety and addressing gender-specific concerns. By looking at how driver impairments vary between male and female drivers, we can identify patterns and risk factors that contribute to road accidents.

::: callout-caution
The driver impairments are self-reported by the drivers involved in an accident.
:::

Before beginning the analysis of driving impairment and its correlation with the number of fatal accidents, I decided to examine the distribution of both male and female drivers involved in these accidents.


In [None]:
# Merge the accident and drimapir tables on ST_CASE
accident_impair = pd.merge(drimpair_nj_selected, accident_nj_selected, on='ST_CASE', how='inner')

# Merge the person and accident_imapir tables
sex_accident_impair = pd.merge(accident_impair, person_df[['ST_CASE', 'SEX']], on='ST_CASE', how='inner')

# Filter out rows where SEX is values other than 1 and 2
sex_accident_impair = sex_accident_impair[sex_accident_impair['SEX'].isin([1, 2])]

# Rename values for sex
sex_accident_impair['SEX'] = sex_accident_impair['SEX'].replace({1: 'Male', 2: 'Female'})

# Group by gender and count the number of fatal accidents for each gender
gender_counts = sex_accident_impair['SEX'].value_counts()

# Set colors for males and females
colors = {'Male': 'purple', 'Female': 'orange'}

# Create pie chart
plt.figure(figsize=(8, 6))
plt.pie(gender_counts, labels=['', ''], autopct='%1.1f%%', startangle=140, colors=[colors[gender] for gender in gender_counts.index])
plt.title('Fatal Accidents by Gender')

# legend
plt.legend(labels=['Male', 'Female'], loc='upper right')
plt.axis('equal')
plt.show()

Male drivers were involved in 67.8% of the fatal accidents and female drivers were involved in 32.2%. This gender disparity in fatal accidents could help policymakers make more targetted driving policies/regulations. I want to further examine sex with the driving impairments of the drivers who were involved in fatal car accidents.


In [None]:
# Group by driver impairment and sex, then calculate total fatalities
group_sex_impair = sex_accident_impair.groupby(['DRIMPAIRNAME', 'SEX'])['FATALS'].sum().unstack()

# Sort the DataFrame by the sum of fatalities in ascending order
group_sex_impair_sorted = group_sex_impair.sum(axis=1).sort_values().index
group_sex_impair = group_sex_impair.loc[group_sex_impair_sorted]

# Make data frame into plotly
data = []
colors = {'Male': 'rgb(148,0,211)', 'Female': 'rgb(255,140,0)'}
for col in group_sex_impair.columns:
    data.append(go.Bar(name=col, x=group_sex_impair.index, y=group_sex_impair[col], marker=dict(color=colors[col])))

# layout for the plot
layout = go.Layout(
    title='Fatalities by Driver Impairment and Sex',
    xaxis=dict(title='Driver Impairment'),
    yaxis=dict(title='Total Fatalities'),
    barmode='stack'
)

# Create the figure
fig = go.Figure(data=data, layout=layout)

# Show the interactive plot
fig.show()

The graph shows that the largest proportion of fatal accidents were reported by drivers who claimed no impairment or were uncertain about their impairment status. Following closely behind were incidents involving drivers under the influence of alcohol, drugs, or medication, which accounted for a significant number of fatalities. There were fewer instances where drivers reported other physical impairments, physical fatigue, or falling asleep at the wheel. This informs us that it is important to take initiatives when it comes to addressing alcohol and drug related impairment.

::: callout-warning
One important thing to note is that since these are self-reported, there might be potential biases in their response since drivers might not provide accurate information. The number of drivers reporting themselves as normal could be because they don't want to disclose their impairment at the time of the accident.
:::

Each bar represents drivers impairment and the bar is divided between males and females. Across all reported impairments, males accounted for a higher number of fatal accidents compared to females. This finding was unexpected, as while the general trend suggests that males accounted for a higher number of fatal accidents, there could have been at least one scenario where individual impairment categories had more females than males.

# Dashboard

![A Photo By National Centers for Environmental Information](https://www.ncei.noaa.gov/sites/g/files/anmtlf171/files/styles/max_1300x1300/public/sites/default/files/collage-of-weather-phenomena_1200x480.png?itok%3DPEeyMVbC)

There are numerous factors to consider when analyzing the causes of fatal accidents. To explore these factors comprehensively, I developed a dashboard using Dash, a Python framework for building web applications. The dashboard incorporates variables such as weather conditions, day of the week, month, and time of day. These factors are believed to influence traffic patterns and the occurrence of fatal accidents on the road.

[Open Dashboard](http://127.0.0.1:8050/)

::: {.callout-tip collapse="false"}
-   There is a dropdown option on the top of the dashboard where you can choose your desired variable.
-   All graphs are interactive
    -   You can scroll your cursor on the specific areas on the graphs
    -   You can take a look at each specific area of the graph by zooming into it
:::

-   WEATHER: A higher number of fatal accidents occur when the weather is clear, while the lowest number of accidents occurs during freezing rain/drizzle and snow. This surprised me because I expected there to be more accidents on the road in adverse weather conditions, such as rain and snow. Two possible reasons for these results could be that I did not normalize my weather data, meaning that the observed pattern of fatal accidents might be influenced by the frequency of clear days in 2021, and vice versa. Another reason could be that when the weather is inclement, drivers tend to exercise more caution and drive slower, whereas they may be more reckless in clear weather. Additionally, traffic tends to slow down on roads during bad weather, as people drive at reduced speeds, decreasing the likelihood of fatal accidents.

-   Day of the week: Initially, this seemed counterintuitive, as one might expect more traffic during weekdays due to school, college, and work commutes. However, weekends often see an increase in recreational activities, travel, and events, resulting in higher volumes of vehicles on the roads. Another factor to consider is the possibility of social gatherings and parties on weekends, where alcohol consumption may be more prevalent. This could lead to reckless driving behavior and an elevated risk of accidents on the roads.

-   Month: In 2021, October recorded the highest number of fatal accidents, while February had the lowest. This correlation aligns with the findings of the weather pie chart, where snowy weather was associated with fewer fatal accidents. Given that February typically experiences more snowfall, this could explain the lower accident rates during this month. Additionally, February 2021 witnessed record-breaking snowfall, with 36.9 inches recorded in seven northern counties, making it the snowiest February in 126 years. The substantial snowfall likely resulted in reduced road travel and contributed to the decrease in fatal accidents during that month [(Scott Fallon)](https://www.northjersey.com/story/news/environment/2021/03/10/north-jersey-february-snowiest-month-record-2021-winter-storm-nj/6938192002/).

-   Hour of the day: Fatal accidents show an uptick between 8pm to 9pm, with the peak occurring around 9pm. This pattern corresponds to the first rush hour on roads, typically between 5pm to 7pm, when many individuals finish work and traffic density increases, raising the likelihood of accidents. However, there's also a secondary rush hour later in the night, usually after 8 pm, when additional individuals may finish work. During this time, various factors may contribute to a higher accident rate compared to the earlier rush hour. These factors include driver fatigue or reduced visibility as darkness falls. The combination of increased traffic density and these factors may explain the rise in fatal accidents around 9pm [(Are Car Accidents at Night More Common?)](https://finzfirm.com/blog/are-car-accidents-at-night-more-common/#:~:text=Fatigue%20often%20sets%20in%20during,at%20night%20significantly%20more%20hazardous.). On the flip side, the least number of fatal accidents occurs at 10am. This could be due to the same reason: the morning rush hour might have diminished by 10am, resulting in fewer cars on the road and therefore fewer possibilities for crashes.

# Future Work

In the future, my primary goal is to develop a predictive model that estimates the probability of a vehicle being involved in a fatal car accident. This approach is interesting because we have gained knowledge about the individual factors contributing to fatal car accidents. By combining these factors, the model aims to offer a comprehensive understanding of accident risk, encouraging drivers to exercise greater caution and make informed decisions on the road.

Furthermore, I plan to include additional variables in the analysis, encompassing aspects such as vehicle safety features like the utilization of airbags and seat belts, as well as road conditions such as the presence of potholes and insufficient lighting. These factors, which were not explored in the current analysis due to constraints such as time or data availability, are essential components in understanding the complex dynamics of road safety. By including them in future analyses, I hope to refine the predictive accuracy of the model and provide deeper insights into the factors influencing accident outcomes.

::: {.callout-tip title="Citations" collapse="true"}
[@Cecchini] [@Clark] [@NHTSA] [@Solano] [@Fallon] [@Finz] [@NCEI] [@github]
:::