# Analysis of Weather Conditions and their Relation to Traffic Accidents in the US

City planners, civil engineers and car manufacturers have worked tirelessly for decades to make driving a safe, comfortable and accessible option for travel. Safer roads, smarter rules and cutting edge technologies have been implement to reduce the frequency of accidents with motor vehicles, to great success as accidents on the road have dropped tremendously since the 1960s according to IIHS. Aside from individuals with poor driving skills and faulty vehicles, one of the greatest hindrances in motor safety has been the chaotic essence of mother nature. In the US, forty-two thousand people have died in 2022.

Some would argue that weather is primarily responsible for the amount of motor vehicle accidents. Weather phenomena such as hurricanes, heavy rain, icy roads and thick fog are able to severely debilitate a person’s driving skills and could pose a threat to their own safety, and that of other drivers around them. Others state that the weather is merely a small issue, and that accidents have many more causes other than weather. They argue that factors such as inexperienced drivers, unmaintained roads or sloppy city planning cause just as much, if not more accidents on the road.

We set out to research the correlation between weather and accidents in the US. To achieve this we took a look at three datasets, these being ‘US Accidents (2016 - 2023)’, ‘Traffic Violations in USA’ and ‘Historical Hourly Weather Data 2012-2017’. ‘US Accidents (2016 - 2023)’ (Moosavi et al.) is a detailed record of vehicle accidents in the US between 2016 and 2023, having recorded the time, place and severity of accidents, among many other variables. ‘Traffic Violations in USA’ (Gutierrez) is a dataset containing a large amount of information about traffic violations and the accidents caused by them. ‘Historical Hourly Weather Data 2012-2017’ (Beniaguev) is a record of weather across the US between the years 2012 and 2017. This dataset has recorded the hourly weather status of many large cities across the United States. With these three datasets we can analyze and compare the link between heavy weather and car accidents across the US.


## The First Perspective
Driving through heavy rain or severe weather conditions can be disoriëntating and can lead to dangerous situations. This is why weather has a lot of influence on the frequency of accidents. Freezing temperatures and bad weather are a dangere to traffic. Cities with severe weather are more prone to traffic accidents than cities with less severe weather. To combat this issue, more precautions should be taken to prevent accidents caused by weather.

### Accidents per Month
When looking at weather data, it can seem very sporadic. One way to overcome this unpredictebilty is to look at monthly accident data, becouse months have a general pattern in weather data. For example, the summer months (June, July and August) are hotter and dryer than say the winter months (December, January, February).
In order begin proving the first perspective, the data should show some difference in number of accidents per month. This is becouse weather for month to month differs drasticaly, so this is where the story should begin. In the following graph you can see the amount of accidents per month in the US.

In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

# datasets importen
df = pd.read_csv('../resources/dataset_weather/weather_accidents.csv')

# Filter the df to detroit
df_detroit = df[df['City'] == 'Detroit']

# Drop irralevant columns
df_detroit = df_detroit.drop(columns=['Severity', 'City'])

# Count the entries per day
accident_count = df_detroit['Date'].value_counts().sort_index()

df_accidents = df.drop(columns=['Severity', 'City'])
df_accidents['Date'] = pd.to_datetime(df_accidents['Date'])
df_accidents['Date'] = df_accidents['Date'].dt.month_name()
total_accidents = df_accidents['Date'].value_counts()

# Merge count with df
df_detroit_count = pd.merge(df_detroit, accident_count, left_on=['Date'], right_on=[accident_count.index])

# Group dataframe by date
df_detroit_count = df_detroit_count.groupby('Date').mean()
df_detroit_count = df_detroit_count.reset_index()

# Converting the the dates to the names of the month
df_detroit_count['Date'] = pd.to_datetime(df_detroit_count['Date'])
df_detroit_count['Date'] = df_detroit_count['Date'].dt.month_name()

# Binning so the plot will show nice catagories
df_detroit_count['Temp_bins'] = pd.cut(df_detroit_count['Temperature'], [0,275,1000], labels=['Freezing', 'Not Freezing'])
df_detroit_count['Humidity_bins'] = pd.qcut(df_detroit_count['Humidity'], q=3, labels=['Low','Medium','High'])
df_detroit_count['Wind_Speed_bins'] = pd.qcut(df_detroit_count['Wind_Speed_km'], q=3, labels=['Low','Medium','High'])
df_detroit_count['count_bins'] = pd.qcut(df_detroit_count['count'], q=3, labels=['Low','Medium','High'])

fig = px.histogram(df_accidents, x='Date', category_orders=dict(Date=["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]), title='Accidents per month in the US', labels={'Date':'Month', 'count' : 'Accidents'}, height=400)

fig.add_annotation(x=-0.05, y=-0.28,
                   xref="paper", yref="paper",
                   showarrow=False,
                   align='left',
                   xanchor='left', yanchor='bottom',
                   text='The sum of accidents in the US per month over the span of two years')

fig.show()

It is notable to see that there is a significant increase in accidents during the months of august, september, october and november. An explaination for this observation could be that there are more people on the road during these months, this could be becouse of the summer holyday in the months july and august.

### Effect of freezing 

When analysing this graph, some noticable features are the amount of accidents per datapoint rather than it being spread out evenly. This is because the only datapoints in the graphed dataset are days with freezing temperatures the datapoints are only of one state so the data is lower than graph beforementioned graph with all cities combined. 

In [37]:
box_plot = px.box(df_detroit_count, x="Temp_bins", y="count",title='Average Accidents when Freezing' ,labels={'Temp_bins':'Temperare condiotion', 'count' : 'Number of Accidents'} ,height=600)
box_plot.add_annotation(x=-0.05, y=-0.14,
                   xref="paper", yref="paper",
                   showarrow=False,
                   align='left',
                   xanchor='left', yanchor='bottom',
                   text='Average number of accidents when freezing or not freezing in the city of Detroit')

box_plot.show()

### Number of accidents and Weather scores
The two Bubble Maps are meant to put the correlation between our weather-score, and the amount of accidents in a city, into perspective. In the final version we want to make this 1 big plot instead of two smaller ones, but we ran out of time. We will also look into other variables that might have an effect on the er of accidents(like the total amount of roads in an area).

In [3]:
import pandas as pd
import plotly.graph_objects as go
from matplotlib import scale

#Import two files, one with weather score for each state and 1 file with accidents per city with coördinates
pf = pd.read_csv('../resources/Code_website/W_PlotData.csv')
df = pd.read_csv('../resources/Code_website/A_AccidentsNewData.csv')

#This code detemines the size of the circles
groottes = [(0, 5), (5, 20), (20, 50), (50, 250), (250, 1000)]
scale = 0.5

#Dates we show
dates = ['2016-01', '2016-04', '2016-07', '2016-10', '2017-01', '2017-04', '2017-07', '2017-10']
fig = go.Figure()

# Loop for the dates that we showcase
for date in dates:
    avg_score_per_state = pf[pf['Date'] == date].groupby('State')['score'].mean().reset_index()
    # The base code for the color of the states that represent weather score
    fig.add_trace(go.Choropleth(
        visible=(date == dates[3]),
        locations=avg_score_per_state['State'],
        z=avg_score_per_state['score'],
        locationmode='USA-states',
        colorscale='PuRd',
        zmin=0.7,
        zmax=1.6,
        colorbar=dict(
            title='Average Weather Score for each state:',
            x=1,
            y=0.3,
            len=0.6,
            thickness=15,
            
        ),
        showscale=True,
        name=f'Weather Score {date}'
    ))

# The base code for the bubbles that represent cities
for lim in groottes:
    scatter_traces = [] 
    for date in dates:
        df_sub = df[(df['Accidents'] >= lim[0]) & (df['Accidents'] < lim[1]) & (df['Date'] == date)]
        scatter_traces.append(go.Scattergeo(
            visible=(date == dates[3]),
            locationmode='USA-states',
            lon=df_sub['lng'],
            lat=df_sub['lat'],
            text=df_sub['City'] + ': ' + df_sub['Accidents'].astype(str) + ' Accidents',
            mode='markers',
            marker=dict(
                size=df_sub['Accidents'] * scale,
                color='#3182bd',
                line_color='rgb(40,40,40)',
                line_width=0.5,
                sizemode='area'
            ),
            name=f'{lim[0]} - {lim[1]} Accidents {date}'
        ))
    fig.add_traces(scatter_traces)

#The code for the slider that selects the date
steps = []
for i, date in enumerate(dates):
    step = dict(
        method='update',
        args=[{'visible': [date == d for d in dates]}],
        label=date
    )
    steps.append(step)

sliders = [dict(
    active=3,
    pad={'t': 20, 'l': 20, 'b':20},
    steps=steps
)]

#Layout opnieuw ingedeeld
fig.update_layout(
    sliders=sliders,
    title='Total of accidents for each city in the USA for 2016 and 2017',
    title_x=0.5,
    title_y=0.95,
    showlegend=True,
    legend_title_text='Amount of Accidents:',
    geo=dict(scope='usa', landcolor='rgb(217, 217, 217)'),
    geo_center=dict(lon=-105, lat=38),
    margin=dict(l=0, r=0, t=50, b=0),
    coloraxis_colorbar=dict(lenmode='fraction', len=0.5),
    dragmode=False,
)

#Knoppen om in en uit te zoomen.
zoom_buttons = [
    dict(args=[{"geo.projection.scale": 1, "geo.center.lon": -105, "geo.center.lat": 38}], label='Back', method="relayout"),
    dict(args=[{"geo.projection.scale": 1.5, "geo.center.lon": -117.5, "geo.center.lat": 44}], label='Zoom Left Top', method="relayout"),
    dict(args=[{"geo.projection.scale": 1.5, "geo.center.lon": -117.5, "geo.center.lat": 31.5}], label='Zoom Left Bottom', method="relayout"),
    dict(args=[{"geo.projection.scale": 1.5, "geo.center.lon": -92.5, "geo.center.lat": 44}], label='Zoom Right Top', method="relayout"),
    dict(args=[{"geo.projection.scale": 1.5, "geo.center.lon": -92.5, "geo.center.lat": 31.5}], label='Zoom Right Bottom', method="relayout")
]

fig.update_layout(
    updatemenus=[dict(type="buttons", showactive=False, buttons=zoom_buttons, x=0.2, y=0.85)],
)

fig.show(config={'scrollZoom': False})

This visualization combines elements of a choropleth and a bubble plot, featuring a time-slide function and zoom controls. It aims to illustrate variations in accident frequency across cities throughout the four seasons of 2016 and 2017. However, the data shows significant disparities in total accidents between seasons, suggesting possible gaps in the dataset or substantial seasonal differences in accident occurrence. Additionally, the weather score does not appear to consistently influence accident numbers. Upon analysis, the map colors fluctuate frequently, particularly noticeable in states like Louisiana (on the right side), where accident rates remain high regardless of weather scores. These factors can potentially lead to misleading interpretations of the data.

In [4]:
import pandas as pd
from plotnine import ggplot, aes, geom_point, facet_wrap, labs, theme_light, theme, scale_x_continuous, ggtitle, geom_smooth
import ipywidgets as widgets
from IPython.display import display, clear_output
from ipywidgets import interactive_output

# Data containing all the info
data = pd.read_csv('../resources/Code_website/F_PlotData.csv').dropna(subset=['Average_Severity', 'Weather_Score', 'Region'])
data['Date'] = pd.to_datetime(data['Date'])

# Dates
dates = ['2016-04', '2016-07', '2016-10', '2017-01', '2017-04', '2017-07', '2017-10']

# Dropdown widget
date_dropdown = widgets.Dropdown(
    options=dates,
    value='2017-01',
    description='Date:'
)
# Plot function
def update_figure(selected_date):
    filtered_data = data[data['Date'].dt.to_period('M') == selected_date]
    
    p = ggplot(filtered_data, aes(x='Average_Severity', y='Weather_Score', color='Region')) + \
        geom_point() + \
        geom_smooth(method='lm', se=False, color='black') + \
        facet_wrap('~Region', ncol=3) + \
        labs(x="Average Severity", y="Weather Score") + \
        ggtitle('The impact of weather on the severity of traffic accidents, in the regions of the US, during a specific timeframe') + \
        theme_light() + \
        theme(legend_position='none') + \
        theme(panel_spacing_y=0, panel_spacing_x=0.03) + \
        scale_x_continuous(limits=(2, 4), breaks=range(1, 5)) + \
        theme(figure_size=(13, 6)) 
    
    #Clear, so new data gets loaded
    clear_output(wait=True)
    display(date_dropdown)
    display(p)

# Interactive dropdown with dates
interactive_plot = widgets.interactive_output(update_figure, {'selected_date': date_dropdown})
display(interactive_plot)


Output()

This facetgrid illustrates how weather scores impact accident severity across various US regions, each characterized by distinct weather patterns. Each subplot represents a different region, showcasing the relationship between weather scores and accident severity. In each subplot, regression lines are included to assess correlation, yet none of these lines show a discernible pattern, suggesting no clear relationship between weather conditions and accident severity.

## The Second Perspective
Accidents are not the cause of the weather, but rather the causes of other factors such as the city, the state and the road condition. Weather can contribute, but is a minor factor as most vehicles are built to withstand most weather events. Urban infrastructure, regional traffic laws, and maintenance of roadways play significant roles in accident occurrence. Therefore, addressing these factors is crucial for improving road safety and reducing accidents.

### Surrounding conditions
In the following graph you can see that most accidents take place with no relevant infrastructure such as traffic stops or traffic lighs nearby. It is however notable that aside from the no infrastructure, most accidents take place at junctions or traffic_signs. These are also the places that are most susceptible to human error. We can conclude from this graph that the place where the traffic accident takes place is relevant to the cause of the accidents.

In [5]:
import plotly.express as px
import pandas as pd


df = pd.read_csv('../resources/us_accidents_filtered_rows.csv')

infrastructures = ['Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 
                   'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 
                   'Traffic_Signal', 'Turning_Loop']

# Check if all infrastructure columns are False for each row
def all_false(row):
    return all(not row[infra] for infra in infrastructures)

# Apply the function row-wise to create a boolean mask
all_false_mask = df.apply(all_false, axis=1)

# Count rows where all infrastructure columns are False
no_infrastructure_count = all_false_mask.sum()

# Count occurrences of True values for each infrastructure type
counts = [df[infra].sum() for infra in infrastructures]

# If there are rows with all False values, add a category for "No Infrastructure Near Accidents"
if no_infrastructure_count > 0:
    infrastructures.append('No Infrastructure')
    counts.append(no_infrastructure_count)

# Create lists for labels and values
labels = infrastructures
values = counts

# Calculate total number of accidents
total_accidents = sum(values)

# Calculate percentages
percentages = [count / total_accidents * 100 for count in counts]

threshold = 1.3
combined_labels = []
combined_values = []

for label, value, percent in zip(labels, values, percentages):
    if percent < threshold:
        combined_labels.append('Other')
        combined_values.append(value)
    else:
        combined_labels.append(label)
        combined_values.append(value)

fig = px.pie(names=combined_labels, values=combined_values, title='Distribution of Traffic Accidents by Infrastructure Proximity',
             hole=0.8,
             labels={'label': 'percent'},
             width=800, height=500)


fig.update_traces(textposition='outside', textinfo='label+percent')
fig.update_layout(showlegend=False)

fig.show()



## Traffic violations
Another variable that could influence the amount of traffic accidents is traffic violations. Inexperienced drivers or drivers under the influence could be the majority of the causes of traffic accidents.
The plot shown below visualises the correlations between traffic violations, and it visualises the weather score. The traffic violations index is shown on the left on the y axis and the weather score is shown on the right y axis. The data itself is grouped by month. 

### The Drivers

In [52]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv('../resources/traffic_violations_filtered.csv')
weather_df = pd.read_csv('../resources/dataset_weather/weather_accidents.csv')

# Convert 'Date Of Stop' to datetime
df['Date Of Stop'] = pd.to_datetime(df['Date Of Stop'])
weather_df['Date'] = pd.to_datetime(weather_df['Date'], errors='coerce')

# Extract year and month from 'Date Of Stop'
df['YearMonth'] = df['Date Of Stop'].dt.to_period('M')
weather_df['YearMonth'] = weather_df['Date'].dt.to_period('M')

# Create binary columns
df['Contributed To Accident'] = df['Contributed To Accident'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Personal Injury'] = df['Personal Injury'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Belts'] = df['Belts'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Alcohol'] = df['Alcohol'].apply(lambda x: 1 if x == 'Yes' else 0)

# Group by YearMonth and sum the violations
monthly_data = df.groupby('YearMonth')[['Contributed To Accident', 'Personal Injury', 'Belts', 'Alcohol']].sum().reset_index()

# Aggregate weather scores by month
monthly_weather_scores = weather_df.groupby('YearMonth')['weather_score'].mean().reset_index()

# Merge monthly_data and monthly_weather_scores on YearMonth
monthly_merged = pd.merge(monthly_data, monthly_weather_scores, on='YearMonth', how='left')

monthly_merged['YearMonth'] = monthly_merged['YearMonth'].astype(str)

# Plotting the data
fig = px.line(monthly_merged, x='YearMonth', y=['Contributed To Accident', 'Personal Injury', 'Belts', 'Alcohol'], 
              labels={'value':'Number of Violations', 'YearMonth':'Date (Month)'},
              title='Monthly Violations and Average Weather Score')

# Adding weather score data as a line plot with a separate y-axis
fig.add_trace(go.Scatter(x=monthly_merged['YearMonth'].astype(str), y=monthly_merged['weather_score'], name='Weather Score', 
                         mode='lines', yaxis='y2'))

# Update layout for dual y-axes
fig.update_layout(
    xaxis_title='Date (Month)',
    yaxis_title='Number of Violations',
    yaxis2=dict(
        title='Weather Score',
        overlaying='y',
        side='right'
    ),
    legend=dict(
        x=1.1,
        y=1
    )
)

fig.show()


## Sources
Fatality facts 2022: Yearly snapshot. IIHS. (2024, June). https://www.iihs.org/topics/fatality-statistics/detail/yearly-snapshot 

Moosavi, S. (2023, May 28). US accidents (2016 - 2023). Kaggle. https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents 

Beniaguev, D. (2017, December 28). Historical hourly weather data 2012-2017. Kaggle. https://www.kaggle.com/datasets/selfishgene/historical-hourly-weather-data?select=wind_speed.csv 

Gutierrez, F. (2017, October 31). Traffic violations in USA. Kaggle. https://www.kaggle.com/datasets/felix4guti/traffic-violations-in-usa 