# Analyse van weersverschijnselen en verkeersongelukken in de VS

## Introduction
City planners, civil engineers and car manufacturers have worked tirelessly for decades to make driving a safe, comfortable and accessible option for travel. Safer roads, smarter rules and cutting edge technologies have been implement to reduce the frequency of accidents with motor vehicles, to great success as accidents on the road have dropped tremendously since the 1960s (https://www.iihs.org/topics/fatality-statistics/detail/yearly-snapshot) . Aside from individuals with poor driving skills and faulty vehicles, one of the greatest hindrances in motor safety has been the chaotic essence of mother nature. In the US, forty-two thousand people have died in 2022.

Some would argue that weather is primarily responsible for the amount of motor vehicle accidents. Weather phenomena such as hurricanes, heavy rain, icy roads and thick fog are able to severely debilitate a person's driving skills and could pose a threat to their own safety, and that of other drivers around them. Others state that the weather is merely a small issue, and that accidents have many more causes other than weather. They argue that factors such as inexperienced drivers, unmaintained roads or sloppy city planning cause just as much, if not more accidents on the road.

We set out to research the correlation between weather and accidents in the US. To achieve this we took a look at two datasets, these being 'US Accidents (2016 - 2023)', and 'Historical Hourly Weather Data 2012-2017'. 'US Accidents (2016 - 2023)' is a detailed record of vehicle accidents in the US between 2016 and 2023, having recorded the time, place and severity of accidents, among many other variables. 'Historical Hourly Weather Data 2012-2017' is a record of weather across the US between the years 2012 and 2017. This dataset has recorded the hourly weather status of many large cities across the United States. With these two datasets we can analyze and compare the link between heavy weather and car accidents across the US.

## Dataset and Preprocessing

The first dataset used in this data story is the dataset US Accidents (2016-2023), found on Kaggle. This dataset contains all traffic accidents in America from Febuary 2016 to March 2023. These datapoints are all collected with the help of multiple API's that record traffic accidents and reports them. These API's send out various traffic details which are captured by various entities, such as US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. This dataset contains 7728394 records and 46 different variables. It contains variables on where the accidents took place, when they took place, if any infrastructure was nearby and the severity of the accident. <br>
For this data story only the USA will be researched, so for this dataset we removed all states and cities that were not in the USA. As for the variables, all variables that were not used such as ID, Source, Timezone and so on. To find overlap between the this database and the database mentioned in the next paragraph, only the years 2016 and 2017 were used as these are the overlapping years between the databases.

The second dataset used in this data story is Historical Hourly Weather Data 2012-2017, also found on Kaggle. It contains the data of the weather in various US states as well as some canadian and israeli cities. The records are measured by the hour and records variables such as temparature, humidity, air pressure and wind speed. There are roughly 45300 records in the database.
The only country that will be researched is the US, so all the records of the other countries have been removed from the database. The variables of the database are all in a different csv file which made it more difficult to find correlation between the databases. To overcome this obstacle all the csv files were melted and filtered on year, after the filtering the database was merged to one csv file. 

## The First Perspective
Driving through heavy rain or severe weather conditions can be disoriëntating and can lead to dangerous situations. This is why weather has a lot of influence on the frequency of accidents. Freezing temperatures and bad weather are a dangere to traffic. Cities with severe weather are more prone to traffic accidents than cities with less severe weather. To combat this issue, more precautions should be taken to prevent accidents caused by weather.

### Accidents per day
In the following graph you can see the amount of accidents per day. It is notable to see that there is a significant increase in accidents during workdays. When zooming in there is a noticable pattern with 5 days of more accidents followed by two days(weekend) of less accidents. It is also noteworthy that there is a slight increase in traffic accidents around January and October. This could be because of freezing and holidays respectively. This is worth a further investigation and this will be discussed later on.

In [1]:
import plotly.express as px
import pandas as pd
accidents = pd.read_csv('../resources/us_accidents.csv')

accidents['Start_Time'] = pd.to_datetime(accidents['Start_Time'])
date_counts = accidents['Start_Time'].dt.date.value_counts().sort_index()

# Create a bar chart with Plotly
fig = px.bar(
    x=date_counts.index,
    y=date_counts.values,
    labels={'x': 'Date', 'y': 'Number of Accidents'},
    title='Number of Accidents per Day',
    color_discrete_sequence= ['red']
)

# Update the layout to improve appearance
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Number of Accidents',
    xaxis_tickangle=-90
)

# Show the plot
fig.show()

### Number of accidents and Weather scores
The two Bubble Maps are meant to put the correlation between our weather-score, and the amount of accidents in a city, into perspective. In the final version we want to make this 1 big plot instead of two smaller ones, but we ran out of time. We will also look into other variables that might have an effect on the er of accidents(like the total amount of roads in an area).

In [2]:
# from matplotlib import scale
# import pandas as pd
# import plotly.graph_objects as go

# df = pd.read_csv('../resources/BubbleMapData.csv')
# groottes = [(0, 5), (5, 50), (50, 500), (500, 5000), (5000, 500000)]
# colors = ["Black", "Yellow", "Red", "Green", "Blue"]

# scale = 0.01

# fig = go.Figure()

# for i in range(len(groottes)):
#     lim = groottes[i]
#     df_sub = df[(df['ongelukken'] >= lim[0]) & (df['ongelukken'] < lim[1])]
#     fig.add_trace(go.Scattergeo(
#         locationmode='USA-states',
#         lon=df_sub['lng'],
#         lat=df_sub['lat'],
#         text=df_sub['City'] + ': ' + df_sub['ongelukken'].astype(str) + ' ongelukken',
#         mode='markers',
#         marker=dict(
#             size=df_sub['ongelukken'] * scale,
#             color=colors[i],
#             line_color='rgb(40,40,40)',
#             line_width=0.5,
#             sizemode='area'
#         ),
#         name=f'{lim[0]} - {lim[1]} accidents'
#     ))

# fig.update_layout(
#     title_text='Number of accidents in the VS for each city',
#     showlegend=True,
#     geo=dict(
#         scope='usa',
#         landcolor='rgb(217, 217, 217)',
#     )
# )

# fig.show()

In [3]:
# from matplotlib import scale
# import pandas as pd
# import plotly.express as px
# import plotly.graph_objects as go

# df = pd.read_csv('../resources/dataset_weather/WeatherBubbleData.csv')

# grootte = [(0,0.30),(0.30,0.35),(0.35,0.40),(0.40,0.45),(0.45,0.50),(0.50,1)]
# color_scale = px.colors.diverging.Tealrose

# def assign_size(score):
#     for i, (min_range, max_range) in enumerate(grootte):
#         if min_range <= score < max_range:
#             return (i + 1) * 5 
#     return 5 


# hover_text = []
# for index, row in df.iterrows():
#     hover_text.append(f"City: {row['city']}<br>Score: {row['score']}")


# fig = go.Figure()

# fig.add_trace(go.Scattergeo(
#     locationmode='USA-states',
#     lon=df['lng'],
#     lat=df['lat'],
#     mode='markers',
#     marker=dict(
#         size=[assign_size(score) for score in df['score']],
#         color=df['score'],
#         colorscale=color_scale,  
#         colorbar=dict(title='Weather Score'),  
#         line_color='rgb(40, 40, 40)',
#         line_width=0.5,
#         sizemode='diameter', 
#         opacity=0.8
#     ),
#     text=hover_text,  
#     hoverinfo='text',  
# ))

# fig.update_layout(
#     title='Weather scores for the big American cities',
#     geo=dict(
#         scope='usa',
#         landcolor='rgb(217, 217, 217)',
#     )
# )


# fig.show()

### Effect of freezing 

When analysing this graph, some noticable features are the amount of accidents per datapoint rather than it being spread out evenly. This is because the only datapoints in the graphed dataset are days with freezing temperatures the datapoints are only of one state so the data is lower than graph beforementioned graph with all cities combined. 

In [4]:
temp = pd.read_csv('../resources/dataset_weather/us_temperature.csv')
accidents = pd.read_csv('../resources/us_accidents.csv')
melted_df = pd.melt(temp, id_vars=['datetime'], var_name='State', value_name='Temperature')

melted_df = melted_df[melted_df['Temperature'] < 270]
melted_df['datetime'] = pd.to_datetime(melted_df['datetime']).dt.date
accidents['Start_Time'] = pd.to_datetime(accidents['Start_Time']).dt.date

merged_data = pd.merge(melted_df, accidents,left_on=['datetime', 'State'], right_on=['Start_Time', 'City'], how='inner')

date_counts = merged_data['datetime'].value_counts().sort_index()

# Create a bar chart with Plotly
fig = px.bar(
    x=date_counts.index,
    y=date_counts.values,
    labels={'x': 'Date', 'y': 'Number of Accidents'},
    title='Number of Accidents per Day when Freezing'
)

# Update the layout to improve appearance
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Number of Accidents',
    xaxis_tickangle=-90
)

# Show the plot
fig.show()


Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



## The Second Perspective
Accidents are not the cause of the weather, but rather the causes of other factors such as the city, the state and the road condition. Weather can contribute, but is a minor factor as most vehicles are built to withstand most weather events. Urban infrastructure, regional traffic laws, and maintenance of roadways play significant roles in accident occurrence. Therefore, addressing these factors is crucial for improving road safety and reducing accidents.

### Surrounding conditions
In the following graph you can see that most accidents take place with no relevant infrastructure such as traffic stops or traffic lighs nearby. It is however notable that aside from the no infrastructure, most accidents take place at junctions or traffic_signs. These are also the places that are most susceptible to human error. We can conclude from this graph that the place where the traffic accident takes place is relevant to the cause of the accidents.

In [5]:
import plotly.express as px
import pandas as pd


df = pd.read_csv('../resources/us_accidents_filtered_rows.csv')

infrastructures = ['Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 
                   'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 
                   'Traffic_Signal', 'Turning_Loop']

# Check if all infrastructure columns are False for each row
def all_false(row):
    return all(not row[infra] for infra in infrastructures)

# Apply the function row-wise to create a boolean mask
all_false_mask = df.apply(all_false, axis=1)

# Count rows where all infrastructure columns are False
no_infrastructure_count = all_false_mask.sum()

# Count occurrences of True values for each infrastructure type
counts = [df[infra].sum() for infra in infrastructures]

# If there are rows with all False values, add a category for "No Infrastructure Near Accidents"
if no_infrastructure_count > 0:
    infrastructures.append('No Infrastructure')
    counts.append(no_infrastructure_count)

# Create lists for labels and values
labels = infrastructures
values = counts

# Calculate total number of accidents
total_accidents = sum(values)

# Calculate percentages
percentages = [count / total_accidents * 100 for count in counts]

threshold = 1.3
combined_labels = []
combined_values = []

for label, value, percent in zip(labels, values, percentages):
    if percent < threshold:
        combined_labels.append('Other')
        combined_values.append(value)
    else:
        combined_labels.append(label)
        combined_values.append(value)

fig = px.pie(names=combined_labels, values=combined_values, title='Distribution of Traffic Accidents by Infrastructure Proximity',
             hole=0.8,
             labels={'label': 'percent'},
             width=800, height=500)


fig.update_traces(textposition='outside', textinfo='label+percent')
fig.update_layout(showlegend=False)

fig.show()



### The Drivers

In [6]:
df_violations = pd.read_csv('../resources/traffic_violations_filtered.csv')
df_accidents = pd.read_csv('../resources/us_accidents.csv')

# Convert the date columns to datetime
df_accidents['date'] = pd.to_datetime(df_accidents['Start_Time'])
df_violations['date'] = pd.to_datetime(df_violations['Date Of Stop'])

# Extract year and month, and aggregate by month
df_accidents['month'] = df_accidents['date'].dt.to_period('M')
df_violations['month'] = df_violations['date'].dt.to_period('M')

accidents_per_month = df_accidents.groupby('month').size().reset_index(name='traffic accidents')
violations_per_month = df_violations.groupby('month').size().reset_index(name='traffic violations')

# Merge the aggregated data on the month
df_merged = pd.merge(accidents_per_month, violations_per_month, on='month', how='outer').fillna(0)

# Convert 'month' back to string for plotting
df_merged['month'] = df_merged['month'].astype(str)

# Melt the dataframe for Plotly Express
df_melted = df_merged.melt(id_vars='month', value_vars=['traffic accidents', 'traffic violations'], 
                           var_name='type', value_name='count')

# Create the line plot
fig = px.line(df_melted, x='month', y='count', color='type', 
              labels={'month': 'Date (Months)', 'count': 'Count', 'type': ''},
              title='Monthly Traffic Accidents and Traffic Violations')

# Show the plot
fig.show()


### Reflection

Height van de parquet aanpassen <br>
In de barplot aangeven welke dagen het heeft gevroren, ipv een aparte grafiek daarvoor. 


### References
https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents <br>
https://www.kaggle.com/datasets/selfishgene/historical-hourly-weather-data/data?select=weather_description.csv