# Car Crash Data Analysis in Monroe County (2003-2015)

## Context and Purpose
This project aims to explore and understand the patterns in car crash data spanning from 2003 to 2015 in Monroe County, New York. The primary goal is to answer the question: "How do factors such as time of day, day of the week, and month impact the frequency and severity of car accidents?" Additionally, this project serves as an opportunity to enhance my data science skills, with a focus on mastering the visualization capabilities of Plotly.

## Methodology
The analysis involved the following steps:
1. **Data Exploration**: Understanding the structure and contents of the dataset.
2. **Data Visualization**: Using Plotly to create various interactive visualizations that focus on different aspects of the data, such as time of day, day of the week, and month.
3. **Interpretation of Findings**: Analyzing the visualizations to draw meaningful insights and conclusions about the patterns and trends in the car crash data.

## Summary of Findings
- **Injury Severity**: The majority of accidents in the dataset resulted in no injury, which is a reassuring finding regarding road safety in Monroe County.
- **Worst Time for Accidents**: Analysis revealed that Mondays in 2003 witnessed the highest number of accidents (715 in total), making it the most accident-prone time in the dataset.
- **Accident Trends Over Time**: Contrary to my initial hypothesis that accident rates would decrease over time due to advancements in technology, this trend was not consistently observed in the data.
- **Time of Day Analysis**: The data indicated that the time frame from 12 PM to 8 PM, encompassing lunch hours and evening rush hours, was the most prone to accidents.

The insights derived from this analysis provide valuable information that can be used for improving road safety measures and informing the public about high-risk times for driving.

In [12]:
#Imports
import pandas as pd

import plotly.express as px
import plotly.offline as py
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import calendar
from datetime import datetime

import warnings

warnings.filterwarnings('ignore')


#Read in data and check shape
df = pd.read_excel('new dataset.xlsx')
print(df.shape)
display(df.head())

#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

(53943, 11)


Unnamed: 0,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude
0,2015,1,5,Weekday,0.0,2-Car,No injury/unknown,OTHER (DRIVER) - EXPLAIN IN NARRATIVE,1ST & FESS,39.159207,-86.525874
1,2015,1,6,Weekday,1500.0,2-Car,No injury/unknown,FOLLOWING TOO CLOSELY,2ND & COLLEGE,39.16144,-86.534848
2,2015,1,6,Weekend,2300.0,2-Car,Non-incapacitating,DISREGARD SIGNAL/REG SIGN,BASSWOOD & BLOOMFIELD,39.14978,-86.56889
3,2015,1,7,Weekend,900.0,2-Car,Non-incapacitating,FAILURE TO YIELD RIGHT OF WAY,GATES & JACOBS,39.165655,-86.575956
4,2015,1,7,Weekend,1100.0,2-Car,No injury/unknown,FAILURE TO YIELD RIGHT OF WAY,W 3RD,39.164848,-86.579625


In [2]:
#Drop duplicates and null values
df.drop_duplicates(inplace=True)
df.dropna(subset=['Hour'], inplace=True)

#Check for null values, data types, and duplicates
print(df.isnull().sum())
print()
print(df.dtypes)
print()
print(df.describe())
print()
print(df.duplicated().sum())

# Find and display duplicates
duplicates = df[df.duplicated()]
display(duplicates)

Year                    0
Month                   0
Day                     0
Weekend?                0
Hour                    0
Collision Type          6
Injury Type             0
Primary Factor       1085
Reported_Location      28
Latitude               30
Longitude              30
dtype: int64

Year                   int64
Month                  int64
Day                    int64
Weekend?              object
Hour                 float64
Collision Type        object
Injury Type           object
Primary Factor        object
Reported_Location     object
Latitude             float64
Longitude            float64
dtype: object

               Year         Month           Day          Hour      Latitude  \
count  53654.000000  53654.000000  53654.000000  53654.000000  53624.000000   
mean    2008.988705      6.665486      4.197394   1347.316137     35.585441   
std        3.783493      3.512871      1.908706    531.631694     11.285154   
min     2003.000000      1.000000      1.000000   

Unnamed: 0,Year,Month,Day,Weekend?,Hour,Collision Type,Injury Type,Primary Factor,Reported_Location,Latitude,Longitude


# Plotly Insights and Tips: Accidents by Injury Type Over Time (Line Graph)

In the course of this project, I've had the opportunity to dive deep into Plotly, a powerful and versatile plotting library. Here are some key takeaways and helpful tips from my experience, particularly from creating the 'Accidents by Injury Type Over Time' line plot:

## Similarities to Matplotlib
- **Familiar Ground for Matplotlib Users**: One of the first things I noticed is that Plotly operates quite similarly to Matplotlib in many respects. This made the transition smoother, as I could apply some of my existing knowledge about data visualization.

## Customization and Styling
- **Enhancing Visual Appeal**: A standout feature in Plotly is the ease with which you can switch themes and styles. For example, using `template='plotly_dark'` instantly gives the plot a sleek, modern look. And it's my personal favorite template. This flexibility allows for quick adjustments to the visual presentation, making data more engaging and accessible to different audiences.

If you want to see all of the Plotly templates available, you can type: 

```python
import plotly.io as pio
pio.templates
```

In [3]:
# Group by year and injury type, then count the number of accidents
injury_data = df.groupby(['Year', 'Injury Type']).size().reset_index(name='Accidents')

# Create a Plotly line plot
fig = px.line(injury_data, x='Year', y='Accidents', color='Injury Type',
              title='Accidents by Injury Type Over Time',
              labels={'Year': 'Year', 'Accidents': 'Number of Accidents', 'Injury Type': 'Type of Injury'})

# Customize the layout
fig.update_layout(xaxis_title='Year', yaxis_title='Number of Accidents', height=1000, template='plotly_dark')

# Show the plot
fig.show()

#Save the plot to html file
#py.plot(fig, filename='injury_type_plot.html')

# Advanced Plotly Features: Number of Accidents by Day of the Week & Month (Bar Charts)

In creating the 'Number of Accidents by Day of the Week' and 'Number of Accidents by Month' visualizations, I delved into some of Plotly's more advanced features. These features not only enhanced the aesthetic appeal of the plots but also improved their functionality and user interactivity. Here are some key takeaways:

## Custom Tooltips from DataFrame Columns
- **Dynamic Tooltip Content**: Plotly allows for the creation of a custom tooltip column in the DataFrame, which can then be seamlessly integrated into the plot. This feature offers a convenient way to control and customize the information displayed in tooltips. While it's not always necessary (as seen in my polar subplots for time of day), it's a handy option for more controlled scenarios.

## Layout Customization
- **Extensive Control over Layout**: The level of customization available for the layout in Plotly is impressive. Everything about the plot’s title, including font, positioning, size, and color, can be adjusted. This flexibility is crucial for tailoring the visualization to specific presentation or reporting needs.

## Sizing Adjustments
- **Dimension Control**: Unlike Matplotlib, where you might adjust the `figsize`, Plotly’s sizing is controlled in the `update_layout` section. This approach provides a unified way to manage the dimensions and layout of the plot.

## Tooltip Customization
- **Enhanced Tooltip Functionality**: By adjusting the `hovertemplate` variable, a wide range of modifications can be made to the tooltips. This feature is particularly useful for enriching the interactive experience of the plot, providing users with detailed and context-specific information on hover.

## Diverse Themes and Templates
- **Aesthetic Variety**: Plotly includes a variety of themes and templates, including a Seaborn-like theme. These pre-set styles can quickly give plots a professional and polished look, akin to popular visualization libraries.

In conclusion, these advanced capabilities of Plotly have not only augmented my data visualization toolkit but also opened up new avenues for creatively displaying data insights.

In [4]:
# Function to get the day of the week given the year, month, and day
def get_day_of_week(row):
    return calendar.day_name[datetime(row['Year'], row['Month'], row['Day']).weekday()]

# Apply the function to the dataframe
df['DayOfWeek'] = df.apply(get_day_of_week, axis=1)

# Check if the new column is added correctly
df[['Year', 'Month', 'Day', 'DayOfWeek']].head()

Unnamed: 0,Year,Month,Day,DayOfWeek
0,2015,1,5,Monday
1,2015,1,6,Tuesday
2,2015,1,6,Tuesday
3,2015,1,7,Wednesday
4,2015,1,7,Wednesday


In [5]:
# Aggregate the data to get the count of accidents for each day of the week across each year
accidents_by_day_week = df.groupby(['Year', 'DayOfWeek']).size().unstack(fill_value=0)

# Reorder the columns to have the days in the correct order
ordered_days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
accidents_by_day_week = accidents_by_day_week[ordered_days]

# Display the aggregated data
accidents_by_day_week.head()

DayOfWeek,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2003,715,607,677,613,679,575,662
2004,629,656,613,551,599,626,528
2005,498,535,566,600,583,567,563
2006,539,557,572,569,576,552,551
2007,607,617,551,550,612,577,533


In [6]:
# Convert the accidents_by_day_week dataframe to a long format for Plotly
accidents_long_df = accidents_by_day_week.reset_index().melt(id_vars='Year', var_name='DayOfWeek', value_name='Accidents')

# Let's create a new 'Rank' column in the long format dataframe that ranks each row based on 'Accidents' (1 being the highest number of accidents) 
accidents_long_df['Rank'] = accidents_long_df['Accidents'].rank(method='min', ascending=False).astype(int)

# Create custom tooltip text that includes the ranking information
accidents_long_df['Tooltip'] = accidents_long_df.apply(
    lambda row: f"{row['DayOfWeek']}s in {row['Year']} were the #{int(row['Rank'])} worst day for accidents", axis=1)

accidents_long_df.head()

Unnamed: 0,Year,DayOfWeek,Accidents,Rank,Tooltip
0,2003,Monday,715,1,Mondays in 2003 were the #1 worst day for acci...
1,2004,Monday,629,16,Mondays in 2004 were the #16 worst day for acc...
2,2005,Monday,498,90,Mondays in 2005 were the #90 worst day for acc...
3,2006,Monday,539,82,Mondays in 2006 were the #82 worst day for acc...
4,2007,Monday,607,35,Mondays in 2007 were the #35 worst day for acc...


In [7]:
# Create a figure with plotly.graph_objects
fig = go.Figure()


# Add one bar for each day of the week
for idx, day in enumerate(ordered_days):
    day_data = accidents_long_df[accidents_long_df['DayOfWeek'] == day]
    fig.add_trace(go.Bar(
        x=day_data['Year'],
        y=day_data['Accidents'],
        name=day,
        text=day_data['Accidents'],  # Number of accidents displayed inside the bar
        hoverinfo='text',
        hovertemplate='<b>Day of the week</b>: ' + day + '<br>' +
                      '<b>Year</b>: %{x}<br>' +
                      '<b>Accidents</b>: %{y}<br><br>' +
                      '%{hovertext}<extra></extra>',
        hovertext=day_data['Tooltip'].str.replace(r"(#\d+)", r"<b>\1</b>"), # Custom hover text
    ))

# Update the layout for a grouped bar chart
fig.update_layout(
    height=900,
    barmode='group',
    title={
        'text': 'Number of Accidents by Day of the Week (2003-2015)',
        'y':0.92,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(
            family="Arial, sans-serif",
            size=24,
            color="black",
        ),
    },
    xaxis=dict(title='Year'),
    yaxis=dict(title='Number of Accidents'),
    legend_title='Day of the Week',
    uniformtext_minsize=8,  # Ensures text size inside bars is readable
    uniformtext_mode='hide',  # Hides text if the bar is too small
    template='seaborn', # nice!
)

# Show the figure
fig.show()

# Save the figure
#py.plot(fig, filename='day_of_the_week_plot.html')

In [8]:
# Using the same methodology here that I used for day_of_the_week

df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])


accidents_by_month = df.groupby(['Year', 'Month']).size().unstack(fill_value=0)


ordered_months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
accidents_by_month = accidents_by_month[ordered_months]


accidents_long_df_month = accidents_by_month.reset_index().melt(id_vars='Year', var_name='Month', value_name='Accidents')

accidents_long_df_month['Rank'] = accidents_long_df_month['Accidents'].rank(method='min', ascending=False).astype(int)


accidents_long_df_month['Tooltip'] = accidents_long_df_month.apply(
    lambda row: f"{row['Month']} {row['Year']} was the #{int(row['Rank'])} worst month for accidents", axis=1)

accidents_long_df_month


Unnamed: 0,Year,Month,Accidents,Rank,Tooltip
0,2003,Jan,370,52,Jan 2003 was the #52 worst month for accidents
1,2004,Jan,286,135,Jan 2004 was the #135 worst month for accidents
2,2005,Jan,384,33,Jan 2005 was the #33 worst month for accidents
3,2006,Jan,303,122,Jan 2006 was the #122 worst month for accidents
4,2007,Jan,318,100,Jan 2007 was the #100 worst month for accidents
...,...,...,...,...,...
151,2011,Dec,263,149,Dec 2011 was the #149 worst month for accidents
152,2012,Dec,362,57,Dec 2012 was the #57 worst month for accidents
153,2013,Dec,380,36,Dec 2013 was the #36 worst month for accidents
154,2014,Dec,385,30,Dec 2014 was the #30 worst month for accidents


In [9]:
# Same methodology as day_of_the_week, but with a few slight tweaks

fig = go.Figure()


for idx, month in enumerate(ordered_months):
    month_data = accidents_long_df_month[accidents_long_df_month['Month'] == month]
    fig.add_trace(go.Bar(
        x=month_data['Year'],
        y=month_data['Accidents'],
        name=month,
        text=month_data['Accidents'],  
        hoverinfo='text',
        hovertemplate='<b>Month</b>: ' + month + '<br>' +
                      '<b>Year</b>: %{text}<br>' +
                      '<b>Accidents</b>: %{y}<br><br>' +
                      '%{hovertext}<extra></extra>', 
        hovertext=month_data['Tooltip'].str.replace(r"(#\d+)", r"<b>\1</b>"),
    ))


fig.update_layout(
    barmode='stack', # Stacked instead of grouped
    title={
        'text': 'Number of Accidents by Month (2003-2015)',
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': dict(
            family="Helvetica, sans-serif",
            size=24,
            color="black",
        ),
    },
    xaxis=dict(title='Year'),
    yaxis=dict(title='Number of Accidents'),
    legend_title='Momth',
    uniformtext_minsize=8,
    uniformtext_mode='hide',
    template='gridon' # Using the gridon template adds helpful grid lines to the plot
)


fig.show()

#py.plot(fig, filename='month_plot.html')

# More Advanced Features: Number of Accidents by Time of Day (Polar Subplots)

Creating the time of day polar plots provided some unique insights into the capabilities of Plotly, especially in terms of layout design and customization. Here's what I learned:

## Uniformly Spaced Subplots
- **Structured Visualization**: The `make_subplots` feature in Plotly is extremely powerful for creating complex, multi-plot layouts. For the time of day analysis, it allowed me to generate a uniform grid of polar plots, one for each year, ensuring consistency in presentation and making comparisons easier.

## Flexible Title Positioning
- **Creative Freedom with Titles**: A surprising and useful feature I discovered is the ability to place plot titles anywhere on the canvas. Moving the title to the bottom right corner was an experiment that showcased Plotly's flexibility in customizing plot components.

## Additional Insights
- **Interactive Elements on Subplots**: One of the strengths of Plotly is its interactivity, and this extends to subplots as well. Each polar plot maintained its interactive features, like hover data, enhancing the user experience.
- **Customizing Polar Charts**: Adjusting the aesthetics and functionality of polar charts, such as the color scheme, radial and angular ticks, and hover information, provided a deeper understanding of Plotly's customization options.
- **Learning Curve**: While creating multiple polar plots required a bit more coding effort, it was a valuable learning experience in understanding the structure and syntax specific to Plotly, particularly for more complex visualizations.

In summary, the process of creating these polar plots was not only a lesson in data visualization but also an exploration of the artistic possibilities within data science. This experience has broadened my understanding of how interactive and engaging data presentations can be crafted.

In [10]:
# First, we'll need to create a new column for the 4-hour time increments
def get_time_period(hour):
    if 0 <= hour < 4:
        return 'Midnight to Early Morning'
    elif 4 <= hour < 8:
        return 'Early to Late Morning'
    elif 8 <= hour < 12:
        return 'Morning to Noon'
    elif 12 <= hour < 16:
        return 'Afternoon'
    elif 16 <= hour < 20:
        return 'Late Afternoon to Evening'
    else:  # from 20 to 24
        return 'Night'

# Apply the function to the 'Hour' column (making sure to convert to proper hour format first)
df['Hour'] = df['Hour'] // 100  # Convert from military time to standard
df['TimePeriod'] = df['Hour'].apply(get_time_period)

# Now, let's group and aggregate the data by year and the new 'TimePeriod'
accidents_by_time_period = df.groupby(['Year', 'TimePeriod']).size().reset_index(name='Accidents')

# Next, we calculate the global ranking for these time increments across all years
accidents_by_time_period['Rank'] = accidents_by_time_period['Accidents'].rank(method='min', ascending=False).astype(int)

# Let's take a look at the dataframe
accidents_by_time_period.head()


Unnamed: 0,Year,TimePeriod,Accidents,Rank
0,2003,Afternoon,1304,1
1,2003,Early to Late Morning,322,55
2,2003,Late Afternoon to Evening,1216,6
3,2003,Midnight to Early Morning,206,77
4,2003,Morning to Noon,923,27


In [11]:
# Unique years and time periods for creating the subplots
unique_years = sorted(accidents_by_time_period['Year'].unique())
unique_time_periods = accidents_by_time_period['TimePeriod'].unique()

# Determine the layout for subplots
rows = 3 
cols = len(unique_years) // rows + (len(unique_years) % rows > 0)

# Create subplots - each subplot will be a polar chart
fig = make_subplots(rows=rows, cols=cols, specs=[[{'type': 'polar'}]*cols]*rows,
                    subplot_titles=[str(year) for year in unique_years])

#Add traces for legend (one per time period)
time_period_colors = px.colors.qualitative.Plotly

# Loop through each year and time period to create polar subplots
for i, year in enumerate(unique_years):
    row = (i // cols) + 1
    col = (i % cols) + 1
    year_data = accidents_by_time_period[accidents_by_time_period['Year'] == year]
    
    # Add a polar subplot for each year
    for j, time_period in enumerate(unique_time_periods):
        data = year_data[year_data['TimePeriod'] == time_period]
        fig.add_trace(
            go.Barpolar(
                r=data['Accidents'],
                theta=[time_period for _ in data['Accidents']],
                name=time_period,
                marker_color=time_period_colors[j % len(time_period_colors)],
                hoverinfo='text',
                hovertext=f"<b>{time_period}</b><br>Number of Accidents=<b>{data['Accidents'].iloc[0]}</b><br>This is the #{data['Rank'].iloc[0]} worst time for accidents",
                showlegend=(i == 0)  # Only show legend for the first subplot
            ),
            row=row, col=col
        )

# Hide radial axis tick labels and update other polar properties as needed
fig.update_polars(radialaxis_showticklabels=False)

# Ensure that the layout for each polar subplot is uniform and that the colors are visible
max_accidents = accidents_by_time_period['Accidents'].max()
fig.update_polars(radialaxis_range=[0, max_accidents])

# Set the height and width of the figure to increase the size of the subplots
fig.update_layout(
    height=1200,  # Increase as needed
    width=2700,  # Increase as needed
    title={
        'text': 'Number of Accidents by <br> Time of Day (2003-2015)',
        'y':0.1,
        'x':.8,
        'xanchor': 'right',
        'yanchor': 'bottom',
        'font': dict(
            family="Helvetica, sans-serif",
            size=35,
            color="white",
        ),
    },
    # title_text='Number of Accidents by Time of Day (2003-2015)',
    # title_font=dict(size=24, color='white', family='Helvetica, sans-serif'),
    template='plotly_dark', 
)

# fig.update_layout(
#     annotations=[
#         go.layout.Annotation(
#             text='Number of Accidents by <br> Time of Day (2003-2015)',
#             x=.8,  # x=1 and xref='paper' places it at the far right
#             y=0,  # y=0 and yref='paper' places it at the bottom
#             xref='paper',
#             yref='paper',
#             showarrow=False,
#             font=dict(
#                 size=40,
#                 family='Helvetica, sans-serif',
#                 color="white"
#             ),
#             align='right'
#         )
#     ]
# )

# Show the figure
fig.show()

#py.plot(fig, filename='time_of_day_plot.html')

## Additional Tips and Tricks
- **Interactive Elements**: Plotly's interactivity is a game-changer, especially for exploratory data analysis. The ability to zoom, pan, and hover for more information adds a dynamic layer to data exploration.
- **Customization Options**: The level of customization with Plotly is extensive. From changing axis labels to modifying the color palette, the library provides numerous options to tailor your plots.
- **Documentation and Community**: Plotly's documentation is comprehensive, offering examples and guidance for almost any type of plot. Additionally, the community around Plotly is active, providing a wealth of information and support.

In summary, my journey with Plotly has been enlightening. The library's blend of simplicity, power, and aesthetic flexibility makes it an excellent tool for any data scientist looking to present data in an informative and visually appealing way.

Thanks for reading!