In [None]:
import pandas as pd

In [None]:
accidents = pd.read_csv('data/Traffic_Accidents__2019_.csv')

In [None]:
accidents.head()

First, let's convert the Date and Time column to the datetime type. To save time, we can specify the format.

In [None]:
accidents['Date and Time'] = pd.to_datetime(accidents['Date and Time'], format = '%m/%d/%Y %I:%M:%S %p')

Once we have it in datetime format it becomes easy to answer questions like "How many accidents were there per month?".

In [None]:
(accidents
 .assign(month = accidents['Date and Time'].dt.month_name())
 .month
 .value_counts(sort = False)
)

Let's look at how the number of accidents vary by hour of the day.

One method we could try is to extract out the date and hour portions and do a `groupby` + `count`.

In [None]:
(accidents
 .assign(date = accidents['Date and Time'].dt.date, 
         hour = accidents['Date and Time'].dt.hour)     # Create a date and hour column so that we can group
 .groupby(['date', 'hour'])
 ['Accident Number']
 .count()
 .reset_index()
 .head(10)
)

There is a big problem with this, which can be seen if you look carefully at the output above. There are no rows for 6:00, 7:00, or 8:00 on January 1. This is because there were no accidents during these hours, so there were no rows to count.

A better method (and one that will require less work to pull off) is to use a `Grouper` to group by hour.

In [None]:
(accidents
 .groupby(pd.Grouper(key = 'Date and Time',     # point it to your datetime column
                     freq = '1h',               # How much do you want to group together values?
                     origin = 'epoch'           # This will start times at midnight of 1970-01-01. This ensure
                                                # This ensures that we are starting our first grouped period on the hour
                    ))
 ['Accident Number']
 .count()
 .reset_index()
 .head(10)
)

Late night on November 11, 2019 [Nashville received a rare November snow](https://fox17.com/news/local/nws-nashville-sees-rare-early-november-snow). Let's investigate to see if we can detect any effect on the number of accidents the following morning.

In [None]:
# First, filter down to the following day
snow_day = accidents[(accidents['Date and Time'] >= '2019-11-12') & 
                     (accidents['Date and Time'] < '2019-11-13')]

Now, let's apply a grouper to the our snow day.

In [None]:
(snow_day
 .groupby(pd.Grouper(key = 'Date and Time',
                     freq = '1h',
                     origin = 'epoch'
                    ))
 ['Accident Number']
 .count()
 .plot(figsize = (10,5))
);

It does look like there were quite a few accidents in the morning, but in isolation, it is hard to know if what we are seeing is unusual. Let's do a comparison with the rest of the data.

In [None]:
accidents_grouped = (accidents
                     .assign(weekday = accidents['Date and Time'].dt.day_name())
                     .query('weekday != "Saturday" and weekday != "Sunday"')   # remove the weekends
                     .groupby(pd.Grouper(key = 'Date and Time',
                                         freq = '1h',
                                         origin = 'epoch'
                                        ))
 ['Accident Number']
 .count()
 .reset_index()       # convert the Date and Time column back to a regular column
)

In [None]:
(accidents_grouped
 .assign(hour = accidents_grouped['Date and Time'].dt.hour)
 .groupby('hour')
 ['Accident Number']
 .agg(['mean', 'std', 'median','max'])
)

From this, we can see that in the morning, the 7:00 hour is usually the worst, with an average of more than 5 accidents.

Looking back at the snow day, we can see that while the morning hours did have a high number of crashes, none of them were the worst that occurred in 2019.

However, together the hours or 6, 7, 8, and 9 all had above-average number of crashes. Maybe we can compare this block of time to this block of time across the whole dataset.

One way to accomplish this is to change our grouping frequency to 4 hours. Note that we also need to adjust the origin value so that 6:00 - 10:00 get grouped together.

In [None]:
snow_day.groupby(pd.Grouper(key = 'Date and Time',     
                                     freq = '4h',               
                                     origin = '2018-12-31 02:00:00'  # This will result in the 6:00 AM - 10:00 AM times to be grouped together           
                           ))['Date and Time'].count()

Let's also regroup the full dataset.

In [None]:
accidents_grouped = (accidents
                     .assign(weekday = accidents['Date and Time'].dt.day_name())
                     .query('weekday != "Saturday" and weekday != "Sunday"')
                     .groupby(pd.Grouper(key = 'Date and Time',     
                                     freq = '4h',               
                                     origin = '2018-12-31 02:00:00'                                      
                                    ))
 ['Accident Number']
 .count()
 .reset_index()
)

In [None]:
(accidents_grouped
 .assign(hour = accidents_grouped['Date and Time'].dt.hour)
 .groupby('hour')
 ['Accident Number']
 .agg(['mean', 'std', 'median', 'max'])
)

Comparing the snow day to the overall average for the 4 hour period starting at 6:00 AM, we can see that there were an above-average number of accidents. It wasn't the worst day in the whole year, but let's investigate and see where it lands.

In [None]:
(accidents_grouped
 .assign(hour = accidents_grouped['Date and Time'].dt.hour,
         weekday = accidents_grouped['Date and Time'].dt.day_name())
 .query('hour == 6')
 .nlargest(5, 'Accident Number')
)

This day was the worst Tuesday in 2019, and is tied for the 4th worst day in the whole year.