This is a walkthrough through the US_Accidents datasets. It is an attempt of an end to end project.
A data dictionary for the dataset can be found at: https://smoosavi.org/datasets/us_accidents

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# read in the dataset and take a quick pick at it
df = pd.read_csv("/kaggle/input/us-accidents-may19/US_Accidents_May19.csv")
df.sample(10)

1. Quick peek through the data shows we have all sorts of data types (string, datetime, float, boolian, and integers) but we also have some NaN values. 
2. Next I will look at the shape of the df and how many NaN values each column contains.

In [None]:
print('The DataFrame has {} rows and {} columns'.format(df.shape[0],df.shape[1]))
print('\n')
missing = df.isnull().sum().sort_values(ascending=False)
percent_missing = ((missing/df.isnull().count())*100).sort_values(ascending=False)
missing_df = pd.concat([missing,percent_missing], axis=1, keys=['Total', 'Percent'],sort=False)
missing_df[missing_df['Total']>=1]

Our dataset has 49 columns and over 2.2 million rows. We also have a lot of missing data within our dataset. We will now determing the method with which we will treat our missing values using the data dictionary to understand the data that each column has.

### fill NaN's with zero (0).
I will fill these columns with 0 because it is possible to have no recorded value for these. For example, it's possible to have zero rain if rain didn't fall that day. 

In [None]:
lst = ['Humidity(%)','Precipitation(in)','Wind_Chill(F)','Wind_Speed(mph)','Visibility(mi)']
for l in lst:
    df[l] = df[l].fillna(0)

### fill NaN's with mean
I will fill the following columns with the average value in the column as filling them with zero doesn't make much sense.

In [None]:
lst = ['Temperature(F)','Pressure(in)']
for l in lst:
    df[l]=df[l].fillna(df[l].mean())

In [None]:
'''
This is a good time to take a look at our missing values again. I have added a third column showing the respective data types
'''
missing = df.isnull().sum().sort_values(ascending=False)
percent_missing = ((missing/df.isnull().count())*100).sort_values(ascending=False)
missing_df = pd.concat([missing,percent_missing,df[missing.index].dtypes], axis=1, keys=['Total', 'Percent','Data Types'],sort=False)
missing_df[missing_df['Total']>=1]

In [None]:
missing_copy = missing_df[missing_df['Total']>=1].copy()

In [None]:
object_columns = missing_copy[missing_copy['Data Types']=='object'].index
df[object_columns].head()

### Fill NaN's with the most occuring entry
The data in these columns is categorical in nature. So i will fill the missing values with the most occuring value for these columns.

1. Filling the 'City' column. Since we have a 'State' column. I'll fill the city column with the most occuring city of the state it belongs to

In [None]:
df['City'] = df.groupby('State')['City'].transform(lambda grp: grp.fillna(grp.value_counts().index[0]))

2. for the next missing value imputation. I want to impute the Day/Night columns. To do that, I will reference the 'Start_Time' column to get the hour, and impute Day or Night based on the value. We need to convert the 'Start_Time' column to a datetime and while we're at it, we'll do the same for the 'End_Time' column.

In [None]:
df['Start_Time'] = pd.to_datetime(df['Start_Time']) # convert Start_Time to datetime
df['End_Time'] = pd.to_datetime(df['End_Time']) # convert End_Time to datetime
df['Weather_Timestamp'] = pd.to_datetime(df['Weather_Timestamp']) # convert Weather_Timestamp to datetime

In [None]:
# fill the Nautical_Twilight column with Day/Night by inferring the Start_Time column

def filler(df,columns):
    # get list comprising column missing data
    lst = df[df[columns].isna()].index
    for i in lst:
        if 6<= df.loc[i,'Start_Time'].hour and df.loc[i,'Start_Time'].hour <18:
            df[columns] = df[columns].fillna('Day')
        else:
            df[columns] = df[columns].fillna('Night')

filler(df,'Nautical_Twilight')

In [None]:
# Another easier option is to just impute the Day/Night values wth the mode as ['Sunrise_Sunset','Civil_Twilight','Astronomical_Twilight'] 
# vary depending on time of year and might be difficult to infer based on hour of day.

def median_imputer(x):
    df[x].fillna(df[x].mode()[0],inplace=True)

median_impute = ['Sunrise_Sunset','Civil_Twilight','Astronomical_Twilight','Wind_Direction','Weather_Condition']
for col in median_impute:
    median_imputer(col)

In [None]:
# impute the timezone based on the State column

df['Timezone'] = df.groupby('State')['Timezone'].transform(lambda tz: tz.fillna(tz.value_counts().index[0]))

In [None]:
# impute the Weather_Timestamp with the value at Start_Time. This column records the time the weather was taken (we won't really need it)

df.loc[(pd.isnull(df.Weather_Timestamp)), 'Weather_Timestamp'] = df.Start_Time

In [None]:
'''
This is a good time to take a look at our missing values again.
'''
missing = df.isnull().sum().sort_values(ascending=False)
percent_missing = ((missing/df.isnull().count())*100).sort_values(ascending=False)
missing_df = pd.concat([missing,percent_missing,df[missing.index].dtypes], axis=1, keys=['Total', 'Percent','Data Types'],sort=False)
missing_df[missing_df['Total']>=1]

In [None]:
# we do for Zipcode and Airport_Code what we did for columns like Timezone
df['Zipcode'] = df.groupby('State')['Zipcode'].transform(lambda zc: zc.fillna(zc.value_counts().index[0]))
df['Airport_Code'] = df.groupby('State')['Airport_Code'].transform(lambda ac: ac.fillna(ac.value_counts().index[0]))

In [None]:
# we will fill the one record in Description with 'Accident'

df.Description = df.Description.fillna('Accident')

### Missing Value treatment by dropping values
I will drop the End_Lat and End_Lng columns. The record the lat and long where the accident ended, if the accident affected a huge aread of road. It'll be difficult to impute them. Although, one way of imputing would be to set their values the same as Start_lat and Start_lng, but that'll be the same as removing them since about 77% of data in the columns would be the same. Deleting/Droping is the logical choice.

In [None]:
df.drop(labels=['End_Lat', 'End_Lng'],axis=1,inplace=True)

The two remaining columns, i will fill the Number (which records street number) the the common street accidents happen by State, and will just fill TMC with the 201 Code since all records represent accidents.
<b> Another possible way to deal with the Number column would be to fill in NaN's with (0) considering that all accidents may not occur in a street. But, what/how does that affect our model? </b>

In [None]:
df['Number'] = df.groupby('State')['Number'].transform(lambda n: n.fillna(n.value_counts().index[0]))
df.TMC = df.TMC.fillna(201.0)

In [None]:
'''
This is a good time to take a look at our missing values again.
'''
missing = df.isnull().sum().sort_values(ascending=False)
percent_missing = ((missing/df.isnull().count())*100).sort_values(ascending=False)
missing_df = pd.concat([missing,percent_missing,df[missing.index].dtypes], axis=1, keys=['Total', 'Percent','Data Types'],sort=False)
missing_df[missing_df['Total']>=1]

In [None]:
df.sample(10)

In [None]:
# write and store the cleaned file to a pickle file
df.to_pickle('US_Accidents_Cleaned.pkl')

# Exploratory Data Analysis

In [None]:
# import libraries for Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

In [None]:
df = pd.read_pickle('US_Accidents_Cleaned.pkl')

In [None]:
# create new features for timeseries analysis.
df['Hour'] = df['Start_Time'].dt.hour
df['Day'] = df['Start_Time'].dt.day
df['Day_Name'] = df['Start_Time'].dt.day_name()
df['Week'] = df['Start_Time'].dt.week
df['Month'] = df['Start_Time'].dt.month
df['Count'] = 1

In [None]:
df.groupby('Month')['Count'].value_counts()

In [None]:
import calendar
df.groupby('Month')['Count'].value_counts().plot(kind='bar')
df.groupby('Month')['Count'].value_counts().plot(color='k',linestyle='-',marker='.',linewidth=0.4)
plt.xticks(np.arange(12),calendar.month_name[1:13],rotation=45)
plt.xlabel('Month')
plt.title('Monthly Accident Count')

May registers low accident count in a year. August has the highest accident count. Probably because it's summer and people travel a lot.
<br> Why are there relatively low accidents in April, May, June, and July compared to other months in the year? </br>

In [None]:
plt.figure(figsize=(10,6))
df.groupby('Week')['Count'].value_counts().plot(linewidth=1,marker='.')
plt.xticks(np.arange(52),np.arange(1,53),rotation = 90)
plt.xlabel('Week of Year')
plt.title('Accident Count by Week of Year')
plt.show()

Week 22 registered the lowest accident. <br> Questions is why? Why is there such a relatively low accident count in May compared to other months? </br>

In [None]:
plt.figure(figsize=(10,6))
df.groupby('State')['Count'].value_counts().plot(kind='bar')
plt.xticks(np.arange(50),sorted(df['State'].unique()),rotation = 90)
plt.xlabel('State')
plt.title('Accident Count by State')
plt.show()

CA, FL, NC, NY, TX are the States that register the highest count of accidents in the Country.
<b/> Why are these states so high when it comes to accidents? Could do an analysis of accident count of state v Size. I suspect the bigger the state the more cars -> accidents?

In [None]:
by_severity = df.groupby('Severity')['Count'].sum()

In [None]:
sns.countplot(x='Severity',data=df)

In [None]:
# Bivariate visualization of categorical variables

#create a frequency table of state against severity
cat_var = pd.crosstab(columns=df['Severity'],
    index=df['State'])

#plot a stacked plot
cat_var.plot(kind='bar',stacked=True,figsize=(16,8),color=['purple','orange','blue','red','green'])
plt.title('Stacked plot of Accident Severity in respective State')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(x='Severity',y='Wind_Speed(mph)',data=df,hue='Severity')
plt.ylim(0,100)

In [None]:
# I used median here because there are so many outliers in the boxplot that i felt using mean would skew the data

df.groupby('Severity')['Wind_Speed(mph)'].median().plot(kind='bar')
plt.ylabel('Wind_Speed(mph)')
plt.title("Median 'Wind_Speed(mph)' by Severity")
plt.show()

<b>Is Wind_Speed a factor that influences accident Severity?</b> <br/>Honestly from the plot it's inconclusive. It seems the distribution amongst the respective severities is somewhat similar. There seems to be more accidents in Severity 2 and 3 that in Severity 4 actually

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(x='Severity',y='Wind_Chill(F)',data=df,hue='Severity')
plt.legend(loc='best')
plt.show()

In [None]:
df.groupby('Severity')['Wind_Chill(F)'].mean().plot(kind='bar')
plt.ylabel('Wind_Chill(F)')
plt.title("Average 'Wind_Chill(F)' by Severity")
plt.show()

Similar to Wind_Speed, the distributions of Wind_Chill are inconclusive. Except that most accidents happen when the Wind Chill is over 0˚

Analysis of Boolean Columns

barplot of ['Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit',
       'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming',
       'Traffic_Signal', 'Turning_Loop'] accidents by Severity

In [None]:
def catplotter(col):
    x = df.groupby([col, 'Severity'])['Count'].sum().reset_index()
    sns.catplot("Severity", "Count", col=col, data=x, kind="bar")
    plt.show()

In [None]:
catplotter('Roundabout')

In [None]:
catplotter('Bump')

In [None]:
catplotter('Amenity')

In [None]:
catplotter('Crossing')

In [None]:
catplotter('Give_Way')

In [None]:
catplotter('Junction')

In [None]:
catplotter('No_Exit')

In [None]:
catplotter('Railway')

In [None]:
catplotter('Station')

In [None]:
catplotter('Stop')

In [None]:
catplotter('Traffic_Signal')

In [None]:
catplotter('Turning_Loop')

In [None]:
catplotter('Side')

Most accidents happen on the right had side. Makes sense since the U.S drives on the right side of the road

In [None]:
catplotter('Sunrise_Sunset')

In [None]:
catplotter('Civil_Twilight')

In [None]:
catplotter('Nautical_Twilight')

In [None]:
catplotter('Astronomical_Twilight')

Most accidents happen during the day time. 

In [None]:
df.sample(10)

In [None]:
# Severity Impact by Temperature
plt.figure(figsize = (16, 6))
sns.violinplot(y="Temperature(F)", x="Severity", data=df,width=0.6,linewidth=0.5)
plt.show()

In [None]:
# Severity Impact by Humidity 
plt.figure(figsize = (16, 6))
sns.violinplot(y="Humidity(%)", x="Severity", data=df,width=0.6,linewidth=0.5)
plt.show()

In [None]:
# Severity Impact by Precipitation(in) 
plt.figure(figsize = (16, 6))
sns.violinplot(y='Precipitation(in)', x="Severity", data=df,width=0.6,linewidth=0.5)
plt.show()

In [None]:
# Severity Impact by Pressure(in)
plt.figure(figsize = (16, 6))
sns.violinplot(y='Pressure(in)', x="Severity", data=df,width=0.6,linewidth=0.5)
plt.show()

In [None]:
# Top 10 weather condition
plt.figure(figsize = (15, 6))
df[df['Weather_Condition'] != 0]['Weather_Condition'].value_counts().iloc[:10].plot(
    kind='bar',color=['b','k','g','r','c','violet','lime','y','m','purple'])
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='Hour',data=df)
df.groupby('Hour')['Count'].value_counts().plot(color='k',linestyle='-',marker='.',linewidth=0.6)
plt.title('Count of Accidents by Hour')
plt.xticks(np.arange(0,24),np.arange(0,24),rotation=90)
plt.xlabel('Hour')
plt.plot()

In [None]:
x = pd.crosstab(index=df['Hour'],columns=df['Severity'])
x.plot(kind='bar',stacked=True, color=['b','k','g','r','c'],figsize=(12,6))
plt.show()

In [None]:
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
severity_2 = df[df['Severity']==2]['Description']
severity_3 = df[df['Severity']==3]['Description']
severity_4 = df[df['Severity']==4]['Description']

In [None]:
desc_2 = severity_2.str.split("(").str[0].value_counts().keys()
wc_desc_2 = WordCloud(scale=5,max_words=100,colormap="rainbow",background_color="white").generate(" ".join(desc_2))

desc_3 = severity_3.str.split("!").str[0].value_counts().keys()
wc_desc_3 = WordCloud(scale=5,max_words=100,colormap="rainbow",background_color="white").generate(" ".join(desc_3))

desc_4 = severity_4.str.split("!").str[0].value_counts().keys()
wc_desc_4 = WordCloud(scale=5,max_words=100,colormap="rainbow",background_color="white").generate(" ".join(desc_4))

In [None]:
fig, axs = plt.subplots(1,3,sharey=True,figsize=(17,14))

axs[0].imshow(wc_desc_2,interpolation="bilinear")
axs[1].imshow(wc_desc_3,interpolation="bilinear")
axs[2].imshow(wc_desc_4,interpolation="bilinear")

axs[0].axis("off")
axs[1].axis("off")
axs[2].axis("off")

axs[0].set_title('Severity 2 Accidents')
axs[1].set_title('Severity 3 Accidents')
axs[2].set_title('Severity 4 Accidents')

plt.show()

For one thing, this plot demonstrates the impact on road usage the accident severity has. Severity 2 and 3 accidents have somewhat of a similar impact on the road, blocked lanes or shoulder, while severity 4 accidents lead to the closure of the road entirely

### Plotting Long and Lat using Folium

In [None]:
import folium

In [None]:
df.sample(3)

In [None]:
w = df.groupby(['State'])['Count'].sum().reset_index()

In [None]:
state_geo = '/kaggle/input/usa-states/usa-states.json'

In [None]:
n = folium.Map(location=[39.381266, -97.922211],zoom_start=5)
folium.Choropleth(
 geo_data=state_geo,
 data=w,
 columns=['State', 'Count'],
 key_on='feature.id',
 fill_color='YlOrRd',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Accidents'
).add_to(n)
n

Choropleth Map of the US showing what we've seen with Barplots earlier - US states according to accident count.
- California has the highest count
- Texas is second.

Two things I have learned from all this:
- It seems that folium.CircleMarker runs into problems when trying to plot a lot of data points. It seems it's different for different people. On my Maching, I couldn't plot more than 40k data points, and that's when I was trying to plot California State data only.
- I am a bit torn with the Choropleth map, i feel the State view is too high level but at the same time I don't see how I can plot a more granulated map as I feel that would be appropriate when analyzing at State level and not country level. Unless someone has any ideas.


If you have any siggestions or you like this, please let me know.. would really appreciate it!!