---

<p style="font-family:Papyrus ;font-size:2em; color:MediumSlateBlue" >
US Accidents Exploratory Analysis on New York Map

</p>

---

In this Kernel , I will try to explore the US accidents data from Feb 2016 till Dec 2020 and try to validate key objective as listed below. If you like the work here , please upvote, will love to hear comments and suggestions for improvements. Thanks !


***Special Thanks to the Contributor of this dataset- Sobhan Moosavi and Team for coming up and sharing this amazing data set that offers so many excellent features from realworld parameters around US Accidents.***

<br>

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:mediumVioletRed; font-style:bold" >
Objectives : 
</p>

The primary goal of the project is to analyze and generate insights on the traffic accidents that took place in USA between 2016 and 2020. The first part of the analysis will examine contry-wide aspects. In second part, New York data will be handled for closer examination. Specifically - 

**Contry-wide Analysis :**
    1. Identify top 10 States and top 10 Cities with most accidents
    2. Analyze accident trends over time series (Month over month, year over year, day of the week)
    3. What time do most of the accidents occur ?
    4. Is there a change in accident severity levels over the years ? 
    5. Do Weather Conditions have effect on Accidents ?
    6. When do most delays happens after the accidents ?
    7. Examine accident severity correlations with available features (like temp, visibility, day or night etc.)
    
**New York City Analysis :**
    1. Compare New York Accidents for week days and hours of the day
    2. What are most accident occuring streets in New York?
    3. Present the Accidents on the NY City map
    


---

<p style="font-family:Segoe Print ;font-size:1.5em; color:mediumVioletRed; font-style:bold" >
Understanding Source Dataset
</p>


Source data used for the analysis is collection of all the realtime traffic accidents reported by number of traffic monitoring APIs over the period from Feb 2016 till Dec 2020. There are 4.2 million accident records in the dataset and 49 columns contributing to variety of information of each accidents.

For understanding the data better and for further analysis , I have all classified all the features from the data into below subject categories:


**Record/source API identifiers :**
            
        ID, Source, TMC 

**Accident properties :**
        
        Severity, Start_Time, End_Time, Start_Lat, Start_Lng, End_Lat, End_Lng, Distance(mi)

**Location properties :**
        
        Description, Number, Street, Side, City, County, State, Zipcode, Country, Timezone, Airport_Code, 

**Weather Condition Properties :**
        
        Weather_Timestamp, Temperature(F), Wind_Chill(F), Humidity(%), Pressure(in), Visibility(mi), Wind_Direction, Wind_Speed(mph), Precipitation(in), Weather_Condition, Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight

**Nearby landmark properties :**
        
        Amenity, Bump, Crossing, Give_Way, Junction, No_Exit, Railway, Roundabout, Station, Stop, Traffic_Calming, Traffic_Signal, Turning_Loop,


<br>



**Dataset Link** : https://www.kaggle.com/sobhanmoosavi/us-accidents

------


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Initial Cleanup of the data
</p>

---

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
import warnings

init_notebook_mode(connected=True)

warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
data_file = r'/kaggle/input/us-accidents/US_Accidents_Dec20.csv'
df = pd.read_csv(data_file)
df.columns

This is a big data set with plenty of fields to consider in analysis. However , some of the fields here can be an over-engineering.
Before going further, I would want to get some of the fields removed to lighten up the dataframe. Fields that I would be dropping are :

* 'End_Lat', 'End_Lng' - Lattitudes , Longitude at the end of the accidents (we already have start co-ordinates to use for this purpose).

* 'Number' - street apt number.

* 'Airport_Code' - Nearest airport code.

* 'Country' - All data is for USA.

* 'Weather_Timestamp' , 'TMC' , 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight', 'Id', 'Source'

* Timezone 




In [None]:
df.drop(columns=['End_Lat', 'End_Lng' ,'Number', 'Airport_Code' ,'Weather_Timestamp' , 'TMC' , 
                 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight',
                 'Country','ID', 'Source','Timezone'], inplace=True)


I am creating a reusable function that I can run on any dataframe to perform quick basic sanity of the data.

In [None]:
from pprint import pprint
def sanity_check(df):
    pprint('-'*70)
    pprint('No. of Row : {0[0]}        No. of Columns : {0[1]}'.format(df.shape))
    pprint('-'*70)
    data_profile = pd.DataFrame(df.dtypes.reset_index()).rename(columns = {'index' : 'Attribute' ,
                                                                           0 : 'DataType'}).set_index('Attribute')
    data_profile = pd.concat([data_profile,df.isnull().sum()], axis=1).rename(columns = {0 : 'Missing Values'})
    data_profile = pd.concat([data_profile,df.nunique()], axis=1).rename(columns = {0 : 'Unique Values'})
    pprint(data_profile)
    pprint('-'*70)

#### Our analysis has more focus on locations and time of the accirdents. The columns 'City' has 137 rows null (which is very low compared to total row count). We will remove these rows from dataset

Similarly, Sunrise_sunset and Description columns has less than 10 rows with null rows, dropping those rows from dataset.

In [None]:
df.dropna(subset=['City','Sunrise_Sunset','Description'], inplace=True)

#### Converting the date columns from object datatypes to date datatype

In [None]:
df['Start_Time'] = pd.to_datetime(df['Start_Time'])
df['End_Time'] = pd.to_datetime(df['End_Time'])

#### Its time to run our reusable sanity check function on the cleaned dataframe now.

In [None]:
sanity_check(df)

All the initial setup of data looks ready for our analysis now. 

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Top 10 States and Top 10 Cities with most accidents
</p>

---

In [None]:
top_10_state = df[['City','State' , 'Severity']].groupby('State').agg({'City' : 'count' , 
                                                       'Severity' : 'mean' }).sort_values(
    by='City',ascending=False).head(10)

In [None]:
df_state_city = df[['State' , 'City','Severity']].groupby(['State' , 'City']).count().rename(columns = {'Severity' : 'Count'})

top_10_city = df_state_city.sort_values(by='Count' , ascending = False).head(10)

In [None]:
fig , (ax1, ax2) = plt.subplots(1,2,figsize=(14,4))

bar = sns.barplot(x=top_10_state.index , y=top_10_state['City'],
                  palette='nipy_spectral_r' , 
#                   palette='pastel' , 
                  edgecolor = 'black',
                  ax=ax1 )
sns.despine(left = True )
ax1.set_xlabel("State")
ax1.set_ylabel("No. of Accidents" , fontdict = {'fontsize':16 , 'color':'MidnightBlue'})
ax1.set_title('Top 10 Accident States in US', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})
# ax3=ax1.twinx()
# ax3.plot(top_10_state['Severity'] ,'o-', color='lightgray')
# ax3.set_ylabel('Severity')


bar = sns.barplot(x=top_10_city.index.get_level_values(1) , y=top_10_city['Count'],
                  palette='nipy_spectral' , 
#                   palette='pastel' , 
                  edgecolor = 'black',
                  ax=ax2
                 )
sns.despine(left = True )
ax2.set_xlabel("City" )
ax2.set_ylabel("No. of Accidents")
ax2.set_title('Top 10 Accident Cities in US', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})
plt.xticks(rotation = 45)


# Working to get labels for percentages
total_accidents = len(df)

# for state
for p in ax1.patches :
    height = p.get_height()
    ax1.text(p.get_x() + p.get_width()/2,
            height + 20000,
            '{:.2f}%'.format(height/total_accidents*100),
            ha = "center",
            fontsize = 8, color='indianred')

    
# for City
for p in ax2.patches :
    height = p.get_height()
    ax2.text(p.get_x() + p.get_width()/2,
            height + 3000,
            '{:.2f}%'.format(height/total_accidents*100),
            ha = "center",
            fontsize = 8, color='indianred')
    
    
fig.show()



>We can see that CA is the states with most accidents in last 4 years - close to 23% of all the accidents in the country. 

>Texas is the second higheest in no. of accidents. 

>When it comes to cities, Houston and Dallas (both TX) are in list of top 5 accident reported cities.


---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Accidents trends on time series
</p>

---


#### We will setup few time series here to use them further for building various view points on accidents

In [None]:
# Creating Date Time series attributes


df['Year'] = df['Start_Time'].dt.year
df['Month'] = df['Start_Time'].dt.month  # .dt.month_name()
df['Hour'] = df['Start_Time'].dt.hour
diff = df['End_Time'] - df['Start_Time']
df['DelayTime'] = round(diff.dt.seconds/3600,1)
year = df['Year'].value_counts()
month = df['Month'].value_counts().sort_index()
month_map = {1:'Jan' , 2:'Feb' , 3:'Mar' , 4:'Apr' , 5:'May' , 6:'Jun', 7:'Jul' , 8:'Aug' 
             , 9:'Sep',10:'Oct' , 11:'Nov' , 12:'Dec'}

hour_severity = df[['Hour' , 'Severity']].groupby('Hour').agg({'Hour' : 'count' , 'Severity' : 'mean'})

df['Day'] = df['Start_Time'].dt.dayofweek
day_severity = df[['Day' , 'Severity']].groupby('Day').agg({'Day' : 'count' , 'Severity' : 'mean'})
day_map = {0:'Monday' , 1:'Tueday' , 2:'Wedday' , 3:"Thuday" , 4:'Friday' , 5:"Saturday" , 6:'Sunday'}


# df['Month'].head()

In [None]:
hour_severity = df[['Hour' , 'Severity']].groupby('Hour').agg({'Hour' : 'count' , 'Severity' : 'mean'})

df['Day'] = df['Start_Time'].dt.dayofweek
day_severity = df[['Day' , 'Severity']].groupby('Day').agg({'Day' : 'count' , 'Severity' : 'mean'})
day_map = {0:'Monday' , 1:'Tueday' , 2:'Wedday' , 3:"Thuday" , 4:'Friday' , 5:"Saturday" , 6:'Sunday'}


#### Visualizing the timeseries plot now .. 


In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(14,5))


# plot for year

light_palette = sns.color_palette(palette='pastel')

year_color_map = ['Lavender' for _ in range(5)]
year_color_map[0] = 'LightCoral' #light_palette[0]
year_color_map[4] = 'PaleGreen' #light_palette[2]

years = ax1.bar(year.index.values , year, color=year_color_map , edgecolor = 'black')
ax1.spines[('top')].set_visible(False)
ax1.spines[('right')].set_visible(False)
ax1.set_xlabel("Years", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax1.set_ylabel("No. of Accidents")
ax1.set_title('Accidents per Years', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})

for p in ax1.patches :
    height = p.get_height()
    ax1.text(p.get_x() + p.get_width()/2,
            height + 20000,
            '{:.2f}%'.format(height/total_accidents*100),
            ha = "center",
            fontsize = 8, color='Blue')

    
# plot for month


month_color_map = ['Lavender' for _ in range(12)]
month_color_map[11] = 'LightCoral' #light_palette[0]
month_color_map[6] = 'PaleGreen' #light_palette[2]

m = sns.barplot( x= month.index.map(month_map), y=month,  ax = ax2, palette=month_color_map , edgecolor='black' )
plt.xticks(rotation=60)
ax2.set_xlabel("Months", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax2.set_ylabel("No. of Accidents")
ax2.set_title('Accidents per Months', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})
sns.despine(left=True)

for p in ax2.patches :
    height = p.get_height()
    ax2.text(p.get_x() + p.get_width()/2,
            height + 8000,
            '{:.2f}%'.format(height/total_accidents*100),
            ha = "center",
            fontsize = 8, color='blue')

ax1.grid(axis='y', linestyle='-', alpha=0.4)    
ax2.grid(axis='y', linestyle='-', alpha=0.4) 
    
plt.show()


#### Continuing to view the data on different time scale - Week days

In [None]:
fig, (ax , ax2, ax3) = plt.subplots(1,3,figsize = (16,6))

sns.set_context('paper')

# f = sns.lineplot(x=day_severity['Day'].index.map(day_map) , y=day_severity['Severity'], 
#                  ax = ax,  label='Severity', legend = 'full' , dashes=True, palette=light_palette, color='red')


ax.plot(day_severity['Severity'] ,  color='Turquoise', label=day_map,linewidth=3,
           linestyle='solid',marker='.',markersize=18, markerfacecolor='w',markeredgecolor='b',markeredgewidth='2')


ax.set_xlabel("Days of the week", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax.set_ylabel("Severity Level")
ax.set_title('Severity by day of week', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})


ax2.plot(day_severity['Day'] ,  color='Turquoise', label=day_map,linewidth=3,
           linestyle='solid',marker='.',markersize=18, markerfacecolor='w',markeredgecolor='b',markeredgewidth='2')

ax2.set_xlabel("Days of the week", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax2.set_ylabel("No. of Accidents")
ax2.set_title('Accidents Count by day', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})

f2 = sns.barplot(x=day_severity['Day'].index.map(day_map) , y=day_severity['Day'], ax = ax3, palette = 'nipy_spectral_r')
plt.xticks(rotation=60)
ax3.set_xlabel("Days of the week", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax3.set_title('Accidents count on days of week', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})

sns.despine(left=True)

fig.show()

>Saturday and Sunday are usually low days for number of accidents. However, the severity of the accidents occuring on Saturday and Sunday is comparatively at higher levels.

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
What time do most of the accidents occur ?
</p>

---

In [None]:

fig, ax = plt.subplots(1,1,figsize = (14,6))

sns.set_context('paper')

# ax.plot(hour_severity['Hour'], color='Salmon' , linewidth=3, linestyle='solid',
#         marker='*',markersize=18, markerfacecolor='w',markeredgecolor='m',markeredgewidth='2',
#         label = 'No. of Accidents'
#        )


f = sns.barplot(x=hour_severity['Hour'].index , y=hour_severity['Hour'], ax = ax, palette='Pastel2')

ax2 = ax.twinx()

ax2.plot(hour_severity['Severity'] , color='CornFlowerBlue', label='Severity',linewidth=3,
           linestyle='solid',marker='.',markersize=18, markerfacecolor='w',markeredgecolor='b',markeredgewidth='2')

sns.despine(left=True)
# ax.spines[('top')].set_visible(False)
# ax.spines[('right')].set_visible(False)
# ax.spines[('left')].set_visible(False)
ax2.spines[('top')].set_visible(False)
ax2.spines[('right')].set_visible(False)
ax2.spines[('left')].set_visible(False)
ax.set_xlabel("Hours of the Day", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax.set_ylabel("No. of Accidents")
ax2.set_ylabel("Severity of Accidents", rotation=270 ,labelpad=20)
ax.set_title('Accidents and Severity per Hour of the day', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})
# ax.legend(loc=(0,1))
ax2.legend(loc=(0,0.8))

ax.annotate('Morning office rush' , xytext=(3,150000) , xy=(7,5000),arrowprops={'arrowstyle':'fancy' , 'color':'Red'})
ax.annotate('Office Returning rush' , xytext=(19,150000),xy=(16,5000),arrowprops={'arrowstyle':'fancy', 'color':'Red'})

fig.show()


>**Morning 7 AM to 9 AM and evening 4 PM to 6 PM are the prime hours when most of the accidents happened.**  (Looks like when pandemic is over, I will have to change my office commute start time, just to be more safe :-) )

>**Although that is true, the accidents occuring between 3 AM to 5 AM tends be extremely severe. Likewise, accidents occuring between 8 PM to 9 PM tends to be high severity.**




---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Is there a change in accident severity levels over the years ?
</p>

---


In [None]:
sev_4_mean = df[df['Severity'] == 4][['Severity','Year']].groupby('Year').count().mean()
sev_4_mean[0]

In [None]:
f , (ax1,ax2) = plt.subplots(1,2,figsize=(16,6))

df['Severity'].value_counts().plot.pie(autopct = '%1.1f%%' , ax=ax1, colors =sns.color_palette(palette='Pastel1') ,
                                      pctdistance = 0.8, explode = [.03,.03,.03,.03], 
                                       textprops = {'fontsize' : 12 , 'color' : 'DarkSlateBlue'},
                                       labels=['Severity 2','Severity 3' , 'Severity 4' , 'Severity 1']
                              )

ax1.set_title("Accidents Severity", fontdict = {'fontsize':16 , 'color':'MidnightBlue'} )


s = sns.countplot(data=df[['Severity','Year']] , x = 'Year' , hue='Severity' , ax=ax2, palette = 'rainbow' 
                  , edgecolor='black')
ax2.axhline(sev_4_mean[0] ,color='Blue', linewidth=1, linestyle='dashdot')
ax2.annotate(f"Severity 4 mean : {sev_4_mean[0]}",
            va = 'center', ha='center',
            color='#4a4a4a',
            bbox=dict(boxstyle='round', pad=0.4, facecolor='Wheat', linewidth=0),
            xy=(1,80000))

ax2.set_title("Severity levels by years", fontdict = {'fontsize':16 , 'color':'MidnightBlue'} )

sns.despine(left=True)

>**Most of accidents fall in Severity 2 category (71%)**

>Examining the severity levels over the years, it is seen that Severity 4 accidents have been closely in the same range over the years. However, Severity 2 and Severity 1 accidents have been increasing drastically by years. At the same time Severity 3 level accidents are descreasing. **This indicates that measures taken by Road Governance departments in last two years across USA are proving effective in reducing the Severity 3 accidents into Severity 1 and Severity 2 (kudos).**


---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Examining the Severity associtation with Temperature, Humidity and Pressure
</p>

---

In [None]:
pair = sns.pairplot(df[['Severity','Temperature(F)','Humidity(%)','Pressure(in)']].dropna(), hue='Severity', palette='nipy_spectral')
# pair = sns.pairplot(df[['Severity','Temperature(F)']].dropna(), hue='Severity', palette='nipy_spectral')

pair.fig.suptitle('Distribution of Temp , Humidity and Pressure with Severity', y =1.08 
                  , fontsize = 16 , color = 'MidnightBlue' , ha = 'center' , va='top')

plt.show()



>Most of the accidents (irrespective of Severity Level) at occuring in Temperature ranges between 50 to 80 F. This does not offer any greater insight.

>**When Humidity histogram is reviwed, we can see that High Humidity (80 to 100) is always a favorable factor for Severity 1, 2 and 3 Level accident occurence. This in general is indicating that weather condition associated with rainy situation is more prone to accident.**

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Do Weather Conditions have effect on Accidents ?
</p>

---

We will be generalizing availalbe variety of weather conditions here into more higher/broader categories. That will help us avoid micro-noise and focus on the high contributing weather conditions with respect to accidents

In [None]:
# Generalization of Weather condition

conditions = df['Weather_Condition'].dropna().unique().tolist()

condition_map = dict()


for x in conditions :
    if x.lower().find('snow')>0 or x.lower().startswith('snow') or x.lower().find('ice')>0 or x.lower().startswith('ice'):
        condition_map[x] = 'Snow Situation'
    elif (x.lower().find('rain')>0 or x.lower().find('drizzle')>0 or 
          x.lower().startswith('rain') or x.lower().startswith('drizzle') or
          x.lower().find('thunder')>0 or x.lower().startswith('thunder')):
        condition_map[x] = 'Rainy Situation'
    elif (x.lower().find('storm')>0 or x.lower().find('thunder')>0):
        condition_map[x] = 'Storm Situation'
    elif (x.lower().find('cloud')>0 or x.lower().startswith('cloud')>0):
        condition_map[x] = 'Cloudy'
    elif (x.lower().find('fog')>0 or x.lower().startswith('fog')>0):
        condition_map[x] = 'Fog'
    elif (x.lower().find('dust')>0 or x.lower().startswith('dust')>0):
        condition_map[x] = 'Dust'
    elif (x.lower().find('wind')>0 or x.lower().startswith('wind')>0):
        condition_map[x] = 'Windy'
    else:
        condition_map[x] =x


df['Weather'] = df['Weather_Condition'].map(condition_map)
# df['Weather'].value_counts().sort_values(ascending=False).head(20)
total = len(df['Weather'])
# total
top_10_weather = df['Weather'].value_counts()[:10]
top_15_weather = df['Weather'].value_counts()[:13]

top_10_weather
# condition_map




In [None]:
def check_exist(x):
    if x in top_15_weather :
        return x
    else :
        return 'Other'

df['Weather2'] = df['Weather'].apply(check_exist)

cmap = {x:y for (x,y) in zip(top_10_weather.index , sns.color_palette('pastel'))}

Now analysing the weather conditions

In [None]:
# Analysing the 'Weather_Condition' attribute

fig,(ax,ax2) = plt.subplots(1,2,figsize = (16, 6))

sns.countplot(y='Weather', data=df[['Weather','Severity']], order=df['Weather'].value_counts()[:10].index, 
              palette=cmap , edgecolor = 'black' , 
              ax= ax)


ax.set_xlabel("Accident Counts", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax.set_ylabel("Top 10 Weather Conditions")
ax.set_title('Comparison of Weather Conditions', fontdict = {'fontsize':16 , 'color':'MidnightBlue'}, pad=15)


sns.countplot(y='Weather', data=df, order=df[df['Severity'] == 4]['Weather'].value_counts()[:10].index, 
              palette=cmap, edgecolor = 'black' ,  ax= ax2)

ax2.set_xlabel("Accident Counts", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
ax2.set_ylabel("Severity 4 Weather Conditions")
ax2.set_title('Severity-4 Weather distribution', fontdict = {'fontsize':16 , 'color':'MidnightBlue'}, pad=15)


# ax.grid(axis='y', linestyle='-', alpha=0.4) 
sns.despine(left=True)

plt.show()


>**Highest number of accidents occured when weather conditions are cloudy. Rainy situations are among top 5 conditions , however it is not the topmost conditions for accidents.**

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
When do most delays happens after the accidents ?
</p>

---


In [None]:
fig, ax = plt.subplots(1,1,figsize=(16,5))

w = sns.pointplot(y='DelayTime',x='Weather2',data=df[['Weather2','DelayTime','Severity']],
                  hue = 'Severity'
                  ,ci=None  , 
               order= top_10_weather.index, #kind = 'point',
               height=4, aspect=2 , palette='nipy_spectral', ax= ax)

ax.grid(axis='y', linestyle='-', alpha=0.4)  

# w = sns.lineplot(x='Weather2', y='DelayTime' , data=df[['Weather2','DelayTime']] , hue_order= top_15_weather.index)

plt.xlabel("Weather conditions", fontdict = {'fontsize':12 , 'color':'MidnightBlue'} )
plt.xticks(fontsize=12 , rotation = 45)
plt.ylabel("Delay Times (in Hours)")

ax.set_title('Delay times for different Weather Conditions', fontdict = {'fontsize':16 , 'color':'MidnightBlue'}, pad=15)

plt.show()


>**The maximum amount of traffic delays due to accidents are occurring when the conditions are smokey (seems wildfire or fire hazards scenarios)**

>**Rainy and Snow Situations are among the usual delay accident ranges**

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
What are common factors associated with high severity of the accidents ?
</p>

---

In [None]:
df['Severity'] = df['Severity'].astype('int')

In [None]:
# plotting correlations on a heatmap

features = ['Severity','Temperature(F)', 'Humidity(%)', 
       'Pressure(in)', 'Visibility(mi)', 'Wind_Direction', 'Wind_Speed(mph)',
       'Precipitation(in)', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 
       'Sunrise_Sunset']

mask = np.zeros_like(df[features].corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# [df['Severity'] == 4]

plt.figure(figsize=(16,12))
sns.heatmap(df[features].corr(), cmap=sns.diverging_palette(240, 10, as_cmap=True), square=True, 
            annot=True, fmt='.2f', center=0, linewidth=2, cbar=True , mask = mask)


plt.show()

>Features like Traffic Signals and Crossings are having negative corelations with accident severity.

>Surprisingly , visibility does not have any signficant corelation with accident severity.

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Zero-in on New York city accidents
</p>

---


In [None]:
df_nyc = df[df['City'] == 'New York']

year = df_nyc['Year'].value_counts()
month = df_nyc['Month'].value_counts().sort_index()
month_map = {1:'Jan' , 2:'Feb' , 3:'Mar' , 4:'Apr' , 5:'May' , 6:'Jun', 7:'Jul' , 8:'Aug' 
             , 9:'Sep',10:'Oct' , 11:'Nov' , 12:'Dec'}
hour = df_nyc['Hour'].value_counts().sort_index()

hour_severity = df_nyc[['Hour' , 'Severity']].groupby('Hour').agg({'Hour' : 'count' , 'Severity' : 'mean'})

# df_nyc['Day'] = df_nyc['Start_Time'].dt.dayofweek
day_severity = df_nyc['Day'].value_counts().sort_index()
# day_severity = df_nyc[['Day' , 'Severity']].groupby('Day').agg({'Day' : 'count' , 'Severity' : 'mean'})
day_map = {0:'Monday' , 1:'Tueday' , 2:'Wedday' , 3:"Thuday" , 4:'Friday' , 5:"Saturday" , 6:'Sunday'}
year_map = {x:x for x in year.index}
hour_map = {x:x for x in hour.index}

light_palette = sns.color_palette(palette='pastel')

In [None]:
day_severity = df_nyc['Day'].value_counts().sort_index()

#### Accidents in NYC on time series

Finally, realizing that I am needing to perform same plotting repeatedly for different time series ,I created this reusable function to plot the accidents on the time parameters. The function takes in the frequency parameter (day, year, month etc) and accordingly handles the plotting.

In [None]:
fig,([ax1,ax2],[ax3,ax4]) = plt.subplots(2,2,figsize=(16,9))

def plot_dist(kind  , text ,axis,  red, green  ) :
    '''
    Reusable function to plot distribution based on input time criteria
    Usage : plot_dist(kind, text, axis, red , green) - all params mandatory
    
        kind : 'd' for day, 'm' for month , 'y' for year, 'h' for hour
        red  : list of item to be rendered red (max)
        green : list of item to be rendered green (min)
        text : Text to be showns as part of Title
        axis : Axis to plot on  
    '''
    if kind == 'd' :
        tot, ser, map = 7, day_severity ,  day_map
    elif kind == 'm':
        tot, ser, map = 12, month ,  month_map
    elif kind == 'y':
        tot, ser, map = 5, year ,  year_map
    elif kind == 'h':
        tot, ser, map = 24, hour ,  hour_map
    
    day_color_map = ['AliceBlue' for _ in range(tot)]
    for r in red:
        day_color_map[r] = 'Crimson' 
    for g in green:
        day_color_map[g] = 'SpringGreen' 
    
    d = sns.barplot(x=ser.index.map(map) , y=ser, ax = axis, palette = day_color_map, edgecolor='black' )
    plt.xticks(rotation=60)
    axis.set_xlabel(text, fontdict = {'fontsize':12 , 'color':'MediumVioletRed'} )
    axis.set_title(f'Accidents count on {text}', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})
    axis.grid(axis='y', linestyle='-', alpha=0.4) 
    
plt.subplots_adjust(wspace=0.2 , hspace = 0.4)
plt.suptitle("New York City Accidents on Timeseries" , fontsize = 18 , color="RosyBrown")

plot_dist('d' ,"Days of the week", ax3,[0],[5])
plot_dist('y' ,"Years", ax1,[4],[0])
plot_dist('m' ,"Months", ax2, [10],[0])
plot_dist('h' ,"Hours", ax4,red=[7,16],green=[2])
plt.show()

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
What are most accidents occuring streets in NYC ?
</p>

---

In [None]:
top_st = df_nyc['Street'].value_counts().sort_values(ascending=False).head(10).index.tolist()

In [None]:
top_st_severity = df_nyc[df_nyc['Street'].isin(top_st)][['Street' , 'Severity']] .groupby('Street').mean()

top_st_delay = df_nyc[df_nyc['Street'].isin(top_st)][['Street' , 'DelayTime']] .groupby('Street').mean()

In [None]:
fig, (ax,ax2,ax3) = plt.subplots(3,1,figsize=(16,10), sharex=True)

fig.subplots_adjust(hspace=0)

sns.countplot(data = df_nyc[df_nyc['Street'].isin(top_st)][['Street' , 'Severity']] ,
              x='Street' , ax=ax3, palette='Pastel2',edgecolor = 'Black')
plt.xticks(rotation=30)

ax2.plot( top_st_severity, color='CornFlowerBlue', label='Severity',linewidth=3,
           linestyle='solid',marker='.',markersize=18, markerfacecolor='w',markeredgecolor='b',markeredgewidth='2')

ax.plot( top_st_delay, color='LightCoral', label='Severity',linewidth=3,
           linestyle='solid',marker='*',markersize=18, markerfacecolor='w',markeredgecolor='b',markeredgewidth='2')

ax.spines[('top')].set_visible(False)
ax.spines[('right')].set_visible(False)
ax2.spines[('right')].set_visible(False)
ax3.spines[('right')].set_visible(False)
ax3.set_xlabel("NYC Streets", fontdict = {'fontsize':14 , 'color':'Teal'} )
ax3.set_ylabel("No. of Accidents", fontdict = {'fontsize':12 , 'color':'MidnightBlue'})
ax2.set_ylabel("Severity of Accidents", fontdict = {'fontsize':12 , 'color':'MidnightBlue'})
ax.set_ylabel("Avg. Delay Times (Hours)", fontdict = {'fontsize':12 , 'color':'MidnightBlue'})
ax.set_title('Top NYC Streets - Accident ,  Severity and Delay', fontdict = {'fontsize':16 , 'color':'MidnightBlue'})
ax2.legend(loc=(0.01,0.8))
ax.legend(loc=(0.01,0.8))
ax.grid(axis='x', linestyle='-', alpha=0.4) 
ax2.grid(axis='x', linestyle='-', alpha=0.4) 
ax3.grid(axis='x', linestyle='-', alpha=0.4) 

plt.show()

---

<p style="font-family:Segoe Print ;font-size:1.5em; color:MidnightBlue; font-style:bold" >
Present the Accidents on the NY City map
</p>

---

In [None]:
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode,iplot,plot
init_notebook_mode(connected=True)

In [None]:
fig = go.Figure(
    data=go.Choropleth(
        locations = pd.value_counts(df['State']).index, 
        z = pd.value_counts(df['State']).values.astype(float), 
        locationmode = 'USA-states', 
        colorscale = 'reds', 
        colorbar_title = " Accident Counts"), 
    
    layout=go.Layout(
        title_text='Accidents Counts by States (Feb 2016—Dec 2020)', 
        title_x=0.5, 
        font=dict(family='Calibri', size=14, color='MidnightBlue'), 
        geo_scope='usa'))

fig.show()

In [None]:
fig = px.density_mapbox(df_nyc, lat='Start_Lat', lon='Start_Lng', z='Severity', hover_name='Street', radius=5,
                        center=dict(lat=40.730610, lon=-73.935242), zoom=12,
                        mapbox_style="open-street-map", height=900)

fig.show()