# US Accidents Exploratory Data Analysis

TODO - All about the dataset (source, what it contains, how it will be useful)

    -- Kaggle
    
    -- Information about accidents
    
    -- Can be useful to prevent accidents 

In [None]:
pip install opendatasets --upgrade --quiet

In [None]:
import opendatasets as od

download_url = "https://www.kaggle.com/sobhanmoosavi/us-accidents"

od.download(download_url)

In [None]:
data_filename = "./us-accidents/US_Accidents_Dec20_updated.csv"

## Data Preparation and Cleaning
1) Load the file using Pandas

2) Look at some information about the date and the columns

3) Fix any missing or incorrect values

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("darkgrid")
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv(data_filename)
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
# Finding the number of numeric columns in this dataset

numerics = ["int16", "int32", "int64", "float16", "float32", "float64"]

numeric_df = df.select_dtypes(include = numerics)

len(numeric_df.columns)

In [None]:
# Percentage of missing values per column

missing_percentages = round(100*df.isna().sum().sort_values(ascending = False)/len(df), 2)

missing_percentages

In [None]:
missing_percentages[missing_percentages != 0].plot.barh()
plt.show()

## Exploratory Analysis and Visualization

Columns which are analyzed below:

1) City

2) Start Time

3) Start Lat, Start Lng


In [None]:
df.columns

### Cities

In [None]:
# Finding the number of unique cities 

cities = df.City.unique()
len(cities)

In [None]:
# Number of cities where accident happened

cities_by_accident = df.City.value_counts()
cities_by_accident.head()

In [None]:
# Top 20 cities in which accidnets happened

cities_by_accident[:20].plot.barh()
plt.show()

In [None]:
sns.distplot(cities_by_accident)
plt.show()

# This displot shows that most of the cities have less than 2000 accidents. 

In [None]:
sns.histplot(cities_by_accident, log_scale= True)
plt.show()

In [None]:
cities_by_accident[cities_by_accident == 1]

In [None]:
# Creating the buckets for the number of accidents

high_accident_cities = cities_by_accident[cities_by_accident >= 1000]
low_accident_cities = cities_by_accident[cities_by_accident < 1000]

In [None]:
len(high_accident_cities)

In [None]:
100 * len(high_accident_cities) / len(cities)

In [None]:
sns.distplot(high_accident_cities)
plt.show()

In [None]:
sns.distplot(low_accident_cities)
plt.show()

### Start Time

In [None]:
df.Start_Time

In [None]:
df.Start_Time = pd.to_datetime(df.Start_Time)

In [None]:
df.Start_Time[0]

In [None]:
sns.distplot(df.Start_Time.dt.hour, bins = 24, kde = False, norm_hist = True)
plt.show()

# A high percentage of accidents occur between 3 pm to 6 pm (probably people are in hurry to come back home from work). 

In [None]:
# Checking on days of the week

sns.distplot(df.Start_Time.dt.dayofweek, bins = 7, kde = False, norm_hist = True)
plt.show()

#### Is the distribution of accidents by hour the same on weekends as on weekdays?

In [None]:
Sunday_Start_Time = df.Start_Time[df.Start_Time.dt.dayofweek == 6]

sns.distplot(Sunday_Start_Time.dt.hour, bins = 24, kde = False, norm_hist = True)
plt.show()

# On Sundays, The peak occurs in the evening between 4 PM and 12 AM. 

In [None]:
# Checking for the months

sns.distplot(df.Start_Time.dt.month, bins = 12, kde = False, norm_hist = True)
plt.show()

# It seems that lowest accidents occur in summer season and highest in the winter season. 
# This might be due to the low visibility on the roads due to fog or may be slippery roads due to snow. 

In [None]:
# Checking for invidual years just to inspect the above distplot.

df_2019 = df[df.Start_Time.dt.year == 2019]
sns.distplot(df_2019.Start_Time.dt.month, bins = 12, kde = False, norm_hist = True)
plt.show()

In [None]:
df_2017 = df[df.Start_Time.dt.year == 2017]
sns.distplot(df_2017.Start_Time.dt.month, bins = 12, kde = False, norm_hist = True)
plt.show()

In [None]:
df_2016 = df[df.Start_Time.dt.year == 2016]
sns.distplot(df_2016.Start_Time.dt.month, bins = 12, kde = False, norm_hist = True)
plt.show()

# As we see here that the data is missing in the year 2016, which could the be reason of getting inaccurate plot. 

### Start Latitude & Longitude

In [None]:
df.Start_Lat.head()

In [None]:
df.Start_Lng.head()

In [None]:
sns.scatterplot(x = df.Start_Lng, y = df.Start_Lat)
plt.show()

In [None]:
import folium
from folium.plugins import HeatMap

In [None]:
# Displaying on heatmap
import ipynbcompress
map = folium.Map()

HeatMap(zip(list(df.Start_Lat), list(df.Start_Lng))).add_to(map)
map

### Ask & answer questions
𝟏) 𝐀𝐫𝐞 𝐭𝐡𝐞𝐫𝐞 𝐦𝐨𝐫𝐞 𝐚𝐜𝐜𝐢𝐝𝐞𝐧𝐭𝐬 𝐢𝐧 𝐰𝐚𝐫𝐦𝐞𝐫 𝐨𝐫 𝐜𝐨𝐥𝐝𝐞𝐫 𝐦𝐨𝐧𝐭𝐡𝐬?

Answer: There are more accidents in colder months, This might be due to snow or low visibility due to fog. 

𝟐) 𝐃𝐨𝐞𝐬 𝐍𝐞𝐰 𝐘𝐨𝐫𝐤 𝐬𝐡𝐨𝐰 𝐮𝐩 𝐢𝐧 𝐭𝐡𝐞 𝐝𝐚𝐭𝐚? 𝐈𝐟 𝐲𝐞𝐬, 𝐰𝐡𝐲 𝐢𝐬 𝐭𝐡𝐞 𝐜𝐨𝐮𝐧𝐭 𝐥𝐨𝐰𝐞𝐫 𝐢𝐟 𝐭𝐡𝐢𝐬 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐩𝐨𝐩𝐮𝐥𝐚𝐭𝐞𝐝 𝐜𝐢𝐭𝐲. 

Answer: No, New York is not available in the data.

𝟑) 𝐖𝐡𝐚𝐭 𝐭𝐢𝐦𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐝𝐚𝐲 𝐚𝐫𝐞 𝐚𝐜𝐜𝐢𝐝𝐞𝐧𝐭𝐬 𝐦𝐨𝐫𝐞 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐭 𝐢𝐧?

Answer: A high percentage of accidents occur between 3 pm to 6 pm (probably people are in hurry to come back home from work).

𝟒) 𝐖𝐡𝐢𝐜𝐡 𝐝𝐚𝐲𝐬 𝐨𝐟 𝐭𝐡𝐞 𝐰𝐞𝐞𝐤 𝐡𝐚𝐯𝐞 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐚𝐜𝐜𝐢𝐝𝐞𝐧𝐭𝐬 ?

Answer: Most accidents are happening on weekdays only. The reason could be same as people go on their work on weekdays,
so there are high chances of accidents in traffic. 

𝟓) 𝐖𝐡𝐢𝐜𝐡 𝐦𝐨𝐧𝐭𝐡𝐬 𝐡𝐚𝐯𝐞 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐚𝐜𝐜𝐢𝐝𝐞𝐧𝐭𝐬?

Answer: November and December months have the most accidents, and the reason could be the low visibility due to peak
winters or snow fall. 



𝐍𝐨𝐭𝐞: There is high scope of finding many more things in this data. This is not limited only to these findings. We can look for weather conditions, Temperature and many other columns. 

### Summary and Conclusion

Insights:
    
1) No data from New York

2) The number of accidents per city decreases exponentially. 

2) Less than 3% cities have more than 1000 yearly accidents. 

3) Over 1100 cities have reported just one accident (need to investigate more).

4) Los Angeles is at the top in terms of number of accidents. 