**NOTE -**
The Dataset used was updated recently from Kaggle
- The filename of the dataset has been updated, the new filename is **'./us-accidents/US_Accidents_Dec20_updated.csv'**. Please change the filename if you are getting errors while creating the dataframe.
- The **"Source"** column is removed from the updated dataset. Any mention of "Source" in this notebook can be ignored while using the updated dataset.

# US Accidents Exploratory Data Analysis

TODO - talk about EDA

TODO - talk about the dataset (source, what it contains, how it will be useful)
  - Kaggle
  - informaiton about accidents
  - can use useful to prevent accidents
  - mention that this does not contain data about New York


In [None]:
pip install opendatasets --upgrade --quiet

In [None]:
import opendatasets as od

download_url = 'https://www.kaggle.com/sobhanmoosavi/us-accidents'

od.download(download_url)

In [None]:
data_filename = './us-accidents/US_Accidents_Dec20.csv'

## Data Preparation and Cleaning

1. Load the file using Pandas
2. Look at some information about the data & the columns
3. Fix any missing or incorrect values

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(data_filename)

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

numeric_df = df.select_dtypes(include=numerics)
len(numeric_df.columns)

Percentage of missing values per column

In [None]:
missing_percentages = df.isna().sum().sort_values(ascending=False) / len(df)
missing_percentages

In [None]:
type(missing_percentages)

In [None]:
missing_percentages[missing_percentages != 0].plot(kind='barh')

Remove columns that you don't want to use.

## Exploratory Analysis and Visualization

Columns we'll analyze:

1. City
2. Start Time
3. Start Lat, Start Lng
4. Temperature
5. Weather Condition

In [None]:
df.columns

### City

In [None]:
df.City

In [None]:
cities = df.City.unique()
len(cities)

In [None]:
cities_by_accident = df.City.value_counts()
cities_by_accident

In [None]:
cities_by_accident[:20]

In [None]:
type(cities_by_accident)

In [None]:
cities_by_accident[:20].plot(kind='barh')

In [None]:
import seaborn as sns
sns.set_style("darkgrid")

In [None]:
sns.histplot(cities_by_accident, log_scale=True)

In [None]:
cities_by_accident[cities_by_accident == 1]

### Start Time

In [None]:
df.Start_Time

In [None]:
df.Start_Time = pd.to_datetime(df.Start_Time)

- Figure out how to show percentages

In [None]:
sns.distplot(df.Start_Time.dt.hour, bins=24, kde=False, norm_hist=True)

- A high percentage of accidents occur between 6 am to 10 am (probably people in a hurry to get to work)
- Next higest percentage is 3 pm to 6 pm.

In [None]:
sns.distplot(df.Start_Time.dt.dayofweek, bins=7, kde=False, norm_hist=True)

Is the distribution of accidents by hour the same on weekends as on weekdays.

In [None]:
sundays_start_time = df.Start_Time[df.Start_Time.dt.dayofweek == 6]
sns.distplot(sundays_start_time.dt.hour, bins=24, kde=False, norm_hist=True)

In [None]:
monday_start_time = df.Start_Time[df.Start_Time.dt.dayofweek == 0]
sns.distplot(monday_start_time.dt.hour, bins=24, kde=False, norm_hist=True)

On Sundays, the peak occurs between 10 am and 3 pm, unlike weekdays

In [None]:
df_2019 = df[df.Start_Time.dt.year == 2019]
df_2019_Bing = df_2019[df_2019.Source == 'MapQuest']
sns.distplot(df_2019_Bing.Start_Time.dt.month, bins=12, kde=False, norm_hist=True)

Can you explain the month-wise trend of accidents?

- Much data is missing for 2016. Maybe even 2017.
- There seems to be some issue with the Bing data

In [None]:
df.Source.value_counts().plot(kind='pie')

- Consider excluding Bing data, seems to have issues.

### Start Latitude & Longitude


In [None]:
df.Start_Lat

In [None]:
df.Start_Lng

In [None]:
sample_df = df.sample(int(0.1 * len(df)))

In [None]:
sns.scatterplot(x=sample_df.Start_Lng, y=sample_df.Start_Lat, size=0.001)

In [None]:
import folium

In [None]:
lat, lon = df.Start_Lat[0], df.Start_Lng[0]
lat, lon

In [None]:
for x in df[['Start_Lat', 'Start_Lng']].sample(100).iteritems():
    print(x[1])

In [None]:
zip(list(df.Start_Lat), list(df.Start_Lng))

In [None]:
from folium.plugins import HeatMap

In [None]:
sample_df = df.sample(int(0.001 * len(df)))
lat_lon_pairs = list(zip(list(sample_df.Start_Lat), list(sample_df.Start_Lng)))

In [None]:
map = folium.Map()
HeatMap(lat_lon_pairs).add_to(map)
map

## Ask & answer questions

1. Are there more accidents in warmer or colder areas?
2. Which 5 states have the highest number of accidents? How about per capita?
3. Does New York show up in the data? If yes, why is the count lower if this the most populated city.
4. Among the top 100 cities in number of accidents, which states do they belong to most frequently.
5. What time of the day are accidents most frequent in? - ANSWERED
6. Which days of the week have the most accidents?
7. Which months have the most accidents?
8. What is the trend of accidents year over year (decreasing/increasing?)
9. When is accidents per unit of traffic the highest.

## Summary and Conclusion


Insights:
- No data from New York
- The number of accidents per city decreases exponentially
- Less than 5% of cities have more than 1000 yearly accidents.
- Over 1200 cities have reported just one accident (need to investigate)


In [None]:
import jovian

In [None]:
jovian.commit()