# Exploratory Data Analysis Project: US Accidents (1.5 million records)
##### A Countrywide Traffic Accident Dataset (2016 - 2020)

<img style="float:center" src="https://image.freepik.com/free-vector/traffic-accident-abstract-concept-vector-illustration-road-accident-report-traffic-laws-violation-single-car-crash-investigation-injury-statistics-multi-vehicle-collision-abstract-metaphor_335657-1800.jpg" alt="image" style="width:435px;height:366px;">

In [30]:
print("hello world")

## Project Goals
* Analyze columns such as : City, State, Start time and more.
* Answer questions like: 

  -Where do traffic accidents are most common?
  
  -At what time of the day traffic accidents are more likely?

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:
file_path = "/kaggle/input/us-accidents/US_Accidents_Dec20_updated.csv"

df = pd.read_csv(file_path)

In [3]:
print(f"Number of columns: {df.shape[1]}")
print(f"Number of rows: {df.shape[0]}")

In [4]:
df.info()

In [5]:
df.describe()

In [6]:
df.isnull().sum()

Many missing values like WindChill and precipitation which have 400k+ missing values

In [7]:
df = df[["Start_Time", "City", "State", "Temperature(F)", "Start_Lng", "Start_Lat"]]

In [8]:
df.head()

**Let's start with geographical data**

In [9]:
sns.set_style("darkgrid")
sns.scatterplot(x="Start_Lng", y="Start_Lat", data=df, alpha=0.4)

In [10]:
df["City"].nunique()

In [11]:
df["State"].nunique()

In [12]:
count_city = df["City"].value_counts()
count_city

In [13]:
count_city[:20]

- There are cities with only 1 accident record
- New York isn't on the top cities being the most populated in the U.S

In [14]:
only_ny = df.loc[df["City"] == "New York"]
only_ny.head()

In [15]:
only_ny.count().max()

New York city is included in the dataset but has only 4000 recorded accidents.

In [16]:
#cities with more than 1000 records
high_accidents = count_city[count_city >= 1000]
high_accidents.head()

In [17]:
sns.histplot(high_accidents,  log_scale=True)

In [18]:
new_df = df[df["City"].isin(high_accidents.index)]

In [19]:
print(f"new_df has {round(((df.shape[0]-new_df.shape[0])/df.shape[0])*100, 4)}% less rows than the original dataframe")

**A bit more than 40% of Cities of the original dataframe had less than 1000 records**


Quantity of accidents by State

In [20]:
accidents_by_state = new_df.groupby("State").City.count().sort_values(ascending=False)
accidents_by_state[:10]

In [21]:
sns.barplot(x=accidents_by_state[:10].index , y=accidents_by_state[:10])

-California far surpasses other states in terms of traffic accidents 

-After Florida (2nd place), states don't reach the 100 000 mark.

**Parsing dates**

In [22]:
import datetime as dt

new_df["Start_Time"] = pd.to_datetime(new_df["Start_Time"])
new_df["Start_Time"].dt.year

In [23]:
sns.countplot(new_df["Start_Time"].dt.year)

2020 was the year with most traffic accident records.

In [24]:
#wrapping the set_xticlabels stament inside a variable to hide the text objects
p = sns.countplot(new_df["Start_Time"].dt.hour)
var = p.set_xticklabels([x for x in range(1, 25)])

Much of the accidents are happening between 2pm and 7pm.

In [25]:
g = sns.countplot(new_df["Start_Time"].dt.dayofweek)
var = g.set_xticklabels(rotation=30, labels=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])

Weekdays are the most dangerous, maybe people are in a hurry to go to their jobs from monday to friday.

In [26]:
month_vc = new_df["Start_Time"].dt.month.value_counts().sort_index()
month_vc

In [27]:
g = sns.countplot(new_df["Start_Time"].dt.month)
var = g.set_xticklabels(rotation=50, labels=["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"])

September to December are the months with the most accidents. Maybe it's correlated to the tempature and frozen roads.

In [28]:
temp_f = new_df.groupby(new_df["Start_Time"].dt.month)["Temperature(F)"].mean()
temp_c = pd.Series([((x-32)*5/9) for x in temp_f], index=temp_f.index)
print("Temperature in Celsius:")
list(temp_c)

In [29]:
cor = month_vc.corr(temp_c)
cor

Negative correlation between the amount of accidents each month and the temperature. No luck.

***Conclusions:***

* The state of California has the highest traffic accidients(293 395) with almost 3 times the  records compared to Florida,the second place(107 215). 

* Not surprisingly Los angeles, miami and many other cities inside the states named beforehand have the most accident records.

* Any weekday of December between 7am to 10am and 2pm to 7pm are the most dangerous times to drive, especially in the state of California.

* Lastly, the dataset is missing data from many cities and/or has gathered most of the data from a few States.