# US Accidents Exploratory Data Analysis 

This is a countrywide car accident dataset that covers 49 states of the USA

This dataset contains data for US accidents from Feb 2016- March 2023

The dataset contains approximately 7.7 million accident records

We will employ data analysis and visualization techniques to uncover patterns, trends, and important factors associated with accidents
        

In [None]:
pip install opendatasets --upgrade --quiet


#### Libraries used for this project


In [None]:
import opendatasets as od
import pandas as pd
import seaborn as sns
sns.set_style('darkgrid')
import folium 
from folium import plugins
from folium.plugins import HeatMap
from folium.plugins import MarkerCluster
import matplotlib.pyplot as plt

#### Get the data

In [None]:
download_url='https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents'
od.download(download_url)
data_filename='./us-accidents/US_Accidents_March23.csv'
df=pd.read_csv(data_filename)


#### Understanding our data 





In [None]:
#Display 5 rows from the DataFrame
df.head(5)

In [None]:
# Get the rows and columns
rows,columns = df.shape
print(f'This dataset contains rows: {rows} and columns :{columns}')

In [None]:
# Get the type of each 
df.info()

In [None]:
##get the columns with only numeric datatype to perform statistical analysis
df.select_dtypes(include='number').columns.tolist()

In [None]:
# Check if all the cases are unique
unique_ids = df['ID'].nunique()
print(f"Number of unique cases (based on ID): {unique_ids}")

#### Data Cleaning

##### Handle Missing Values

In [None]:
#Percentage of missing values per column
Missing_percentage=(df.isnull().sum().sort_values(ascending=False) / len(df)) * 100
Missing_percentage[Missing_percentage!=0].plot(kind='barh')

In [None]:
#Remove columns with missing values which will not add value to our analysis
df.drop(columns=['End_Lat','End_Lng','Precipitation(in)','Street','Wind_Chill(F)','Timezone','Zipcode'], inplace=True)

In [None]:
# Check if the data has duplicates
df.duplicated().sum()

###### There are no duplicates in our dataset

In [None]:
Missing_percentage=df.isnull().sum()

#### Exploratatory analysis

    We will consider below columns for our analysis
        1. City - Shows the city in address field.
        2. Start Time - Shows start time of the accident in local time zone.
        3. Start Lat, Start Lng - Shows the latitude and longitude in GPS coordinate of the start point.
        4. Temperature - Shows the temperature (in Fahrenheit).
        5. Weather Condition - Shows the weather condition (rain, snow, thunderstorm, fog, etc.)

##### Plot a horizontal bar graph to analyse top 10 cities with highest accidents

In [None]:
cities_by_accident=df.City.value_counts()
cities_by_accident[:10].plot(kind='bar', title='Accidents by City', ylabel = 'Accidents Count',xlabel='City Name')


##### Among the top 100 cities in number of accidents, which states do they belong to 

In [None]:
Top100=df.City.value_counts().head(100).index

In [None]:
filt=df['City'].isin(Top100)

In [None]:
Stats=df.loc[filt,['State']].value_counts().head(20).plot(kind='bar')

##### Which 5 states have the highest number of accidents?

In [None]:
cities_by_accident.nlargest(5)

##### Percentage of high and low accident cities

In [None]:
high_accident_city=cities_by_accident[cities_by_accident>=10000]
high_accident_city_per=len(high_accident_city)/len(df.City)
low_accident_city=cities_by_accident[cities_by_accident<10000]
low_accident_city_per=len(low_accident_city)/len(df.City)

In [None]:
high_accident_city_per

In [None]:
low_accident_city_per

In [None]:
Accident=sns.histplot(high_accident_city)
Accident.set_xlabel("Accident Count")
Accident.set_title("Cities with more than 10000 Accidents")

less tham 5% of cities have more than 10000 yearly accidents

##### Count of Cities with only one accident 

In [None]:
##1203 cities with only one accident 
cities_by_accident[cities_by_accident==1].count()

In [None]:
df.Start_Time=pd.to_datetime(df.Start_Time,format="ISO8601")
Hour=df.Start_Time.dt.hour
day_of_week=df.Start_Time.dt.dayofweek
month=df.Start_Time.dt.month
year=df.Start_Time.dt.year

##### What time of the day are accidents most frequent in?

In [None]:

sns.histplot(Hour, bins=24)
plt.xlabel("Hour")
plt.ylabel("Number of Occurence")
plt.title('Accidents Count By Time of Day')


We observe high number of accidents during 6 to 10 am and 3PM to 10pm which could be due to people travelling to and from work. We will further  check if the trend is same for the weekend too

##### Which days of the week have the most accidents

In [None]:
sns.histplot(day_of_week, bins=7)
plt.xlabel("Day of the Week")
plt.title('Accidents by day of the week')


Weekends have lesser number of accidents in comparison to weekdays, which could be due to non-working days

##### What is the trend of accidents year over year (decreasing/increasing?)

In [None]:
sns.histplot(df.Start_Time.dt.year, bins=8)

We see the accidents are increasing exponentially year or year ( This dataset contains only first 3 months of data for 2023)

##### Which months have the most accidents?

In [None]:
sns.histplot(month, bins=12)
plt.xlabel("Month")
plt.title('Accidents by Month of the year')


We see a spike in accidents during winter season with December contributing to highest number of accidents - possible due to foggy weather,lippery road conditions, and increased holiday season ( longer travel hours) 

##### what is the trend of accidents over the weekends

In [None]:
Weekend=df.Start_Time[(day_of_week == 6) | (day_of_week == 5)]
sns.histplot(Weekend.dt.hour, bins=24)
plt.xlabel("Hour")
plt.title('Accident count on Weekends')

On weeekends, accidents are high between 10 am to 9 pm unlike weekdays, this could be due to Leisure Travel, less road traffic - drivers are less cautions 

In [None]:
df.Weather_Condition.value_counts().sort_values(ascending=False)[:20].plot(kind='bar')
plt.xlabel("Hour")
plt.title('Weather conditions')

Most accidents took place when the weather was 'Fair', concluding that the weather is not the major factor for accidents

In [None]:
df['Severity'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Accidents by Severity')


79.7%  of accidents are of low Severity leading to significantly less traffic

In [None]:
sample=df.sample(int(0.001*len(df)))
sns.scatterplot(x=df.Start_Lng,y=df.Start_Lat,data=df,size=df.Severity, hue=df.Severity)

In [None]:
map=folium.Map(location=[39.8283, -98.5795], width="100%", height="100%")
sample=df.sample(int(0.001*len(df)))
heat_data =list(zip(list(sample['Start_Lat']),list(sample['Start_Lng'])))
HeatMap(heat_data).add_to(map)
map

Coastal regions have account to higher number of accidents

### Summary and Conclusion
            Insights:
            The number of accidents are increasing exponentially year on year 
            Less than 5% of cities have more than 1000 yearly accidents.
            Coastal region account to higher number of accidents
            More accidents takes place on weekdays in comparison to weekends
            Accidents are more during winter season
            New York, despite being the most populated city has less number of accidents. This needs to be checked
            Jan 2016 does not contain complete data which might lead to incorrect analysis
            Over 1200 cities have reported just one accident - Needs further analysis
            less tham 5% of cities have more than 10000 yearly accidents