# 🚗 US Accidents Analysis
## 📊 By Sarthak Patil<br>
Accidents are unpredictable, yet they leave behind patterns that can help us better understand, prevent, and respond to them. This project explores the US Accidents dataset, a rich, countrywide collection of car accident records covering 49 U.S. states from February 2016 to March 2023. The dataset contains over 7.7 million accident reports, making it one of the most comprehensive sources available for transportation and safety analysis.

The data was aggregated via various APIs that stream live traffic incidents, sourcing information from state departments of transportation, law enforcement agencies, traffic cameras, and road sensors. As such, it provides an authentic, multi-source view of traffic incidents across the nation.<br>
_________________________________________________________________________________
## 🔍 Why This Analysis Matters <br>
Understanding traffic accidents isn't just about numbers—it's about saving lives, improving infrastructure, and designing smarter cities. This dataset can be leveraged for:

🚨 Real-time accident prediction and alert systems

🗺️ Hotspot detection to identify high-risk areas

🧠 Casualty and severity analysis

🌧️ Studying the impact of weather, visibility, and other environmental factors

📈 Modeling cause-and-effect relationships in accident occurrence

_________________________________________________________________________________
## 🚧 Note from Author
This notebook is a basic and initial version of the analysis. Many more insights, visualizations, and predictive modeling components will be added in future updates.
Stay tuned as I enhance this project step-by-step!




# **US Accidents Exploratory Data Analysis**

TODO - talk about EDA

TODO - talk about the dataset (source, what it contains, how it will be useful)

1.Kaggle<br>
2.informaiton about accidents<br>
3.can use useful to prevent accidents<br>
4.mention that this does not contain data about **New York**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
pip install opendatasets --upgrade --quiet

In [None]:
import opendatasets as od

download_url = 'https://www.kaggle.com/sobhanmoosavi/us-accidents'

od.download(download_url)

In [None]:
data_filename ='./us-accidents/US_Accidents_March23.csv'

## **Data Preparation and Cleaning**

1. Load the file using Pandas
2. Look at some information about the data & the columns
3. Fix any missing or incorrect values


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(data_filename)

In [None]:
df

In [None]:
df.columns


In [None]:
len(df.columns)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

numeric_df = df.select_dtypes(include=numerics)
len(numeric_df.columns)

**Percentage of missing values per column**

In [None]:
missing_percentages =df.isna().sum().sort_values(ascending=False)/len(df)
missing_percentages

In [None]:
type(missing_percentages)

In [None]:
missing_percentages[missing_percentages!=0]

In [None]:
missing_percentages[missing_percentages!=0].plot(kind='barh')

**Remove columns that you don't want to use.**

# **Exploratory Analysis and Visualization**

Columns we'll analyze:

1.City<br>
2.Start Lat, Start Lng<br>


In [None]:
df.columns

## City

In [None]:
df.City

In [None]:
cities=df.City.unique()
len(cities)

In [None]:
cities_by_accident=df.City.value_counts()
cities_by_accident


In [None]:
cities_by_accident[:10]

In [None]:
 cities_by_accident[:20].plot(kind='barh')

In [None]:
import seaborn as sns
sns.set_style('darkgrid')

In [None]:
sns.histplot(cities_by_accident, log_scale=True)

In [None]:
cities_by_accident[cities_by_accident == 1]

## Start Latitude & Longitude

In [None]:
df.Start_Lat

In [None]:
df.Start_Lng

In [None]:
sample_df = df.sample(int(0.1 * len(df)))

In [None]:
sns.scatterplot(x=sample_df.Start_Lng, y=sample_df.Start_Lat, size=0.001)

In [None]:
import folium

In [None]:
lat, lon = df.Start_Lat[0], df.Start_Lng[0]
lat, lon

In [None]:
for x in df[['Start_Lat', 'Start_Lng']].sample(100).items():
    print(x[1])


In [None]:
zip(list(df.Start_Lat), list(df.Start_Lng))

In [None]:
from folium.plugins import HeatMap

In [None]:
sample_df = df.sample(int(0.001 * len(df)))
lat_lon_pairs = list(zip(list(sample_df.Start_Lat), list(sample_df.Start_Lng)))

In [None]:
map = folium.Map()
HeatMap(lat_lon_pairs).add_to(map)
map

In [61]:
df.columns


Index(['ID', 'Source', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat',
       'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description',
       'Street', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone',
       'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)',
       'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction',
       'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity',
       'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway',
       'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
       'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')

# **Summary and Conclusion**

Insights:

1. No Data for New York City
2. Less than 8% of cities have more than 1000 yearly accidents.
3. Over 1000 cities have reported just one accident (need to investigate)
4. The number of accidents per city decreases exponentially