## Project 1: US Traffic Accidents

### Overview

This project performs a deep-dive diagnostic analysis of large-scale traffic accident data across the United States. Utilizing a dataset of millions of accident records, the study employs Python (Pandas, PySpark) for heavy-duty data processing and Matplotlib/Seaborn for exploratory visualization.

The goal is to transition from descriptive analytics (what happened) to prescriptive insights (what should be done). The final deliverables include a reproducible computational notebook, an interactive Tableau Dashboard for spatial exploration, and a strategic briefing for DOT stakeholders.

### Business Understanding

1. **The Challenge**

    Traffic fatalities and road accidents represent a significant public health crisis and a massive economic burden, costing billions in property damage, healthcare, and lost productivity. The Department of Transportation (DOT) faces the challenge of allocating limited resources—such as highway patrols, infrastructure upgrades, and emergency response teams—across a vast national network. Without data-driven prioritization, interventions are often reactive rather than proactive.

2. **Primary Objective**

    The objective of this analysis is to identify the high-risk variables that contribute to accident frequency and severity. By uncovering hidden patterns in temporal trends, weather impacts, and infrastructure flaws, we aim to provide the DOT with three high-impact, data-driven recommendations to:

    - **Reduce Accident Frequency**: By identifying "hotspots" and high-risk time windows.

    - **Mitigate Severity**: By understanding which environmental or infrastructural factors lead to fatal outcomes versus minor collisions.

    - **Optimize Resource Allocation**: Providing a blueprint for where the DOT should implement safety measures (e.g., improved lighting, signage, or traffic calming).

3. **Key Stakeholders**

    - Department of Transportation (DOT) Executives, Urban Planners & Engineers & Public Safety Officials

### Data Understanding

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').getOrCreate()

In [None]:
import os

print("Current Directory:", os.getcwd(), '\n')
print("Files in directory:", os.listdir('.'))

Current Directory: /home/jovyan 

Files in directory: ['.bash_logout', '.bashrc', '.profile', '.ipython', '.npm', '.local', '.conda', '.config', '.cache', '.jupyter', '.wget-hsts', 'work']


In [20]:
spark_df = spark.read.csv('./work/data/US_Accidents_March23.csv', inferSchema=True, header=True) 

In [23]:
spark_df.dtypes

[('ID', 'string'),
 ('Source', 'string'),
 ('Severity', 'int'),
 ('Start_Time', 'timestamp'),
 ('End_Time', 'timestamp'),
 ('Start_Lat', 'double'),
 ('Start_Lng', 'double'),
 ('End_Lat', 'double'),
 ('End_Lng', 'double'),
 ('Distance(mi)', 'double'),
 ('Description', 'string'),
 ('Street', 'string'),
 ('City', 'string'),
 ('County', 'string'),
 ('State', 'string'),
 ('Zipcode', 'string'),
 ('Country', 'string'),
 ('Timezone', 'string'),
 ('Airport_Code', 'string'),
 ('Weather_Timestamp', 'timestamp'),
 ('Temperature(F)', 'double'),
 ('Wind_Chill(F)', 'double'),
 ('Humidity(%)', 'double'),
 ('Pressure(in)', 'double'),
 ('Visibility(mi)', 'double'),
 ('Wind_Direction', 'string'),
 ('Wind_Speed(mph)', 'double'),
 ('Precipitation(in)', 'double'),
 ('Weather_Condition', 'string'),
 ('Amenity', 'boolean'),
 ('Bump', 'boolean'),
 ('Crossing', 'boolean'),
 ('Give_Way', 'boolean'),
 ('Junction', 'boolean'),
 ('No_Exit', 'boolean'),
 ('Railway', 'boolean'),
 ('Roundabout', 'boolean'),
 ('Station

In [24]:
spark_df.count()

7728394

In [25]:
spark_df.head()

Row(ID='A-1', Source='Source2', Severity=3, Start_Time=datetime.datetime(2016, 2, 8, 5, 46), End_Time=datetime.datetime(2016, 2, 8, 11, 0), Start_Lat=39.865147, Start_Lng=-84.058723, End_Lat=None, End_Lng=None, Distance(mi)=0.01, Description='Right lane blocked due to accident on I-70 Eastbound at Exit 41 OH-235 State Route 4.', Street='I-70 E', City='Dayton', County='Montgomery', State='OH', Zipcode='45424', Country='US', Timezone='US/Eastern', Airport_Code='KFFO', Weather_Timestamp=datetime.datetime(2016, 2, 8, 5, 58), Temperature(F)=36.9, Wind_Chill(F)=None, Humidity(%)=91.0, Pressure(in)=29.68, Visibility(mi)=10.0, Wind_Direction='Calm', Wind_Speed(mph)=None, Precipitation(in)=0.02, Weather_Condition='Light Rain', Amenity=False, Bump=False, Crossing=False, Give_Way=False, Junction=False, No_Exit=False, Railway=False, Roundabout=False, Station=False, Stop=False, Traffic_Calming=False, Traffic_Signal=False, Turning_Loop=False, Sunrise_Sunset='Night', Civil_Twilight='Night', Nautical_

In [26]:
spark.stop()

### Data Preparation

### Analysis

### Evaluation

### Business Insight/Recommendation 1

### Business Insight/Recommendation 2

### Business Insight/Recommendation 3

##### Tableau Dashboard link

### Conclusion and Next Steps