# CS 163 Final Project EDA Summary

In [None]:
# ================== IMPORT LIBRARIES ==================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from fpdf import FPDF
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium.plugins import HeatMap
import pandas as pd

In [None]:
# ================== SECTION 1: Load Dataset ==================
file_path = "C:/Users/sherv/Desktop/SP25/CS163/US_Accidents_March23.csv"

# Load dataset (keep original unchanged)
df_original = pd.read_csv(file_path, low_memory=False)
df = df_original.copy()  # Work on a copy

In [None]:
# ================== SECTION 2: Data Cleaning & Preprocessing ==================
# Select relevant columns for analysis
columns_to_keep = ["Severity", "Start_Time", "State", "Temperature(F)", "Humidity(%)",
                   "Visibility(mi)", "Precipitation(in)", "Weather_Condition"]
df = df[columns_to_keep]

# Convert 'Start_Time' to datetime format
df["Start_Time"] = pd.to_datetime(df["Start_Time"], errors='coerce')

# Identify missing values
missing_values = df.isnull().sum()

# Drop missing values for simplicity
df.dropna(inplace=True)

In [None]:
# ================== SECTION 3: Descriptive Statistics ==================
# Summary statistics
summary_stats = df.describe()
print("Summary Statistics:\n", summary_stats)

# Correlation Matrix (only numerical columns)
numeric_df = df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()
print("\nCorrelation Matrix:\n", corr_matrix)


The average temperature during accidents is 61.4°F, with humidity averaging 65.9%, and low precipitation values (mean = 0.0085 inches). The correlation matrix indicates weak correlations between severity and environmental factors (e.g., precipitation, temperature, humidity), confirming that these variables alone are not strong predictors of accident severity. However, humidity and visibility are negatively correlated (-0.41), meaning that higher humidity tends to reduce visibility, which could indirectly affect accident rates.



In [None]:
# ================== SECTION 4: Data Visualizations ==================
# 1️⃣ Accident Severity Distribution Bar Chart
# Count the number of accidents per severity level
severity_counts = df["Severity"].value_counts().sort_index()

# Create a bar chart
plt.figure(figsize=(8, 5))
sns.barplot(x=severity_counts.index, y=severity_counts.values, palette="Blues")

# Annotate bars with exact counts
for i, count in enumerate(severity_counts.values):
    plt.text(i, count + 10000, f"{count:,}", ha="center", fontsize=12)

# Labels and title
plt.title("Accident Severity Distribution", fontsize=14)
plt.xlabel("Severity Level", fontsize=12)
plt.ylabel("Number of Accidents", fontsize=12)
plt.xticks(ticks=[0, 1, 2, 3], labels=["1 (Low)", "2 (Moderate)", "3 (High)", "4 (Severe)"])

# Save and show
plt.show()


The Accident Severity Distribution bar chart reveals that Severity Level 2 (Moderate) accounts for the majority of accidents (~3.9 million), followed by Severity Level 3 (High) and a much smaller proportion of Severity Level 1 (Low) and Level 4 (Severe). This suggests that most accidents cause moderate disruptions to traffic flow rather than extreme consequences.

In [None]:
# 2️⃣ Heatmap of Feature Correlations
plt.figure(figsize=(8,5))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()


## Key Heatmap Insights:
### Severity vs. Other Factors:

Severity has weak correlations with all other features (-0.03 to 0.04), meaning accident severity isn't strongly dependent on temperature, humidity, visibility, or precipitation.
Temperature(F) vs. Humidity(%) (-0.34)

Moderate negative correlation: As temperature increases, humidity decreases (warmer air holds more moisture but reduces relative humidity).
Humidity(%) vs. Visibility(mi) (-0.41)

Moderate negative correlation: Higher humidity reduces visibility (likely due to fog, mist, or rain).
Visibility(mi) vs. Precipitation(in) (-0.12)

Slight negative correlation: Increased precipitation (rain/snow) slightly reduces visibility, but it’s not a strong effect.
Precipitation(in) vs. Severity (0.02)

No significant correlation: This means higher precipitation doesn't strongly impact accident severity.

### What This Means for our EDA
No strong predictor of accident severity among these weather variables.
Humidity and visibility are more closely related, which makes sense since fog and precipitation affect visibility.
You may need to explore other factors (e.g., road conditions, time of day, traffic volume) to better predict accident severity.


In [None]:
# 3️⃣ Scatter plot: Precipitation vs. Severity

plt.figure(figsize=(8, 5))
sns.scatterplot(x=df["Precipitation(in)"], y=df["Severity"], alpha=0.5, color="blue")

# Apply log scale to precipitation
plt.xscale("log")
plt.xticks([0.1, 1, 5, 10, 20, 50], labels=["0.1", "1", "5", "10", "20", "50"])  # Custom tick labels

# Improve readability
plt.title("Precipitation vs Severity (Log Scale)", fontsize=14)
plt.xlabel("Precipitation (inches, log scale)", fontsize=12)
plt.ylabel("Severity", fontsize=12)

plt.grid(True, which="both", linestyle="--", linewidth=0.5)
plt.show()

The Precipitation vs. Severity (Log Scale) scatter plot demonstrates that most accidents occur in low precipitation levels (<1 inch), with a small number of cases at higher precipitation levels. Since Severity is categorical (1-4), the log scale helps visualize how accident severity is distributed across different precipitation levels. The lack of a strong pattern suggests that precipitation alone does not strongly determine accident severity.

In [None]:
# 4️⃣ Bar Chart: Accidents by State
# Count the number of accidents per state
import matplotlib.ticker as mticker
state_counts = df["State"].value_counts().sort_values(ascending=False)

# Create a bar plot
plt.figure(figsize=(15, 6))
sns.barplot(x=state_counts.index, y=state_counts.values, palette="viridis")

# Improve Y-axis scale (increments of 100,000)
plt.yticks(range(0, max(state_counts.values) + 100000, 100000))  # Set tick intervals

# Format Y-axis labels with commas instead of scientific notation
plt.gca().yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{int(x):,}'))

# Rotate x-axis labels for readability
plt.xticks(rotation=90)

# Add vertical labels on bars
for i, count in enumerate(state_counts.values):
    plt.text(i, count + 10000, f"{count:,}", ha="center", va="bottom", fontsize=9, rotation=90)

# Labels and Title
plt.title("Accident Distribution by State", fontsize=14)
plt.xlabel("State", fontsize=12)
plt.ylabel("Number of Accidents", fontsize=12)

# Save and show
plt.show()

The Accident Distribution by State bar chart highlights that California (CA), Florida (FL), and Texas (TX) have the highest number of recorded accidents, with California exceeding 1 million accidents. This trend suggests that states with high population density and urban traffic congestion tend to experience more accidents. Conversely, states like Vermont (VT), South Dakota (SD), and Wyoming (WY) have significantly lower accident counts, likely due to lower population density and fewer urban roadways.

In [None]:

# ================== Data Cleaning ==================
# Keep only relevant columns and drop missing values
df_map = df_original[["Start_Lat", "Start_Lng"]].dropna()

# Convert DataFrame to a list of [lat, lon] for HeatMap
heat_data = df_map.values.tolist()

# ================== Create Folium Heatmap ==================
# Initialize the map centered in the U.S.
m = folium.Map(location=[37.8, -96], zoom_start=5, tiles="CartoDB Voyager", attr="Stamen Terrain, OpenStreetMap")


# Add heatmap layer
HeatMap(
    heat_data,
    radius=8,    # Adjust size of heat points
    blur=4,      # Adjust blur effect
    max_zoom=10  # Improve zoom visibility
).add_to(m)

# ================== Save and Display Map ==================
# Save map to an HTML file
m.save("US_Accident_Heatmap.html")

# Display map (if running in Jupyter Notebook)
m

## General Conclusion from initial EDA

Overall, these findings suggest that urban density, traffic congestion, and other external factors likely have a more significant impact on accident frequency and severity than weather conditions alone. Further analysis could explore time-of-day trends, road conditions, and traffic volume to refine predictive insights. 🚗📊







