![Add a relevant banner image here](path_to_image)

# Project Title

## Overview

Short project description. Your bottom line up front (BLUF) insights.

## Business Understanding

This data is relevant in the fact that there are many accidents each day, causing damage to the wellbeing of people and property. Using this model, we can gain insight into correlative factors for accidents to occur. Efforts can go towards building/repairing infrastructure, modifying security and adjusting insurance policies with these insights in mind.

How relevant is the day of the week when it comes to the severity of crashes?

What potentially unseen factors play a larger role in causing accidents than one may assume?

How can one use the statistical knowledge gained here in practical application to help prevent accidents in the future?

These are important questions that can be answered through careful analysis of the data provided.

## Data Understanding

Text here

In [None]:
# Load relevant imports here
import pandas as pd
import seaborn as sns
import numpy as np
import scipy
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay


In [None]:
df = pd.read_csv('/Users/Joe/Downloads/archive/US_Accidents_March23.csv')
df.head()

In [None]:
df.info()
df.describe()
df.shape
print(df.isnull().sum())

## Data Preparation
Text here

In [None]:
numerical_cols_to_impute = ['Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
categorical_cols_to_impute = ['Weather_Condition', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight']

for col in numerical_cols_to_impute:
    if col in df.columns:
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Filled missing values in '{col}' with median value ({median_val:.2f}).")

for col in categorical_cols_to_impute:
    if col in df.columns:
        mode_val = df[col].mode()[0]
        df[col].fillna(mode_val, inplace=True)
        print(f"Filled missing values in '{col}' with mode value ('{mode_val}').")

# Dropping columns with a very high percentage of missing values that are not critical for our analysis.
# Example: 'Number' often has many missing values.
if 'Number' in df.columns:
    df.drop('Number', axis=1, inplace=True)
    print("Dropped 'Number' column due to high number of missing values.")

# Step 4.2: Handling Outliers
print("\nStep 4.2: Handling Outliers...")
# We'll use the Interquartile Range (IQR) method to identify and cap outliers
# for the 'Temperature(F)' column as an example.
if 'Temperature(F)' in df.columns:
    Q1 = df['Temperature(F)'].quantile(0.25)
    Q3 = df['Temperature(F)'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping the outliers
    original_outliers = df[(df['Temperature(F)'] < lower_bound) | (df['Temperature(F)'] > upper_bound)].shape[0]
    df['Temperature(F)'] = np.where(df['Temperature(F)'] > upper_bound, upper_bound, df['Temperature(F)'])
    df['Temperature(F)'] = np.where(df['Temperature(F)'] < lower_bound, lower_bound, df['Temperature(F)'])
    print(f"Capped {original_outliers} outliers in 'Temperature(F)' using IQR method.")
    print(f"Temperature(F) values are now capped between {lower_bound:.2f} and {upper_bound:.2f}.")

# Step 4.3: Converting Data Types
print("\nStep 4.3: Converting Data Types...")
# Convert time-related columns from object to datetime
for col in ['Start_Time', 'End_Time']:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')
        print(f"Converted '{col}' to datetime objects.")

# Drop rows where conversion might have failed (resulted in NaT)
df.dropna(subset=['Start_Time', 'End_Time'], inplace=True)

# Step 4.4: Feature Engineering
print("\nStep 4.4: Feature Engineering...")
# Create derived features from the 'Start_Time' column
if 'Start_Time' in df.columns:
    df['Hour'] = df['Start_Time'].dt.hour
    df['DayOfWeek'] = df['Start_Time'].dt.dayofweek # Monday=0, Sunday=6
    df['Month'] = df['Start_Time'].dt.month
    print("Created 'Hour', 'DayOfWeek', and 'Month' features from 'Start_Time'.")

# Calculate the duration of the accident in minutes
if 'Start_Time' in df.columns and 'End_Time' in df.columns:
    df['Duration(min)'] = (df['End_Time'] - df['Start_Time']).dt.total_seconds() / 60
    print("Created 'Duration(min)' feature.")


# --- 5. Final Data Inspection ---
print("\n--- Final Data Inspection ---")
print("Dataset Information after cleaning and preprocessing:")
df.info()

print("\nChecking for any remaining missing values:")
print(df.isnull().sum())


# --- 6. Descriptive Statistics (Post-Processing) ---
print("\n--- Descriptive Statistics (Post-Processing) ---")
print("Summary statistics for numerical columns after cleaning:")
print(df.describe())

print("\nFirst 5 rows of the final preprocessed dataset:")
print(df.head())

## Analysis

Text here

In [None]:
numerical_cols_to_impute = ['Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
categorical_cols_to_impute = ['Weather_Condition', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight']

for col in numerical_cols_to_impute:
    if col in df.columns:
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Filled missing values in '{col}' with median value ({median_val:.2f}).")

for col in categorical_cols_to_impute:
    if col in df.columns:
        mode_val = df[col].mode()[0]
        df[col].fillna(mode_val, inplace=True)
        print(f"Filled missing values in '{col}' with mode value ('{mode_val}').")

# Dropping columns with a very high percentage of missing values that are not critical for our analysis.
# Example: 'Number' often has many missing values.
if 'Number' in df.columns:
    df.drop('Number', axis=1, inplace=True)
    print("Dropped 'Number' column due to high number of missing values.")

# Step 4.2: Handling Outliers
print("\nStep 4.2: Handling Outliers...")
# We'll use the Interquartile Range (IQR) method to identify and cap outliers
# for the 'Temperature(F)' column as an example.
if 'Temperature(F)' in df.columns:
    Q1 = df['Temperature(F)'].quantile(0.25)
    Q3 = df['Temperature(F)'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping the outliers
    original_outliers = df[(df['Temperature(F)'] < lower_bound) | (df['Temperature(F)'] > upper_bound)].shape[0]
    df['Temperature(F)'] = np.where(df['Temperature(F)'] > upper_bound, upper_bound, df['Temperature(F)'])
    df['Temperature(F)'] = np.where(df['Temperature(F)'] < lower_bound, lower_bound, df['Temperature(F)'])
    print(f"Capped {original_outliers} outliers in 'Temperature(F)' using IQR method.")
    print(f"Temperature(F) values are now capped between {lower_bound:.2f} and {upper_bound:.2f}.")

# Step 4.3: Converting Data Types
print("\nStep 4.3: Converting Data Types...")
# Convert time-related columns from object to datetime
for col in ['Start_Time', 'End_Time']:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')
        print(f"Converted '{col}' to datetime objects.")

# Drop rows where conversion might have failed (resulted in NaT)
df.dropna(subset=['Start_Time', 'End_Time'], inplace=True)

# Step 4.4: Feature Engineering
print("\nStep 4.4: Feature Engineering...")
# Create derived features from the 'Start_Time' column
if 'Start_Time' in df.columns:
    df['Hour'] = df['Start_Time'].dt.hour
    df['DayOfWeek'] = df['Start_Time'].dt.dayofweek # Monday=0, Sunday=6
    df['Month'] = df['Start_Time'].dt.month
    print("Created 'Hour', 'DayOfWeek', and 'Month' features from 'Start_Time'.")

# Calculate the duration of the accident in minutes
if 'Start_Time' in df.columns and 'End_Time' in df.columns:
    df['Duration(min)'] = (df['End_Time'] - df['Start_Time']).dt.total_seconds() / 60
    print("Created 'Duration(min)' feature.")


# --- 5. Final Data Inspection ---
print("\n--- Final Data Inspection ---")
print("Dataset Information after cleaning and preprocessing:")
df.info()

print("\nChecking for any remaining missing values:")
print(df.isnull().sum())


# --- 6. Descriptive Statistics (Post-Processing) ---
print("\n--- Descriptive Statistics (Post-Processing) ---")
print("Summary statistics for numerical columns after cleaning:")
print(df.describe())

print("\nFirst 5 rows of the final preprocessed dataset:")
print(df.head())

In [None]:
numerical_cols_to_impute = ['Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']
categorical_cols_to_impute = ['Weather_Condition', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight']

for col in numerical_cols_to_impute:
    if col in df.columns:
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Filled missing values in '{col}' with median value ({median_val:.2f}).")

for col in categorical_cols_to_impute:
    if col in df.columns:
        mode_val = df[col].mode()[0]
        df[col].fillna(mode_val, inplace=True)
        print(f"Filled missing values in '{col}' with mode value ('{mode_val}').")

# Dropping columns with a very high percentage of missing values that are not critical for our analysis.
# Example: 'Number' often has many missing values.
if 'Number' in df.columns:
    df.drop('Number', axis=1, inplace=True)
    print("Dropped 'Number' column due to high number of missing values.")

# Step 4.2: Handling Outliers
print("\nStep 4.2: Handling Outliers...")
# We'll use the Interquartile Range (IQR) method to identify and cap outliers
# for the 'Temperature(F)' column as an example.
if 'Temperature(F)' in df.columns:
    Q1 = df['Temperature(F)'].quantile(0.25)
    Q3 = df['Temperature(F)'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping the outliers
    original_outliers = df[(df['Temperature(F)'] < lower_bound) | (df['Temperature(F)'] > upper_bound)].shape[0]
    df['Temperature(F)'] = np.where(df['Temperature(F)'] > upper_bound, upper_bound, df['Temperature(F)'])
    df['Temperature(F)'] = np.where(df['Temperature(F)'] < lower_bound, lower_bound, df['Temperature(F)'])
    print(f"Capped {original_outliers} outliers in 'Temperature(F)' using IQR method.")
    print(f"Temperature(F) values are now capped between {lower_bound:.2f} and {upper_bound:.2f}.")

# Step 4.3: Converting Data Types
print("\nStep 4.3: Converting Data Types...")
# Convert time-related columns from object to datetime
for col in ['Start_Time', 'End_Time']:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')
        print(f"Converted '{col}' to datetime objects.")

# Drop rows where conversion might have failed (resulted in NaT)
df.dropna(subset=['Start_Time', 'End_Time'], inplace=True)

# Step 4.4: Feature Engineering
print("\nStep 4.4: Feature Engineering...")
# Create derived features from the 'Start_Time' column
if 'Start_Time' in df.columns:
    df['Hour'] = df['Start_Time'].dt.hour
    df['DayOfWeek'] = df['Start_Time'].dt.dayofweek # Monday=0, Sunday=6
    df['Month'] = df['Start_Time'].dt.month
    print("Created 'Hour', 'DayOfWeek', and 'Month' features from 'Start_Time'.")

# Calculate the duration of the accident in minutes
if 'Start_Time' in df.columns and 'End_Time' in df.columns:
    df['Duration(min)'] = (df['End_Time'] - df['Start_Time']).dt.total_seconds() / 60
    print("Created 'Duration(min)' feature.")


# --- 5. Final Data Inspection ---
print("\n--- Final Data Inspection ---")
print("Dataset Information after cleaning and preprocessing:")
df.info()

print("\nChecking for any remaining missing values:")
print(df.isnull().sum())


# --- 6. Descriptive Statistics (Post-Processing) ---
print("\n--- Descriptive Statistics (Post-Processing) ---")
print("Summary statistics for numerical columns after cleaning:")
print(df.describe())

print("\nFirst 5 rows of the final preprocessed dataset:")
print(df.head())

In [None]:
# --- Visualization 3: Accident Hotspots by City ---

# Get the top 10 cities with the most accidents
top_cities = df['City'].value_counts().nlargest(10)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_cities.values, y=top_cities.index, palette="plasma", orient='h')

plt.title('Top 10 Cities by Number of Accidents', fontsize=16)
plt.xlabel('Number of Accidents')
plt.ylabel('City')
plt.show()

In [None]:
from scipy import stats


print("--- 1. Chi-Square Test: Day of Week vs. Sunrise/Sunset ---")

# H₀ (Null Hypothesis): There is no association between DayOfWeek and Sunrise_Sunset.
# H₁ (Alternative Hypothesis): There is an association between DayOfWeek and Sunrise_Sunset.

# Create a contingency table
contingency_table = pd.crosstab(df['DayOfWeek'], df['Sunrise_Sunset'])
print("\nContingency Table:")
print(contingency_table)

# Perform the Chi-Square test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print(f"\nChi-Square Statistic: {chi2:.2f}")
print(f"P-value: {p_value}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("\nConclusion: We reject the null hypothesis (p < 0.05).")
    print("There is a statistically significant association between the day of the week and whether an accident occurs during the day or night.")
else:
    print("\nConclusion: We fail to reject the null hypothesis (p >= 0.05).")
    print("There is no statistically significant association between the day of the week and Sunrise_Sunset.")

# --- 2. ANOVA: Temperature vs. Accident Severity ---
# We want to test if the mean temperature is significantly different across various accident severity levels.

print("\n\n--- 2. ANOVA: Temperature vs. Accident Severity ---")

# H₀ (Null Hypothesis): The mean temperature is the same for all severity levels.
# H₁ (Alternative Hypothesis): At least one severity level has a different mean temperature.

# Prepare data for ANOVA
severity_groups = [df['Temperature(F)'][df['Severity'] == i] for i in sorted(df['Severity'].unique())]

# Assumption Check 1: Homogeneity of Variances (Levene's Test)
levene_stat, levene_p = stats.levene(*severity_groups)
print(f"\nLevene's Test for Homogeneity of Variances: P-value = {levene_p:.3f}")
if levene_p < alpha:
    print("Levene's test is significant (p < 0.05), suggesting variances are not equal.")
    use_kruskal = True
else:
    print("Levene's test is not significant (p >= 0.05), variances are assumed to be equal.")
    use_kruskal = False

# Assumption Check 2: Normality (Shapiro-Wilk Test)
# Note: For very large samples, this test is almost always significant. We proceed with caution.
# We'll check one group as an example.
shapiro_stat, shapiro_p = stats.shapiro(severity_groups[0].sample(min(5000, len(severity_groups[0]))))
print(f"Shapiro-Wilk Test for Normality (Severity 2 sample): P-value = {shapiro_p:.3f}")
if shapiro_p < alpha:
    print("Shapiro-Wilk test is significant (p < 0.05), suggesting the data may not be normally distributed.")
    use_kruskal = True
else:
    print("Shapiro-Wilk test is not significant (p >= 0.05), normality is assumed.")


# Perform the appropriate test
if use_kruskal:
    print("\nAssumptions for ANOVA were not met. Performing Kruskal-Wallis H-test (non-parametric alternative)...")
    kruskal_stat, kruskal_p = stats.kruskal(*severity_groups)
    print(f"Kruskal-Wallis H-test statistic: {kruskal_stat:.2f}")
    print(f"P-value: {kruskal_p}")
    p_value_anova = kruskal_p
else:
    print("\nPerforming One-Way ANOVA...")
    f_stat, p_value_anova = stats.f_oneway(*severity_groups)
    print(f"F-statistic: {f_stat:.2f}")
    print(f"P-value: {p_value_anova}")

# Interpretation
if p_value_anova < alpha:
    print("\nConclusion: We reject the null hypothesis (p < 0.05).")
    print("There is a statistically significant difference in mean temperature across different accident severity levels.")
else:
    print("\nConclusion: We fail to reject the null hypothesis (p >= 0.05).")
    print("There is no statistically significant difference in mean temperature across different accident severity levels.")


# --- 3. Correlation Analysis ---
# We will analyze the correlation between key numerical features to understand their relationships.

print("\n\n--- 3. Correlation Analysis ---")

# Select numerical columns for correlation analysis
numerical_cols = ['Severity', 'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)', 'Duration(min)']
correlation_matrix = df[numerical_cols].corr()

# Visualize the correlation matrix with a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.show()

## Evaluation

### Business Insight/Recommendation 1

### Business Insight/Recommendation 2

### Business Insight/Recommendation 3

### Tableau Dashboard link

## Conclusion and Next Steps
Text here