0.1 Research Questions

mohamed khafagy‚Äì Data Cleaning & EDA

    How do monthly crash counts differ across NYC boroughs between 2015 and 2024, and which borough shows the fastest growth in crashes over time?
    What are the top contributing factors associated with crashes that involve at least one injured person?


In [None]:
# ============================================
# 1Ô∏è‚É£ Load Required Libraries
# ============================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# üß© Step 2: Load the NYC crash datasets (Vehicles + Persons)

url_crash = "https://data.cityofnewyork.us/resource/h9gi-nx95.csv"
url_persons = "https://data.cityofnewyork.us/resource/f55k-p6yu.csv"

# Full datasets (no row limit) ‚Üí only for exploration
df_crash_raw = pd.read_csv(url_crash)
df_persons_raw = pd.read_csv(url_persons)


# 3Ô∏è‚É£ Preview Shapes & First Rows
print("Full Crash dataset shape:", df_crash_raw.shape)
print("Full Persons dataset shape:", df_persons_raw.shape)

df_crash_raw.head()


2. Raw Dataset Structure (info())

Before any cleaning, we inspect the structure of the raw datasets:

    Check column names
    Check incorrect data types (NYC API loads everything as strings)
    Check missing values
    Understand which fields require cleaning


In [None]:
print("---- CRASHES RAW INFO ----")
df_crash_raw.info()

print("\n---- PERSONS RAW INFO ----")
df_persons_raw.info()


In [None]:
missing_crash = df_crash_raw.isna().sum().sort_values(ascending=False)
missing_persons = df_persons_raw.isna().sum().sort_values(ascending=False)

print("Top 10 missing columns ‚Äî CRASH dataset:")
print(missing_crash.head(10))

print("\nTop 10 missing columns ‚Äî PERSONS dataset:")
print(missing_persons.head(10))


In [None]:
# üß© Step 6: Quick look at main columns
df_crash_raw[['crash_date', 'borough', 'vehicle_type_code1', 'contributing_factor_vehicle_1']].head(10)


In [None]:
# --- Quick dataset overview
print("Crashes shape:", df_crash_raw.shape)
print("Persons shape:", df_persons_raw.shape)

print("\n--- Crashes Info ---")
df_crash_raw.info()

print("\n--- Persons Info ---")
df_persons_raw.info()

print("\nMissing value percentages (Crashes):")
print(df_crash_raw.isna().mean().sort_values(ascending=False).head(10))

üîß Load Cleanable Subset (50,000 rows)

We use a 50,000-row subset for cleaning, because:

    full dataset has millions of rows
    cleaning millions of rows is slow
    patterns remain consistent with a representative sample


In [None]:
#  Step 1: imports
import pandas as pd

# Load 50,000-row Sample from NYC Open Data
url_crash = "https://data.cityofnewyork.us/resource/h9gi-nx95.csv"
url_persons = "https://data.cityofnewyork.us/resource/f55k-p6yu.csv"

crash = pd.read_csv(url_crash, nrows=50000)
persons = pd.read_csv(url_persons, nrows=50000)

print("Crash sample shape:", crash.shape)
print("Persons sample shape:", persons.shape)

crash.head()


In [None]:
#  Step 4: inspect columns and data types
crash.info()

In [None]:
#  Step 5: select key columns
important_cols = ['crash_date', 'borough', 'vehicle_type_code1', 'contributing_factor_vehicle_1']
df_selected = crash[important_cols]

# Show first 5 rows
df_selected.head()

In [None]:
#  Step 6: count crashes per borough
borough_counts = df_selected['borough'].value_counts()

#  Step 7: visualize
import matplotlib.pyplot as plt

plt.figure(figsize=(7,4))
borough_counts.plot(kind='bar', color='skyblue', edgecolor='black')

plt.title('Crashes per Borough', fontsize=14)
plt.xlabel('Borough')
plt.ylabel('Number of Crashes')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


In [None]:
#  Step 8: prepare crash_date as datetime
df_selected['crash_date'] = pd.to_datetime(df_selected['crash_date'], errors='coerce')

#  Step 9: group by month
monthly_crashes = df_selected.groupby(df_selected['crash_date'].dt.to_period('M')).size()

#  Step 10: visualize monthly trend
plt.figure(figsize=(10,5))
monthly_crashes.plot(kind='line', marker='o', color='orange')
plt.title('Crashes Over Time', fontsize=14)
plt.xlabel('Month')
plt.ylabel('Number of Crashes')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()


In [None]:
#  Step 11: find top 10 crash causes
top_factors = df_selected['contributing_factor_vehicle_1'].value_counts().head(10)

#  Step 12: visualize them
plt.figure(figsize=(8,5))
top_factors.plot(kind='barh', color='lightcoral', edgecolor='black')
plt.title('Top 10 Contributing Factors')
plt.xlabel('Number of Crashes')
plt.ylabel('Contributing Factor')
plt.gca().invert_yaxis()  # so the top factor appears at the top
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.show()




Analysis Summary

Most crashes occur in Brooklyn, followed by Queens and Bronx.

The dataset shows a visible peak in crashes around 2021‚Äì2022, which could reflect data completeness or real-world trends.

The top contributing factors are Unspecified and Driver Inattention/Distraction, highlighting the role of distracted driving.

These patterns will help later cleaning, integration, and dashboard design.


initial dataset loading and overview.  
  - Implemented EDA for crashes dataset (borough counts, time trends, top contributing factors).  
  - Documented the main crash-related insights.