# CitiBike Accident Prediction

We need to transform the CitiBike and NYPD datasets into a structured format where we can predict the likelihood of a CitiBike accident. Below is a step-by-step breakdown of how we should engineer meaningful features.

# Step 1: Define the Prediction Goal
- We want to predict the probability of an accident for a CitiBike trip.

- Target Variable (accident_risk): Binary (1 = Accident, 0 = No Accident).
- How to define an accident? → If a CitiBike trip starts or ends near an accident location, it's classified as an accident-prone ride.

 
# Step 2: Feature Engineering
We will extract features from both datasets to enrich CitiBike trip records.

## CitiBike Features
Time-Based Features 

- hour_of_day: Extracted from started_at
- day_of_week: Extracted from started_at (0 = Monday, 6 = Sunday)
- weekend: Boolean flag (1 = Weekend, 0 = Weekday)
- season: Winter, Spring, Summer, Fall (Derived from started_at)
- rush_hour: Boolean flag (1 = Peak commuting hours, 0 = Off-peak)
- Trip-Based Features 

- trip_duration: Difference between ended_at and started_at
- trip_distance: Approximate using Haversine formula between start_lat, start_lng and end_lat, end_lng
- member_casual: Encoded (0 = Casual, 1 = Member)
- rideable_type: One-hot encode different bike types (e.g., classic, electric)
- Geographical Features

- start_accident_count: Number of accidents within 100m radius of start_lat, start_lng
- end_accident_count: Number of accidents within 100m radius of end_lat, end_lng
- start_high_risk: Boolean flag (1 = High accident zone, 0 = Safe zone)
- end_high_risk: Boolean flag (1 = High accident zone, 0 = Safe zone)

## NYPD Features
Accident Severity Features 

- cyclist_injured: Number of cyclists injured (NUMBER OF CYCLIST INJURED)
- cyclist_killed: Number of cyclist fatalities (NUMBER OF CYCLIST KILLED)
- total_injuries: Sum of all injuries in the accident (NUMBER OF PERSONS INJURED)
- total_fatalities: Sum of all fatalities in the accident (NUMBER OF PERSONS KILLED)

Location-Based Features 

- borough: Categorical encoding (One-hot encoding)
- zip_code: Categorical encoding (One-hot encoding)
- high_risk_area: Boolean flag (1 = More than X accidents in the region)
- Time-Based Features

- crash_hour: Extracted from CRASH TIME
- crash_weekend: Boolean flag (1 = Weekend, 0 = Weekday)
- crash_rush_hour: Boolean flag (1 = Peak traffic hours)
- Accident Cause Features 🚗

- contributing_factor_vehicle_1: One-hot encoding of CONTRIBUTING FACTOR VEHICLE 1
- contributing_factor_vehicle_2: One-hot encoding of CONTRIBUTING FACTOR VEHICLE 2
- vehicle_type_code_1: One-hot encoding of VEHICLE TYPE CODE 1


# Step 3: Data Merging

We need to merge CitiBike data with NYPD accident data using:

Time proximity: CitiBike started_at should be within X minutes of an accident’s CRASH DATE + CRASH TIME
Location proximity: CitiBike start_lat, start_lng should be within 100m of an accident's LATITUDE, LONGITUDE

How?

Use KDTree for fast geospatial lookups (like before).
Filter by date and time to ensure the accident is relevant.


# Citybike dataset 


In [15]:
import pandas as pd
import numpy as np
from datetime import datetime
from geopy.distance import geodesic

# Load CitiBike data
citibike_df = pd.read_csv("/Users/shashankhmg/Documents/AXA-Casestudy/Data-Science-Challenge/data/processed/cleaned_citibike_data.csv")

# Convert 'started_at' and 'ended_at' to datetime format
citibike_df['started_at'] = pd.to_datetime(citibike_df['started_at'])
citibike_df['ended_at'] = pd.to_datetime(citibike_df['ended_at'])

# Time-Based Features**
citibike_df["hour_of_day"] = citibike_df["started_at"].dt.hour
citibike_df["day_of_week"] = citibike_df["started_at"].dt.dayofweek  # Monday = 0, Sunday = 6
citibike_df["weekend"] = (citibike_df["day_of_week"] >= 5).astype(int)  # 1 = Weekend, 0 = Weekday
# Classify rush hour: 1 = Rush hour, 0 = Non-rush hour
# Rush hour is typically considered:
# - Morning: 7 AM to 9 AM (commuters going to work)
# - Evening: 4 PM to 7 PM (commuters returning home)

citibike_df["rush_hour"] = np.where(
    ((citibike_df["hour_of_day"] >= 7) & (citibike_df["hour_of_day"] <= 9)) | 
    ((citibike_df["hour_of_day"] >= 16) & (citibike_df["hour_of_day"] <= 19)), 
    1,  # Rush Hour
    0   # Non-Rush Hour
)

# Trip-Based Features**
citibike_df["trip_duration"] = (citibike_df["ended_at"] - citibike_df["started_at"]).dt.total_seconds() / 60  # Convert to minutes

# Compute distance using geodesic distance
def compute_distance(row):
    return geodesic((row["start_lat"], row["start_lng"]), (row["end_lat"], row["end_lng"])).km

citibike_df["trip_distance"] = citibike_df.apply(compute_distance, axis=1)

# One-hot encoding categorical features
citibike_df["member_casual"] = citibike_df["member_casual"].map({"member": 1, "casual": 0})  # Encode membership
# Convert 'rideable_type' into a single binary column (1 = electric, 0 = classic)
citibike_df["rideable_type"] = citibike_df["rideable_type"].map({"electric_bike": 1, "classic_bike": 0})

# Geographical Features**
citibike_df["start_coordinates"] = list(zip(citibike_df["start_lat"], citibike_df["start_lng"]))
citibike_df["end_coordinates"] = list(zip(citibike_df["end_lat"], citibike_df["end_lng"]))

# the dataset I have selected is citybike 01.2025 and hence everything is in wintr. so it makes sense not introducing the column, but we must consider if we use the complete dataset

# Drop unnecessary columns
citibike_features = citibike_df.drop(columns=["ride_id", "started_at", "ended_at", 
                                              "start_station_name", "end_station_name", 
                                              "start_station_id", "end_station_id" ])

  citibike_df = pd.read_csv("/Users/shashankhmg/Documents/AXA-Casestudy/Data-Science-Challenge/data/processed/cleaned_citibike_data.csv")


In [16]:
citibike_features.columns

Index(['rideable_type', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
       'member_casual', 'hour_of_day', 'day_of_week', 'weekend', 'rush_hour',
       'trip_duration', 'trip_distance', 'start_coordinates',
       'end_coordinates'],
      dtype='object')

In [17]:
citibike_features

Unnamed: 0,rideable_type,start_lat,start_lng,end_lat,end_lng,member_casual,hour_of_day,day_of_week,weekend,rush_hour,trip_duration,trip_distance,start_coordinates,end_coordinates
0,1,40.752149,-73.989539,40.744876,-73.995299,1,22,2,0,0,4.442917,0.942805,"(40.752149, -73.989539)","(40.74487634, -73.99529885)"
1,1,40.738290,-73.990060,40.744876,-73.995299,1,15,3,0,0,5.585767,0.854840,"(40.73829, -73.99006)","(40.74487634, -73.99529885)"
2,1,40.745248,-73.947333,40.763359,-73.928647,1,12,2,0,0,8.616567,2.556361,"(40.74524768, -73.94733276)","(40.7633589, -73.9286471)"
3,1,40.738290,-73.990060,40.744876,-73.995299,1,13,1,0,0,5.089867,0.854840,"(40.73829, -73.99006)","(40.74487634, -73.99529885)"
4,1,40.812299,-73.920370,40.792327,-73.938300,1,7,0,0,1,8.680650,2.684817,"(40.812299, -73.92037)","(40.7923272, -73.9383)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2119676,0,40.759604,-73.927144,40.761149,-73.917007,1,14,4,0,0,6.917867,0.872972,"(40.759604471387945, -73.92714411020279)","(40.7611488, -73.9170071)"
2119677,1,40.712628,-73.960575,40.719156,-73.948854,1,20,2,0,0,4.015833,1.227322,"(40.712628, -73.960575)","(40.71915571696044, -73.94885390996933)"
2119678,1,40.735324,-73.998004,40.733424,-74.008515,0,15,1,0,0,2.683867,0.912599,"(40.73532427, -73.99800419)","(40.73342399437081, -74.00851495563984)"
2119679,1,40.794990,-73.933330,40.773914,-73.954395,1,9,6,1,1,10.173150,2.939312,"(40.79499, -73.93333)","(40.77391390238118, -73.9543953537941)"


In [18]:
# Save cleaned CitiBike dataset
citibike_features.to_csv("/Users/shashankhmg/Documents/AXA-Casestudy/Data-Science-Challenge/data/model_data/citibike_features.csv", index=False)


# NYPD Dataset


In [20]:
import pandas as pd
import numpy as np

# Load NYPD accident data
nypd_df = pd.read_csv("/Users/shashankhmg/Documents/AXA-Casestudy/Data-Science-Challenge/data/processed/nypd_data_cleaned.csv")  # Replace with actual path

# Convert 'CRASH DATE' and 'CRASH TIME' to datetime format
nypd_df["CRASH_DATETIME"] = pd.to_datetime(nypd_df["CRASH DATE"] + " " + nypd_df["CRASH TIME"])

# 1️⃣ **Time-Based Features**
nypd_df["hour_of_day"] = nypd_df["CRASH_DATETIME"].dt.hour
nypd_df["day_of_week"] = nypd_df["CRASH_DATETIME"].dt.dayofweek  # Monday = 0, Sunday = 6
nypd_df["weekend"] = (nypd_df["day_of_week"] >= 5).astype(int)  # 1 = Weekend, 0 = Weekday

# Define Rush Hour (Same logic as CitiBike)
nypd_df["rush_hour"] = np.where(
    ((nypd_df["hour_of_day"] >= 7) & (nypd_df["hour_of_day"] <= 9)) | 
    ((nypd_df["hour_of_day"] >= 16) & (nypd_df["hour_of_day"] <= 19)), 
    1,  # Rush Hour
    0   # Non-Rush Hour
)

# Location-Based Features
# Encode borough names into numeric values
nypd_df["borough_encoded"] = pd.factorize(nypd_df["BOROUGH"])[0]  # -1 means missing value
nypd_df["accident_coordinates"] = list(zip(nypd_df["LATITUDE"], nypd_df["LONGITUDE"]))

# 3️Accident Severity Features**
nypd_df["total_injuries"] = (
    nypd_df["NUMBER OF PERSONS INJURED"] + 
    nypd_df["NUMBER OF PEDESTRIANS INJURED"] + 
    nypd_df["NUMBER OF CYCLIST INJURED"] + 
    nypd_df["NUMBER OF MOTORIST INJURED"]
)

nypd_df["total_fatalities"] = (
    nypd_df["NUMBER OF PERSONS KILLED"] + 
    nypd_df["NUMBER OF PEDESTRIANS KILLED"] + 
    nypd_df["NUMBER OF CYCLIST KILLED"] + 
    nypd_df["NUMBER OF MOTORIST KILLED"]
)

nypd_df["cyclist_injuries_ratio"] = np.where(nypd_df["total_injuries"] > 0, 
                                             nypd_df["NUMBER OF CYCLIST INJURED"] / nypd_df["total_injuries"], 
                                             0)

nypd_df["cyclist_fatalities_ratio"] = np.where(nypd_df["total_fatalities"] > 0, 
                                               nypd_df["NUMBER OF CYCLIST KILLED"] / nypd_df["total_fatalities"], 
                                               0)

In [21]:
# Drop unnecessary columns
nypd_features = nypd_df.drop(columns=["CRASH DATE", "CRASH TIME", "CRASH_DATETIME", 
                                      "ZIP CODE", "ON STREET NAME","LOCATION"])

In [53]:
nypd_features

Unnamed: 0,BOROUGH,LATITUDE,LONGITUDE,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,...,hour_of_day,day_of_week,weekend,rush_hour,borough_encoded,accident_coordinates,total_injuries,total_fatalities,cyclist_injuries_ratio,cyclist_fatalities_ratio
0,BROOKLYN,40.621790,-73.970024,1.0,0.0,0,0,0,0,1,...,1,2,0,0,0,"(40.62179, -73.970024)",2.0,0.0,0.0,0.0
1,BROOKLYN,40.667202,-73.866500,0.0,0.0,0,0,0,0,0,...,9,5,1,1,0,"(40.667202, -73.8665)",0.0,0.0,0.0,0.0
2,BROOKLYN,40.683304,-73.917274,0.0,0.0,0,0,0,0,0,...,8,1,0,1,0,"(40.683304, -73.917274)",0.0,0.0,0.0,0.0
3,BRONX,40.868160,-73.831480,2.0,0.0,0,0,0,0,2,...,8,1,0,1,1,"(40.86816, -73.83148)",4.0,0.0,0.0,0.0
4,BROOKLYN,40.671720,-73.897100,0.0,0.0,0,0,0,0,0,...,21,1,0,0,0,"(40.67172, -73.8971)",0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1437812,QUEENS,40.712820,-73.903770,0.0,0.0,0,0,0,0,0,...,0,3,0,0,3,"(40.71282, -73.90377)",0.0,0.0,0.0,0.0
1437813,BRONX,40.826202,-73.821680,1.0,0.0,0,0,0,0,1,...,10,4,0,0,1,"(40.826202, -73.82168)",2.0,0.0,0.0,0.0
1437814,QUEENS,40.765380,-73.911430,0.0,0.0,0,0,0,0,0,...,0,2,0,0,3,"(40.76538, -73.91143)",0.0,0.0,0.0,0.0
1437815,QUEENS,40.765377,-73.889050,1.0,0.0,1,0,0,0,0,...,8,4,0,1,3,"(40.765377, -73.88905)",2.0,0.0,0.0,0.0


In [23]:
# Save cleaned nypd_features dataset
nypd_features.to_csv("/Users/shashankhmg/Documents/AXA-Casestudy/Data-Science-Challenge/data/model_data/nypd_features.csv", index=False)


# Step 1: Train the NYPD Accident Prediction Model
We'll create a model that predicts accident risk based on location, time, and accident history.

## Preprocessing the NYPD Data
Convert date & time to datetime.
Extract hour, day of the week, and weekend flag.
Categorize accidents into "High-Risk", "Medium-Risk", and "Low-Risk" zones.
Use spatial clustering (DBSCAN) to group high-risk accident locations.


In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Load preprocessed NYPD accident data
nypd_accidents = nypd_features.copy()

# Selecting relevant features
features = [
    "LATITUDE", "LONGITUDE", "hour_of_day", "day_of_week", "weekend", "rush_hour",
    "borough_encoded", "total_injuries", "total_fatalities",
    "cyclist_injuries_ratio", "cyclist_fatalities_ratio"
]

target = "accident_severity"  # Define accident severity as the target

# Define severity as binary classification (1 = High-Risk, 0 = Low-Risk) if total injury > 0
nypd_accidents["accident_severity"] = np.where(nypd_accidents["total_injuries"] > 0, 1, 0)

# Drop NaN values for modeling
nypd_accidents = nypd_accidents.dropna(subset=features)


In [28]:
# Standardize numerical features for better model performance (normalization)
scaler = StandardScaler()
nypd_accidents[features] = scaler.fit_transform(nypd_accidents[features])



In [29]:
nypd_accidents[features]

Unnamed: 0,LATITUDE,LONGITUDE,hour_of_day,day_of_week,weekend,rush_hour,borough_encoded,total_injuries,total_fatalities,cyclist_injuries_ratio,cyclist_fatalities_ratio
0,-1.301753,-0.619143,-2.152210,-0.471371,-0.585638,-0.804348,-1.207403,1.035105,-0.034374,-0.179498,-0.011039
1,-0.716753,0.667031,-0.748623,1.067128,1.707538,1.243242,-1.207403,-0.446849,-0.034374,-0.179498,-0.011039
2,-0.509327,0.036219,-0.924072,-0.984204,-0.585638,1.243242,-1.207403,-0.446849,-0.034374,-0.179498,-0.011039
3,1.871996,1.102116,-0.924072,-0.984204,-0.585638,1.243242,-0.432286,2.517059,-0.034374,-0.179498,-0.011039
4,-0.658552,0.286859,1.356756,-0.984204,-0.585638,-0.804348,-1.207403,-0.446849,-0.034374,-0.179498,-0.011039
...,...,...,...,...,...,...,...,...,...,...,...
1437812,-0.129100,0.203991,-2.327658,0.041462,-0.585638,-0.804348,1.117950,-0.446849,-0.034374,-0.179498,-0.011039
1437813,1.331491,1.223871,-0.573175,0.554295,-0.585638,-0.804348,-0.432286,1.035105,-0.034374,-0.179498,-0.011039
1437814,0.547980,0.108824,-2.327658,-0.471371,-0.585638,-0.804348,1.117950,-0.446849,-0.034374,-0.179498,-0.011039
1437815,0.547941,0.386871,-0.924072,0.554295,-0.585638,1.243242,1.117950,1.035105,-0.034374,-0.179498,-0.011039


In [32]:
nypd_accidents[target]

0          1
1          0
2          0
3          1
4          0
          ..
1437812    0
1437813    1
1437814    0
1437815    1
1437816    0
Name: accident_severity, Length: 1437817, dtype: int64

In [30]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    nypd_accidents[features], nypd_accidents[target], test_size=0.2, random_state=42
)

# Train a Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42, class_weight="balanced")
rf_model.fit(X_train, y_train)

In [31]:
# Predict & evaluate
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    221156
           1       1.00      1.00      1.00     66408

    accuracy                           1.00    287564
   macro avg       1.00      1.00      1.00    287564
weighted avg       1.00      1.00      1.00    287564



In [None]:
#New Plan for CitiBike Risk Scoring
#Use KDTree for efficient nearest-neighbor search.
# #Count the number of accidents within 100m (0.001 degrees) of each CitiBike station.
# Apply a weighted scoring system:

#Give higher weights to accidents with fatalities.
#Give medium weights to accidents with serious injuries.
#Give lower weights to minor accidents. 
# Normalize risk scores so they are in a usable range (e.g., 0-10 or 0-100). 
# Assign the final risk score to each CitiBike station.


In [65]:
import numpy as np
from scipy.spatial import cKDTree

# Filter only cyclist-related accidents
nypd_cyclist_accidents = nypd_df[
    (nypd_df["NUMBER OF CYCLIST INJURED"] > 0) | 
    (nypd_df["NUMBER OF CYCLIST KILLED"] > 0)
]

# Convert coordinates to numpy arrays for KDTree
accident_coords = np.array(list(zip(nypd_features["LATITUDE"], nypd_features["LONGITUDE"])))
station_coords = np.array(list(zip(citibike_features["start_lat"], citibike_features["start_lng"])))

# Build KDTree for fast lookup
accident_tree = cKDTree(accident_coords)

# Define search radius (in degrees, approx. 100m = 0.001 degrees)
search_radius = 0.001

# Find nearby accidents for each CitiBike station
accidents_near_station = accident_tree.query_ball_point(station_coords, search_radius)

accidents_near_station

array([list([864, 18317, 22406, 25949, 36038, 39369, 43710, 47332, 54004, 58713, 59200, 65749, 66915, 67604, 72629, 80737, 81617, 82941, 87502, 87700, 88242, 89627, 89814, 95844, 96986, 101603, 107200, 116273, 117882, 118499, 118764, 119271, 120025, 121609, 127497, 130192, 133095, 134713, 136942, 142823, 146161, 150383, 151068, 156539, 157279, 163431, 164643, 166024, 167425, 174267, 178489, 185171, 188932, 199101, 202360, 202420, 203919, 210606, 226249, 236081, 241204, 262648, 271539, 273908, 282837, 285633, 293775, 307482, 311511, 318547, 319632, 321247, 323230, 323596, 328528, 329897, 332797, 333723, 334038, 340563, 350110, 353288, 355325, 356541, 358499, 362913, 365701, 382563, 382806, 383491, 384101, 385645, 385844, 386918, 387953, 392623, 396128, 401168, 410408, 410980, 411450, 416108, 419972, 420603, 421033, 423290, 428734, 436277, 436962, 443981, 447334, 448730, 450637, 455995, 456104, 458652, 461359, 470988, 474134, 479166, 493663, 497200, 497662, 500867, 504096, 508254, 512441

In [66]:
# Count the number of accidents near each station
accident_counts = [len(acc) for acc in accidents_near_station]

# Assign raw accident count to CitiBike stations
citibike_df["accident_count"] = accident_counts


In [71]:
# Group by station name and sum the accident count
top_risky_stations = citibike_df.groupby("start_station_name", as_index=False)["accident_count"].sum()


In [72]:
top_risky_stations

Unnamed: 0,start_station_name,accident_count
0,1 Ave & E 110 St,318375
1,1 Ave & E 118 St,409952
2,1 Ave & E 16 St,801735
3,1 Ave & E 18 St,974389
4,1 Ave & E 30 St,1404586
...,...,...
2149,Wyckoff Ave & Stanhope St,247065
2150,Wyckoff St & 3 Ave,161628
2151,Wythe Ave & Metropolitan Ave,427110
2152,Wythe Ave & N 13 St,208010


In [75]:
# Risk score is calculated by penalizing stations with more than 10 accidents.
# Each accident above 10 adds a penalty weight (default = 2) to the risk score.


In [73]:
# Define penalty weight per accident (adjustable)
penalty_weight = 2  # Each extra accident above 10 increases risk score

# Apply the formula
top_risky_stations["risk_score"] = (top_risky_stations["accident_count"] - 10) * penalty_weight
top_risky_stations["risk_score"] = top_risky_stations["risk_score"].clip(lower=0)  # No negative scores


In [74]:
top_risky_stations

Unnamed: 0,start_station_name,accident_count,risk_score
0,1 Ave & E 110 St,318375,636730
1,1 Ave & E 118 St,409952,819884
2,1 Ave & E 16 St,801735,1603450
3,1 Ave & E 18 St,974389,1948758
4,1 Ave & E 30 St,1404586,2809152
...,...,...,...
2149,Wyckoff Ave & Stanhope St,247065,494110
2150,Wyckoff St & 3 Ave,161628,323236
2151,Wythe Ave & Metropolitan Ave,427110,854200
2152,Wythe Ave & N 13 St,208010,416000


risk score= max(accident count)−min(accident count) / max(accident count)−min(accident count)

In [76]:
from sklearn.preprocessing import MinMaxScaler

# Normalize the risk scores to a scale of 0 to 1 for better comparability
scaler = MinMaxScaler()
top_risky_stations["normalized_risk_score"] = scaler.fit_transform(top_risky_stations[["risk_score"]])


In [81]:
# Apply logarithmic transformation for better risk differentiation
top_risky_stations["log_risk_score"] = np.log1p(top_risky_stations["accident_count"])


In [82]:
top_risky_stations

Unnamed: 0,start_station_name,accident_count,risk_score,normalized_risk_score,log_risk_score
0,1 Ave & E 110 St,318375,636730,0.028814,12.670988
1,1 Ave & E 118 St,409952,819884,0.037102,12.923798
2,1 Ave & E 16 St,801735,1603450,0.072560,13.594535
3,1 Ave & E 18 St,974389,1948758,0.088186,13.789567
4,1 Ave & E 30 St,1404586,2809152,0.127121,14.155254
...,...,...,...,...,...
2149,Wyckoff Ave & Stanhope St,247065,494110,0.022360,12.417411
2150,Wyckoff St & 3 Ave,161628,323236,0.014627,11.993059
2151,Wythe Ave & Metropolitan Ave,427110,854200,0.038655,12.964799
2152,Wythe Ave & N 13 St,208010,416000,0.018825,12.245346


In [None]:
# P(accident at station)= Accident count at station / Total accidents across all stations

In [83]:
# Compute total accidents across all CitiBike stations
total_accidents = top_risky_stations["accident_count"].sum()

# Compute probability of an accident at each station
top_risky_stations["accident_probability"] = top_risky_stations["accident_count"] / total_accidents


In [84]:
top_risky_stations

Unnamed: 0,start_station_name,accident_count,risk_score,normalized_risk_score,log_risk_score,accident_probability
0,1 Ave & E 110 St,318375,636730,0.028814,12.670988,0.000541
1,1 Ave & E 118 St,409952,819884,0.037102,12.923798,0.000697
2,1 Ave & E 16 St,801735,1603450,0.072560,13.594535,0.001363
3,1 Ave & E 18 St,974389,1948758,0.088186,13.789567,0.001656
4,1 Ave & E 30 St,1404586,2809152,0.127121,14.155254,0.002388
...,...,...,...,...,...,...
2149,Wyckoff Ave & Stanhope St,247065,494110,0.022360,12.417411,0.000420
2150,Wyckoff St & 3 Ave,161628,323236,0.014627,11.993059,0.000275
2151,Wythe Ave & Metropolitan Ave,427110,854200,0.038655,12.964799,0.000726
2152,Wythe Ave & N 13 St,208010,416000,0.018825,12.245346,0.000354


In [86]:
top_risky_stations.sort_values(by=['normalized_risk_score'], ascending= False).head(10)


Unnamed: 0,start_station_name,accident_count,risk_score,normalized_risk_score,log_risk_score,accident_probability
38,11 Ave & W 41 St,11049108,22098196,1.0,16.21786,0.018781
2006,W 41 St & 8 Ave,7339520,14679020,0.664263,15.808784,0.012476
1065,E 58 St & 3 Ave,5498808,10997596,0.497669,15.520042,0.009347
1972,W 21 St & 6 Ave,5473477,10946934,0.495377,15.515425,0.009304
1992,W 31 St & 7 Ave,5304838,10609656,0.480114,15.48413,0.009017
2010,W 43 St & 10 Ave,5290296,10580572,0.478798,15.481385,0.008992
668,Broadway & W 58 St,4518963,9037906,0.408988,15.323793,0.007681
447,8 Ave & W 38 St,4261928,8523836,0.385725,15.265232,0.007244
1284,Hanson Pl & Ashland Pl,3710630,7421240,0.33583,15.126713,0.006307
2014,W 45 St & 8 Ave,3681840,7363660,0.333224,15.118923,0.006258


In [88]:
# Merge the normalized risk scores into the CitiBike dataset
citibike_df = citibike_df.merge(
    top_risky_stations[["normalized_risk_score"]],
    on="start station name",
    how="left"
)

KeyError: 'start station name'

: 

In [None]:
# Fill NaN values (if a station has no recorded risk, assign 0)
citibike_df["normalized_risk_score"] = citibike_df["normalized_risk_score"].fillna(0)

# Display the updated CitiBike dataset
import ace_tools as tools
import pandas as pd

tools.display_dataframe_to_user(name="CitiBike Dataset with Risk Score", dataframe=pd.DataFrame(citibike_df))


# Step 2: Assign Risk Scores to CitiBike Stations


In [37]:
from scipy.spatial import cKDTree

# Convert CitiBike station coordinates to numpy array
station_coords = np.array(list(zip(citibike_features["start_lat"], citibike_features["start_lng"])))

# Use KDTree for fast spatial lookup
accident_tree = cKDTree(nypd_accidents[["LATITUDE", "LONGITUDE"]].values)

# Define search radius (100m = ~0.001 degrees)
search_radius = 0.001

# Count accidents near each CitiBike station
accident_counts = accident_tree.query_ball_point(station_coords, search_radius)

# Assign risk scores to stations
citibike_features["risk_score"] = [len(accidents) for accidents in accident_counts]

In [41]:
citibike_features["risk_score"].value_counts()

risk_score
0    2119681
Name: count, dtype: int64

In [None]:
# Normalize risk scores (0 to 1)
citibike_stations["risk_score"] = citibike_stations["risk_score"] / citibike_stations["risk_score"].max()

# Merge risk scores with CitiBike dataset
citibike_df = citibike_df.merge(
    citibike_stations[["start_station_name", "risk_score"]],
    on="start_station_name",
    how="left"
)


In [1]:
pip install pygwalker

Collecting pygwalker
  Downloading pygwalker-0.4.9.13-py3-none-any.whl.metadata (20 kB)
Collecting anywidget (from pygwalker)
  Downloading anywidget-0.9.13-py3-none-any.whl.metadata (7.2 kB)
Collecting appdirs (from pygwalker)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting arrow (from pygwalker)
  Using cached arrow-1.3.0-py3-none-any.whl.metadata (7.5 kB)
Collecting astor (from pygwalker)
  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting cachetools (from pygwalker)
  Using cached cachetools-5.5.1-py3-none-any.whl.metadata (5.4 kB)
Collecting duckdb<2.0.0,>=0.10.1 (from pygwalker)
  Downloading duckdb-1.2.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (966 bytes)
Collecting gw-dsl-parser==0.1.49.1 (from pygwalker)
  Downloading gw_dsl_parser-0.1.49.1-py3-none-any.whl.metadata (1.2 kB)
Collecting ipylab<=1.0.0 (from pygwalker)
  Downloading ipylab-1.0.0-py3-none-any.whl.metadata (6.7 kB)
Collecting ipywidgets (from pygwalker)
  Dow

In [1]:
import pandas as pd
import json

# Load station data
citybikes_data = pd.read_csv('/Users/shashankhmg/Documents/AXA-Casestudy/Data-Science-Challenge/data/model_data/model_data.csv')
citybikes_data.columns

  citybikes_data = pd.read_csv('/Users/shashankhmg/Documents/AXA-Casestudy/Data-Science-Challenge/data/model_data/model_data.csv')


Index(['Unnamed: 0', 'ride_id', 'rideable_type', 'started_at', 'ended_at',
       'start_station_name', 'start_station_id', 'end_station_name',
       'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
       'member_casual', 'accident_count', 'normalized_risk_score',
       'risk_level'],
      dtype='object')

In [2]:
# Create a dictionary mapping station names to all relevant info
station_dict = {}
for _, row in citybikes_data.iterrows():
    station_dict[row["start_station_name"]] = {
        "start_station_id": row["start_station_id"],
        "start_lat": row["start_lat"],
        "start_lng": row["start_lng"],
        "normalized_risk_score": row["normalized_risk_score"]
    }

# Save to JSON
with open("station_data.json", "w") as f:
    json.dump(station_dict, f, indent=4)

