# Uber Hot-Zones Recommendation System

## Project Overview

One of the main challenges faced by Uber is the spatial mismatch between driver availability and user demand.

Users expect a waiting time between 5 and 7 minutes. Beyond this threshold, the probability of ride cancellation significantly increases.

The objective of this project is to identify geographical hot-zones where drivers should position themselves depending on the day of the week in New York City.

This project uses unsupervised machine learning algorithms in order to:

- Detect pickup demand patterns
- Segment pickup locations into geographical clusters
- Recommend optimal positioning areas for drivers

Two clustering algorithms will be compared:

- KMeans
- DBSCAN

## Imports

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

from datetime import datetime

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.pipeline import Pipeline


## Load Data

In [2]:
uber_df = pd.read_csv("data/uber-raw-data-apr14.csv")
uber_df.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [None]:
print("Number of rows : {}".format(uber_df.shape[0]))
print()

print(" Basic statistics: ")
df_desc = uber_df.describe(include='all')
display(df_desc)
print()

print("Percentage of missing values : ")
display(100*uber_df.isnull().sum()/uber_df.shape[0])

Number of rows : 564516

 Basic statistics: 


Unnamed: 0,Date/Time,Lat,Lon,Base
count,564516,564516.0,564516.0,564516
unique,41999,,,5
top,4/7/2014 20:21:00,,,B02682
freq,97,,,227808
mean,,40.740005,-73.976817,
std,,0.036083,0.050426,
min,,40.0729,-74.7733,
25%,,40.7225,-73.9977,
50%,,40.7425,-73.9848,
75%,,40.7607,-73.97,



Percentage of missing values : 


Date/Time    0.0
Lat          0.0
Lon          0.0
Base         0.0
dtype: float64

In [4]:
uber_sample_df = uber_df.sample(10000, random_state=42)

uber_sample_df.shape

(10000, 4)

#### 

A sampling strategy was applied to reduce computational cost during clustering algorithm tuning, enabling faster iteration without significantly impacting spatial demand patterns

## Dataset Overview

The dataset contains historical Uber pickup records in New York City.

Each observation represents a pickup event and includes:

- Date and time of the pickup
- Latitude of the pickup location
- Longitude of the pickup location
- Dispatching base number

In this analysis, latitude and longitude will be used to identify geographical demand clusters.

## Feature Engineering

In [5]:
uber_sample_df = uber_sample_df.rename(columns={"Date/Time": "pickup_datetime"})
uber_sample_df = uber_sample_df.rename(columns={"Lat": "pickup_latitude", "Lon": "pickup_longitude"})

In [6]:
uber_sample_df["pickup_datetime"] = pd.to_datetime(
    uber_sample_df["pickup_datetime"]
)

uber_sample_df["hour"] = (
    uber_sample_df["pickup_datetime"]
    .dt.hour
)

uber_sample_df["week_day"] = (
    uber_sample_df["pickup_datetime"]
    .dt.day_name()
)

## Pick One Day

In [7]:
monday_18_df = uber_sample_df[
    (uber_sample_df["week_day"] == "Monday") &
    (uber_sample_df["hour"] == 18)
]

monday_18_df = monday_18_df.copy()


In [8]:
fig = px.scatter_map(
    monday_18_df,
    lat="pickup_latitude",
    lon="pickup_longitude",
    color="Base",
    map_style="carto-positron"
)

## Preprocessing Pipeline

In [9]:
numeric_features = ["pickup_latitude", "pickup_longitude"]

numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        (
            "num",
            numeric_transformer,
            numeric_features
        )
    ]
)

X_monday_18 = monday_18_df[numeric_features]

X_monday_18_scaled = preprocessor.fit_transform(X_monday_18)

## KMeans Clustering

In [10]:
kmeans = KMeans(
    n_clusters=10,
    random_state=42
)

kmeans.fit(X_monday_18_scaled)

monday_18_df["cluster_kmeans"] = (
    kmeans.predict(X_monday_18_scaled)
)

In [11]:
from sklearn.metrics import (
    silhouette_score,
    davies_bouldin_score,
    calinski_harabasz_score
)

In [12]:
print("KMeans Silhouette:",
      silhouette_score(
          X_monday_18_scaled,
          monday_18_df["cluster_kmeans"]
      ))

print("KMeans DBI:",
      davies_bouldin_score(
          X_monday_18_scaled,
          monday_18_df["cluster_kmeans"]
      ))

print("KMeans CHI:",
      calinski_harabasz_score(
          X_monday_18_scaled,
          monday_18_df["cluster_kmeans"]
      ))

KMeans Silhouette: 0.40859686819192836
KMeans DBI: 0.4680805316154748
KMeans CHI: 121.19798022427268


####

These results indicate that KMeans is able to group pickup locations into spatially distinct clusters.

## DBSCAN Clustering

In [13]:
dbscan = DBSCAN(
    eps=0.3,
    min_samples=15
)

dbscan.fit(X_monday_18_scaled)

monday_18_df["cluster_dbscan"] = (
    dbscan.labels_
)

In [14]:
dbscan_mask = monday_18_df["cluster_dbscan"] != -1

if len(set(monday_18_df.loc[dbscan_mask, "cluster_dbscan"])) > 1:

    print(
        "DBSCAN Silhouette:",
        silhouette_score(
            X_monday_18_scaled[dbscan_mask],
            monday_18_df.loc[
                dbscan_mask,
                "cluster_dbscan"
            ]
        )
    )
else:
    print(
        "DBSCAN Silhouette cannot be computed: "
        "less than 2 clusters detected."
    )

DBSCAN Silhouette cannot be computed: less than 2 clusters detected.


#### DBSCAN Clustering Limitation

DBSCAN failed to detect more than one valid cluster at this time period, preventing the computation of separation-based metrics such as the Silhouette Score.

This indicates that the algorithm was unable to segment pickup demand into multiple spatial zones.

## Operational Coverage Metric

In [15]:
kmeans_coverage = (
    monday_18_df["cluster_kmeans"]
    .notna()
    .sum()
    / len(monday_18_df)
)

print(kmeans_coverage)

1.0


####
KMeans achieved an Actionable Coverage Rate of 100%, meaning that all pickup locations were assigned to a valid cluster.

In [16]:
dbscan_coverage = (
    monday_18_df["cluster_dbscan"] != -1
).sum() / len(monday_18_df)

print(dbscan_coverage)

0.22388059701492538


####
DBSCAN achieved a significantly lower Actionable Coverage Rate of 22%, as a large proportion of pickup locations were classified as noise (cluster = -1).

These unassigned pickups cannot be translated into driver positioning recommendations, limiting the operational usability of the model for real-time fleet allocation.

## Algorithm Comparison

Both KMeans and DBSCAN were evaluated to identify geographical pickup demand zones.

Internal clustering validation metrics were computed to assess structural segmentation quality, while an Actionable Coverage Rate (ACR) metric was introduced to evaluate operational usability.

KMeans achieved satisfactory internal clustering performance, with a Silhouette Score of 0.41, a Davies–Bouldin Index of 0.47, and a Calinski–Harabasz Index of 121.20.

In contrast, DBSCAN failed to detect multiple valid clusters at this time period, preventing the computation of separation-based metrics such as the Silhouette Score, and achieved a significantly lower ACR of 22%.

This indicates that although DBSCAN can identify dense pickup areas, it does not provide sufficient spatial segmentation or coverage for operational driver positioning.

Therefore, KMeans was selected as the final model based on both clustering performance and actionable recommendation capability.

## Hot-Zones Visualization using KMeans

In [17]:
fig = px.scatter_map(
    monday_18_df,
    lat="pickup_latitude",
    lon="pickup_longitude",
    color="cluster_kmeans",
    zoom=10,
    height=600
)

fig.update_layout(
    map_style="open-street-map"
)

fig.show()

## Business Interpretation - Monday 6PM Demand Pattern

Pickup clusters at 6PM on Mondays show a strong concentration of ride requests in Midtown and Lower Manhattan.

This spatial pattern reflects typical evening commuting behavior, with increased demand in business districts and near major transit hubs after working hours.

Drivers positioned in these high-density areas are more likely to receive ride requests quickly, while those in peripheral boroughs may experience longer idle times.

Recommending these Midtown hot-zones to drivers before peak hours could:

- Reduce rider waiting time below the 5–7 minute threshold
- Increase driver utilization rate
- Improve driver-passenger spatial matching
- Decrease ride cancellation probability

## Weekday-Based Hot-Zones Detection

In [18]:
weekdays = uber_sample_df["week_day"].unique()

In [19]:
weekday_clusters = {}
weekday_models = {}

In [20]:
numeric_features = [
    "pickup_latitude",
    "pickup_longitude"
]

numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        (
            "num",
            numeric_transformer,
            numeric_features
        )
    ]
)

In [21]:
for day in weekdays:

    day_df = uber_sample_df[
        uber_sample_df["week_day"] == day
    ].copy()

    X_day = day_df[numeric_features]

    X_day_scaled = (
        preprocessor.fit_transform(X_day)
    )

    kmeans = KMeans(
        n_clusters=8,
        random_state=42
    )

    kmeans.fit(X_day_scaled)

    day_df["kmeans_cluster"] = (
        kmeans.predict(X_day_scaled)
    )

    weekday_clusters[day] = day_df
    weekday_models[day] = kmeans

## Friday Hot-Zones Visualization

In [22]:
friday_df = weekday_clusters["Friday"]

fig = px.scatter_map(
    friday_df,
    lat="pickup_latitude",
    lon="pickup_longitude",
    color="kmeans_cluster",
    zoom=10,
    height=600
)

fig.update_layout(
    map_style="open-street-map"
)

fig.show()

## Business Interpretation - Week day Demand Patterns

Clustering pickup locations by weekday reveals distinct spatial demand patterns across New York City.

These variations suggest that optimal driver positioning strategies should be dynamically adapted depending on the day of the week.

For instance, weekday pickup demand tends to concentrate in business districts, reflecting commuting-related mobility patterns, while other days may exhibit more spatially distributed ride requests.

By leveraging weekday-specific hot-zones derived from KMeans clustering, Uber can proactively recommend positioning areas to drivers in anticipation of daily demand fluctuations.

This enables more efficient driver-passenger matching and supports dynamic fleet allocation aligned with temporal demand patterns across the city.

## Conclusion

This analysis aimed to identify geographical hot-zones in New York City where Uber drivers should position themselves in order to better match rider demand and reduce waiting time.

Unsupervised clustering techniques were applied to segment pickup locations into spatial demand zones. KMeans and DBSCAN algorithms were evaluated using both internal clustering validation metrics and an operational Actionable Coverage Rate (ACR) metric.

While DBSCAN was able to detect dense pickup areas, it failed to provide sufficient spatial segmentation and achieved a significantly lower coverage rate, limiting its operational usability for driver positioning.

In contrast, KMeans ensured full spatial assignment and demonstrated satisfactory clustering performance, enabling the identification of actionable hot-zones through cluster centroids.

By leveraging weekday-specific hot-zones derived from KMeans clustering, Uber can dynamically recommend optimal positioning areas to drivers depending on temporal demand patterns.

This approach supports proactive fleet distribution, improves driver-passenger spatial matching, and contributes to reducing rider waiting time below the acceptable 5–7 minute threshold.