# From Cancellations to Optimization: A Three-Stage Data-Driven Study of Uber Ride Dynamics in 2024
**Jessie Zhao**

## 1. Abstract
This project presents a three-stage analytical framework for studying ridesharing operations using the Uber Ride Analytics Dataset 2024, which contains 148,770 detailed booking records across multiple vehicle types. The study first develops behavioral models to identify and quantify the drivers of cancellations, distinguishing between customer- and driver-initiated terminations. Next, it advances to spatio-temporal demand forecasting, leveraging time-series models (SARIMA, Prophet, and LSTM) and clustering techniques to predict booking demand and identify cancellation hotspots. Finally, the analysis culminates in policy simulation and optimization, evaluating counterfactual scenarios such as reducing driver cancellations or shifting payment methods, and quantifying their potential impacts on revenue and user satisfaction. Together, these stages form a comprehensive, data-driven approach that bridges statistical modeling, machine learning, and causal inference. The findings provide not only academic insights into ridesharing dynamics but also actionable strategies for enhancing operational efficiency and user experience in urban mobility platforms.

## 2. Introduction
The rapid expansion of ridesharing platforms has reshaped urban transportation, offering flexible mobility at scale but also introducing persistent operational challenges. Chief among these are cancellations, which create inefficiencies, degrade customer experience, and reduce driver earnings. Equally problematic are imbalances in supply and demand that fluctuate across time and geography, and policy trade-offs that must balance incentives for drivers with affordability for customers. Addressing these challenges requires a rigorous, data-driven approach that can move beyond description toward actionable solutions.

The Uber Ride Analytics Dataset 2024 provides a rare opportunity to study these dynamics with granularity. Comprising nearly 150,000 bookings across multiple vehicle types, payment methods, and ride outcomes, the dataset captures a wide spectrum of behaviors: completed rides, customer- and driver-initiated cancellations, reasons for service failure, trip distances and durations, and reciprocal satisfaction ratings. Unlike smaller or aggregated datasets, this source enables a comprehensive view of the platform’s operational ecosystem, from individual booking behavior to systemic demand patterns.

This project is motivated by the recognition that no single analytic lens is sufficient. Descriptive summaries cannot explain why cancellations occur; predictive models without behavioral insight cannot guide interventions; and policy recommendations without forecasting are speculative. To bridge these gaps, the study adopts a three-stage framework. The first stage investigates the behavioral determinants of cancellations. The second stage builds predictive models of demand and cancellation risk over time and space. The final stage evaluates potential interventions through simulation, quantifying their impacts on revenue and satisfaction. By integrating statistical inference, machine learning, and causal reasoning, this framework offers both explanatory insight and actionable strategies—contributing to both the academic study of urban mobility and the practical optimization of ridesharing platforms.

## 3. Method
### 3.1 Behavioral Modeling (Step 1)
The first stage of the analysis focuses on modeling the behavioral determinants of booking cancellations, distinguishing between customer- and driver-initiated terminations. Several techniques are employed, including logistic regression, random forests, and survival analysis. Logistic regression serves as a baseline approach due to its interpretability and ability to provide statistical inference on the likelihood of cancellation. Random forests are incorporated to capture potential nonlinear relationships and higher-order interactions between variables such as time of day, vehicle type, or payment method. Survival analysis extends the investigation by introducing a temporal dimension, enabling the modeling of the duration from booking to cancellation and providing insights into not only whether cancellations occur but also when they are most likely to happen. In this study, binary outcomes (cancelled vs. completed) and multinomial outcomes (customer vs. driver cancellations) are modeled, with predictor variables including booking time, vehicle type, ride distance, average driver arrival time (VTAT), average trip duration (CTAT), payment method, and ratings. Models are evaluated using metrics such as AUC and precision-recall, and results are interpreted both statistically and in terms of business relevance to highlight the most influential behavioral drivers of cancellations.

### 3.2 Spatio-Temporal Forecasting (Step 2)
The second stage of the study advances from behavioral analysis to spatio-temporal forecasting, with the objective of predicting ride demand and identifying high-risk cancellation scenarios across both time and geography. Three complementary forecasting methods are considered: SARIMA, Prophet, and LSTM. SARIMA provides a statistically rigorous baseline for modeling temporal dependencies and seasonality, while Prophet, developed by Facebook, offers an interpretable decomposition of trend, weekly cycles, and holiday effects, making it particularly suited for business communication. LSTM, as a deep learning approach, is used to capture long-range dependencies and nonlinear dynamics in sequential data, with the potential to outperform classical models when complex patterns are present. In this project, booking volumes are aggregated at hourly and daily intervals and stratified by pickup location to introduce spatial granularity. Autocorrelation functions (ACF/PACF) guide the specification of SARIMA, while Prophet models incorporate both weekly seasonality and holiday indicators. LSTM models are trained using sliding windows of historical bookings and incorporate exogenous variables such as cancellation rates, VTAT, CTAT, and payment distributions. Forecast accuracy is assessed using RMSE and MAE, and visual comparisons between predicted and actual demand highlight model performance. To complement the time-series analysis, clustering techniques such as ST-DBSCAN are applied to identify spatial hotspots where high demand coincides with elevated cancellation risk, thereby linking the temporal and spatial dimensions of the problem.

### 3.3 Policy Simulation and Optimization (Step 3)
The third stage of the framework focuses on policy simulation and optimization, translating insights from the behavioral and forecasting models into actionable recommendations. This step evaluates counterfactual scenarios designed to test the potential impact of interventions on platform outcomes. While causal inference methods such as Difference-in-Differences and Synthetic Control are considered conceptually, the primary approach relies on counterfactual simulations informed by predictive models from Steps 1 and 2. Baseline scenarios are constructed using observed data, and alternative scenarios are generated by systematically modifying key variables. Examples include reducing driver cancellation probabilities by 20 percent, shifting a portion of cash payments toward UPI, or increasing vehicle supply during high-risk time windows. The modified datasets are then reintroduced into the behavioral and forecasting models to estimate their downstream effects on ride completion rates, revenue generation, and satisfaction metrics. By comparing outcomes across simulated scenarios, this stage provides a quantitative assessment of the relative effectiveness of different intervention strategies. The emphasis is on bridging prediction with prescription, offering ridesharing platforms a framework for not only understanding their current dynamics but also evaluating the potential consequences of targeted operational policies.

In [4]:
import pandas as pd
data_path = '/Users/jessiezhao/Desktop/Lucky Jessie/前行计划/uber-ride-analytics-2024/data/ncr_ride_bookings.csv'
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,...,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,...,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,...,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,...,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,...,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,...,,,,,,737.0,48.21,4.1,4.3,UPI
