# Project description

Sweet Lift Taxi company has collected historical data on taxi orders at airports. To attract more drivers during peak hours, we need to predict the amount of taxi orders for the next hour. Build a model for such a prediction.

The RMSE metric on the test set should not be more than 48.

## Project instructions

1. Download the data and resample it by one hour.
2. Analyze the data.
3. Train different models with different hyperparameters. The test sample should be 10% of the initial dataset. 
4. Test the data using the test sample and provide a conclusion.

## Data description

The data is stored in file `taxi.csv`. The number of orders is in the '*num_orders*' column.

## Preparation

In [None]:
# Loading all the libraries
import pandas as pd
import numpy as np
import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_squared_error

In [None]:
# Loading the data file:
taxi_orders = pd.read_csv('/datasets/taxi.csv')
taxi_orders.head()

In [None]:
# Data overview:
taxi_orders.info()

In [None]:
taxi_orders.describe()

In [None]:
# Converting datetime type:
taxi_orders['datetime'] = pd.to_datetime(taxi_orders['datetime'])
taxi_orders.set_index('datetime', inplace=True)

In [None]:
# Resampling the data by 1 hour:
hourly_orders = taxi_orders.resample('1H').sum()
hourly_orders.head()

In [None]:
# Checking for missing values:
hourly_orders.isna().sum()

In [None]:
# Checking for duplicates:
hourly_orders.index.duplicated().sum()

## Analysis

In [None]:
# Plotting hourly taxi orders:
hourly_orders.plot(figsize=(12, 4))
plt.title('Hourly Taxi Orders Over Time', fontweight='bold')
plt.xlabel('Datetime', fontweight='bold')
plt.ylabel('Number of Orders', fontweight='bold')
plt.grid(True)
plt.show()

The hourly taxi order data from March to August 2018 shows a steady upward trend, with demand rising sharply during the summer months. Frequent spikes, some exceeding 400 orders per hour, suggest periods of intense airport activity or special events. The growing volatility over time calls for deeper analysis. To gain clearer insights, we’ll explore rolling averages to smooth short-term noise, analyze seasonality and long-term trends, and engineer time-based features to capture both predictable cycles and sudden changes in order volume.

In [None]:
# Calculating 24-hour rolling mean:
hourly_orders['rolling_mean_24h'] = hourly_orders['num_orders'].rolling(window=24).mean()

# Plotting original and smoothed series:
hourly_orders[['num_orders', 'rolling_mean_24h']].plot(figsize=(12, 5))
plt.title('Hourly Taxi Orders with 24-Hour Rolling Average', fontweight='bold')
plt.xlabel('Datetime', fontweight='bold')
plt.ylabel('Number of Orders', fontweight='bold')
plt.grid(True)
plt.show()


The plot of hourly taxi orders with a 24-hour rolling average reveals a clear upward trend in demand, with smoother fluctuations that highlight long-term growth, especially from June through August. The rolling average effectively filters out short-term volatility, making underlying patterns more visible. As the next step, we'll analyze average demand by hour of day and day of week to uncover recurring cycles and inform feature engineering.

In [None]:
# Adding 'hour' column:
hourly_orders['hour'] = hourly_orders.index.hour

# Grouping by hour and calculating average orders:
avg_by_hour = hourly_orders.groupby('hour')['num_orders'].mean()

# Converting 24-hour index to AM/PM format:
hour_labels = [f'{h % 12 or 12}{"am" if h < 12 else "pm"}' for h in avg_by_hour.index]

# Plotting average orders by hour of day:
ax = avg_by_hour.plot(kind='bar', figsize=(14, 5))
plt.title('Average Taxi Orders by Hour of Day', fontweight='bold')
plt.xlabel('Hour of Day', fontweight='bold')
plt.ylabel('Average Number of Orders', fontweight='bold')
plt.grid(axis='y')
ax.set_xticklabels(hour_labels, rotation=45, ha='center')
plt.show()

The plot shows that taxi demand peaks around midnight and early morning hours `(12am–2am)`, likely due to late-night arrivals or airport activity. After a dip in the early morning `(5am–7am)`, demand gradually rises again through the afternoon and evening, with a smaller peak around `5pm` and a steady increase from `8pm onward`. These patterns suggest clear daily cycles, with strong nighttime activity and a secondary evening rise—valuable insights for modeling hourly demand.

In [None]:
# Adding day of week column:
hourly_orders['day_of_week'] = hourly_orders.index.dayofweek

# Grouping by day of week and calculating average orders:
avg_by_day = hourly_orders.groupby('day_of_week')['num_orders'].mean()

# Weekday labels:
weekday_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Plotting average orders by day of Week:
ax = avg_by_day.plot(kind='bar', figsize=(8, 4))
plt.title('Average Taxi Orders by Day of Week', fontweight='bold')
plt.xlabel('Day of Week', fontweight='bold')
plt.ylabel('Average Number of Orders', fontweight='bold')
ax.set_xticklabels(weekday_labels, rotation=0)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

The plot of average taxi orders by day of week shows that demand is highest on `Fridays`, followed closely by `Mondays`. `Weekends` see slightly lower activity, suggesting reduced airport traffic or traveler volume. `Tuesday` appears to have the lowest average, while `midweek` demand stays relatively consistent. These weekly patterns suggest that incorporating the day of week as a feature may improve predictive performance.

# `Seasonal Decomposition`:

In [None]:
# Decomposing hourly orders:
decomposition = seasonal_decompose(hourly_orders['num_orders'], model='additive', period=24)
decomposition.plot()
plt.tight_layout()
plt.show()

The seasonal decomposition of the hourly taxi order data revealed three key patterns:

- A gradually rising trend, indicating a steady increase in demand over time

- A strong daily seasonal pattern (repeating every 24 hours), showing consistent hourly fluctuations

- Relatively stable residuals, with some increase in volatility toward the end of the period

These insights confirm that the time series has both trend and seasonality, making it essential to extract informative time-based patterns before modeling. Based on this, we proceed with feature engineering—including lag features, rolling averages, and time-based indicators—to help the model learn from these temporal dynamics.

# `Feature Engineering`:

In this section, we’ll create lag features, rolling averages, and time-based features (hour, day of week, is_weekend) to help the model learn from temporal patterns.

In [None]:
# Creating a new DataFrame to hold features:
features = hourly_orders.copy()

features.head()

# `Adding Time-Based Features`:

In [None]:
# Time-based features:
features['hour'] = features.index.hour
features['day_of_week'] = features.index.dayofweek
features['is_weekend'] = features['day_of_week'].isin([5, 6]).astype(int)

# `Adding Lag Features`:

In [None]:
# Lag features:
features['lag_1'] = features['num_orders'].shift(1)     
features['lag_24'] = features['num_orders'].shift(24)   
features['lag_168'] = features['num_orders'].shift(168) 

# `Adding Shifted Rolling Mean Features`:

In [None]:
# Droping the EDA rolling mean 24h to avoid data leakage:
features.drop(columns=['rolling_mean_24h'], inplace=True)

In [None]:
# Rolling mean features:
features['rolling_mean_3'] = features['num_orders'].shift(1).rolling(window=3).mean()
features['rolling_mean_24h'] = features['num_orders'].shift(1).rolling(window=24).mean()

In [None]:
# Dropping rows with missing values:
features.dropna(inplace=True)

features.head()

In [None]:
# Plotting num_orders vs 24-hour rolling mean:
features[['num_orders', 'rolling_mean_24h']].plot(
    figsize=(12, 4),
    title='Taxi Orders vs 24-Hour Rolling Mean',
    linewidth=2
)
plt.title('Taxi Orders vs 24-Hour Rolling Mean', fontweight='bold')
plt.xlabel('Datetime', fontweight='bold')
plt.ylabel('Number of Orders', fontweight='bold')
plt.grid(True)
plt.show()

We can clearly see an upward trend in the number of orders over time, with the rolling average line capturing the gradual increase in baseline demand. While actual orders fluctuate greatly from hour to hour, the 24-hour rolling mean shows a steady climb—indicating a growing need for taxis over the months. 

In [None]:
# Plotting num_orders vs 24-hour lag
features[['num_orders', 'lag_24']].plot(
    figsize=(12, 4),
    linewidth=2
)
plt.title('Taxi Orders vs 24-Hour Lag', fontweight='bold')
plt.xlabel('Datetime', fontweight='bold')
plt.ylabel('Number of Orders', fontweight='bold')
plt.grid(True)
plt.show()

The close alignment between the num_orders and `lag_24` lines suggests a strong daily autocorrelation in the data—meaning taxi demand at a given hour tends to resemble demand at that same hour the day before.

## Training

In [None]:
# Spliting the dataset into features and target:
X = features.drop('num_orders', axis=1)
y = features['num_orders']

# Spliting the dataset: hold out 10% of the data for final testing:
split_idx = int(len(X) * 0.9)

# Creating train and test sets:
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

# Display the size of each subset:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

# `Random Forest Model`:

In [None]:
# Defining parameter grid:
param_grid = {
    'n_estimators': [100, 300],
    'max_depth': [5, 10],
    'min_samples_split': [2, 5]
}

# Time series-aware cross-validation":
tscv = TimeSeriesSplit(n_splits=3)

# Initializing Random Forest Regressor:
rf = RandomForestRegressor(random_state=42)

# RandomizedSearchCV setup:
search = RandomizedSearchCV(
    rf,
    param_distributions=param_grid,
    n_iter=5,
    cv=tscv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    random_state=42
)

# Fiting on training set:
search.fit(X_train, y_train)

# Best model from search:
best_rf = search.best_estimator_

# Measuring training time:
start_time = time.time()
search.fit(X_train, y_train)
training_time = time.time() - start_time

# Measuring prediction time on training set:
start_time = time.time()
y_pred_train = best_rf.predict(X_train)
prediction_time = time.time() - start_time

# Calculating RMSE:
rmse_train = mean_squared_error(y_train, y_pred_train, squared=False)

# Displaying Validation results:
print("\n--- Tuned Random Forest Training Summary ---")
print(f"Training Time     : {training_time:.2f} seconds")
print(f"Prediction Time   : {prediction_time:.2f} seconds")
print(f"Train RMSE        : {rmse_train:.2f}")

These results show that the tuned Random Forest model trains in under a minute and predicts very quickly. The low training RMSE suggests the model fits the training data well, though further evaluation on unseen data will confirm how well it generalizes.

# `LightGBM model`:

In [None]:
# Defining parameter grid:
param_grid_lgb = {
    'n_estimators': [100, 300],
    'max_depth': [10, 20],
    'num_leaves': [30, 50],
    'learning_rate': [0.1]
}

# Initializing LightGBM Regressor
lgb_model = lgb.LGBMRegressor(random_state=42)

# RandomizedSearchCV setup:
search_lgb = RandomizedSearchCV(
    lgb_model,
    param_distributions=param_grid_lgb,
    n_iter=5,
    cv=tscv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    random_state=42
)

# Fitting model on training data:
search_lgb.fit(X_train, y_train)
best_lgb = search_lgb.best_estimator_

# Measuring training time:
start_time = time.time()
search_lgb.fit(X_train, y_train)
training_time = time.time() - start_time

# Measuring prediction time on training set:
start_time = time.time()
y_pred_train_lgb = best_lgb.predict(X_train)
prediction_time = time.time() - start_time

# Calculating RMSE on training set:
rmse_train_lgb = mean_squared_error(y_train, y_pred_train_lgb, squared=False)

# Displaying validation results:
print("\n--- Tuned LightGBM Training Summary ---")
print(f"Training Time     : {training_time:.2f} seconds")
print(f"Prediction Time   : {prediction_time:.2f} seconds")
print(f"Train RMSE        : {rmse_train_lgb:.2f}")

These results show that the tuned LightGBM model trains in just over 6 seconds and makes predictions almost instantly. The low training RMSE suggests the model captures patterns in the data effectively, though further evaluation on unseen data will confirm how well it generalizes.

## Testing

# `Random Forest Model`:

In [None]:
# Random Forest Test Evaluation:
start_time = time.time()
y_pred_rf = best_rf.predict(X_test)
rf_test_time = time.time() - start_time

# Calculating RMSE on test set:
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)

# # Displaying final test results:
print("\n--- Random Forest Test Set Evaluation ---")
print(f"Prediction Time : {rf_test_time:.2f} seconds")
print(f"Test RMSE       : {rmse_rf:.2f}")

The tuned Random Forest model achieved a test **RMSE of 44.51**, which is below the project’s threshold of **48**.

Prediction time: just 0.02 seconds

This indicates the model generalizes well to unseen data and is suitable for production use from both an accuracy and speed perspective.

# `LightGBM model`:

In [None]:
# LightGBM Test Evaluation
start_time = time.time()
y_pred_lgb = best_lgb.predict(X_test)
lgb_test_time = time.time() - start_time

# Calculating RMSE on test set:
rmse_lgb = mean_squared_error(y_test, y_pred_lgb, squared=False)

# Displaying final test results:
print("\n--- LightGBM Test Set Evaluation ---")
print(f"Prediction Time : {lgb_test_time:.2f} seconds")
print(f"Test RMSE       : {rmse_lgb:.2f}")

The tuned LightGBM model performed very well on the test set, achieving a test **RMSE of 43.27**, comfortably below the target threshold of **48**.

Prediction time: just 0.07seconds

This result shows that the model not only generalizes effectively to unseen data but also does so with extremely fast prediction speed, ideal for real-time forecasting needs

## Gereral Conclusion:

In this project, we developed and evaluated models to forecast hourly taxi orders using historical airport data. After thorough data preparation, including resampling, exploratory analysis, decomposition, and feature engineering, we trained and fine-tuned two machine learning models: Random Forest and LightGBM.

Both models successfully met the project requirement of an RMSE under 48 on the test set. `The LightGBM model` outperformed with a lower **RMSE of 43.27** and faster prediction time **(0.07 seconds)**, making it the more suitable choice for deployment in a real-time prediction setting. `The Random Forest model`, while slightly less accurate **(RMSE: 44.51)**, also delivered reliable performance and robustness.

These results show that time-based features, lag values, and rolling means contributed significantly to capturing temporal patterns in demand. The chosen approach enables the company to anticipate demand spikes and optimize driver distribution during peak hours.