# NYC Taxi Trip Duration Prediction – Modeling

This notebook covers the modeling phase of the project:
- Load and preprocess data  
- Train baseline models  
- Evaluate candidate models (tree-based ensembles)  
- Save results for tracking  
- Train and save final model  
- Test final model on hold-out dataset  


## 1. Setup and Install Dependencies
We install specialized libraries (XGBoost, LightGBM, CatBoost) and import required packages.


In [28]:
!pip install xgboost
!pip install lightgbm
!pip install catboost



## 2. Mount Drive and Clone Repository
We mount Google Drive for data storage and clone the project repository for preprocessing functions.


In [29]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#models
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

#mount data
from google.colab import drive
drive.mount('/content/drive')
%matplotlib inline

#disable warnings
import warnings
warnings.filterwarnings('ignore')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import preprocessing functions

In [30]:
!git clone https://github.com/Eng-Moaz/NYC-Taxi-Trip-Duration.git
%cd NYC-Taxi-Trip-Duration


Cloning into 'NYC-Taxi-Trip-Duration'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 48 (delta 9), reused 48 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (48/48), 730.16 KiB | 2.92 MiB/s, done.
Resolving deltas: 100% (9/9), done.
/content/NYC-Taxi-Trip-Duration/NYC-Taxi-Trip-Duration/NYC-Taxi-Trip-Duration


In [31]:
import sys
sys.path.append('src')
from preprocessing import preprocess

## 3. Data Loading and Preprocessing
We load train, validation, and sample splits from Drive, then apply preprocessing functions from `preprocessing.py`.


In [32]:
train = pd.read_csv('/content/drive/MyDrive/Projects/NYC trip duration/split/train.csv')
val = pd.read_csv('/content/drive/MyDrive/Projects/NYC trip duration/split/val.csv')
train_sample = pd.read_csv('/content/drive/MyDrive/Projects/NYC trip duration/split_sample/train.csv')
val_sample = pd.read_csv('/content/drive/MyDrive/Projects/NYC trip duration/split_sample/val.csv')
train , val = preprocess(train) , preprocess(val)
train_sample , val_sample = preprocess(train_sample) , preprocess(val_sample)

In [33]:
train_X , train_y = train.drop('trip_duration_transformed', axis=1) , train['trip_duration_transformed']
val_X , val_y = val.drop('trip_duration_transformed', axis=1) , val['trip_duration_transformed']
train_sample_X , train_sample_y = train_sample.drop('trip_duration_transformed', axis=1) , train_sample['trip_duration_transformed']
val_sample_X , val_sample_y = val_sample.drop('trip_duration_transformed', axis=1) , val_sample['trip_duration_transformed']

## 4. Project Structure for Saving Results
We create folders for storing model results:
- baselines/ → Simple models for comparison  
- candidates/ → Advanced ensemble models  
- final/ → Best chosen model and test results  


In [34]:
import os
import json

for folder in ["results/baselines", "results/candidates", "results/final"]:
    os.makedirs(folder, exist_ok=True)


## 5. Helper Functions
- `save_results()` → saves results as JSON  
- `evaluate_model()` → trains model and calculates metrics (MAE, RMSE, R²)


In [35]:
def save_results(results: dict, filepath: str):
    """
    Save model results dictionary as JSON.
    """
    with open(filepath, "w") as f:
        json.dump(results, f, indent=4)


In [36]:
def evaluate_model(model, X_train, y_train, X_val, y_val):
    """
    Fit model, predict, and return metrics.
    """
    model.fit(X_train, y_train)
    preds = model.predict(X_val)

    results = {
        "params": model.get_params(),
        "MAE": mean_absolute_error(y_val, preds),
        "RMSE": np.sqrt(mean_squared_error(y_val, preds)),
        "R2": r2_score(y_val, preds)
    }
    return results

## 6. Baseline Models
We start with simple regression models on sample sets:
- Linear Regression  
- Ridge Regression  
- Lasso Regression  
- Random Forest (default settings)  

These act as a baseline to judge more complex models.


In [37]:
baseline_models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "RandomForest_Default": RandomForestRegressor(random_state=42)
}

baseline_results = {}

for name, model in baseline_models.items():
    metrics = evaluate_model(model, train_sample_X, train_sample_y, val_sample_X, val_sample_y)
    baseline_results[name] = metrics
    print(f"{name}: {metrics}")

# Save to results
save_results(baseline_results, "/content/results/baselines/baseline_results.json")

LinearRegression: {'params': {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}, 'MAE': 0.2997796099286002, 'RMSE': np.float64(0.41291303746294544), 'R2': 0.6826770493804539}
Ridge: {'params': {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': None, 'solver': 'auto', 'tol': 0.0001}, 'MAE': 0.30007437798781095, 'RMSE': np.float64(0.4142042718476044), 'R2': 0.6806893235353795}
Lasso: {'params': {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': 1000, 'positive': False, 'precompute': False, 'random_state': None, 'selection': 'cyclic', 'tol': 0.0001, 'warm_start': False}, 'MAE': 0.42822719854458685, 'RMSE': np.float64(0.5473508799009009), 'R2': 0.4424087708432711}
RandomForest_Default: {'params': {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 

## 7. Candidate Models
We train stronger tree-based models:
- Gradient Boosting  
- XGBoost  
- LightGBM  
- CatBoost  

Both on sample data (quick tests) and full data (final evaluation).


In [38]:
results = {}
results_sample = {}

models = {
    #"GradientBoosting": GradientBoostingRegressor(n_estimators=200, max_depth=6, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.1, random_state=42, n_jobs=-1),
    "LightGBM": LGBMRegressor(n_estimators=200, max_depth=-1, learning_rate=0.1, random_state=42, n_jobs=-1),
    "CatBoost": CatBoostRegressor(n_estimators=200, depth=6, learning_rate=0.1, random_state=42, verbose=0)
}


for name, model in models.items():
    print(f"\nTraining {name}...")
    results[name] = evaluate_model(model, train_X, train_y, val_X, val_y)
    results_sample[name] = evaluate_model(model, train_sample_X, train_sample_y, val_sample_X, val_sample_y)
    print(f"Done {name}...")
    metrics = {k: results[name][k] for k in ["MAE", "RMSE", "R2"]}
    print(metrics)


# Save to results
save_results(results, "/content/results/candidates/candidate_results.json")
save_results(results_sample, "/content/results/candidates/candidate_results_sample.json")



Training XGBoost...
Done XGBoost...
{'MAE': 0.010463517162666554, 'RMSE': np.float64(0.016340910921777852), 'R2': 0.9994511773619413}

Training LightGBM...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.169042 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2958
[LightGBM] [Info] Number of data points in the train set: 959130, number of used features: 24
[LightGBM] [Info] Start training from score 6.450472
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000358 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2958
[LightGBM] [Info] Number of data points in the train set: 4782, number of used features: 24
[LightGBM] [Info] Start training from score 6.431870
Done LightGBM...
{'MAE': 0.012901100295933611, 'RMSE': np.float64(0.018962946604024634), 'R2': 

## 8. Model Selection
- Based on validation scores (MAE, RMSE, R²), we choose XGBoost as the final model.
- Also I don't think we need hyperparameter tuning since the scores are almost perfect

In [39]:
import joblib
final_model = XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.1, random_state=42, n_jobs=-1)
final_model.fit(train_X, train_y)
joblib.dump(final_model, "/content/results/final/final_model.pkl")

['/content/results/final/final_model.pkl']

In [40]:
test = pd.read_csv('/content/drive/MyDrive/Projects/NYC trip duration/split/test.csv')
test = preprocess(test)
test_X = test.drop('trip_duration_transformed', axis=1)
test_y = test['trip_duration_transformed']

## 9. Final Testing
We load the saved model, test on the hold-out test set, and record final metrics.


In [41]:
model = joblib.load("/content/results/final/final_model.pkl")
y_preds = model.predict(test_X)
results = {
        "MAE": mean_absolute_error(test_y, y_preds),
        "RMSE": np.sqrt(mean_squared_error(test_y, y_preds)),
        "R2": r2_score(test_y, y_preds)
    }
print(results)
save_results(results, "/content/results/final/testing_results.json")


{'MAE': 0.010557512258037189, 'RMSE': np.float64(0.016497358692916356), 'R2': 0.9994408524418551}
