# Fleet Analytics & Prediction Tasks

## Classification Task
Predict whether a trip is operationally risky or problematic.

We define a trip as **High Risk (1)** if the status indicates a delay or problem
(e.g., "Delayed", "Cancelled", "Failed"). Otherwise, it is **Low Risk (0)**.

This helps the fleet manager identify trips likely to cause service issues.

## Regression Task
Predict the **maintenance_cost** of a vehicle/trip based on:
- distance_km
- vehicle age / type
- driving behaviour (violations, speeding incidents)
- fuel_cost, toll_cost, load_value
- weather_condition and route information

In [1]:
import pandas as pd

df = pd.read_csv("../data/fleet_dummy_5000.csv")

df.head()

Unnamed: 0,trip_id,driver_id,driver_name,origin,destination,distance_km,pickup_time,delivery_time,actual_delivery_time,status,...,profit_margin,vehicle_id,vehicle_type,violation_count,speeding_incidents,gps_start_lat,gps_start_lon,gps_end_lat,gps_end_lon,weather_condition
0,1,112,Driver_112,New York,Boston,615.26,2025-08-28 08:03:12.970230,2025-09-01 07:03:12.970230,2025-09-01 06:40:12.970230,Delivered,...,608.22,209,Truck,0,0,29.893102,-91.745548,26.687392,-113.888335,Sunny
1,2,139,Driver_139,Dallas,Atlanta,2112.03,2025-08-28 20:03:12.970302,2025-09-01 20:03:12.970302,2025-09-01 22:11:12.970302,Delivered,...,2764.47,211,Van,3,3,29.475943,-103.372264,33.766738,-98.863483,Cloudy
2,3,111,Driver_111,Miami,Boston,281.87,2025-08-27 05:03:12.970348,2025-08-28 22:03:12.970348,2025-08-28 21:54:12.970348,Scheduled,...,97.58,208,Van,0,0,25.038389,-74.604847,32.415163,-81.995082,Cloudy
3,4,113,Driver_113,Seattle,Los Angeles,1362.6,2025-08-28 21:03:12.970392,2025-09-01 03:03:12.970392,2025-09-01 05:22:12.970392,Delivered,...,1105.35,237,Van,1,1,35.949256,-92.280003,27.23883,-99.362534,Sunny
4,5,109,Driver_109,Seattle,Denver,410.36,2025-08-28 11:03:12.970434,2025-08-30 11:03:12.970434,2025-08-30 11:51:12.970434,Delivered,...,-32.84,213,Van,0,0,25.305279,-95.553751,34.761454,-76.599042,Rain


In [4]:
df["status"].value_counts()

status
Delivered     2973
In Transit    1013
Scheduled      513
Delayed        501
Name: count, dtype: int64

In [5]:
# Normalize status to lowercase
df["status_lower"] = df["status"].str.lower()

# Define high risk trips (only Delayed)
df["high_risk"] = (df["status_lower"] == "delayed").astype(int)

# Check result
df["high_risk"].value_counts()

high_risk
0    4499
1     501
Name: count, dtype: int64

In [6]:
# Convert pickup_time to datetime
df["pickup_time"] = pd.to_datetime(df["pickup_time"])

# Add time-based features
df["pickup_hour"] = df["pickup_time"].dt.hour
df["pickup_dayofweek"] = df["pickup_time"].dt.dayofweek

# Feature columns
features = [
    "distance_km",
    "fuel_cost",
    "driver_pay",
    "toll_cost",
    "load_value",
    "profit_margin",
    "violation_count",
    "speeding_incidents",
    "gps_start_lat",
    "gps_start_lon",
    "gps_end_lat",
    "gps_end_lon",
    "pickup_hour",
    "pickup_dayofweek"
]

# Feature matrix
X = df[features]

# Targets
y_class = df["high_risk"]          # classification
y_reg = df["maintenance_cost"]     # regression

X.head(), y_class.head(), y_reg.head()

(   distance_km  fuel_cost  driver_pay  toll_cost  load_value  profit_margin  \
 0       615.26     205.54      324.76      71.01     1292.17         608.22   
 1      2112.03     917.13     1050.47      41.84     4846.48        2764.47   
 2       281.87      76.47      161.41      67.92      676.83          97.58   
 3      1362.60     580.53      816.53      71.46     2735.60        1105.35   
 4       410.36     192.53      218.25      55.18      728.79         -32.84   
 
    violation_count  speeding_incidents  gps_start_lat  gps_start_lon  \
 0                0                   0      29.893102     -91.745548   
 1                3                   3      29.475943    -103.372264   
 2                0                   0      25.038389     -74.604847   
 3                1                   1      35.949256     -92.280003   
 4                0                   0      25.305279     -95.553751   
 
    gps_end_lat  gps_end_lon  pickup_hour  pickup_dayofweek  
 0    26.687392 

In [7]:
print("X shape:", X.shape)
print("y_class shape:", y_class.shape)
print("y_reg shape:", y_reg.shape)

df[features + ["maintenance_cost", "high_risk"]].isna().sum()

X shape: (5000, 14)
y_class shape: (5000,)
y_reg shape: (5000,)


distance_km           0
fuel_cost             0
driver_pay            0
toll_cost             0
load_value            0
profit_margin         0
violation_count       0
speeding_incidents    0
gps_start_lat         0
gps_start_lon         0
gps_end_lat           0
gps_end_lon           0
pickup_hour           0
pickup_dayofweek      0
maintenance_cost      0
high_risk             0
dtype: int64

In [8]:
from sklearn.model_selection import train_test_split

# Classification split
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X, y_class, test_size=0.2, random_state=42, stratify=y_class
)

# Regression split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X, y_reg, test_size=0.2, random_state=42
)

print("Classification Train:", X_train_c.shape, "Test:", X_test_c.shape)
print("Regression Train:", X_train_r.shape, "Test:", X_test_r.shape)

Classification Train: (4000, 14) Test: (1000, 14)
Regression Train: (4000, 14) Test: (1000, 14)


In [9]:
numeric_features = X.columns.tolist()
numeric_features

['distance_km',
 'fuel_cost',
 'driver_pay',
 'toll_cost',
 'load_value',
 'profit_margin',
 'violation_count',
 'speeding_incidents',
 'gps_start_lat',
 'gps_start_lon',
 'gps_end_lat',
 'gps_end_lon',
 'pickup_hour',
 'pickup_dayofweek']

In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Preprocessing pipeline for numeric features
preprocess_numeric = Pipeline(
    steps=[
        ("scaler", StandardScaler())
    ]
)

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   trip_id               5000 non-null   int64  
 1   driver_id             5000 non-null   int64  
 2   driver_name           5000 non-null   object 
 3   origin                5000 non-null   object 
 4   destination           5000 non-null   object 
 5   distance_km           5000 non-null   float64
 6   pickup_time           5000 non-null   object 
 7   delivery_time         5000 non-null   object 
 8   actual_delivery_time  5000 non-null   object 
 9   status                5000 non-null   object 
 10  fuel_cost             5000 non-null   float64
 11  driver_pay            5000 non-null   float64
 12  maintenance_cost      5000 non-null   float64
 13  toll_cost             5000 non-null   float64
 14  load_value            5000 non-null   float64
 15  profit_margin        

In [3]:
df.describe()

Unnamed: 0,trip_id,driver_id,distance_km,fuel_cost,driver_pay,maintenance_cost,toll_cost,load_value,profit_margin,vehicle_id,violation_count,speeding_incidents,gps_start_lat,gps_start_lon,gps_end_lat,gps_end_lon
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,125.3916,1356.902714,554.283932,611.457656,176.087792,55.517052,2721.313518,1323.967086,225.4726,0.406,0.406,33.551597,-94.935752,33.415294,-94.714018
std,1443.520003,14.513786,666.350404,300.998545,327.766767,71.855372,26.546452,1410.386234,902.287093,14.509781,0.715309,0.715309,4.937381,14.517723,4.945796,14.36047
min,1.0,101.0,200.01,52.58,62.33,50.02,10.0,320.88,-296.69,201.0,0.0,0.0,25.003645,-119.993053,25.001161,-119.988266
25%,1250.75,113.0,778.5575,302.8675,336.0925,115.1375,32.0975,1521.575,583.0325,213.0,0.0,0.0,29.28512,-107.549334,29.092212,-107.190732
50%,2500.5,125.0,1376.11,535.115,594.915,176.415,55.555,2686.6,1208.63,225.0,0.0,0.0,33.536276,-94.505503,33.413396,-94.278565
75%,3750.25,138.0,1935.3575,761.0,839.625,238.685,79.1875,3815.9525,1956.075,238.0,1.0,1.0,37.825708,-82.137262,37.779266,-82.254098
max,5000.0,150.0,2499.91,1552.81,1489.21,299.97,100.0,6208.53,4160.26,250.0,3.0,3.0,41.999318,-70.00817,41.999345,-70.004045


In [12]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Classification pipeline
clf_logreg = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=1000))
    ]
)

# Train
clf_logreg.fit(X_train_c, y_train_c)

# Predict
y_pred_logreg = clf_logreg.predict(X_test_c)

# Evaluate
print("Logistic Regression Results:\n")
print(classification_report(y_test_c, y_pred_logreg))
print("Confusion Matrix:")
print(confusion_matrix(y_test_c, y_pred_logreg))

Logistic Regression Results:

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       900
           1       0.00      0.00      0.00       100

    accuracy                           0.90      1000
   macro avg       0.45      0.50      0.47      1000
weighted avg       0.81      0.90      0.85      1000

Confusion Matrix:
[[900   0]
 [100   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [13]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest classification pipeline
clf_rf = Pipeline(
    steps=[
        # Tree models don't strictly need scaling, so we skip StandardScaler here
        ("model", RandomForestClassifier(
            n_estimators=100,
            random_state=42
        ))
    ]
)

# Train the model
clf_rf.fit(X_train_c, y_train_c)

# Predict on test set
y_pred_rf = clf_rf.predict(X_test_c)

# Evaluate
print("=== Random Forest Results ===\n")
print(classification_report(y_test_c, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test_c, y_pred_rf))

=== Random Forest Results ===

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       900
           1       0.00      0.00      0.00       100

    accuracy                           0.90      1000
   macro avg       0.45      0.50      0.47      1000
weighted avg       0.81      0.90      0.85      1000

Confusion Matrix:
[[900   0]
 [100   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [14]:
clf_logreg_bal = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=1000, class_weight="balanced"))
    ]
)

clf_logreg_bal.fit(X_train_c, y_train_c)

y_pred_logreg_bal = clf_logreg_bal.predict(X_test_c)

print("=== Balanced Logistic Regression Results ===\n")
print(classification_report(y_test_c, y_pred_logreg_bal))
print(confusion_matrix(y_test_c, y_pred_logreg_bal))

=== Balanced Logistic Regression Results ===

              precision    recall  f1-score   support

           0       0.89      0.50      0.64       900
           1       0.09      0.46      0.16       100

    accuracy                           0.50      1000
   macro avg       0.49      0.48      0.40      1000
weighted avg       0.81      0.50      0.60      1000

[[454 446]
 [ 54  46]]


In [15]:
clf_rf_bal = Pipeline(
    steps=[
        ("model", RandomForestClassifier(
            n_estimators=200,
            random_state=42,
            class_weight="balanced"
        ))
    ]
)

clf_rf_bal.fit(X_train_c, y_train_c)

y_pred_rf_bal = clf_rf_bal.predict(X_test_c)

print("=== Balanced Random Forest Results ===\n")
print(classification_report(y_test_c, y_pred_rf_bal))
print(confusion_matrix(y_test_c, y_pred_rf_bal))

=== Balanced Random Forest Results ===

              precision    recall  f1-score   support

           0       0.90      1.00      0.95       900
           1       0.00      0.00      0.00       100

    accuracy                           0.90      1000
   macro avg       0.45      0.50      0.47      1000
weighted avg       0.81      0.90      0.85      1000

[[900   0]
 [100   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [11]:
print("Classification Train:", X_train_c.shape, "Test:", X_test_c.shape)

Classification Train: (4000, 14) Test: (1000, 14)
