# ETA Error Time Series — Aleph RF–inspired

**Project context**
- **Dataset:** `data_etas.csv` exported from the Aleph Random Forest ETA notebook. Each row is a stop-to-stop segment (origin → next stop on same route).
- **Timestamps:** `real_departure_origin`, `real_arrival_dest`, `eta_arrival_dest`. We compute real vs estimated trip time in minutes and filter valid rows (both > 0, non-null). ETA error = real_trip_time_minutes − estimated_trip_time_minutes.
- **Segment ID:** `trip_id` = origin–destination pair (repeats across days/routes). We pick one segment and turn irregular observations into an evenly spaced series for forecasting.
- **Regularization:** 6-hour buckets; aggregate with median; fill missing buckets with weekday × 6h-slot median baseline, then global median fallback. Final format: `unique_id`, `ds`, `y`. `ds` is kept timezone-naive for Prophet/TimeCopilot.

## 1. Load
- Read `data_etas.csv` (stop-to-stop segments from Aleph RF ETA notebook).
- Simple error handling if file is missing.

In [15]:
import os
import pandas as pd
import numpy as np
import nest_asyncio
from timecopilot import TimeCopilot

In [16]:
def import_data(file_name):
    """Load segment-level ETA data from CSV. Raises clear error if file not found."""
    if not os.path.isfile(file_name):
        raise FileNotFoundError(f"Data file not found: {file_name}")
    return pd.read_csv(file_name)

data_trips = import_data("data_etas.csv")

## 2. Parse / Compute
- Parse timestamp columns; compute real and estimated trip times; convert to minutes.
- Same logic as the Aleph RF ETA notebook.

In [17]:
time_cols = [
    "real_departure_origin",
    "real_arrival_dest",
    "eta_arrival_dest",
]
for col in time_cols:
    data_trips[col] = pd.to_datetime(data_trips[col], errors="coerce")

data_trips["real_trip_time"] = (
    data_trips["real_arrival_dest"] - data_trips["real_departure_origin"]
)
data_trips["estimated_trip_time"] = (
    data_trips["eta_arrival_dest"] - data_trips["real_departure_origin"]
)

data_trips["real_trip_time_minutes"] = (
    data_trips["real_trip_time"].dt.total_seconds() / 60
)
data_trips["estimated_trip_time_minutes"] = (
    data_trips["estimated_trip_time"].dt.total_seconds() / 60
)


## 3. Filter
- Keep rows where both real and estimated trip time (minutes) are non-null and > 0.
- Compute ETA error in minutes: real_trip_time_minutes − estimated_trip_time_minutes.

In [18]:
valid_mask = (
    data_trips["real_trip_time_minutes"].notna()
    & (data_trips["real_trip_time_minutes"] > 0)
    & data_trips["estimated_trip_time_minutes"].notna()
    & (data_trips["estimated_trip_time_minutes"] > 0)
)
data_valid_trips = data_trips.loc[valid_mask].copy()

data_valid_trips["eta_error_minutes"] = (
    data_valid_trips["real_trip_time_minutes"]
    - data_valid_trips["estimated_trip_time_minutes"]
)

print("Total rows:", len(data_trips))
print("Valid segments:", len(data_valid_trips))
data_valid_trips.head()

Total rows: 586164
Valid segments: 345664


Unnamed: 0,external_id,lat_origin,lon_origin,location_id_origin,real_departure_origin,external_schedule_departure_origin,real_seq,lat_dest,lon_dest,location_id_dest,real_arrival_dest,external_schedule_arrival_dest,eta_arrival_dest,real_trip_time,estimated_trip_time,real_trip_time_minutes,estimated_trip_time_minutes,trip_id,eta_error_minutes
16,ae9503ad-81ad-417b-a49b-6cc67ec3d7d5,25.70715,-100.27937,9820891a-be20-4ebd-a402-655ec6c6844c,2025-07-22 02:41:39+00:00,2025-07-21 18:48:27+00:00,1,19.46121,-99.15228,2463c639-c724-4caf-b922-13fa0b270442,2025-07-23 00:42:47+00:00,2025-07-22 14:45:00+00:00,2025-07-22 18:07:36+00:00,0 days 22:01:08,0 days 15:25:57,1321.133333,925.95,9820891a-be20-4ebd-a402-655ec6c6844c-2463c639-...,395.183333
17,ae9503ad-81ad-417b-a49b-6cc67ec3d7d5,19.46121,-99.15228,2463c639-c724-4caf-b922-13fa0b270442,2025-07-23 00:43:00+00:00,2025-07-22 16:45:00+00:00,2,25.70715,-100.27937,9820891a-be20-4ebd-a402-655ec6c6844c,2025-07-26 05:50:31+00:00,2025-07-23 11:49:57+00:00,2025-07-26 05:50:53+00:00,3 days 05:07:31,3 days 05:07:53,4627.516667,4627.883333,2463c639-c724-4caf-b922-13fa0b270442-9820891a-...,-0.366667
18,fd8798f9-fa99-4e33-b726-633e8e39dd23,19.658553,-99.170608,cf2dbff4-1216-4843-9ccb-a87427a1e98a,2025-01-04 01:29:07+00:00,2025-01-04 04:00:00+00:00,1,25.624188,-103.510615,67c0d710-d3d0-49c5-af2f-4e191d16186f,2025-01-05 18:55:46+00:00,2025-01-05 01:26:12+00:00,2025-01-05 19:03:53+00:00,1 days 17:26:39,1 days 17:34:46,2486.65,2494.766667,cf2dbff4-1216-4843-9ccb-a87427a1e98a-67c0d710-...,-8.116667
19,fd8798f9-fa99-4e33-b726-633e8e39dd23,25.624188,-103.510615,67c0d710-d3d0-49c5-af2f-4e191d16186f,2025-01-06 15:11:19+00:00,2025-01-05 02:56:12+00:00,2,28.719916,-106.139718,1cc50ee6-df0e-4484-8614-b1cd341f1db8,2025-01-06 22:42:49+00:00,2025-01-05 10:54:25+00:00,2025-01-06 22:57:05+00:00,0 days 07:31:30,0 days 07:45:46,451.5,465.766667,67c0d710-d3d0-49c5-af2f-4e191d16186f-1cc50ee6-...,-14.266667
20,fd8798f9-fa99-4e33-b726-633e8e39dd23,28.719916,-106.139718,1cc50ee6-df0e-4484-8614-b1cd341f1db8,2025-01-08 15:28:40+00:00,2025-01-05 12:24:25+00:00,3,19.658553,-99.170608,cf2dbff4-1216-4843-9ccb-a87427a1e98a,2025-01-10 01:38:28+00:00,2025-01-06 21:42:08+00:00,2025-01-10 01:50:22+00:00,1 days 10:09:48,1 days 10:21:42,2049.8,2061.7,1cc50ee6-df0e-4484-8614-b1cd341f1db8-cf2dbff4-...,-11.9


In [19]:
data_valid_trips.to_csv('data_valid_trips.csv')

## 4. Segment selection
- Pick one segment (`trip_id`) to turn into a regular time series for forecasting.
- Set `TARGET_TRIP_ID` below to the origin–destination pair to analyze.

In [None]:
TARGET_TRIP_ID = (
    "71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-7320-40f0-bb72-3ead8356e87c"
)
segment_df = data_valid_trips[
    data_valid_trips["trip_id"] == TARGET_TRIP_ID
].copy()

## 5. Regularize to 6H buckets & imputation
- Build an evenly spaced 6H series from irregular segment observations: signed-log ETA error, then resample with median (or mean).
- Fill missing buckets: weekday × 6h-slot median baseline; fallback to global median. Output: `unique_id`, `ds`, `y`.

In [6]:
def build_6h_even_series(data_valid_trips, trip_id, agg="median"):
    """
    Build an evenly spaced 6H time series for one segment.

    Inputs:
      data_valid_trips: DataFrame with columns trip_id, real_departure_origin, eta_error_minutes.
      trip_id: Segment identifier (origin-destination pair).
      agg: Aggregation for 6H buckets — "median" (default) or "mean".

    Output:
      (final_df, impute_info): final_df has columns unique_id, ds, y; impute_info is
      {"n_imputed": int, "n_total": int} for sanity checks.

    Logic: Filter segment → signed-log(eta_error_minutes) → resample 6H with agg →
    fill missing buckets with weekday×slot6h median baseline, then global median fallback.
    """
    required = ["trip_id", "real_departure_origin", "eta_error_minutes"]
    for col in required:
        if col not in data_valid_trips.columns:
            raise ValueError(f"Required column missing: {col}")

    df = data_valid_trips[data_valid_trips["trip_id"] == trip_id].copy()
    if df.empty:
        raise ValueError(f"No rows found for trip_id: {trip_id!r}")

    df["real_departure_origin"] = pd.to_datetime(
        df["real_departure_origin"], utc=True, errors="coerce"
    )
    df = (
        df.dropna(subset=["real_departure_origin"])
        .sort_values("real_departure_origin")
        .set_index("real_departure_origin")
    )

    e = df["eta_error_minutes"].astype(float)
    df["signed_log_error"] = np.sign(e) * np.log1p(np.abs(e))

    if agg == "median":
        y = df["signed_log_error"].resample("6H").median()
    elif agg == "mean":
        y = df["signed_log_error"].resample("6H").mean()
    else:
        raise ValueError(f"agg must be 'median' or 'mean', got {agg!r}")

    missing = y.isna()
    tmp = pd.DataFrame({"y": y})
    tmp["weekday"] = tmp.index.dayofweek
    tmp["slot6h"] = tmp.index.hour // 6

    slot_median = (
        tmp.loc[~missing].groupby(["weekday", "slot6h"])["y"].median()
    )
    y_filled = y.copy()
    fill_vals = tmp.loc[missing, ["weekday", "slot6h"]].apply(
        lambda r: slot_median.get((r["weekday"], r["slot6h"])), axis=1
    )
    y_filled.loc[missing] = fill_vals.values
    y_filled = y_filled.fillna(y.median())

    final_df = pd.DataFrame({
        "unique_id": trip_id,
        "ds": y_filled.index,
        "y": y_filled.values,
    })
    impute_info = {"n_imputed": int(missing.sum()), "n_total": len(y)}
    return final_df, impute_info


In [7]:
segment_df_final, impute_info = build_6h_even_series(
    data_valid_trips, TARGET_TRIP_ID
)
segment_df_final.head()


Unnamed: 0,unique_id,ds,y
0,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-02 12:00:00+00:00,1.936341
1,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-02 18:00:00+00:00,1.813738
2,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 00:00:00+00:00,0.182322
3,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 06:00:00+00:00,-1.496642
4,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 12:00:00+00:00,-1.15268


## 6. Sanity check
- Shape, date range, and share of buckets that were imputed; head and tail of the series.

In [8]:
n_imp, n_tot = impute_info["n_imputed"], impute_info["n_total"]
pct_imputed = (n_imp / n_tot * 100) if n_tot else 0
print("Shape:", segment_df_final.shape)
print("Date range:", segment_df_final["ds"].min(), "->", segment_df_final["ds"].max())
print(f"Buckets imputed: {n_imp} / {n_tot} ({pct_imputed:.1f}%)")
print("\nHead:")
display(segment_df_final.head())
print("Tail:")
display(segment_df_final.tail())

Shape: (1205, 3)
Date range: 2025-01-02 12:00:00+00:00 -> 2025-10-30 12:00:00+00:00
Buckets imputed: 652 / 1205 (54.1%)

Head:


Unnamed: 0,unique_id,ds,y
0,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-02 12:00:00+00:00,1.936341
1,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-02 18:00:00+00:00,1.813738
2,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 00:00:00+00:00,0.182322
3,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 06:00:00+00:00,-1.496642
4,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 12:00:00+00:00,-1.15268


Tail:


Unnamed: 0,unique_id,ds,y
1200,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-29 12:00:00+00:00,1.360738
1201,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-29 18:00:00+00:00,1.462483
1202,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-30 00:00:00+00:00,-1.308333
1203,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-30 06:00:00+00:00,-1.576786
1204,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-30 12:00:00+00:00,1.193922


## 7. Final dataset (timezone-naive)
Prophet (used by TimeCopilot) does not support timezone-aware `ds`. Strip timezone before forecasting.

In [9]:
segment_df_final = segment_df_final.copy()
segment_df_final["ds"] = pd.to_datetime(segment_df_final["ds"]).dt.tz_localize(None)


In [10]:
segment_df_final.head()

Unnamed: 0,unique_id,ds,y
0,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-02 12:00:00,1.936341
1,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-02 18:00:00,1.813738
2,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 00:00:00,0.182322
3,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 06:00:00,-1.496642
4,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-01-03 12:00:00,-1.15268


## 8. TimeCopilot forecast
- Use TimeCopilot to select a model and produce a 6H-ahead forecast (e.g. `h=20` buckets).
- Set `OPENAI_API_KEY` via environment variable; do not hardcode secrets in the notebook.

In [11]:
nest_asyncio.apply()

In [None]:
os.environ["OPENAI_API_KEY"] = "s..."

In [13]:
tc = TimeCopilot(
    llm="openai:gpt-4o",
    retries=3
)

In [14]:
result = tc.forecast(
    df=segment_df_final,
    freq="6H",
    h=20   
)

1it [00:08,  8.33s/it]
1it [00:00,  3.04it/s]
1it [00:05,  5.60s/it]
1it [00:00, 108.39it/s]
301it [01:13,  4.10it/s]


In [None]:
print(result)


AgentRunResult(output=ForecastAgentOutput(tsfeatures_analysis='The time series analysis indicates moderate seasonality and autocorrelation. With an entropy of 0.694, the data is reasonably complex but predictable, while the KPSS statistic suggests stationarity. The ACF values imply dependencies at various lags, with a notable seasonal component.', selected_model='AutoETS', model_details='AutoETS automatically selects the best Exponential Smoothing parameters to fit time series data. It models various types of trends and seasonal patterns, making it versatile for data exhibiting moderate seasonality and trend, as seen in our features.', model_comparison='The cross-validation results reveal AutoETS achieved the lowest MASE (0.586), outperforming other models like AutoARIMA (0.834), HistoricAverage (0.963), and SeasonalNaive (1.101).', is_better_than_seasonal_naive=True, reason_for_selection='AutoETS was chosen for its superior performance in cross-validation, capturing seasonality and tr

In [26]:
result.fcst_df


Unnamed: 0,unique_id,ds,AutoETS
0,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-30 18:00:00,1.310728
1,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-31 00:00:00,0.128078
2,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-31 06:00:00,-1.674279
3,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-31 12:00:00,1.078863
4,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-10-31 18:00:00,1.310728
5,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-11-01 00:00:00,0.128078
6,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-11-01 06:00:00,-1.674279
7,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-11-01 12:00:00,1.078863
8,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-11-01 18:00:00,1.310728
9,71a24abd-0fb8-4110-be27-a93d8078d2cf-622564d0-...,2025-11-02 00:00:00,0.128078
