<a href="https://colab.research.google.com/github/Jana-Alrzoog/2025_GP_28/blob/main/masar-sim/notebooks/masar_occupancy_week.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚌 **Masar Occupancy — Week Generator**

This notebook generates **minute-level passenger occupancy** for a **7-day window** across selected stations/lines.  
It starts from the **base curve** and applies **context modifiers** (station capacity, weekend, weather, events; holidays per config),  
then exports tidy CSVs for dashboards and Firestore/pipeline publishing.

---

### 🎯 **Purpose**
- Produce **7 consecutive daily time series** at **1-minute resolution**.
- Fill any **missing days** in the target week with a consistent **template day**.
- Output clean, validated CSVs for QA, visualization, and aggregation.

---

### 🧩 **Inputs**
| File / Seed | Description |
|---|---|
| `base_week.csv` or `base_day.csv × 7` | Base minute-level demand (from **masar_base_demand.ipynb**) |
| Seeds | `stations`, `events`, `weather` *(holidays controlled via config)* |
| Config | `00_config.yaml` *(paths, multipliers, timezone, resolution, headways)* |

---

### ⚙️ **Workflow**
1️⃣ **Load config & seeds** (paths, stations, events, weather).  
2️⃣ **Select week window** (e.g., `2025-09-21 → 2025-09-27`) and ensure TZ & minute resolution.  
3️⃣ **Build minute grid per day & station**; **expand missing days** using the densest template day.  
4️⃣ **Compute modifiers** per minute:
   - `station_scale` (capacity vs. network mean)  
   - `weekend_mult` (Fri/Sat)  
   - `weather_mult` (Sunny/Dusty/Rainy…)  
   - `event_mult` (supports global and station-scoped events)  
   - `holiday_mult` *(enabled/disabled via config)*
5️⃣ **Finalize demand** → normalize per station, map to `station_total`, derive `crowd_level`.  
6️⃣ **Headways** from config peak/off-peak patterns.  
7️⃣ **QA checks** (non-negative totals, station coverage, event flags).  
8️⃣ **Export** per-day CSVs **and** a consolidated weekly CSV.

---

In [2]:
%cd /content
!git clone https://github.com/Jana-Alrzoog/2025_GP_28.git
%cd /content/2025_GP_28/masar-sim
!ls


/content
fatal: destination path '2025_GP_28' already exists and is not an empty directory.
/content/2025_GP_28/masar-sim
data  lib  notebooks  sims


In [3]:
# =========================================================
# masar_occupancy_week.ipynb
# Generate a full week with changing scenarios → occupancy_week.csv
# =========================================================

import os, json, csv, yaml
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from dateutil.parser import parse

CANDIDATES = [
    "/content/2025_GP_28_latest/masar-sim",
    "/content/2025_GP_28/masar-sim",
    "/content/masar-sim",
]
ROOT = next((p for p in CANDIDATES if os.path.exists(p)), None)
assert ROOT, "لم أجد مجلد masar-sim. تأكدي من الكلون والمسار."
SEED = f"{ROOT}/data/seeds"
GEN  = f"{ROOT}/data/generated"
CONF = f"{ROOT}/sims/00_config.yaml"

print("ROOT =", ROOT)
print("GEN  =", GEN)
print("CONF =", CONF)

with open(CONF) as f:
    config = yaml.safe_load(f)

with open(f"{SEED}/stations.json") as f:
    stations = json.load(f)
with open(f"{SEED}/weather_patterns.json") as f:
    weather_map = json.load(f)
with open(f"{SEED}/calendar_events.csv") as f:
    events_seed = list(csv.DictReader(f))


base_path = f"{GEN}/base_day.csv"
assert os.path.exists(base_path), "base_day.csv غير موجود—شغّلي masar_base_demand.ipynb أولًا."
base_day = pd.read_csv(base_path, parse_dates=["timestamp"])
print(f"base_day rows={len(base_day):,}, stations={base_day['station_id'].nunique()}, day={base_day['timestamp'].dt.date.iloc[0]}")


ROOT = /content/2025_GP_28/masar-sim
GEN  = /content/2025_GP_28/masar-sim/data/generated
CONF = /content/2025_GP_28/masar-sim/sims/00_config.yaml
base_day rows=6,486, stations=6, day=2025-09-24


In [4]:
import sys
sys.path.append(f"{ROOT}/lib")
from modifiers import compute_demand_modifier


In [5]:
week_start = parse("2025-09-21")

scenario_cycle = ["normal", "rainy", "event_kafd", "normal", "holiday", "dusty", "normal"]

days = [
    {"date": (week_start + timedelta(days=i)).date(), "tag": scenario_cycle[i % len(scenario_cycle)]}
    for i in range(7)
]
days


[{'date': datetime.date(2025, 9, 21), 'tag': 'normal'},
 {'date': datetime.date(2025, 9, 22), 'tag': 'rainy'},
 {'date': datetime.date(2025, 9, 23), 'tag': 'event_kafd'},
 {'date': datetime.date(2025, 9, 24), 'tag': 'normal'},
 {'date': datetime.date(2025, 9, 25), 'tag': 'holiday'},
 {'date': datetime.date(2025, 9, 26), 'tag': 'dusty'},
 {'date': datetime.date(2025, 9, 27), 'tag': 'normal'}]

In [8]:
import os
os.rename(
    "/content/2025_GP_28/masar-sim/data/generated/base_day.csv",
    "/content/2025_GP_28/masar-sim/data/generated/day_base.csv"
)
print("Renamed ✓ base_day.csv → day_base.csv")


Renamed ✓ base_day.csv → day_base.csv


In [10]:
req_cols = ["station_id", "minute_of_day", "base_demand"]
# Auto-create minute_of_day if missing (0–1439)
if "minute_of_day" not in base_day.columns:
    if "hour" in base_day.columns and "minute" in base_day.columns:
        base_day["minute_of_day"] = base_day["hour"] * 60 + base_day["minute"]
    elif "timestamp" in base_day.columns:
        ts = pd.to_datetime(base_day["timestamp"], errors="coerce")
        base_day["minute_of_day"] = ts.dt.hour * 60 + ts.dt.minute
    else:
        raise KeyError("minute_of_day not found and no way to reconstruct it (need hour/minute or timestamp).")


In [14]:
# ============================================================
# Fix base_day columns and build minute_of_day if missing
# Run this ONCE after loading `base_day` (or before bootstrap).
# ============================================================

import pandas as pd

if 'base_day' not in globals():
    # Try load if not in memory
    import os
    ROOT = "/content/2025_GP_28/masar-sim"
    GEN  = f"{ROOT}/data/generated"
    candidates = [
        f"{GEN}/day_base.csv",
        f"{GEN}/base_day.csv",           # your current file name
        f"{GEN}/day_demand_base.csv",
        f"{ROOT}/data/base/day_base.csv",
        f"{ROOT}/data/base/day_demand_base.csv",
    ]
    for p in candidates:
        if os.path.exists(p):
            base_day = pd.read_csv(p)
            print("Loaded base_day from:", p)
            break
    else:
        raise FileNotFoundError("No base-day CSV found in common locations.")

# 1) normalize headers
base_day.columns = [str(c).strip().lower() for c in base_day.columns]

# 2) standardize common names
rename_map = {
    # station id
    "station": "station_id",
    "station code": "station_id",
    "station_code": "station_id",
    "sid": "station_id",
    # base demand
    "base": "base_demand",
    "basedemand": "base_demand",
    "demand_base": "base_demand",
    "base_day_demand": "base_demand",
    "base_day": "base_demand",
    # minute of day
    "minute": "minute_of_day",
    "min": "minute_of_day",
    "minuteofday": "minute_of_day",
    "minute-of-day": "minute_of_day",
}
base_day = base_day.rename(columns=rename_map)

# 3) build minute_of_day if missing
if "minute_of_day" not in base_day.columns:
    if {"hour","minute"}.issubset(base_day.columns):
        base_day["minute_of_day"] = (
            pd.to_numeric(base_day["hour"], errors="coerce").fillna(0).astype(int)*60 +
            pd.to_numeric(base_day["minute"], errors="coerce").fillna(0).astype(int)
        )
        print("Built minute_of_day from hour+minute ✓")
    elif "time" in base_day.columns:
        # supports "HH:MM" or "HH:MM:SS"
        t = pd.to_datetime(base_day["time"], errors="coerce")
        base_day["minute_of_day"] = (t.dt.hour*60 + t.dt.minute).astype("Int64").fillna(0).astype(int)
        print("Built minute_of_day from time (HH:MM[:SS]) ✓")
    elif "timestamp" in base_day.columns:
        ts = pd.to_datetime(base_day["timestamp"], errors="coerce")
        base_day["minute_of_day"] = (ts.dt.hour*60 + ts.dt.minute).astype("Int64").fillna(0).astype(int)
        print("Built minute_of_day from timestamp ✓")
    else:
        # fallback: if rows are per-minute in order, use the index
        if len(base_day) in (1440, 2880):  # minute-level (1 day / maybe 2)
            base_day = base_day.reset_index().rename(columns={"index":"minute_of_day"})
            base_day["minute_of_day"] = base_day["minute_of_day"].clip(0, 1439).astype(int)
            print("Built minute_of_day from index fallback ✓")
        else:
            raise KeyError(
                "minute_of_day is missing and cannot be reconstructed.\n"
                "Provide either (hour, minute) OR a 'time' column OR 'timestamp'."
            )

# 4) ensure required columns and types
if "station_id" not in base_day.columns:
    # try recover from code/name columns if present
    for cand in ["code","stationname","name"]:
        if cand in base_day.columns:
            base_day["station_id"] = base_day[cand].astype(str)
            print(f"Created station_id from '{cand}' ✓")
            break
if "station_id" not in base_day.columns:
    raise KeyError("Missing 'station_id' in base_day (and no alternative column found).")

base_day["station_id"]    = base_day["station_id"].astype(str).str.strip()
base_day["minute_of_day"] = pd.to_numeric(base_day["minute_of_day"], errors="coerce").fillna(0).astype(int)

# base_demand may be under a different name in your file; try a safe fallback
if "base_demand" not in base_day.columns:
    for cand in ["demand", "baseline", "base", "y_base"]:
        if cand in base_day.columns:
            base_day["base_demand"] = pd.to_numeric(base_day[cand], errors="coerce")
            print(f"Created base_demand from '{cand}' ✓")
            break
if "base_demand" not in base_day.columns:
    raise KeyError("Missing 'base_demand' in base_day.")

base_day["base_demand"] = pd.to_numeric(base_day["base_demand"], errors="coerce").fillna(0.0)

print("Normalization ✓  Columns:", list(base_day.columns)[:10], "…")
print(base_day.head(3))


Built minute_of_day from timestamp ✓
Normalization ✓  Columns: ['timestamp', 'station_id', 'base_demand', 'base_demand_norm', 'minute_of_day'] …
        timestamp station_id  base_demand  base_demand_norm  minute_of_day
0  9/24/2025 6:00         S1     0.078566          0.074824            360
1  9/24/2025 6:01         S1     0.080302          0.076478            361
2  9/24/2025 6:02         S1     0.082128          0.078217            362


In [21]:
# ============================================================
#  Bootstrap loader: build `base_df` for the week from `base_day` in memory
#  - Uses the `base_day` dataframe already loaded and normalized.
#  - Replicates the base-day minute grid across WEEK_START..WEEK_END
# ============================================================

import os
import pandas as pd
from datetime import timedelta

# ---- Project roots (edit if your path differs)
if 'ROOT' not in globals():
    ROOT = "/content/2025_GP_28/masar-sim"
SEED = f"{ROOT}/data/seeds"
GEN  = f"{ROOT}/data/generated"

# ---- Target week (keep in sync with your main script)
WEEK_START = pd.Timestamp("2025-09-21")
WEEK_END   = pd.Timestamp("2025-09-27")

# Use the base_day DataFrame already loaded and normalized
if 'base_day' not in globals():
    raise RuntimeError("The 'base_day' DataFrame was not found in memory. Please run the preceding cells.")

# Ensure required columns exist after prior normalization
req_cols = ["station_id", "minute_of_day", "base_demand"]
missing = [c for c in req_cols if c not in base_day.columns]
if missing:
    raise KeyError(f"Missing required column(s) in 'base_day' DataFrame: {missing}. Rerun cells that load and normalize base_day.")

# If timestamp exists we ignore its date and rebuild per target dates
# Build a full week by repeating the base-day minute grid per date
dates = pd.date_range(WEEK_START, WEEK_END, freq="D")
frames = []
for d in dates:
    df_d = base_day.copy()
    h = (df_d["minute_of_day"] // 60).astype(int)
    m = (df_d["minute_of_day"] %  60).astype(int)
    df_d["date"]      = d.strftime("%Y-%m-%d")
    df_d["timestamp"] = pd.to_datetime(df_d["date"] + " " + h.astype(str).str.zfill(2) + ":" + m.astype(str).str.zfill(2) + ":00")
    df_d["hour"]      = h
    df_d["day_of_week"] = d.weekday()
    df_d["is_weekend"]  = df_d["day_of_week"].isin([4,5]).astype(int)  # Fri=4, Sat=5
    frames.append(df_d)

base_df = pd.concat(frames, ignore_index=True).sort_values(
    ["date","station_id","minute_of_day"]
).reset_index(drop=True)

print("base_df ready ✓")
print("Dates:", base_df["date"].unique()[:3], "...", base_df["date"].unique()[-3:])
print("Rows:", len(base_df), "| Stations:", base_df["station_id"].nunique())

base_df ready ✓
Dates: ['2025-09-21' '2025-09-22' '2025-09-23'] ... ['2025-09-25' '2025-09-26' '2025-09-27']
Rows: 45402 | Stations: 6


In [23]:
import os, csv, json, yaml
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

# ===================== 0) Data sources =====================
# NOTE: If your pipeline used 'day_demand_base.csv', rename it to 'day_base.csv'.
if 'week_df' in globals():
    df = week_df.copy()
elif 'base_df' in globals():
    df = base_df.copy()
else:
    raise RuntimeError("week_df or base_df not found in memory.")

# Project roots
if 'ROOT' not in globals():
    ROOT = "/content/2025_GP_28/masar-sim"
SEED = f"{ROOT}/data/seeds"
CONF = f"{ROOT}/sims/00_config.yaml"

# Config
with open(CONF, "r", encoding="utf-8") as f:
    config = yaml.safe_load(f) or {}

# ===================== 1) Time fields + week window =====================
ts = pd.to_datetime(df["timestamp"], errors="coerce")
df["timestamp"]     = ts
df["date"]          = ts.dt.strftime("%Y-%m-%d")
df["hour"]          = ts.dt.hour
df["minute_of_day"] = df["hour"]*60 + ts.dt.minute
df["day_of_week"]   = ts.dt.weekday
if "is_weekend" not in df.columns:
    # Asia/Riyadh: Fri=4, Sat=5
    df["is_weekend"] = df["day_of_week"].isin([4,5]).astype(int)

# Target week: 21 → 27 September 2025
WEEK_START = pd.Timestamp("2025-09-21")
WEEK_END   = pd.Timestamp("2025-09-27")
mask_week  = (df["timestamp"] >= WEEK_START) & (df["timestamp"] <= WEEK_END + pd.Timedelta(days=1) - pd.Timedelta(seconds=1))
df = df.loc[mask_week].copy()
if df.empty:
    raise RuntimeError(f"No rows found within week window: {WEEK_START.date()} → {WEEK_END.date()}")

# ========= 1.1) Fill missing days within week if needed =========
def rebuild_ts(date_iso: str, minute_of_day: int) -> pd.Timestamp:
    h = int(minute_of_day // 60)
    m = int(minute_of_day % 60)
    return pd.Timestamp(f"{date_iso} {h:02d}:{m:02d}:00")

def expand_week_if_needed(df_week: pd.DataFrame,
                          week_start: pd.Timestamp,
                          week_end: pd.Timestamp) -> pd.DataFrame:
    target_dates = [(week_start + pd.Timedelta(days=i)).strftime("%Y-%m-%d")
                    for i in range((week_end - week_start).days + 1)]
    have_dates = sorted(df_week["date"].unique().tolist())
    missing = [d for d in target_dates if d not in have_dates]
    if not missing:
        print("No expansion needed. All week dates present ✓")
        return df_week

    # Use the densest day inside the window as a template
    tmpl_date = df_week["date"].value_counts().idxmax()
    template  = df_week[df_week["date"] == tmpl_date].copy()
    keep_cols = template.columns.tolist()

    clones = []
    for d in missing:
        c = template.copy()
        # Update time fields
        c["date"] = d
        c["timestamp"] = c["minute_of_day"].apply(lambda mo: rebuild_ts(d, int(mo)))
        c["hour"] = pd.to_datetime(c["timestamp"]).dt.hour
        c["day_of_week"] = pd.to_datetime(c["timestamp"]).dt.weekday
        c["is_weekend"]  = c["day_of_week"].isin([4,5]).astype(int)
        clones.append(c[keep_cols])

    if clones:
        df_week = pd.concat([df_week] + clones, axis=0, ignore_index=True)

    print(f"Expanded missing dates → added {len(missing)} day(s): {missing}")
    # Sort after merge
    df_week = df_week.sort_values(["date","station_id","minute_of_day"]).reset_index(drop=True)
    return df_week

df = expand_week_if_needed(df, WEEK_START, WEEK_END)

# ===================== 2) Station mapping =====================
def _norm(x): return str(x).strip().upper()

with open(f"{SEED}/stations.json", "r", encoding="utf-8") as f:
    stations_list = json.load(f)

sid_by_code, sid_by_name = {}, {}
for st in stations_list:
    sid  = str(st.get("station_id","")).strip()
    code = str(st.get("code","")).strip()
    name = str(st.get("name","")).strip()
    if code: sid_by_code[_norm(code)] = sid
    if name: sid_by_name[_norm(name)] = sid

capacity_df = pd.DataFrame(stations_list)[["station_id","capacity_station"]]

ALIASES = {
    "AIRPORT T1-2": "AIRP_T12",
    "QASR AL-HOKM": "QASR",
    "NATIONAL MUSEUM": "MUSEUM",
    "WESTERN STATION": "S6",
}
def resolve_sid(token: str):
    t = _norm(token)
    if t in sid_by_code: return sid_by_code[t]
    if t in sid_by_name: return sid_by_name[t]
    if t in ALIASES:
        c = _norm(ALIASES[t])
        return sid_by_code.get(c, ALIASES[t])
    return None

# ===================== 3) Events only (holidays disabled) =====================
def norm_date(x: str) -> str:
    if x is None: return ""
    s = str(x).strip()
    if not s: return ""
    d = pd.to_datetime(s, errors="coerce", dayfirst=False)
    if pd.isna(d):
        d = pd.to_datetime(s, errors="coerce", dayfirst=True)
    return "" if pd.isna(d) else d.strftime("%Y-%m-%d")

events_csv = f"{SEED}/calendar_events.csv"
event_rows = []
with open(events_csv, "r", encoding="utf-8") as f:
    rdr = csv.DictReader(f)
    cols = {c.lower().strip(): c for c in rdr.fieldnames}
    for r in rdr:
        event_rows.append({
            "date": norm_date(r.get(cols.get("date","date"), "")),
            "event_type": (r.get(cols.get("event_type","event_type")) or r.get(cols.get("type","type")) or "Other").strip(),
            "stations_impacted": (r.get(cols.get("stations_impacted","stations_impacted")) or r.get(cols.get("stations","stations")) or "*").strip(),
            "demand_modifier": float((r.get(cols.get("demand_modifier","demand_modifier")) or "1.0")),
        })

GLOBAL_EVENT_TYPES = {"SaudiNationalDay"}

event_types_map = {}              # (date, SID) -> set(types)
event_mult_override = {}          # (date, SID) -> product(mods)
global_event_types_by_date = {}   # date -> set(types)
global_event_mult_by_date  = {}   # date -> product(mods)

for e in event_rows:
    d = e["date"]
    if not d:
        continue
    etype = e["event_type"] or "Other"
    dm    = float(e.get("demand_modifier", 1.0) or 1.0)
    tokens = [s.strip() for s in (e["stations_impacted"] or "*").split(";")]

    # Global event?
    is_global = (etype in GLOBAL_EVENT_TYPES) or any(_norm(t) in {"*", "ALL", "ALL STATIONS"} for t in tokens)
    if is_global:
        global_event_types_by_date.setdefault(d, set()).add(etype)
        global_event_mult_by_date[d] = global_event_mult_by_date.get(d, 1.0) * dm

    # Station-scoped events
    for tok in tokens:
        if tok == "" or _norm(tok) in {"*", "ALL", "ALL STATIONS"}:
            continue
        sid = resolve_sid(tok)
        if sid is None:
            print(f"[warn] Unknown station alias in events CSV: '{tok}'")
            continue
        key = (d, _norm(sid))
        event_types_map.setdefault(key, set()).add(etype)
        event_mult_override[key] = event_mult_override.get(key, 1.0) * dm

# Holidays disabled entirely
holiday_dates = set()

def list_event_types(date_str, sid):
    sidn = _norm(sid)
    types = set()
    if (date_str, sidn) in event_types_map:
        types |= event_types_map[(date_str, sidn)]
    if date_str in global_event_types_by_date:
        types |= global_event_types_by_date[date_str]
    return sorted(types)

def event_csv_multiplier(date_str, sid):
    sidn = _norm(sid)
    m = 1.0
    if (date_str, sidn) in event_mult_override:
        m *= event_mult_override[(date_str, sidn)]
    if date_str in global_event_mult_by_date:
        m *= global_event_mult_by_date[date_str]
    return float(m)

# ===================== 4) Modifiers (weekend + events + weather; holidays OFF) =====================
mult_cfg     = (config.get("multipliers", {}) or {})
weather_mult = mult_cfg.get("weather", {}) or {}
events_mult  = mult_cfg.get("events", {}) or {}
weekend_mult = float(mult_cfg.get("weekend", 1.0))
holiday_mult = 1.0  # force no holiday effect
COMBINE_MODE = "stack"  # multiply

def build_modifier(row):
    m = 1.0
    # weekend
    if int(row.get("is_weekend",0)) == 1:
        m *= weekend_mult

    # holidays off
    hol_m = 1.0

    # events: prefer explicit CSV multiplier; fallback to config types
    ev_m = event_csv_multiplier(row["date"], row["station_id"])
    if ev_m == 1.0:
        tmp = 1.0
        for t in list_event_types(row["date"], row["station_id"]):
            tmp *= float(events_mult.get(t, events_mult.get("Other", 1.0)))
        ev_m = tmp if tmp != 1.0 else 1.0

    m = m * hol_m * ev_m if COMBINE_MODE == "stack" else m * max(hol_m, ev_m)

    # weather
    w = str(row.get("weather_code", "") or "")
    m *= float(weather_mult.get(w, 1.0))
    return float(m)

df["modifier"] = df.apply(build_modifier, axis=1)

# ===================== 5) Final demand =====================
base_demand_safe = pd.to_numeric(df.get("base_demand", 0), errors="coerce").fillna(0)
df["demand_final"] = (base_demand_safe * pd.to_numeric(df["modifier"], errors="coerce").fillna(1.0)).fillna(0)

# ===================== 6) station_total + crowd_level =====================
df = df.merge(capacity_df, on="station_id", how="left")
# Normalize by global max instead of per-station max
global_max = max(df["demand_final"].max(), 1e-9)
df["_denom"] = global_max
df["demand_norm_final"] = (df["demand_final"] / df["_denom"]).clip(0, 1)

def station_total_from_norm(row):
    cap = float(row.get("capacity_station") or 0)
    if cap <= 0: return 0
    norm = float(row["demand_norm_final"])
    evb  = event_csv_multiplier(row["date"], row["station_id"])
    # Soft event boost up to +10%
    boost = min(1.10, 1.0 if evb <= 1.0 else min(evb, 1.10))
    return int(np.round(norm * cap * boost))

df["station_total"] = df.apply(station_total_from_norm, axis=1).astype(int)

def crowd_from_cap(row):
    cap = float(row.get("capacity_station") or 0)
    x = float(row.get("station_total") or 0)
    if cap <= 0: return "Medium"
    r = x / cap
    if   r < 0.30: return "Low"
    elif r < 0.60: return "Medium"
    elif r < 0.85: return "High"
    else:          return "Extreme"

df["crowd_level"] = df.apply(crowd_from_cap, axis=1)

# ===================== 7) Flags (event/holiday) =====================
df["special_event_type"] = df.apply(lambda r: "+".join(list_event_types(r["date"], r["station_id"])) or "None", axis=1)
df["event_flag"]   = (df["special_event_type"] != "None").astype(int)
df["holiday_flag"] = 0  # holidays disabled

# ===================== 8) Headway seconds =====================
headway_cfg = config.get("headway", {})
peaks_cfg   = config.get("peaks", [])
peak_hours  = [int(x.get("hour")) for x in peaks_cfg if "hour" in x]
peak_hw_min    = float(np.median(headway_cfg.get("peak_pattern",    [7,7,6,8])))
offpeak_hw_min = float(np.median(headway_cfg.get("offpeak_pattern", [11,10,12,11])))
def hw_for_hour(h): return int(peak_hw_min*60) if int(h) in peak_hours else int(offpeak_hw_min*60)

if "headway_seconds" in df.columns:
    df["headway_seconds"] = pd.to_numeric(df["headway_seconds"], errors="coerce")
    mask = df["headway_seconds"].isna()
    df.loc[mask, "headway_seconds"] = df.loc[mask, "hour"].apply(hw_for_hour)
else:
    df["headway_seconds"] = df["hour"].apply(hw_for_hour)
df["headway_seconds"] = df["headway_seconds"].astype(int)

# ===================== 9) Export week =====================
FINAL_SCHEMA = [
    "date","timestamp","hour","minute_of_day","day_of_week","is_weekend",
    "station_id",
    "base_demand","modifier","demand_final",
    "station_total","crowd_level",
    "special_event_type","event_flag","holiday_flag",
    "headway_seconds"
]
for c in FINAL_SCHEMA:
    if c not in df.columns:
        df[c] = np.nan

out = df[FINAL_SCHEMA].sort_values(["date","station_id","minute_of_day"]).reset_index(drop=True)

# QA
assert out["station_id"].notna().all()
assert (out["station_total"] >= 0).all()

# Quick checks
print("Per-day rows in window:")
print(out["date"].value_counts().sort_index())
_23 = out[out["date"]=="2025-09-23"]
print("\n23-Sep unique stations with event_flag=1:", _23[_23["event_flag"]==1]["station_id"].nunique())

# Save
OUT_DIR = f"{ROOT}/data/generated"
os.makedirs(OUT_DIR, exist_ok=True)
OUT_PATH = f"{OUT_DIR}/cf_week_2025-09-21_to_27.csv"
out.to_csv(OUT_PATH, index=False, encoding="utf-8-sig")
print("Saved ✓", OUT_PATH, "| Rows:", len(out), "| Week:", WEEK_START.date(), "→", WEEK_END.date())


No expansion needed. All week dates present ✓
Per-day rows in window:
date
2025-09-21    6486
2025-09-22    6486
2025-09-23    6486
2025-09-24    6486
2025-09-25    6486
2025-09-26    6486
2025-09-27    6486
Name: count, dtype: int64

23-Sep unique stations with event_flag=1: 6
Saved ✓ /content/2025_GP_28/masar-sim/data/generated/cf_week_2025-09-21_to_27.csv | Rows: 45402 | Week: 2025-09-21 → 2025-09-27


In [24]:
try:
    from google.colab import files
    files.download(OUT_PATH)
except Exception as e:
    print("Download skipped (not running in Colab):", e)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>