<a href="https://colab.research.google.com/github/Jana-Alrzoog/2025_GP_28/blob/main/masar-sim/notebooks/masar_occupancy_day.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# 🚇 Masar Occupancy — Day Generator

This notebook generates **minute-level passenger occupancy** for a single day across selected stations/lines.
It applies the **base-day curve** and multiplies it by **context modifiers** (station capacity, weekend, weather, and events — holidays disabled per current config), then exports tidy CSVs ready for weekly/monthly aggregation and dashboards.

---

## 🎯 Purpose

* Produce **24-hour occupancy time series** at 1-minute resolution.
* Apply standardized modifiers to reflect realistic day-to-day variation.
* Emit clean outputs for QA, visualization, and Firestore/pipeline publishing.

---

## 🔧 Inputs

* **`base_day.csv`** (from `masar_base_demand.ipynb`)
* **Seeds:** `stations`, `events`, `weather` (holidays ignored)
* **Config:** `00_config.yaml` (paths, multipliers, timezone, resolution)

---

## ⚙️ Workflow

1️⃣ **Load config & seeds** (paths, stations, events, weather).
2️⃣ **Pick target date** (e.g., `DATE="2025-09-23"`), set `TZ` and `MINUTE_RES`.
3️⃣ **Build minute grid** for the day per station.
4️⃣ **Compute modifiers** per minute:

* `station_scale` (capacity vs. network mean)
* `weekend_mult` (Fri/Sat)
* `weather_mult` (Sunny/Dusty/Rainy…)
* `event_mult` (supports `stations_impacted` lists)
* *(holiday multiplier disabled by design)*
  5️⃣ **Apply**: `occupancy = base_norm × final_modifier`.
  6️⃣ **Export CSVs** and basic charts.

---


In [1]:
!git clone https://github.com/Jana-Alrzoog/2025_GP_28.git
%cd /content/2025_GP_28/masar-sim
!ls


Cloning into '2025_GP_28'...
remote: Enumerating objects: 662, done.[K
remote: Counting objects: 100% (177/177), done.[K
remote: Compressing objects: 100% (170/170), done.[K
remote: Total 662 (delta 92), reused 1 (delta 1), pack-reused 485 (from 1)[K
Receiving objects: 100% (662/662), 4.57 MiB | 9.11 MiB/s, done.
Resolving deltas: 100% (239/239), done.
/content/2025_GP_28/masar-sim
data  lib  notebooks  sims


In [2]:
ROOT = "/content/2025_GP_28/masar-sim"
GEN = f"{ROOT}/data/generated"
SEED = f"{ROOT}/data/seeds"
CONF = f"{ROOT}/sims/00_config.yaml"


In [6]:
%cd /content/2025_GP_28
!git fetch origin
!git checkout main
!git reset --hard origin/main
!ls masar-sim/sims


/content/2025_GP_28
Already on 'main'
Your branch is up to date with 'origin/main'.
HEAD is now at d2ec5b3 Created using Colab
00_config.yaml


In [7]:
%cd /content
!git clone https://github.com/Jana-Alrzoog/2025_GP_28.git 2025_GP_28_latest
!ls /content/2025_GP_28_latest/masar-sim/sims

ROOT = "/content/2025_GP_28_latest/masar-sim"
GEN  = f"{ROOT}/data/generated"
SEED = f"{ROOT}/data/seeds"
CONF = f"{ROOT}/sims/00_config.yaml"

!mkdir -p $GEN
!cp /content/2025_GP_28/masar-sim/data/generated/base_demand_day.csv $GEN/

!ls $CONF
!ls $GEN


/content
fatal: destination path '2025_GP_28_latest' already exists and is not an empty directory.
00_config.yaml
/content/2025_GP_28_latest/masar-sim/sims/00_config.yaml
base_day.csv	     cf_week_f.csv	occupancy_week.csv
base_demand_day.csv  occupancy_day.csv


In [9]:
import os, json, csv, yaml
import numpy as np
import pandas as pd

ROOT = "/content/2025_GP_28_latest/masar-sim"
GEN  = f"{ROOT}/data/generated"
SEED = f"{ROOT}/data/seeds"
CONF = f"{ROOT}/sims/00_config.yaml"

base_path = f"{GEN}/base_day.csv"
assert os.path.exists(base_path), "base_day.csv not found"

base_df = pd.read_csv(base_path, parse_dates=["timestamp"])

with open(CONF) as f:
    config = yaml.safe_load(f)

with open(f"{SEED}/stations.json") as f:
    stations = json.load(f)
with open(f"{SEED}/weather_patterns.json") as f:
    weather_map = json.load(f)
with open(f"{SEED}/calendar_events.csv") as f:
    events = list(csv.DictReader(f))

print("rows:", len(base_df), "| stations:", base_df["station_id"].nunique())


rows: 6486 | stations: 6


In [15]:
import os, csv, json, yaml
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

# ===================== 0) Data sources =====================
# Expect one of these to exist in memory: week_df or base_df
if 'week_df' in globals():
    df = week_df.copy()
elif 'base_df' in globals():
    df = base_df.copy()
else:
    raise RuntimeError("week_df or base_df not found in memory.")

# Root paths
if 'ROOT' not in globals():
    ROOT = "/content/2025_GP_28/masar-sim"
SEED = f"{ROOT}/data/seeds"
CONF = f"{ROOT}/sims/00_config.yaml"

# Config
with open(CONF, "r", encoding="utf-8") as f:
    config = yaml.safe_load(f) or {}

# ===================== 1) Time fields + single-day window =====================
ts = pd.to_datetime(df["timestamp"], errors="coerce")
df["timestamp"]     = ts
df["date"]          = ts.dt.strftime("%Y-%m-%d")
df["hour"]          = ts.dt.hour
df["minute_of_day"] = df["hour"]*60 + ts.dt.minute
df["day_of_week"]   = ts.dt.weekday
if "is_weekend" not in df.columns:
    # Fri=4, Sat=5 in Asia/Riyadh
    df["is_weekend"] = df["day_of_week"].isin([4,5]).astype(int)

# Pick ONE day (default: 2025-09-24).
# Tip: if you previously used 'day_demand_base.csv', rename to 'day_base.csv' in your pipeline.
DAY_DATE = pd.Timestamp("2025-09-24")
mask_day = (df["timestamp"] >= DAY_DATE) & (df["timestamp"] < DAY_DATE + pd.Timedelta(days=1))
df = df.loc[mask_day].copy()
if df.empty:
    raise RuntimeError(f"No rows found for the day window: {DAY_DATE.date()}")

# ===================== 2) Station mapping =====================
def _norm(x): return str(x).strip().upper()

with open(f"{SEED}/stations.json", "r", encoding="utf-8") as f:
    stations_list = json.load(f)

sid_by_code, sid_by_name = {}, {}
for st in stations_list:
    sid  = str(st.get("station_id","")).strip()
    code = str(st.get("code","")).strip()
    name = str(st.get("name","")).strip()
    if code: sid_by_code[_norm(code)] = sid
    if name: sid_by_name[_norm(name)] = sid

capacity_df = pd.DataFrame(stations_list)[["station_id","capacity_station"]]

ALIASES = {
    "AIRPORT T1-2": "AIRP_T12",
    "QASR AL-HOKM": "QASR",
    "NATIONAL MUSEUM": "MUSEUM",
    "WESTERN STATION": "S6",
}
def resolve_sid(token: str):
    t = _norm(token)
    if t in sid_by_code: return sid_by_code[t]
    if t in sid_by_name: return sid_by_name[t]
    if t in ALIASES:
        c = _norm(ALIASES[t])
        return sid_by_code.get(c, ALIASES[t])
    return None

# ===================== 3) Events only (holidays disabled) =====================
def norm_date(x: str) -> str:
    if x is None: return ""
    s = str(x).strip()
    if not s: return ""
    d = pd.to_datetime(s, errors="coerce", dayfirst=False)
    if pd.isna(d):
        d = pd.to_datetime(s, errors="coerce", dayfirst=True)
    return "" if pd.isna(d) else d.strftime("%Y-%m-%d")

events_csv = f"{SEED}/calendar_events.csv"
event_rows = []
with open(events_csv, "r", encoding="utf-8") as f:
    rdr = csv.DictReader(f)
    cols = {c.lower().strip(): c for c in rdr.fieldnames}
    for r in rdr:
        event_rows.append({
            "date": norm_date(r.get(cols.get("date","date"), "")),
            "event_type": (r.get(cols.get("event_type","event_type")) or r.get(cols.get("type","type")) or "Other").strip(),
            "stations_impacted": (r.get(cols.get("stations_impacted","stations_impacted")) or r.get(cols.get("stations","stations")) or "*").strip(),
            "demand_modifier": float((r.get(cols.get("demand_modifier","demand_modifier")) or "1.0")),
        })

# Global events (e.g., SaudiNationalDay) still supported, but holidays are disabled below.
GLOBAL_EVENT_TYPES = {"SaudiNationalDay"}

event_types_map = {}              # (date, SID) -> set(types)
event_mult_override = {}          # (date, SID) -> product(mods)
global_event_types_by_date = {}   # date -> set(types)
global_event_mult_by_date  = {}   # date -> product(mods)

for e in event_rows:
    d = e["date"]
    if not d:
        continue
    etype = e["event_type"] or "Other"
    dm    = float(e.get("demand_modifier", 1.0) or 1.0)
    tokens = [s.strip() for s in (e["stations_impacted"] or "*").split(";")]

    # Global event?
    is_global = (etype in GLOBAL_EVENT_TYPES) or any(_norm(t) in {"*", "ALL", "ALL STATIONS"} for t in tokens)
    if is_global:
        global_event_types_by_date.setdefault(d, set()).add(etype)
        global_event_mult_by_date[d] = global_event_mult_by_date.get(d, 1.0) * dm

    # Station-specific events
    for tok in tokens:
        if tok == "" or _norm(tok) in {"*", "ALL", "ALL STATIONS"}:
            continue
        sid = resolve_sid(tok)
        if sid is None:
            print(f"[warn] Unknown station alias in events CSV: '{tok}'")
            continue
        key = (d, _norm(sid))
        event_types_map.setdefault(key, set()).add(etype)
        event_mult_override[key] = event_mult_override.get(key, 1.0) * dm

# Holidays DISABLED (force empty set)
holiday_dates = set()

def list_event_types(date_str, sid):
    sidn = _norm(sid)
    types = set()
    if (date_str, sidn) in event_types_map:
        types |= event_types_map[(date_str, sidn)]
    if date_str in global_event_types_by_date:
        types |= global_event_types_by_date[date_str]
    return sorted(types)

def event_csv_multiplier(date_str, sid):
    sidn = _norm(sid)
    m = 1.0
    if (date_str, sidn) in event_mult_override:
        m *= event_mult_override[(date_str, sidn)]
    if date_str in global_event_mult_by_date:
        m *= global_event_mult_by_date[date_str]
    return float(m)

# ===================== 4) Multipliers (weekend + events + weather); holidays off =====================
mult_cfg     = (config.get("multipliers", {}) or {})
weather_mult = mult_cfg.get("weather", {}) or {}
events_mult  = mult_cfg.get("events", {}) or {}
weekend_mult = float(mult_cfg.get("weekend", 1.0))
holiday_mult = 1.0  # force no holiday effect
COMBINE_MODE = "stack"  # "stack" => multiply; else use max(holiday, events)

def build_modifier(row):
    m = 1.0
    # weekend
    if int(row.get("is_weekend",0)) == 1:
        m *= weekend_mult

    # holidays disabled => hol_m = 1.0 always
    hol_m = 1.0

    # events: prefer explicit CSV multiplier; otherwise fallback to config-based types
    ev_m = event_csv_multiplier(row["date"], row["station_id"])
    if ev_m == 1.0:
        tmp = 1.0
        for t in list_event_types(row["date"], row["station_id"]):
            tmp *= float(events_mult.get(t, events_mult.get("Other", 1.0)))
        ev_m = tmp if tmp != 1.0 else 1.0

    m = m * hol_m * ev_m if COMBINE_MODE == "stack" else m * max(hol_m, ev_m)

    # weather
    w = str(row.get("weather_code", "") or "")
    m *= float(weather_mult.get(w, 1.0))
    return float(m)

df["modifier"] = df.apply(build_modifier, axis=1)

# ===================== 5) Final demand =====================
base_demand_safe = pd.to_numeric(df.get("base_demand", 0), errors="coerce").fillna(0)
df["demand_final"] = (base_demand_safe * pd.to_numeric(df["modifier"], errors="coerce").fillna(1.0)).fillna(0)

# ===================== 6) station_total + crowd_level =====================
df = df.merge(capacity_df, on="station_id", how="left")
df["_denom"] = df.groupby("station_id")["demand_final"].transform(lambda s: max(s.max(), 1e-9))
df["demand_norm_final"] = (df["demand_final"] / df["_denom"]).clip(0, 1)

def station_total_from_norm(row):
    cap = float(row.get("capacity_station") or 0)
    if cap <= 0: return 0
    norm = float(row["demand_norm_final"])
    evb  = event_csv_multiplier(row["date"], row["station_id"])
    # Soft cap boost for events up to +10%
    boost = min(1.10, 1.0 if evb <= 1.0 else min(evb, 1.10))
    return int(np.round(norm * cap * boost))

df["station_total"] = df.apply(station_total_from_norm, axis=1).astype(int)

def crowd_from_cap(row):
    cap = float(row.get("capacity_station") or 0)
    x = float(row.get("station_total") or 0)
    if cap <= 0: return "Medium"
    r = x / cap
    if   r < 0.30: return "Low"
    elif r < 0.60: return "Medium"
    elif r < 0.85: return "High"
    else:          return "Extreme"

df["crowd_level"] = df.apply(crowd_from_cap, axis=1)

# ===================== 7) Event/holiday flags =====================
df["special_event_type"] = df.apply(lambda r: "+".join(list_event_types(r["date"], r["station_id"])) or "None", axis=1)
df["event_flag"]   = (df["special_event_type"] != "None").astype(int)
df["holiday_flag"] = 0  # holidays disabled

# ===================== 8) Headway seconds =====================
headway_cfg = config.get("headway", {})
peaks_cfg   = config.get("peaks", [])
peak_hours  = [int(x.get("hour")) for x in peaks_cfg if "hour" in x]
peak_hw_min    = float(np.median(headway_cfg.get("peak_pattern",    [7,7,6,8])))
offpeak_hw_min = float(np.median(headway_cfg.get("offpeak_pattern", [11,10,12,11])))
def hw_for_hour(h): return int(peak_hw_min*60) if int(h) in peak_hours else int(offpeak_hw_min*60)

if "headway_seconds" in df.columns:
    df["headway_seconds"] = pd.to_numeric(df["headway_seconds"], errors="coerce")
    mask = df["headway_seconds"].isna()
    df.loc[mask, "headway_seconds"] = df.loc[mask, "hour"].apply(hw_for_hour)
else:
    df["headway_seconds"] = df["hour"].apply(hw_for_hour)
df["headway_seconds"] = df["headway_seconds"].astype(int)

# ===================== 9) Output (single day) =====================
FINAL_SCHEMA = [
    "date","timestamp","hour","minute_of_day","day_of_week","is_weekend",
    "station_id",
    "base_demand","modifier","demand_final",
    "station_total","crowd_level",
    "special_event_type","event_flag","holiday_flag",
    "headway_seconds"
]
for c in FINAL_SCHEMA:
    if c not in df.columns:
        df[c] = np.nan

out = df[FINAL_SCHEMA].sort_values(["date","station_id","minute_of_day"]).reset_index(drop=True)

# QA
assert out["station_id"].notna().all()
assert (out["station_total"] >= 0).all()

# Quick checks
print("Rows for the selected day:", len(out))
print("Unique stations with event_flag=1:", out[out["event_flag"]==1]["station_id"].nunique())

# Save as a per-day file (changed from 'cf_week_2025-09-21_to_27.csv')
OUT_DIR = f"{ROOT}/data/generated"
os.makedirs(OUT_DIR, exist_ok=True)
OUT_PATH = f"{OUT_DIR}/cf_day_{DAY_DATE.date()}.csv"
out.to_csv(OUT_PATH, index=False, encoding="utf-8-sig")
print("Saved ✓", OUT_PATH)


Rows for the selected day: 6480
Unique stations with event_flag=1: 0
Saved ✓ /content/2025_GP_28_latest/masar-sim/data/generated/cf_day_2025-09-24.csv


In [16]:
from google.colab import files

OUT_PATH = f"{ROOT}/data/generated/cf_day_{DAY_DATE.date()}.csv"

files.download(OUT_PATH)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>