# 02 — Data Audit (Schema, Timezone & Session Coverage)

**Goal.** Before any backtest, verify that our raw data cleanly supports the strategy’s session rules:
- Timestamps parse correctly and map to **America/New_York** (NY) time.
- **Opening Range (OR)** window **09:30–10:00 (inclusive)** has all expected minutes (31 bars).
- The trading window **09:30–12:00** is complete, and both **10:22** (entry) and **12:00** (hard exit) bars exist.
- Days with gaps, duplicates, or anomalies are **excluded with a clear reason**.

**Inputs.**
- Raw 1-minute CSVs in `data/raw/`, semicolon `;` delimited, **no header**  
  Columns: `datetime;open;high;low;close;volume`  
- Config files (single source of truth):  
  - `config/strategy.yml`  → zones, SL/TP, one-trade/day  
  - `config/instruments.yml` → timezone, OR/entry/exit times, costs, data schema

**Outputs of this notebook.**
- `reports/tables/valid_days.csv` — dates that passed all checks with OR stats  
- `reports/tables/exclusion_log.csv` — dates we skipped **and why**  
- 2–3 quick visuals for coverage and OR-range distribution

> **Why this matters.** Clean coverage prevents hidden biases (like missing 10:22) and makes backtest results **reproducible** and **trustworthy**.

---


###  2.1 — Imports, paths, and config preview (no data read yet)

In [2]:
# 2.1 — Imports, paths, and config preview (no data read yet)

from pathlib import Path
import pandas as pd
import yaml

# Pretty printing
pd.set_option("display.width", 140)
pd.set_option("display.max_columns", 30)

# Project layout
ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
CONFIG_DIR = ROOT / "config"
DATA_RAW_DIR = ROOT / "data" / "raw"
REPORTS_TBLS = ROOT / "reports" / "tables"
REPORTS_FIGS = ROOT / "reports" / "figures"

for d in (REPORTS_TBLS, REPORTS_FIGS):
    d.mkdir(parents=True, exist_ok=True)

def load_yaml(p: Path):
    with open(p, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

# Load configs (single source of truth)
STRATEGY = load_yaml(CONFIG_DIR / "strategy.yml")
INSTR    = load_yaml(CONFIG_DIR / "instruments.yml")

# Extract frequently used fields for a quick human check
session = INSTR.get("session", {})
data_cfg = INSTR.get("data", {})
policies = INSTR.get("policies", {})
market   = INSTR.get("market", {})
costs    = INSTR.get("costs", {})

summary = pd.DataFrame.from_dict({
    "Instrument": [market.get("instrument", "TBD")],
    "Timezone": [session.get("timezone", "America/New_York")],
    "OR Start": [session.get("or_window", {}).get("start", "09:30")],
    "OR End (inclusive)": [session.get("or_window", {}).get("end_inclusive", "10:00")],
    "Entry Time": [session.get("entry_time", "10:22")],
    "Hard Exit": [session.get("hard_exit_time", "12:00")],
    "Top Zone %": [STRATEGY.get("parameters", {}).get("zones", {}).get("top_pct", 0.35)],
    "Bottom Zone %": [STRATEGY.get("parameters", {}).get("zones", {}).get("bottom_pct", 0.35)],
    "SL (pts)": [STRATEGY.get("parameters", {}).get("risk", {}).get("stop_loss_points", 25)],
    "TP (pts)": [STRATEGY.get("parameters", {}).get("risk", {}).get("take_profit_points", 75)],
    "$/point": [market.get("point_value_usd", 80.0)],
    "Tick size": [market.get("tick_size", None)],
    "Fees/side ($)": [costs.get("fees_per_side_usd", 0.0)],
    "Slippage (pts)": [costs.get("slippage_points", 0.0)],
    "Delimiter": [data_cfg.get("delimiter", ";")],
    "Datetime fmt": [data_cfg.get("datetime_format", "%Y%m%d %H%M%S")],
    "Source TZ": [data_cfg.get("source_timezone", None)],
    "Has premarket": [data_cfg.get("has_premarket", False)],
})

print("ROOT:", ROOT)
print("CONFIG_DIR:", CONFIG_DIR)
print("DATA_RAW_DIR:", DATA_RAW_DIR)
display(summary)

print("\nPolicies snapshot:")
display(pd.Series(policies, name="value"))


ROOT: d:\Projects\OpeningRange
CONFIG_DIR: d:\Projects\OpeningRange\config
DATA_RAW_DIR: d:\Projects\OpeningRange\data\raw


Unnamed: 0,Instrument,Timezone,OR Start,OR End (inclusive),Entry Time,Hard Exit,Top Zone %,Bottom Zone %,SL (pts),TP (pts),$/point,Tick size,Fees/side ($),Slippage (pts),Delimiter,Datetime fmt,Source TZ,Has premarket
0,NSXUSD,America/New_York,09:30,10:00,10:22,12:00,0.35,0.35,25,75,80.0,,0.0,0.0,;,%Y%m%d %H%M%S,,False



Policies snapshot:


require_complete_or_window           True
require_entry_bar                    True
require_exit_bar                     True
skip_zero_or_range                   True
min_or_range_points                   0.0
skip_days_with_gaps                  True
skip_days_with_premarket_detected    True
Name: value, dtype: object

### 2.2 — File Discovery & Schema Validation

**What we do now (read this first):**

* List CSV files in `data/raw/`.
* Read a **small sample** from each file using the configured delimiter and expected schema (6 columns, no header).
* Validate that the sample rows look right:

  * `datetime` matches the configured `datetime_format`.
  * `open/high/low/close` are numeric.
  * `volume` is numeric.
* Produce a small **table** summarizing pass/fail per file and notes.
  *(No timezone mapping or session slicing yet—this is just a quick sanity check.)*