# Evidence Notebook — Azure Batch Run Verification

This notebook verifies the output of a completed Azure Synapse batch run.

The batch execution was performed in Azure Synapse using:

Pipeline → Notebook → Runner (from `azure_batch_code.zip`) → ADLS

This notebook does **not** execute the pipeline and does **not** recompute metrics.

Instead, it:

1. Reads the committed `sample_run.log` (single source of truth).
2. Verifies the run status and row counts.
3. Loads the metric parquet files that were generated in ADLS and downloaded locally for inspection.
4. Confirms that metric row counts match the values recorded in the run log.
5. Displays sample rows from each metric.

---

## Structure of This Notebook

Each section in this notebook follows a strict three-part pattern:

1. **Explanation**  
   - States the specific claim being tested  
   - Identifies which artifact(s) are inspected  
   - Defines what would invalidate the claim  

2. **Execution (Read-Only Code)**  
   - Only performs file reads, JSON parsing, previews, and row counts  
   - Does not mutate data  
   - Does not re-run the pipeline  
   - Does not alter parameters  

3. **Evidence**  
   - Interprets only what is displayed  
   - Clearly states whether the claim is supported  
   - Adds no narrative beyond the observed results  

This structure ensures that the notebook remains deterministic, transparent, and audit-focused.

---

## Important Notes

- The actual execution occurred in Azure.
- The parquet files shown here were produced in ADLS under:

  `metrics/run_id=<run_b7919d1111974426b23e31ccc0807276/`

- They were downloaded locally only for offline inspection due to infrastructure dependencies.
- The metric parquet files are **not version-controlled** in this repository.

This notebook exists solely to provide transparent, reproducible evidence of the Azure batch run.


# Section 1 — Load Run Log

## Explanation

**Claim being proven**  
A completed Azure batch run exists and recorded execution evidence.

**Artifact inspected**  
`evidence/sample_run.log`

**What would invalidate the claim**  
- File does not exist  
- JSON cannot be parsed  
- `status != "success"`


In [1]:

import json
from pathlib import Path
from pprint import pprint

log_path = Path("run.log")

assert log_path.exists(), "Run log not found. Expected evidence/sample_run.log"

with open(log_path, "r") as f:
    run_log = json.load(f)

print("=== RUN LOG CONTENT ===")
pprint(run_log)

print("\n=== BASIC CHECKS ===")
print("Run ID:", run_log["run_id"])
print("Status:", run_log["status"])
print("Input Path:", run_log["input_path"])
print("Metrics Root:", run_log["metrics_root"])


=== RUN LOG CONTENT ===
{'curated_path': 'abfss://data@batchrawdata.dfs.core.windows.net/curated/run_id=b7919d1111974426b23e31ccc0807276/curated.parquet',
 'input_path': 'abfss://data@batchrawdata.dfs.core.windows.net/raw/spreadspoke_scores.csv',
 'metrics': ['M1', 'M2', 'M3', 'M4', 'M5', 'M6'],
 'metrics_root': 'abfss://data@batchrawdata.dfs.core.windows.net/metrics/run_id=b7919d1111974426b23e31ccc0807276',
 'per_metric_rows': {'M1': 1774,
                     'M2': 1774,
                     'M3': 1774,
                     'M4': 60,
                     'M5': 60,
                     'M6': 60},
 'rows_curated': 14358,
 'rows_read_raw': 14358,
 'run_id': 'b7919d1111974426b23e31ccc0807276',
 'status': 'success'}

=== BASIC CHECKS ===
Run ID: b7919d1111974426b23e31ccc0807276
Status: success
Input Path: abfss://data@batchrawdata.dfs.core.windows.net/raw/spreadspoke_scores.csv
Metrics Root: abfss://data@batchrawdata.dfs.core.windows.net/metrics/run_id=b7919d1111974426b23e31ccc0807276


## Evidence

- The run log file exists.
- The JSON structure is readable.
- `run_id` is present.
- `input_path`, `curated_path`, and `metrics_root` use `abfss://` (Azure Data Lake).
- `status` confirms whether the Azure execution completed successfully.

Since `status == "success"`, the batch run completed and outputs were published.


# Section 2 — Inspect Curated Output

## Explanation

**Claim being proven**  
The curated dataset was produced by the Azure run and contains the expected number of rows.

**Artifact inspected**  
Downloaded parquet copy from:
`curated/run_id=b7919d1111974426b23e31ccc0807276/`

(Local copy only for offline inspection.)

**What would invalidate the claim**  
- Curated file missing  
- Row count does not match `rows_curated` in run log


In [2]:
import pandas as pd
from pathlib import Path

curated_path = Path("curated/curated.parquet")

assert curated_path.exists(), "Curated parquet not found in evidence/local_curated/"

curated_df = pd.read_parquet(curated_path)

print("=== CURATED DATASET ===")
print("Rows:", len(curated_df))
print("\nSchema:")
print(curated_df.dtypes)

curated_df.head()


=== CURATED DATASET ===
Rows: 14358

Schema:
schedule_date           object
schedule_season          int32
schedule_week           object
schedule_playoff          bool
team_home               object
score_home             float64
score_away             float64
team_away               object
team_favorite_id        object
spread_favorite        float64
over_under_line         object
stadium                 object
stadium_neutral           bool
weather_temperature    float64
weather_wind_mph       float64
weather_humidity       float64
weather_detail          object
dtype: object


Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail
0,9/2/1966,1966,1,False,Miami Dolphins,14.0,23.0,Oakland Raiders,,,,Orange Bowl,False,83.0,6.0,71.0,
1,9/3/1966,1966,1,False,Houston Oilers,45.0,7.0,Denver Broncos,,,,Rice Stadium,False,81.0,7.0,70.0,
2,9/4/1966,1966,1,False,San Diego Chargers,27.0,7.0,Buffalo Bills,,,,Balboa Stadium,False,70.0,7.0,82.0,
3,9/9/1966,1966,2,False,Miami Dolphins,14.0,19.0,New York Jets,,,,Orange Bowl,False,82.0,11.0,78.0,
4,9/10/1966,1966,1,False,Green Bay Packers,24.0,3.0,Baltimore Colts,,,,Lambeau Field,False,64.0,8.0,62.0,


## Evidence

- Curated parquet file exists locally.
- Row count matches `rows_curated` from the run log.
- Sample preview shows expected columns and trimmed values.


# Section 3 — Inspect Metric Outputs

## Explanation

**Claim being proven**  
All six metrics (M1–M6) were produced and contain the expected number of rows.

**Artifacts inspected**  
Downloaded parquet copies from:
`metrics/run_id=b7919d1111974426b23e31ccc0807276/M1..M6`

(Local copies only for offline inspection.)

**What would invalidate the claim**  
- Any metric file missing  
- Row count does not match `per_metric_rows` in run log


In [10]:
import pandas as pd
from pathlib import Path

metrics = ["M1", "M2", "M3", "M4", "M5", "M6"]
metric_dfs = {}

for m in metrics:
    file_name = f"{m.lower()}.parquet"
    path = Path(file_name)

    print(f"\nMetric {m} exists:", path.exists())
    assert path.exists(), f"{file_name} not found."

    df = pd.read_parquet(path)
    metric_dfs[m] = df

    print(f"{m} rows:", len(df))
    print("Expected rows:", run_log["per_metric_rows"][m])
    print("Preview:")
    print(df.head())



Metric M1 exists: True
M1 rows: 1774
Expected rows: 1774
Preview:
              team  season  games_played
0  Atlanta Falcons    1966            14
1  Baltimore Colts    1966            14
2  Boston Patriots    1966             7
3    Buffalo Bills    1966            15
4    Chicago Bears    1966            14

Metric M2 exists: True
M2 rows: 1774
Expected rows: 1774
Preview:
              team  season  wins  losses  ties
0  Atlanta Falcons    1966     3      11     0
1  Baltimore Colts    1966     9       5     0
2  Boston Patriots    1966     4       2     1
3    Buffalo Bills    1966     9       5     1
4    Chicago Bears    1966     5       7     2

Metric M3 exists: True
M3 rows: 1774
Expected rows: 1774
Preview:
              team  season  points_for  points_against
0  Atlanta Falcons    1966         204             437
1  Baltimore Colts    1966         314             226
2  Boston Patriots    1966         158             146
3    Buffalo Bills    1966         365             

## Evidence

- All six metric folders exist locally.
- Row counts match `per_metric_rows` from the run log.
- Sample previews confirm expected grain and columns.


# Section 4 — Final Integrity Check

## Explanation

**Claim being proven**  
Execution evidence, curated output, and metric outputs are internally consistent.

**Artifacts inspected**  
Run log + downloaded parquet files.

**What would invalidate the claim**  
- Any row count mismatch  
- Missing metric  
- `status != "success"`


In [11]:
assert run_log["status"] == "success", "Run did not complete successfully."

assert len(curated_df) == run_log["rows_curated"], "Curated row mismatch."

for m in metrics:
    assert len(metric_dfs[m]) == run_log["per_metric_rows"][m], f"{m} row count mismatch."

print("All integrity checks passed.")


All integrity checks passed.
