# Lab 03: Generate Quality Report with Known Failures

Goal: run validation on an intentionally flawed dataset, read the failures, and produce a markdown report to share.


## 1) Install Pandera (if not already)
- Do: Run the install.
- Why: Ensure the library is available for this session.
- You should see: Install success.
- If it doesn't look right: Rerun; confirm runtime has internet.


In [None]:
!pip -q install pandera[pandas]


## 2) Load schema and bad data
- Do: Add the artifacts path, import the schema, and load `../inputs/collection_with_failures.csv`.
- Why: Use the same rules on a dataset with known issues to see how failures surface.
- You should see: Dataframe preview with some suspect values.
- If it doesn't look right: Check paths; ensure the artifacts folder is accessible.


In [None]:
import sys
from pathlib import Path
import subprocess


def _find_repo_root(start: Path) -> Path:
    for candidate in [start] + list(start.parents):
        if (candidate / 'WORKSHOP_OVERVIEW.md').exists():
            return candidate
    return start


def _ensure_repo_root() -> Path:
    # Colab opens notebooks in /content without repo files. Clone so relative data imports work.
    if 'google.colab' in sys.modules:
        repo_root = Path('/content/data-workflows-workshop')
        if not repo_root.exists():
            subprocess.run(
                [
                    'git',
                    'clone',
                    '--depth',
                    '1',
                    'https://github.com/MSU-DHI-Lab/data-workflows-workshop.git',
                    str(repo_root),
                ],
                check=True,
            )
        return repo_root

    return _find_repo_root(Path.cwd().resolve())


REPO_ROOT = _ensure_repo_root()
LAB03_ROOT = REPO_ROOT / 'day-03-quality-gates-and-reuse/01-labs/lab-03'
LAB02_DELIVERABLES = REPO_ROOT / 'day-03-quality-gates-and-reuse/01-labs/lab-02/deliverables'

sys.path.append(str(LAB02_DELIVERABLES))

import pandas as pd
import importlib

schema_module = importlib.import_module('validation_schema')
schema = schema_module.schema

df_bad = pd.read_csv(LAB03_ROOT / 'inputs/collection_with_failures.csv')
df_bad.head()


## 3) Run validation and capture errors
- Do: Validate the bad dataframe in a try/except and collect the errors.
- Why: We expect failures; capturing them lets us report clearly.
- You should see: A `SchemaErrors` message with details on offending rows.
- If it doesn't look right: Ensure the schema import succeeded; check that the CSV has the expected columns.


In [None]:
import pandera as pa
from pandera.errors import SchemaErrors

try:
    schema.validate(df_bad, lazy=True)
    validation_errors = None
except SchemaErrors as err:
    validation_errors = err.failure_cases
    display(err.failure_cases)


## 4) Generate a markdown report
- Do: Build a short report with counts and examples of failures.
- Why: Reports are artifacts for stakeholders, not only programmers; they tell the story of what failed and why.
- You should see: A markdown string with sections you can save.
- If it doesn't look right: Check that `validation_errors` is populated; ensure lazy=True was set to collect all failures.


In [None]:
total_rows = len(df_bad)
fail_count = len(validation_errors) if validation_errors is not None else 0
report_lines = [
    '# Quality Report',
    f'- Total rows: {total_rows}',
    f'- Failed checks: {fail_count}',
]
if validation_errors is not None:
    sample = validation_errors.head(10)
    report_lines.append('## Sample failures (first 10)')
    for _, row in sample.iterrows():
        report_lines.append('- Column: {col} | Check: {chk} | Failure: {fail} | Index: {idx}'.format(col=row['column'], chk=row['check'], fail=row['failure_case'], idx=row['index']))
report = '\n'.join(report_lines)
print(report)
report_path = LAB03_ROOT / 'validation_report.md'
with open(report_path,'w') as f:
    f.write(report)
print(f'Saved report to {report_path}')


## 5) Reflect
- Do: Note which checks caught issues and why they matter.
- Why: Helps you decide whether to fix data, adjust checks, or quarantine records.
- You should see: Your own notes summarizing next actions.
- If it doesn't look right: Review the failure cases table; tie each check to the problem it prevented.
