# Day 01 Validation Notebook (Colab-friendly)

Use this notebook after exporting your cleaned CSV from OpenRefine. You'll upload the file, run light checks, and copy a short validation summary.

## 1) Upload your cleaned CSV
- Do: Run the cell below and choose the cleaned CSV you exported in Lab 02.
- Why: Colab needs the file to run checks locally. Upload keeps the data scoped to your session.
- You should see: A file chooser, then the filename listed with a size.
- If it doesn't look right: Re-export the CSV from OpenRefine, confirm the header row exists, and retry the upload.

In [None]:
try:
    from google.colab import files
    upload = files.upload()  # Choose your cleaned CSV when prompted
except ModuleNotFoundError:
    upload = {}
    print("Not running in Colab. Set `clean_filename` in the next cell to your CSV path.")


## 2) Load the CSV and preview
- Do: Replace the filename below if needed, then run.
- Why: Confirms the file loads cleanly and the header matches expectations.
- You should see: The first five rows with normalized rights and places.
- If it doesn't look right: Check for extra header rows, ensure UTF-8 encoding, and verify the separator is a comma.

In [None]:
from pathlib import Path

import pandas as pd

# If you uploaded via Colab, this will use the uploaded filename.
# Otherwise, set this to where your Lab 02 export lives.
if upload:
    clean_filename = list(upload.keys())[0]
else:
    clean_filename = '../lab-02/outputs/collection_cleaned.csv'

if not Path(clean_filename).exists():
    raise FileNotFoundError(
        f"Couldn't find {clean_filename!r}.\n\n"
        "In Colab: run Step 1 and upload your cleaned CSV.\n"
        "Locally: edit `clean_filename` to point at your exported file."
    )

df = pd.read_csv(clean_filename)

required_columns = {'id', 'title', 'creator', 'place', 'rights', 'date'}
missing = required_columns - set(df.columns)
if missing:
    raise ValueError(
        "CSV loaded, but it's missing expected columns: "
        + ", ".join(sorted(missing))
        + "\nFound columns: "
        + ", ".join(df.columns)
    )

df.head()

## 3) Check counts and missing values
- Do: Run to see row count and missing values per column.
- Why: Validates no unexpected drop or duplication happened during cleaning.
- You should see: Row count matching your input (8 in the sample) and low/zero missing values.
- If it doesn't look right: Re-run the OpenRefine export, confirm all rows were included, and inspect columns with many missing values.

In [None]:
row_count = len(df)
missing = df.isna().sum()
print(f'Rows: {row_count}')
print('Missing values per column:')
print(missing)

## 4) Check distinct rights and place values
- Do: Run to list unique rights tokens and places.
- Why: Confirms normalization stuck and no stray variants remain.
- You should see: Rights limited to the chosen statements; places limited to the normalized forms.
- If it doesn't look right: Reapply the OpenRefine operations file, re-export, and rerun this notebook.

In [None]:
print('Distinct rights:', sorted(df['rights'].dropna().unique()))
print('Distinct places:', sorted(df['place'].dropna().unique()))

## 5) Validation summary
- Do: Run to print a short summary you can copy into your deliverables notes.
- Why: Captures evidence of what you checked immediately after cleaning.
- You should see: A compact block with row count, missing counts, and distinct values.
- If it doesn't look right: Confirm the dataframe loaded (step 2) and rerun previous checks.

In [None]:
summary = [
    'Validation Summary:',
    '- Rows: {}'.format(row_count),
    '- Missing values: {}'.format(missing.to_dict()),
    '- Rights tokens: {}'.format(sorted(df['rights'].dropna().unique())),
    '- Places: {}'.format(sorted(df['place'].dropna().unique()))
]
print('\n'.join(summary))
