# Lab 01: Profile the Dataset

Goal: build professional curiosity about the dataset before writing checks. You'll look at shape, missingness, and unique values.


## 1) Load the CSV
- Do: Load `inputs/collection_cleaned.csv` and run.
- Why: Profiling starts with a clean load so you trust the frame you're inspecting.
- You should see: A dataframe preview with 4 rows and 6 columns.
- If it doesn't look right: Confirm the file is CSV with headers; check the path; rerun this cell.


In [None]:
import sys
from pathlib import Path
import subprocess

import pandas as pd


def _find_repo_root(start: Path) -> Path:
    for candidate in [start] + list(start.parents):
        if (candidate / 'WORKSHOP_OVERVIEW.md').exists():
            return candidate
    return start


def _ensure_repo_root() -> Path:
    # Colab opens notebooks without the repo files present. Clone so relative lab inputs exist.
    if 'google.colab' in sys.modules:
        repo_root = Path('/content/data-workflows-workshop')
        if not repo_root.exists():
            subprocess.run(
                [
                    'git',
                    'clone',
                    '--depth',
                    '1',
                    'https://github.com/MSU-DHI-Lab/data-workflows-workshop.git',
                    str(repo_root),
                ],
                check=True,
            )
        return repo_root

    return _find_repo_root(Path.cwd().resolve())


REPO_ROOT = _ensure_repo_root()
LAB01_ROOT = REPO_ROOT / 'day-03-quality-gates-and-reuse/01-labs/lab-01'

csv_path = LAB01_ROOT / 'inputs/collection_cleaned.csv'
df = pd.read_csv(csv_path)
df.head()


## 2) Check shape and dtypes
- Do: Run to see row/column counts and data types.
- Why: Confirms the expected structure before writing validation.
- You should see: (4, 6) and types showing strings for text fields, int for date if parsed.
- If it doesn't look right: Check for extra header rows; ensure date column is numeric or castable.


In [None]:
df.shape, df.dtypes


## 3) Missingness scan
- Do: Run to count missing values per column.
- Why: Guides which checks to add (e.g., id cannot be missing).
- You should see: Zeros across key fields in this sample.
- If it doesn't look right: Inspect columns with missing values; confirm they are expected.


In [None]:
df.isna().sum()


## 4) Unique values for rights and place
- Do: Run to see allowed tokens and place variants.
- Why: Informs allowed lists and normalization expectations for validation.
- You should see: Rights tokens ['CC BY 4.0', 'Public Domain', 'Rights Reserved']; places ['Albany', 'New York City'].
- If it doesn't look right: Check for trailing spaces; adjust allowed lists later in validation.


In [None]:
df['rights'].unique(), df['place'].unique()
