# 01 Data Loading & Validation
**Goal:** Load processed tables, validate schema expectations, and surface missingness.

This notebook:
- Loads the AI Incident Database snapshot
- Normalizes schema
- Audits coverage across taxonomies
- Evaluates data completeness

In [23]:
from pathlib import Path
import pandas as pd

DATA = Path("../data")
OUT = Path("../outputs/figures")
OUT.mkdir(parents=True, exist_ok=True)

inc = pd.read_csv(DATA / "incidents.csv")
rep = pd.read_csv(DATA / "reports.csv")
sub = pd.read_csv(DATA / "submissions.csv")

mit = pd.read_csv(DATA / "classifications_MIT.csv") if (DATA / "classifications_MIT.csv").exists() else None
gmf = pd.read_csv(DATA / "classifications_GMF.csv") if (DATA / "classifications_GMF.csv").exists() else None
cset = pd.read_csv(DATA / "classifications_CSETv1.csv") if (DATA / "classifications_CSETv1.csv").exists() else None

print("Incidents:", inc.shape)
print("Reports:", rep.shape)
print("Submissions:", sub.shape)
print("MIT:", None if mit is None else mit.shape)
print("GMF:", None if gmf is None else gmf.shape)
print("CSET:", None if cset is None else cset.shape)

# normalize columns
inc.columns = [c.strip().lower() for c in inc.columns]
rep.columns = [c.strip().lower() for c in rep.columns]
sub.columns = [c.strip().lower() for c in sub.columns]

if mit is not None: mit.columns = [c.strip().lower() for c in mit.columns]
if gmf is not None: gmf.columns = [c.strip().lower() for c in gmf.columns]
if cset is not None: cset.columns = [c.strip().lower() for c in cset.columns]

# MIT uses "incident id" sometimes
if mit is not None and "incident id" in mit.columns:
    mit = mit.rename(columns={"incident id": "incident_id"})

Incidents: (1367, 9)
Reports: (6687, 21)
Submissions: (45, 15)
MIT: (1242, 8)
GMF: (326, 21)
CSET: (214, 65)


In [24]:
print("inc keys:", [c for c in inc.columns if "incident" in c or c.endswith("_id")])
print("rep keys:", [c for c in rep.columns if "incident" in c or c.endswith("_id")])
print("sub keys:", [c for c in sub.columns if "incident" in c or "url" in c or c.endswith("_id")])
print("mit keys:", [] if mit is None else [c for c in mit.columns if "incident" in c or c.endswith("_id")])

inc keys: ['_id', 'incident_id']
rep keys: ['_id']
sub keys: ['image_url', 'incident_date', 'incident_id', 'mongodb_id', 'url']
mit keys: ['incident_id']


In [29]:
def pick_incident_col(df, label):
    cols = [c.strip().lower() for c in df.columns]
    df.columns = cols
    for cand in ["incident_id", "incident id", "incidentid"]:
        if cand in cols:
            return cand
    for c in cols:
        if "incident" in c and "id" in c:
            return c
    for c in cols:
        if "incident" in c:
            return c
    raise KeyError(f"[{label}] Could not find incident id column.")

incident_ids = set(inc["incident_id"])

mit_inc_col = pick_incident_col(mit, "MIT") if mit is not None else None
gmf_inc_col = pick_incident_col(gmf, "GMF") if gmf is not None else None
cset_inc_col = pick_incident_col(cset, "CSET") if cset is not None else None

print("MIT incident column:", mit_inc_col)
print("GMF incident column:", gmf_inc_col)
print("CSET incident column:", cset_inc_col)

mit_cov = (mit[mit_inc_col].nunique() / len(incident_ids)) if mit is not None else 0
gmf_cov = (gmf[gmf_inc_col].nunique() / len(incident_ids)) if gmf is not None else 0
cset_cov = (cset[cset_inc_col].nunique() / len(incident_ids)) if cset is not None else 0

print(f"MIT coverage:  {mit_cov:.1%}")
print(f"GMF coverage:  {gmf_cov:.1%}")
print(f"CSET coverage: {cset_cov:.1%}")

print("Total reports (rows):", len(rep))
print("Unique report URLs:", rep["url"].nunique() if "url" in rep.columns else "n/a")
print("Unique source domains:", rep["source_domain"].nunique() if "source_domain" in rep.columns else "n/a")
print("Submissions rows (auxiliary):", len(sub))
print("Note: This snapshot has no incident↔report mapping table, so per-incident report counts are not reproducible here.")

MIT incident column: incident_id
GMF incident column: incident id
CSET incident column: incident id
MIT coverage:  90.9%
GMF coverage:  23.8%
CSET coverage: 15.7%
Total reports (rows): 6687
Unique report URLs: 5846
Unique source domains: 1781
Submissions rows (auxiliary): 45
Note: This snapshot has no incident↔report mapping table, so per-incident report counts are not reproducible here.


In [30]:
missing = inc.isna().mean().sort_values(ascending=False)
missing.head(25)

_id                                        0.0
incident_id                                0.0
date                                       0.0
reports                                    0.0
alleged deployer of ai system              0.0
alleged developer of ai system             0.0
alleged harmed or nearly harmed parties    0.0
description                                0.0
title                                      0.0
dtype: float64

In [31]:
high_missing = missing[missing > 0.5]
print(f"Columns with >50% missingness: {len(high_missing)}")
high_missing

Columns with >50% missingness: 0


Series([], dtype: float64)

### Missingness Assessment (Incidents Table)

We evaluated column-level missingness in `incidents.csv` to assess structural completeness and downstream analytical reliability.

**Finding:**  
No columns exhibit greater than 50% missingness.

**Implications:**
- The incident-level dataset is structurally complete.
- Core identifiers and metadata fields are consistently populated.
- Descriptive trend analysis can be conducted without major imputation.
- Missingness is unlikely to systematically bias high-level distributional statistics.

**Analytical Decision:**  
All incident-level variables are retained for descriptive and comparative analysis. No columns are excluded due to sparsity in this snapshot.