# Privacy & Governance: NovaCred Credit Applications

**Dataset:** `data/cleaned_credit_applications.csv` pre-cleaned by `01-data-quality.ipynb`  

## Scope

This notebook covers:
1. **PII identification**  all personally identifiable fields: `full_name`, `email`, `ssn`, `ip_address`, `date_of_birth`, `zip_code`
2. **Pseudonymization / anonymization**  demonstrate on at least one PII column
3. **GDPR mapping**  lawful basis, data minimisation (Art. 5), storage limitation (Art. 5), right to erasure (Art. 17)
4. **EU AI Act**  classify the credit scoring system and note obligations


In [6]:
import pandas as pd
import hashlib

# Load pre-cleaned dataset from the data quality pipeline
df = pd.read_csv("../data/cleaned_credit_applications.csv")

# PII fields identified in the dataset schema (Table 1 of project description)
PII_FIELDS = [
    'applicant_info.full_name',     # direct identifier
    'applicant_info.email',         # direct identifier
    'applicant_info.ssn',           # highly sensitive unique government ID
    'applicant_info.ip_address',    # quasi-identifier (links to physical location/device)
    'applicant_info.date_of_birth', # quasi-identifier (combined with other fields re-identifies)
    'applicant_info.zip_code',      # quasi-identifier (geographic)
]

print(f"Loaded {len(df)} records")
print(f"\nPII fields present in dataset ({len(PII_FIELDS)}):")
for f in PII_FIELDS:
    present = f in df.columns
    pct_filled = df[f].notna().mean() if present else 0
    print(f"  {'OK' if present else 'MISSING':<8} {f:<45} {pct_filled:.0%} populated")

Loaded 500 records

PII fields present in dataset (6):
  OK       applicant_info.full_name                      100% populated
  OK       applicant_info.email                          98% populated
  OK       applicant_info.ssn                            99% populated
  OK       applicant_info.ip_address                     99% populated
  OK       applicant_info.date_of_birth                  99% populated
  OK       applicant_info.zip_code                       100% populated
