# NovaCred Credit Application — Governance Assessment

**Role:** Governance Officer  

> This notebook covers the governance layer of the NovaCred audit: PII identification, GDPR compliance mapping, EU AI Act classification, pseudonymization demonstration, and actionable governance recommendations.

---
## 1. Setup & Data Loading

In [9]:
import json
import hashlib
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["novacred"]
collection = db["credit_applications"]

# Load JSON 
data_path = Path('../data/raw_credit_applications.json')
with open(data_path, 'r') as f:
    raw_data = json.load(f)

In [None]:
#change ID name 
for doc in raw_data:
    if "_id" in doc:
        doc["old_id"] = doc.pop("_id")


print(f"Total records in collection: {collection.count_documents({})}")

# Preview one document
collection.find_one()

Collection already populated — skipping insert.
Total records in collection: 502


{'_id': ObjectId('69a1a95b3a6b5d003e46fb20'),
 'applicant_info': {'full_name': 'Jerry Smith',
  'email': 'jerry.smith17@hotmail.com',
  'ssn': '596-64-4340',
  'ip_address': '192.168.48.155',
  'gender': 'Male',
  'date_of_birth': '2001-03-09',
  'zip_code': '10036'},
 'financials': {'annual_income': 73000,
  'credit_history_months': 23,
  'debt_to_income': 0.2,
  'savings_balance': 31212},
 'spending_behavior': [{'category': 'Shopping', 'amount': 480},
  {'category': 'Rent', 'amount': 790},
  {'category': 'Alcohol', 'amount': 247}],
 'decision': {'loan_approved': False,
  'rejection_reason': 'algorithm_risk_score'},
 'processing_timestamp': '2024-01-15T00:00:00Z',
 'old_id': 'app_200'}

---
## 2. PII Identification



In [None]:
# MongoDB-based PII scan — counts & percentages
import re
from collections import Counter

total = collection.count_documents({})
sample_docs = list(collection.find().limit(500))
if not sample_docs:
    raise RuntimeError('No documents sampled from MongoDB collection.')

def flatten_keys(doc, prefix=''):
    keys = set()
    for k, v in doc.items():
        path = f"{prefix}.{k}" if prefix else k
        if isinstance(v, dict):
            keys |= flatten_keys(v, path)
        elif isinstance(v, list):
            keys.add(path)
            if v and isinstance(v[0], dict):
                for subk in v[0].keys():
                    keys.add(f"{path}.{subk}")
        else:
            keys.add(path)
    return keys

all_keys = set()
for d in sample_docs:
    all_keys |= flatten_keys(d)
all_keys = sorted(all_keys)

stats = []
for field in all_keys:
    try:
        exists = collection.count_documents({field: {'$exists': True}})
    except Exception:
        exists = 0
    stats.append(dict(field=field, exists=exists))

# ── Print table ──────────────────────────────────────────────────────────────
print(f"Total records in collection: {total}\n")
header = f"{'Field':<45} {'Present':>14}"
print(header)
print('─' * len(header))

for s in sorted(stats, key=lambda x: x['field']):
    if s['exists'] > 0:
        n = s['exists']
        print(f"{s['field'][:45]:<45} {n:>5} ({n/total*100:5.1f}%)")

# ── PII summary ───────────────────────────────────────────────────────────────
print("\n── PII Field Coverage Summary ──────────────────────────────────────────")
email_pat = r'^[\w\.\-\+]+@[\w\.\-]+\.[a-zA-Z]{2,}$'
ssn_pat   = r'^\d{3}-\d{2}-\d{4}$'
ip_pat    = r'^\d{1,3}(?:\.\d{1,3}){3}$'
date_pat  = r'^\d{4}-\d{2}-\d{2}$|^\d{2}/\d{2}/\d{4}$|^\d{2}/\d{2}/\d{2,4}$'
zip_pat   = r'^\d{3,5}$'

pii_fields = {
    'full_name  (direct id)': ('applicant_info.full_name',      None),
    'email      (direct id)': ('applicant_info.email',          email_pat),
    'SSN        (direct id)': ('applicant_info.ssn',            ssn_pat),
    'ip_address (direct id)': ('applicant_info.ip_address',     ip_pat),
    'date_of_birth (quasi)':  ('applicant_info.date_of_birth',  date_pat),
    'zip_code   (quasi)':     ('applicant_info.zip_code',       zip_pat),
    'gender     (quasi)':     ('applicant_info.gender',         None),
}

for label, (field, pat) in pii_fields.items():
    if pat:
        n = collection.count_documents({field: {'$regex': pat, '$options': 'i'}})
    else:
        n = collection.count_documents({field: {'$exists': True, '$ne': None, '$ne': ''}})
    bar = '█' * int(n / total * 40)
    print(f"  {label:<30} {n:>4}/{total}  ({n/total*100:5.1f}%)  {bar}")

# ── Sensitive spending categories ────────────────────────────────────────────
print("\n── Sensitive Spending Categories (Article 9 risk) ──────────────────────")
sensitive_cats = {'Healthcare', 'Gambling', 'Adult Entertainment', 'Alcohol'}
cat_counts = Counter()
records_with_sensitive = set()
for doc in collection.find({}, {'spending_behavior': 1}):
    sb = doc.get('spending_behavior', [])
    if isinstance(sb, list):
        for it in sb:
            if isinstance(it, dict):
                cat = it.get('category', '')
                cat_counts[cat] += 1
                if cat in sensitive_cats:
                    records_with_sensitive.add(str(doc['_id']))

print(f"  Records containing at least one sensitive category: "
      f"{len(records_with_sensitive)}/{total} ({len(records_with_sensitive)/total*100:.1f}%)\n")
for cat in sorted(sensitive_cats):
    n = cat_counts[cat]
    bar = '█' * int(n / total * 40)
    print(f"  {cat:<25} {n:>4} entries  {bar}")

print("\n  All spending categories (entry count):")
for cat, cnt in cat_counts.most_common():
    print(f"    {cat:<25} {cnt:>4}")


### PII Inventory (based on GDPR Article 4(1))
- **Direct identifiers (High risk)**: `applicant_info.full_name`, `applicant_info.email`, `applicant_info.ssn`, `applicant_info.ip_address`, original `_id` (application id).  


- **Quasi-identifiers (Moderate risk)**: `applicant_info.date_of_birth`, `applicant_info.zip_code`, `applicant_info.gender`, `spending_behavior` categories (when granular).  


- **Financial & decision attributes (Personal; contextually sensitive)**: `financials.annual_income`, `financials.credit_history_months`, `financials.debt_to_income`, `financials.savings_balance`, `decision.loan_approved`, `decision.interest_rate`, `decision.approved_amount`, `loan_purpose`, `processing_timestamp`.  


- **Potential Article 9 risks (Requires DPIA / higher protection)**: spending categories such as `Healthcare`, `Gambling`, `Adult Entertainment` can enable sensitive inferences (health, addictions, sexual behaviour).  
  - Risk: inferred special-category data — treat as sensitive.  

- **System / metadata**: `processing_timestamp`, database ObjectId (if kept), ingestion logs — personal when linked to an individual.  


### Why This Dataset Is Unsafe in a Data Breach (GDPR Perspective)

All 502 records store direct identifiers (name, SSN, email), financial data, and sensitive spending categories **in plaintext in a single document**. A single compromised credential exposes everything at once — with no encryption, hashing, or separation to limit the damage. This triggers mandatory supervisory authority notification within 72 hours (Art. 33) and direct notification to all affected individuals (Art. 34), with potential fines up to €10M under Art. 83(4).

---
## 3. GDPR Compliance Mapping

### Lawful Basis — Art. 6 & Art. 22

Credit application processing relies on **Art. 6(1)(b)** — necessary for the performance of a contract. SSN collection requires the stronger basis of **Art. 6(1)(c)** (legal obligation) and must be strictly limited to identity verification.

Automated rejections (e.g., `algorithm_risk_score`) trigger **Art. 22**: applicants have the right to request human review of any solely automated decision with significant legal effect.

### Data Minimization — Art. 5(1)(c)

Several fields exceed what is strictly necessary for credit assessment:

- **`ip_address`** — irrelevant to creditworthiness; should not be retained after submission.
- **`spending_behavior` categories** — granular categories (`Healthcare`, `Gambling`, `Adult Entertainment`) enable sensitive inferences beyond what scoring requires; aggregated totals suffice.
- **`ssn`** — needed only for identity verification at intake; must not persist in the operational collection afterwards.

### Storage Limitation — Art. 5(1)(e)

No retention schedule exists in the current system. Recommended policy:

- **Rejected applications** — retain for the statutory minimum (typically 5 years under EU financial regulation), then delete or fully anonymize.
- **Approved loans** — retain for the loan term plus the regulatory minimum.
- **`ip_address`** — delete immediately after the application session closes.

### Right to Erasure — Art. 17

The pseudonymisation design supports erasure requests: deleting a subject's record from `identity_store` severs the name-to-token link, rendering the main collection entry effectively anonymous.

Two limitations apply:
1. Backup copies and audit logs must be purged separately within the response window.
2. Erasure may be refused where retention is required by law (e.g., anti-money laundering obligations).

---
## 4. EU AI Act Classification

### Risk Classification — Annex III, Point 5(b)

NovaCred's credit scoring algorithm is explicitly classified as **high-risk** under the EU AI Act (Annex III, §5(b)): *"AI systems intended to be used to evaluate the creditworthiness of natural persons or establish their credit score."*

The determining factor is the **use case**, not model complexity — any automated system producing credit decisions with significant legal effect on individuals falls under this classification, regardless of the underlying algorithm.

### High-Risk Obligations — Title III, Chapter 2

As a deployer of a high-risk AI system, NovaCred must comply with:

- **Art. 9 — Risk management**: Continuous identification and mitigation of risks throughout the system lifecycle.
- **Art. 10 — Data governance**: Training data must be examined for biases; the presence of `gender` as an input feature requires particular scrutiny under this article.
- **Art. 12 — Logging**: Automatic logging of system operation to ensure traceability of decisions — absent in the current dataset (`rejection_reason` alone is insufficient).
- **Art. 14 — Human oversight**: Measures enabling human intervention or override must be in place; `algorithm_risk_score` rejections with no visible human review mechanism raise a direct compliance gap.
- **Art. 26 — Deployer obligations**: NovaCred must monitor system operation, report serious incidents to the national authority, and retain logs for at least 6 months.

---
## 5. Pseudonymisation — Full Name

Replace `applicant_info.full_name` with a random token and store the mapping in a separate `identity_store` collection. The main `credit_applications` collection no longer contains real names; only the identity store — which should be access-controlled separately — can resolve a token back to a person.

In [None]:
import uuid

identity_store = db["identity_store"]

# Only run if not already pseudonymised
if identity_store.count_documents({}) == 0:
    for doc in collection.find({}, {"_id": 1, "applicant_info.full_name": 1}):
        full_name = doc.get("applicant_info", {}).get("full_name")
        if not full_name:
            continue

        token = str(uuid.uuid4())

        # Store mapping in identity_store
        identity_store.insert_one({
            "token": token,
            "full_name": full_name,
            "application_id": doc["_id"]
        })

        # Replace name with token in main collection
        collection.update_one(
            {"_id": doc["_id"]},
            {"$set": {"applicant_info.full_name": token}}
        )

    print(f"Pseudonymised {identity_store.count_documents({})} records.")
else:
    print("Already pseudonymised — skipping.")

# Preview: main collection no longer holds real names
print("\n── Main collection (no real name) ──")
doc = collection.find_one({}, {"applicant_info.full_name": 1, "_id": 0})
print(doc)

# Preview: identity store holds the mapping
print("\n── Identity store (restricted access) ──")
identity_store.find_one({}, {"_id": 0})

 Each applicant's real name is replaced with a random UUID token in the main database. The mapping between token and real name lives in a separate `identity_store` collection that would, in a real system, have stricter access controls. The original JSON file on disk is not modified. This is pseudonymisation — not anonymisation — because the name can still be recovered if you have access to the identity store.

---
## 6. GDPR Data Subject Rights

The five core rights applicable to NovaCred's processing. Rights to erasure and automated decision-making (Art. 17 & Art. 22) were introduced in Sections 3 and 4 respectively — this section provides the full picture with example queries.

### Right of Access — Art. 15

Applicants may request all data NovaCred holds on them. The lookup uses `identity_store` to resolve the subject's name to their token, then retrieves the full record from the main collection.

In [None]:
# Art. 15 — Right of Access
# Resolve name → token via identity_store, then retrieve the full record

subject = identity_store.find_one({}, {"token": 1, "full_name": 1, "_id": 0})
token = subject["token"]
print(f"Subject: {subject['full_name']}  →  token: {token[:8]}...")

subject_data = collection.find_one({"applicant_info.full_name": token}, {"_id": 0})
subject_data

### Right to Rectification — Art. 16

Applicants may request correction of inaccurate personal data. The update targets the main collection via the applicant's token — the identity store is not modified since the name itself has not changed.

In [None]:
# Art. 16 — Right to Rectification
# Correct an inaccurate field — example: wrong date of birth

result = collection.update_one(
    {"applicant_info.full_name": token},
    {"$set": {"applicant_info.date_of_birth": "1990-05-15"}}
)
print(f"Matched: {result.matched_count}  Modified: {result.modified_count}")

### Right to Erasure — Art. 17

See Section 3 for full discussion. Erasure is executed in two steps: delete the `identity_store` mapping (severing the name-to-token link), then delete the application record from the main collection.

In [None]:
# Art. 17 — Right to Erasure
# Step 1: delete from identity_store — severs the name-to-token link
# Step 2: delete the application record from the main collection
# (Commented out to avoid modifying live data in this demo — see Section 3 for design rationale)

# identity_store.delete_one({"token": token})
# collection.delete_one({"applicant_info.full_name": token})

print("Erasure procedure: remove token mapping from identity_store, then delete record from main collection.")
print("Once executed, no link between the token and the real identity remains.")

### Right to Data Portability — Art. 20

Applicants may request their data in a structured, machine-readable format for transfer to another provider. The export must cover all fields provided by the subject — financials, spending behaviour, and decision outputs.

In [None]:
# Art. 20 — Right to Data Portability
# Export the subject's full record as a portable JSON object

subject_data = collection.find_one({"applicant_info.full_name": token}, {"_id": 0})
portable_export = json.dumps(subject_data, default=str, indent=2)
print(portable_export)

### Right to Object — Art. 21

Applicants may object to processing for profiling purposes. On receipt of an objection, NovaCred must stop automated processing of that record, log the objection with a timestamp, and route it to human review.

This is a workflow obligation — a production system would set a status flag (e.g., `processing_status: "objection_pending"`) and create an audit trail entry. No single query implements it.