# NovaCred Credit Application — Governance Assessment

**Role:** Governance Officer  
**Course:** Data Ecosystems and Governance in Organizations (DEGO 2606) — Nova SBE  
**Dataset:** `raw_credit_applications.json`

> This notebook covers the governance layer of the NovaCred audit: PII identification, GDPR compliance mapping, EU AI Act classification, pseudonymization demonstration, and actionable governance recommendations.

---
## 1. Setup & Data Loading

In [12]:
collection.drop()

In [14]:
import json
import hashlib
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from pymongo import MongoClient
from pymongo.errors import BulkWriteError

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["novacred"]
collection = db["credit_applications"]

# Load JSON and rename _id -> old_id so MongoDB assigns fresh ObjectIds
data_path = Path('../raw_credit_applications.json')
with open(data_path, 'r') as f:
    raw_data = json.load(f)

for doc in raw_data:
    if "_id" in doc:
        doc["old_id"] = doc.pop("_id")

# Insert all records (including duplicates, now distinguishable by ObjectId)
if collection.count_documents({}) == 0:
    result = collection.insert_many(raw_data)
    print(f"Inserted {len(result.inserted_ids)} documents.")
else:
    print("Collection already populated — skipping insert.")

print(f"Total records in collection: {collection.count_documents({})}")

# Preview one document
collection.find_one()

Inserted 502 documents.
Total records in collection: 502


{'_id': ObjectId('69a1a95b3a6b5d003e46fb20'),
 'applicant_info': {'full_name': 'Jerry Smith',
  'email': 'jerry.smith17@hotmail.com',
  'ssn': '596-64-4340',
  'ip_address': '192.168.48.155',
  'gender': 'Male',
  'date_of_birth': '2001-03-09',
  'zip_code': '10036'},
 'financials': {'annual_income': 73000,
  'credit_history_months': 23,
  'debt_to_income': 0.2,
  'savings_balance': 31212},
 'spending_behavior': [{'category': 'Shopping', 'amount': 480},
  {'category': 'Rent', 'amount': 790},
  {'category': 'Alcohol', 'amount': 247}],
 'decision': {'loan_approved': False,
  'rejection_reason': 'algorithm_risk_score'},
 'processing_timestamp': '2024-01-15T00:00:00Z',
 'old_id': 'app_200'}

---
## 2. PII Identification

### 2.1 PII Inventory

Map every field in the dataset against GDPR Article 4(1) — *personal data* — and Article 9 — *special categories of data*.

### 2.2 Sensitive PII — Direct Identifiers

Flag fields that on their own uniquely identify a natural person (name, SSN, email, IP address).

### 2.3 Quasi-Identifiers & Indirect PII

Fields that alone are not identifying but can be combined to re-identify individuals (date of birth, ZIP code, gender).

---
## 3. GDPR Compliance Assessment

### 3.1 Lawful Basis for Processing (Article 6)

Evaluate which legal basis NovaCred could rely on for each processing activity and whether it is adequately documented.

### 3.2 Data Minimisation (Article 5(1)(c))

Assess whether every collected field is strictly necessary for the credit-scoring purpose.

### 3.3 Storage Limitation (Article 5(1)(e))

Check for evidence of a data retention policy. Identify fields with no clear retention justification.

### 3.4 Right to Erasure (Article 17)

Evaluate whether the data architecture supports erasure requests (e.g., can a single applicant's record be fully removed?).

### 3.5 Automated Decision-Making (Article 22)

Credit decisions are made by an ML model — assess obligations around transparency, human oversight, and the right to explanation.

---
## 4. EU AI Act Classification

### 4.1 Risk Classification

Classify NovaCred's credit-scoring system under the EU AI Act (Annex III — High-Risk AI Systems) and document the implications.

### 4.2 High-Risk Obligations

Map the applicable obligations: risk management, data governance, transparency, human oversight, accuracy & robustness, logging.

---
## 5. Privacy Demonstration

### 5.1 Pseudonymisation of Direct Identifiers

Apply one-way hashing (SHA-256) to replace direct identifiers (SSN, email, full name) with pseudonyms. The original values are not stored in the output dataset.

### 5.2 Anonymisation / Generalisation of Quasi-Identifiers

Generalise date of birth to age brackets and ZIP code to region to reduce re-identification risk.

### 5.3 Before / After Comparison

Show a side-by-side sample of the original and privacy-protected records.

---
## 6. Governance Gaps Analysis

### 6.1 Identified Gaps

Systematically document each governance gap found in the dataset and the processing pipeline.

### 6.2 Gap Heat-Map / Summary Table

Visualise severity and coverage of each gap across GDPR principles.

---
## 7. Governance Recommendations

### 7.1 Short-Term Controls (0–3 months)

Immediate actions NovaCred can take to reduce regulatory exposure.

### 7.2 Medium-Term Controls (3–12 months)

Structural changes: audit trail implementation, consent management, data retention schedules.

### 7.3 Long-Term Controls (12+ months)

Strategic governance programme: DPIA process, AI governance board, continuous fairness monitoring.

---
## 8. Summary

| Area | Key Finding | Recommended Action | GDPR Article / AI Act Ref |
|------|-------------|-------------------|---------------------------|
| PII exposure | | | |
| Lawful basis | | | |
| Data minimisation | | | |
| Storage limitation | | | |
| Automated decisions | | | |
| Audit trail | | | |
| Human oversight | | | |