# Data Governance Assessment: Credit Application Analysis

**Role:** Data Governance Officer

**Objective:** Conduct a comprehensive governance assessment of the credit applications dataset, identifying data quality issues, bias patterns, privacy risks, and governance gaps.

**Output:** Findings and recommendations for executive summary in README

## Section 1: Load and Explore Data

We begin by loading the raw credit applications dataset and performing initial exploration to understand:
- Dataset size and structure
- Data types and fields
- Presence of PII (Personally Identifiable Information)
- Decision outcomes (loan approvals/rejections)

In [1]:
import json
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

# Set styling
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load the raw dataset
with open('../raw_credit_applications.json', 'r') as f:
    raw_data = json.load(f)

print(f"✓ Dataset loaded successfully")
print(f"✓ Total records: {len(raw_data)}")

✓ Dataset loaded successfully
✓ Total records: 502


In [2]:
# Convert JSON to pandas DataFrame for easier analysis
# Flatten the nested structure

records = []
for app in raw_data:
    record = {
        'app_id': app.get('_id'),
        'full_name': app['applicant_info'].get('full_name'),
        'email': app['applicant_info'].get('email'),
        'ssn': app['applicant_info'].get('ssn'),
        'ip_address': app['applicant_info'].get('ip_address'),
        'gender': app['applicant_info'].get('gender'),
        'date_of_birth': app['applicant_info'].get('date_of_birth'),
        'zip_code': app['applicant_info'].get('zip_code'),
        'annual_income': app['financials'].get('annual_income'),
        'credit_history_months': app['financials'].get('credit_history_months'),
        'debt_to_income': app['financials'].get('debt_to_income'),
        'savings_balance': app['financials'].get('savings_balance'),
        'loan_approved': app['decision'].get('loan_approved'),
        'rejection_reason': app['decision'].get('rejection_reason'),
        'processing_timestamp': app.get('processing_timestamp')
    }
    records.append(record)

df = pd.DataFrame(records)
print(f"✓ DataFrame created: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nDataFrame Info:")
print(df.info())

✓ DataFrame created: 502 rows × 15 columns

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   app_id                 502 non-null    object 
 1   full_name              502 non-null    object 
 2   email                  502 non-null    object 
 3   ssn                    497 non-null    object 
 4   ip_address             497 non-null    object 
 5   gender                 501 non-null    object 
 6   date_of_birth          501 non-null    object 
 7   zip_code               501 non-null    object 
 8   annual_income          497 non-null    object 
 9   credit_history_months  502 non-null    int64  
 10  debt_to_income         502 non-null    float64
 11  savings_balance        502 non-null    int64  
 12  loan_approved          502 non-null    bool   
 13  rejection_reason       210 non-null    object 
 14