# 01 — Data Quality Analysis

**NovaCred Credit Application Governance | DEGO 2606**

**Author:** Jasper Gräfe (Data Engineer)

---

## Objectives

Assess the quality of `raw_credit_applications.json` across four dimensions:

| Dimension | Description |
|-----------|-------------|
| Completeness | Are all required fields present and populated? |
| Consistency | Are values formatted and coded uniformly? |
| Validity | Do values conform to expected ranges and formats? |
| Accuracy | Are values plausible and internally consistent? |

---

## Sections

0. Setup & Data Loading
1. Dataset Overview & Schema Profiling
2. Completeness Analysis
3. Consistency Analysis
4. Validity Analysis
5. Accuracy Analysis
6. Consolidated Quality Report & Clean Export
7. Quality Scores & Governance Notes

---
## Section 0 — Setup & Data Loading

In [None]:
import json
import re
import pathlib
import collections
from datetime import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

In [None]:
DATA_PATH = pathlib.Path('../data/raw_credit_applications.json')

with open(DATA_PATH, 'r') as f:
    raw_data = json.load(f)

print(f'Records loaded: {len(raw_data)}')
print(f'\nSample record keys: {list(raw_data[0].keys())}')

In [None]:
def flatten_record(record):
    flat = {'_id': record.get('_id')}
    flat.update(record.get('applicant_info', {}))
    flat.update(record.get('financials', {}))
    decision = record.get('decision', {})
    flat['loan_approved']   = decision.get('loan_approved')
    flat['interest_rate']   = decision.get('interest_rate')
    flat['approved_amount'] = decision.get('approved_amount')
    flat['rejection_reason']= decision.get('rejection_reason')
    return flat

df = pd.DataFrame([flatten_record(r) for r in raw_data])
print(f'Shape: {df.shape}')
df.head(3)

---
## Section 1 — Dataset Overview & Schema Profiling

In [None]:
# To be implemented

---
## Section 2 — Completeness Analysis

In [None]:
# To be implemented

---
## Section 3 — Consistency Analysis

In [None]:
# To be implemented

---
## Section 4 — Validity Analysis

In [None]:
# To be implemented

---
## Section 5 — Accuracy Analysis

In [None]:
# To be implemented

---
## Section 6 — Consolidated Quality Report & Clean Export

In [None]:
# To be implemented

---
## Section 7 — Quality Scores & Governance Notes

In [None]:
# To be implemented