Skip to content

Stdelarosa/reference-data-quality-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reference Data Quality Engine

Overview

Financial firms depend on reference data, the master records describing every security they trade. ISIN identifiers, currency codes, asset classes, expiry dates. When this data is wrong, trades fail, risk calculations break, and regulatory reports become inaccurate.

This project is a full reference data quality system built to simulate the validation infrastructure used by data operations teams at hedge funds. It ingests a securities dataset, runs automated validation across five data quality dimensions, enforces SLA rules against defined thresholds, scores vendor data quality, and surfaces everything in a live Streamlit dashboard.

Architecture

data/securities.csv
        |
        v
db/schema.py          Creates PostgreSQL database and tables
db/seed.py            Loads securities data into PostgreSQL
        |
        v
validation/sql_checks.py        10 SQL validation queries
validation/schema_validation.py     Pandera schema enforcement
validation/sla_engine.py        SLA rules engine and vendor scorecard
        |
        v
dashboard/app.py      Streamlit dashboard: metrics, failures, SLA, drill-down

Validation Checks

Check Dimension Severity Result
Null ISIN values Completeness High FAIL
Null CUSIP values Completeness High FAIL
Duplicate ISIN values Uniqueness High FAIL
Invalid ISIN format Validity High FAIL
Negative price values Validity High FAIL
Expired securities with active status Consistency Medium FAIL
Currency country mismatch Consistency Medium FAIL
Invalid status values Validity Medium PASS
Null sector values Completeness Low PASS
Zero price on active securities Validity Medium PASS
Pandera: price_usd schema enforcement Validity High FAIL

SLA Rules

Rule Threshold Actual Status
ISIN Completeness 95% 94.4% FAIL
Zero Duplicate ISINs 0 1 FAIL
Zero Negative Prices 0 1 FAIL
All Active Securities Have Price 0 missing 0 missing PASS
No Expired Securities Marked Active 0 1 FAIL
Vendor A ISIN Completeness 95% 100.0% PASS
Vendor B ISIN Completeness 95% 88.9% FAIL

Key Findings

Finding Detail
Total securities 18 records across equity and fixed income
Validation failures 8 out of 11 checks failed
SLA breaches 5 out of 7 rules breached
Critical issues Null ISIN, duplicate ISIN, invalid ISIN format, negative price
Consistency violations 1 expired security still marked Active, 1 currency/country mismatch
Vendor performance Vendor A clean at 100% ISIN completeness, Vendor B failing at 88.9%

Tools

Python | PostgreSQL | Pandas | Pandera | Streamlit | SQLAlchemy

How to Run

  1. Install dependencies: pip install pandas pandera streamlit sqlalchemy psycopg2-binary

  2. Set database password: export DB_PASSWORD=your_password_here

  3. Create database and tables: python db/schema.py

  4. Load securities data: python db/seed.py

  5. Run SQL validation checks: python validation/sql_checks.py

  6. Run schema validation: python validation/schema_validation.py

  7. Run SLA engine: python validation/sla_engine.py

  8. Launch dashboard: streamlit run dashboard/app.py

About

Financial securities validation system built on PostgreSQL. SQL quality rules across five data dimensions, Pandera schema enforcement, SLA breach detection, vendor scorecard, and a live Streamlit dashboard.

Resources

Stars

Watchers

Forks

Contributors

Languages