Skip to content

Bralabee/fabric_data_quality

Repository files navigation

Fabric Data Quality Framework

A reusable, configurable data quality framework using Great Expectations, designed for Microsoft Fabric environments and usable across all your HS2 projects.

πŸš€ What's New in v1.2.0

  • ABFSS Support: Added support for loading configuration files directly from abfss:// paths in Microsoft Fabric.
  • Configurable Thresholds: Set custom pass/fail thresholds (e.g., 95%) instead of hardcoded 100%.
  • Enhanced Reporting: Validation results now include threshold details.

🧭 Environment & Path Support

  • Local Python: use local filesystem paths (e.g., config/my_table.yml, /tmp/data.parquet).
  • Microsoft Fabric: abfss://... and Files/... paths require Fabric utilities (mssparkutils). Outside Fabric these paths are not supported and will fail fast.

🎯 Purpose

This standalone framework provides data quality validation capabilities that can be used by:

  • full_stack_hss - Incident data analysis
  • AIMS_LOCAL - AIMS data processing
  • ACA_COMMERCIAL - Commercial data pipelines
  • Any other project in your workspace

πŸ“ Project Structure

fabric_data_quality/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ requirements.txt                   # Dependencies
β”œβ”€β”€ setup.py                           # Installation script
β”‚
β”œβ”€β”€ dq_framework/                      # Core framework package
β”‚   β”œβ”€β”€ __init__.py                   # Package exports
β”‚   β”œβ”€β”€ validator.py                  # Core validation engine
β”‚   β”œβ”€β”€ fabric_connector.py           # MS Fabric integration
β”‚   β”œβ”€β”€ config_loader.py              # YAML configuration loader
β”‚   β”œβ”€β”€ data_profiler.py              # Data profiling engine
β”‚   β”œβ”€β”€ batch_profiler.py             # Parallel batch profiling
β”‚   β”œβ”€β”€ loader.py                     # Robust data loading (local/ABFSS)
β”‚   └── utils.py                      # File system utilities
β”‚
β”œβ”€β”€ scripts/                           # Utility scripts
β”‚   β”œβ”€β”€ profile_data.py               # Data profiling CLI tool
β”‚   └── activate_and_test.sh          # Environment setup & test
β”‚
β”œβ”€β”€ config_templates/                  # Reusable YAML templates
β”‚   β”œβ”€β”€ bronze_layer_template.yml     # Raw data validation
β”‚   β”œβ”€β”€ silver_layer_template.yml     # Cleaned data validation
β”‚   β”œβ”€β”€ gold_layer_template.yml       # Business logic validation
β”‚   └── custom_template.yml           # Blank template
β”‚
β”œβ”€β”€ examples/                          # Project-specific examples
β”‚   β”œβ”€β”€ hss_incidents_example.yml     # HSS project
β”‚   β”œβ”€β”€ aims_data_example.yml         # AIMS project
β”‚   β”œβ”€β”€ aca_commercial_example.yml    # ACA project
β”‚   └── usage_examples.py             # Code examples
β”‚
β”œβ”€β”€ tests/                             # Unit tests
β”‚   β”œβ”€β”€ test_validator.py
β”‚   └── test_config_loader.py
β”‚
└── docs/                              # Documentation
    β”œβ”€β”€ CONFIGURATION_GUIDE.md
    β”œβ”€β”€ FABRIC_INTEGRATION.md
    β”œβ”€β”€ FABRIC_ETL_INTEGRATION.md     # Complete ETL integration guide
    β”œβ”€β”€ FABRIC_QUICK_START.md         # 5-minute Fabric setup
    β”œβ”€β”€ PROFILING_WORKFLOW.md         # Profiling guide
    └── ...                           # Other documentation

πŸš€ Quick Start

1. Setup (One Time)

# Activate the conda environment
conda activate fabric-dq

# Verify installation
python -c "from dq_framework import DataProfiler; print('βœ… Ready')"

2. Profile Your Data (One Time Per Dataset)

# Profile any data source (CSV, Parquet, Excel, JSON)
python scripts/profile_data.py path/to/your/data.csv \
    --output config/my_validation.yml \
    --null-tolerance 10 \
    --severity medium

# Profile a directory of mixed files (Batch Mode)
# Use --workers to speed up processing with parallel execution
python scripts/profile_data.py path/to/data_folder/ --output config/ --workers 4

# Example: Profile CAUSEWAY data
python scripts/profile_data.py sample_source_data/CAUSEWAY_combined_scr_2024.csv \
    --output config/causeway_validation.yml \
    --sample 50000

3. Advanced Features

  • Parallel Processing: Use --workers N to profile multiple files simultaneously.
  • Smart Sampling: Automatically limits large files (>500MB) to 100k rows to prevent memory crashes.
  • Efficient Parquet Reading: Uses pyarrow batch reading for massive parquet files.
  • Batch Mode: Point to a directory to profile all supported files (CSV, Parquet, Excel, JSON) at once.
  • Fabric Native Support: Works directly with abfss:// paths in MS Fabric environments.

4. Enhance Config (One Time Per Dataset)

Review and add business rules to config/my_validation.yml:

expectations:
  # Auto-generated rules
  - expectation_type: expect_column_to_exist
    kwargs: {column: customer_id}
  
  # Add your business rules
  - expectation_type: expect_column_values_to_be_unique
    kwargs: {column: customer_id}
    meta: {severity: critical, description: "Customer IDs must be unique"}

πŸ“¦ Deployment to Microsoft Fabric

To use this framework as a standard library in Fabric:

  1. Build the Wheel:

    python -m build --wheel --no-isolation

    This creates dist/fabric_data_quality-1.2.0-py3-none-any.whl.

  2. Upload to Fabric:

    • Go to your Fabric Workspace -> Manage environments.
    • Select your environment -> Custom libraries -> Upload.
    • Select the .whl file.
    • Publish the environment.
  3. Use in Notebooks:

    from dq_framework import FabricDataQualityRunner

πŸš€ Quick Start (Local)

1. Setup (One Time)

from dq_framework import DataQualityValidator, ConfigLoader
import pandas as pd

# Load new data
df = pd.read_csv('new_data_batch.csv')

# Validate using config created once
config = ConfigLoader().load('config/my_validation.yml')
validator = DataQualityValidator(config_dict=config)
results = validator.validate(df)

print(f"Success: {results['success']}")

Key Principle: Profile once, validate forever!

πŸ” Universal Data Profiler

The framework includes a universal data profiler that works with any data source:

Supported Formats

  • βœ… CSV (auto-detects encoding)
  • βœ… Parquet
  • βœ… Excel (.xlsx, .xls)
  • βœ… JSON
  • βœ… Any pandas DataFrame

CLI Usage

# Basic profiling
python scripts/profile_data.py data/file.csv

# With all options
python scripts/profile_data.py data/file.csv \
    --output config/validation.yml \
    --name "my_validation" \
    --null-tolerance 5.0 \
    --severity high \
    --sample 100000

# Just profile (no config generation)
python scripts/profile_data.py data/file.csv --profile-only

See PROFILING_WORKFLOW.md for complete guide.

Option A: Install as editable package (Recommended for development)

cd /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality
pip install -e .

Option B: Direct import (Add to Python path)

import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')

Option C: Install dependencies only

pip install -r requirements.txt

2. Basic Usage

In ANY project (e.g., full_stack_hss):

from dq_framework import DataQualityValidator

# Load your data
import pandas as pd
df = pd.read_parquet('data/my_data.parquet')

# Create validator with config
validator = DataQualityValidator(
    config_path='/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality/config_templates/bronze_layer_template.yml'
)

# Validate
results = validator.validate(df)

# Check results
if results['success']:
    print("βœ… Data quality checks passed!")
else:
    print(f"❌ {results['failed_checks']} checks failed")

3. MS Fabric Usage

from dq_framework import FabricDataQualityRunner

# Initialize for Fabric
runner = FabricDataQualityRunner(
    config_path="Files/dq_configs/my_table_config.yml"
)

# Validate Delta table
results = runner.validate_delta_table("my_table_name")

# Handle results
if not results['success']:
    runner.handle_failure(results, action="alert")

πŸ”§ Integration Guide

Using in full_stack_hss project:

# In full_stack_hss/src/transform/transform.py
import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')

from dq_framework import DataQualityValidator

def transform_with_quality_checks():
    # Your existing code
    df = load_data()
    
    # Add DQ check
    validator = DataQualityValidator(
        config_path='../../fabric_data_quality/examples/hss_incidents_example.yml'
    )
    results = validator.validate(df)
    
    if not results['success']:
        logger.warning(f"Data quality issues detected: {results['summary']}")
    
    # Continue transformation
    df_transformed = transform(df)
    return df_transformed

Using in AIMS_LOCAL project:

# In AIMS_LOCAL/src/your_script.py
import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')

from dq_framework import DataQualityValidator

# Use AIMS-specific config
validator = DataQualityValidator(
    config_path='../../fabric_data_quality/examples/aims_data_example.yml'
)

df_aims = pd.read_parquet('data/aims_data.parquet')
results = validator.validate(df_aims)

Using in ACA_COMMERCIAL project:

# In ACA_COMMERCIAL notebooks
import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')

from dq_framework import DataQualityValidator

validator = DataQualityValidator(
    config_path='../../fabric_data_quality/examples/aca_commercial_example.yml'
)

results = validator.validate(df_commercial)

πŸ“ Creating Custom Configurations

Step 1: Copy a template

cp config_templates/bronze_layer_template.yml my_configs/my_data_checks.yml

Step 2: Customize expectations

# my_configs/my_data_checks.yml
data_source:
  name: "my_data_source"
  description: "My custom data validation"

expectations:
  - expectation_type: "expect_column_values_to_not_be_null"
    kwargs:
      column: "id"
    meta:
      severity: "critical"
  
  - expectation_type: "expect_column_values_to_be_unique"
    kwargs:
      column: "email"

Step 3: Use in your project

validator = DataQualityValidator(
    config_path='path/to/my_data_checks.yml'
)
results = validator.validate(df)

🎨 Configuration Templates Available

1. Bronze Layer Template

  • Basic schema validation
  • Row count checks
  • Null checks for critical columns
  • Use for: Raw data landing validation

2. Silver Layer Template

  • Data type validation
  • Format validation (emails, dates, etc.)
  • Range checks
  • Use for: Cleaned/transformed data

3. Gold Layer Template

  • Business rule validation
  • Aggregation checks
  • Cross-column validation
  • Use for: Final business-ready data

4. Custom Template

  • Blank template with examples
  • Use for: Your specific needs

πŸ“Š Features

βœ… Configurable - YAML-based rules, no code changes needed
βœ… Reusable - One framework, multiple projects
βœ… Fabric-Native - Works with Spark DataFrames and Delta tables
βœ… Flexible - Severity levels, sampling, smart error handling
βœ… Documented - Examples for each project type

πŸ“š Documentation

πŸ“š Documentation

For MS Fabric Users

For All Users

πŸ§ͺ Testing

# Run all tests
pytest tests/

πŸ”„ Version History

  • v1.1.3 (2025-12-06) - Added configurable global thresholds and ABFSS support
  • v1.1.0 (2025-10-28) - Added MS Fabric ETL integration guides
  • v1.0.0 (2025-10-28) - Initial standalone framework with universal profiler

πŸ’» Code Example: Using Thresholds

from dq_framework import DataQualityValidator

# Initialize
validator = DataQualityValidator("config/my_checks.yml")

# Validate with a custom 95% threshold
# (Overrides config file settings if provided)
result = validator.validate(df, threshold=95.0)

if result['success']:
    print(f"Passed! Score: {result['success_rate']}%")
else:
    print(f"Failed. Score: {result['success_rate']}%")

🀝 Contributing

To add new features or templates:

  1. Create your feature in dq_framework/
  2. Add tests in tests/
  3. Add examples in examples/
  4. Update this README

πŸ“ž Support

For questions or issues:

  • Check examples in examples/
  • Review configuration templates in config_templates/
  • See detailed docs in docs/

Framework Owner: Data Engineering Team
Last Updated: October 2025
Status: Production Ready

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published