A reusable, configurable data quality framework using Great Expectations, designed for Microsoft Fabric environments and usable across all your HS2 projects.
- ABFSS Support: Added support for loading configuration files directly from
abfss://paths in Microsoft Fabric. - Configurable Thresholds: Set custom pass/fail thresholds (e.g., 95%) instead of hardcoded 100%.
- Enhanced Reporting: Validation results now include threshold details.
- Local Python: use local filesystem paths (e.g.,
config/my_table.yml,/tmp/data.parquet). - Microsoft Fabric:
abfss://...andFiles/...paths require Fabric utilities (mssparkutils). Outside Fabric these paths are not supported and will fail fast.
This standalone framework provides data quality validation capabilities that can be used by:
full_stack_hss- Incident data analysisAIMS_LOCAL- AIMS data processingACA_COMMERCIAL- Commercial data pipelines- Any other project in your workspace
fabric_data_quality/
βββ README.md # This file
βββ requirements.txt # Dependencies
βββ setup.py # Installation script
β
βββ dq_framework/ # Core framework package
β βββ __init__.py # Package exports
β βββ validator.py # Core validation engine
β βββ fabric_connector.py # MS Fabric integration
β βββ config_loader.py # YAML configuration loader
β βββ data_profiler.py # Data profiling engine
β βββ batch_profiler.py # Parallel batch profiling
β βββ loader.py # Robust data loading (local/ABFSS)
β βββ utils.py # File system utilities
β
βββ scripts/ # Utility scripts
β βββ profile_data.py # Data profiling CLI tool
β βββ activate_and_test.sh # Environment setup & test
β
βββ config_templates/ # Reusable YAML templates
β βββ bronze_layer_template.yml # Raw data validation
β βββ silver_layer_template.yml # Cleaned data validation
β βββ gold_layer_template.yml # Business logic validation
β βββ custom_template.yml # Blank template
β
βββ examples/ # Project-specific examples
β βββ hss_incidents_example.yml # HSS project
β βββ aims_data_example.yml # AIMS project
β βββ aca_commercial_example.yml # ACA project
β βββ usage_examples.py # Code examples
β
βββ tests/ # Unit tests
β βββ test_validator.py
β βββ test_config_loader.py
β
βββ docs/ # Documentation
βββ CONFIGURATION_GUIDE.md
βββ FABRIC_INTEGRATION.md
βββ FABRIC_ETL_INTEGRATION.md # Complete ETL integration guide
βββ FABRIC_QUICK_START.md # 5-minute Fabric setup
βββ PROFILING_WORKFLOW.md # Profiling guide
βββ ... # Other documentation
# Activate the conda environment
conda activate fabric-dq
# Verify installation
python -c "from dq_framework import DataProfiler; print('β
Ready')"# Profile any data source (CSV, Parquet, Excel, JSON)
python scripts/profile_data.py path/to/your/data.csv \
--output config/my_validation.yml \
--null-tolerance 10 \
--severity medium
# Profile a directory of mixed files (Batch Mode)
# Use --workers to speed up processing with parallel execution
python scripts/profile_data.py path/to/data_folder/ --output config/ --workers 4
# Example: Profile CAUSEWAY data
python scripts/profile_data.py sample_source_data/CAUSEWAY_combined_scr_2024.csv \
--output config/causeway_validation.yml \
--sample 50000- Parallel Processing: Use
--workers Nto profile multiple files simultaneously. - Smart Sampling: Automatically limits large files (>500MB) to 100k rows to prevent memory crashes.
- Efficient Parquet Reading: Uses
pyarrowbatch reading for massive parquet files. - Batch Mode: Point to a directory to profile all supported files (CSV, Parquet, Excel, JSON) at once.
- Fabric Native Support: Works directly with
abfss://paths in MS Fabric environments.
Review and add business rules to config/my_validation.yml:
expectations:
# Auto-generated rules
- expectation_type: expect_column_to_exist
kwargs: {column: customer_id}
# Add your business rules
- expectation_type: expect_column_values_to_be_unique
kwargs: {column: customer_id}
meta: {severity: critical, description: "Customer IDs must be unique"}To use this framework as a standard library in Fabric:
-
Build the Wheel:
python -m build --wheel --no-isolation
This creates
dist/fabric_data_quality-1.2.0-py3-none-any.whl. -
Upload to Fabric:
- Go to your Fabric Workspace -> Manage environments.
- Select your environment -> Custom libraries -> Upload.
- Select the
.whlfile. - Publish the environment.
-
Use in Notebooks:
from dq_framework import FabricDataQualityRunner
from dq_framework import DataQualityValidator, ConfigLoader
import pandas as pd
# Load new data
df = pd.read_csv('new_data_batch.csv')
# Validate using config created once
config = ConfigLoader().load('config/my_validation.yml')
validator = DataQualityValidator(config_dict=config)
results = validator.validate(df)
print(f"Success: {results['success']}")Key Principle: Profile once, validate forever!
The framework includes a universal data profiler that works with any data source:
- β CSV (auto-detects encoding)
- β Parquet
- β Excel (.xlsx, .xls)
- β JSON
- β Any pandas DataFrame
# Basic profiling
python scripts/profile_data.py data/file.csv
# With all options
python scripts/profile_data.py data/file.csv \
--output config/validation.yml \
--name "my_validation" \
--null-tolerance 5.0 \
--severity high \
--sample 100000
# Just profile (no config generation)
python scripts/profile_data.py data/file.csv --profile-onlySee PROFILING_WORKFLOW.md for complete guide.
cd /home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality
pip install -e .import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')pip install -r requirements.txtfrom dq_framework import DataQualityValidator
# Load your data
import pandas as pd
df = pd.read_parquet('data/my_data.parquet')
# Create validator with config
validator = DataQualityValidator(
config_path='/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality/config_templates/bronze_layer_template.yml'
)
# Validate
results = validator.validate(df)
# Check results
if results['success']:
print("β
Data quality checks passed!")
else:
print(f"β {results['failed_checks']} checks failed")from dq_framework import FabricDataQualityRunner
# Initialize for Fabric
runner = FabricDataQualityRunner(
config_path="Files/dq_configs/my_table_config.yml"
)
# Validate Delta table
results = runner.validate_delta_table("my_table_name")
# Handle results
if not results['success']:
runner.handle_failure(results, action="alert")# In full_stack_hss/src/transform/transform.py
import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')
from dq_framework import DataQualityValidator
def transform_with_quality_checks():
# Your existing code
df = load_data()
# Add DQ check
validator = DataQualityValidator(
config_path='../../fabric_data_quality/examples/hss_incidents_example.yml'
)
results = validator.validate(df)
if not results['success']:
logger.warning(f"Data quality issues detected: {results['summary']}")
# Continue transformation
df_transformed = transform(df)
return df_transformed# In AIMS_LOCAL/src/your_script.py
import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')
from dq_framework import DataQualityValidator
# Use AIMS-specific config
validator = DataQualityValidator(
config_path='../../fabric_data_quality/examples/aims_data_example.yml'
)
df_aims = pd.read_parquet('data/aims_data.parquet')
results = validator.validate(df_aims)# In ACA_COMMERCIAL notebooks
import sys
sys.path.append('/home/sanmi/Documents/HS2/HS2_PROJECTS_2025/fabric_data_quality')
from dq_framework import DataQualityValidator
validator = DataQualityValidator(
config_path='../../fabric_data_quality/examples/aca_commercial_example.yml'
)
results = validator.validate(df_commercial)cp config_templates/bronze_layer_template.yml my_configs/my_data_checks.yml# my_configs/my_data_checks.yml
data_source:
name: "my_data_source"
description: "My custom data validation"
expectations:
- expectation_type: "expect_column_values_to_not_be_null"
kwargs:
column: "id"
meta:
severity: "critical"
- expectation_type: "expect_column_values_to_be_unique"
kwargs:
column: "email"validator = DataQualityValidator(
config_path='path/to/my_data_checks.yml'
)
results = validator.validate(df)- Basic schema validation
- Row count checks
- Null checks for critical columns
- Use for: Raw data landing validation
- Data type validation
- Format validation (emails, dates, etc.)
- Range checks
- Use for: Cleaned/transformed data
- Business rule validation
- Aggregation checks
- Cross-column validation
- Use for: Final business-ready data
- Blank template with examples
- Use for: Your specific needs
β
Configurable - YAML-based rules, no code changes needed
β
Reusable - One framework, multiple projects
β
Fabric-Native - Works with Spark DataFrames and Delta tables
β
Flexible - Severity levels, sampling, smart error handling
β
Documented - Examples for each project type
- Installation Guide - Setup instructions
- Configuration Guide - How to create configs
- Fabric Integration - MS Fabric specific guidance
- Examples - Code examples
- FABRIC_QUICK_START.md - 5-minute setup guide for MS Fabric
- FABRIC_ETL_INTEGRATION.md - Complete ETL pipeline integration
- fabric_etl_example.py - Copy-paste Fabric notebook code
- PROFILING_WORKFLOW.md - "Profile once, validate forever" workflow
- CONFIGURATION_GUIDE.md - YAML configuration reference
- FABRIC_INTEGRATION.md - PySpark integration patterns
- QUICK_REFERENCE.md - One-page API cheat sheet
# Run all tests
pytest tests/- v1.1.3 (2025-12-06) - Added configurable global thresholds and ABFSS support
- v1.1.0 (2025-10-28) - Added MS Fabric ETL integration guides
- v1.0.0 (2025-10-28) - Initial standalone framework with universal profiler
from dq_framework import DataQualityValidator
# Initialize
validator = DataQualityValidator("config/my_checks.yml")
# Validate with a custom 95% threshold
# (Overrides config file settings if provided)
result = validator.validate(df, threshold=95.0)
if result['success']:
print(f"Passed! Score: {result['success_rate']}%")
else:
print(f"Failed. Score: {result['success_rate']}%")To add new features or templates:
- Create your feature in
dq_framework/ - Add tests in
tests/ - Add examples in
examples/ - Update this README
For questions or issues:
- Check examples in
examples/ - Review configuration templates in
config_templates/ - See detailed docs in
docs/
Framework Owner: Data Engineering Team
Last Updated: October 2025
Status: Production Ready