A Config-Driven Composite & Index Builder for Analytics Workflows
The Derived Variable Engine is a modular, configuration-driven transformation engine designed to build composite metrics, indices, and derived KPIs from structured datasets.
It supports:
- Multiple aggregation strategies (mean, sum, weighted mean, etc.)
- Special code handling
- Minimum valid response thresholds
- Optional governance validation (scale enforcement)
- Config-controlled fallback behavior
- Execution reporting with JSON audit logs
The architecture intentionally separates:
- Computation logic
- Fallback logic
- Governance validation
- Configuration validation
This keeps the system extensible, auditable, and production-friendly.
- Architecture
- Features
- Project Structure
- Configuration
- Execution Flow
- Validation Layers Explained
- Edge Case Testing
- Requirements
- License
Config (YAML)
↓
Config Validation (engine.py)
↓
Optional Governance Layer (validation.py)
↓
Aggregation Registry (aggregations.py)
↓
Fallback Registry (fallback.py)
↓
Output Dataset + JSON Report
Each layer has a clearly defined responsibility.
meansummedianminmaxstdcount_validweighted_mean
All aggregations are registry-based and easily extensible.
Special values (e.g., -98, -99) are excluded from aggregation and handled via configurable fallback strategies.
nan_if_no_validpropagate_special- Configurable multi-special fallback values
Control the proportion of valid responses required before computing a derived variable.
Example:
min_valid_ratio: 0.5Enable strict scale validation:
enable_validation: true
Validation Checks:
- Numeric enforcement
- Scale bounds (scale_min, scale_max)
- Special code exclusion from scale validation
Each run generates:
- Execution time
- Total rows processed
- Derived variables created
- Valid vs invalid row counts per variable
- JSON audit report
derived-variable-engine/
│
├── src/
│ ├── main.py
│ ├── engine.py
│ ├── aggregations.py
│ ├── fallback.py
│ ├── validation.py
│
├── configs/
│ └── derived_config.yaml
│
├── data/
│ ├── sample_input.csv
│ ├── sample_input_edge_case.csv
│
├── outputs/
│ ├── derived_output.csv
│ ├── derived_output_edge_case.csv
│
├── logs/
│ └── derived_report.json
│
├── requirements.txt
├── README.md
└── License
Configuration is YAML-driven.
Example:
enable_validation: true
derived_variables:
- name: SATIS
source_columns:
- q100_1
- q100_2
- q100_3
- q100_4
aggregation: mean
special_codes: [-98, -99]
fallback_strategy: propagate_special
multi_special_fallback: -98
scale_min: 1
scale_max: 5
min_valid_ratio: 0.5
No code changes are required to:
- Add new derived variables
- Change aggregation strategy
- Modify fallback logic
- Adjust governance strictness
- Load dataset (CSV)
- Load YAML configuration
- Validate configuration structure
- Optionally run governance validation
- Apply aggregation registry
- Apply fallback registry
- Save derived dataset
- Generate execution report
1️⃣ Configuration Validation (engine.py)
Ensures:
- No duplicate derived variables
- No overwriting existing columns
- Aggregation exists
- Source columns exist
- Weight lengths match (for weighted mean)
Stops execution if invalid.
Triggered via config flag.
Ensures:
- Numeric data types
- Scale boundaries respected
- Special codes excluded from scale checks
Stops execution on scale violations.
Inside the engine:
- Counts valid responses
- Enforces min_valid_ratio
- Determines whether fallback applies
Does not stop execution — controls derived output behavior.
Included:
- Fully valid rows
- All special code rows
- Mixed special code rows
- Threshold boundary rows
- Weighted mean edge cases
- Below-threshold cases
- Multi-special fallback cases
Edge case Test files:
- sample_input_edge_case.csv
- derived_output_edge_case.csv
pandas>=1.5
numpy>=1.23
PyYAML>=6.0
MIT License
Copyright (c) 2026