A production-grade PostgreSQL data warehouse implementing the Medallion Architecture (Bronze → Silver → Gold) with comprehensive logging, data lineage tracking, and enterprise-grade ETL capabilities.
This project demonstrates a complete data warehouse solution built on PostgreSQL with a focus on:
- Medallion Architecture: Bronze (raw) → Silver (cleansed) → Gold (analytics) data layers
- Centralized Configuration: Environment-based configuration management via
.env - Comprehensive Logging: Process tracking, error handling, performance monitoring, and data lineage
- Modular SQL Generation: Reusable SQL builders and templates
- Type Safety: SQLAlchemy ORM for database operations
- Enterprise Patterns: Audit trails, data quality checks, and impact analysis
Phase 1 Complete - Infrastructure & Logging Foundation
- ✅ Database setup infrastructure (database, schemas, logging tables)
- ✅ Comprehensive logging system with 4 specialized modules
- ✅ Centralized ORM models in
models/package - ✅ Complete test suite with 207 passing tests
- ✅ No circular dependencies - clean architecture
- ✅ Production-ready error handling and recovery mechanisms
- 🚧 Next Phase: Bronze layer implementation
For details, see PHASE1_COMPLETE.md and REFACTORING_PLAN.md.
- Current Status
- Architecture
- Quick Start
- Project Structure
- Core Components
- Models Package
- Setup Components
- SQL Utilities
- Logging Infrastructure
- Testing
- Configuration
- Usage Examples
- Development
- License
┌─────────────────┐
│ Data Sources │ (CRM, ERP Systems)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Bronze Layer │ Raw, unprocessed data
│ (bronze.*) │ - Exact copy of source
└────────┬────────┘ - Append-only
│ - Full history
▼
┌─────────────────┐
│ Silver Layer │ Cleansed, conformed data
│ (silver.*) │ - Data validation
└────────┬────────┘ - Deduplication
│ - Standardization
▼
┌─────────────────┐
│ Gold Layer │ Business-ready analytics
│ (gold.*) │ - Aggregations
└─────────────────┘ - KPIs & metrics
- Optimized for BI tools
┌─────────────────┐
│ Logs Layer │ Audit & monitoring
│ (logs.*) │ - Process tracking
└─────────────────┘ - Error logging
- Performance metrics
- Data lineage
- Database: PostgreSQL 13+
- ORM: SQLAlchemy 2.0+
- Language: Python 3.8+
- Configuration: python-dotenv
- Monitoring: psutil (system metrics)
# PostgreSQL 13 or higher
psql --version
# Python 3.8 or higher
python --version- Clone the repository
git clone <repository-url>
cd SQL-Data-Warehouse-Analytics-Project- Install dependencies
pip install -r requirements.txt- Configure environment
# Copy and edit .env file
cp .env.example .env
# Edit .env with your PostgreSQL credentials- Run setup
# Complete warehouse setup
python -m setup.setup_orchestrator
# Or setup without sample tables
python -m setup.setup_orchestrator --no-samplesfrom setup import SetupOrchestrator
orchestrator = SetupOrchestrator()
results = orchestrator.run_complete_setup()
if all(results.values()):
print("✅ Setup completed successfully!")SQL-Data-Warehouse-Analytics-Project/
├── core/ # Core infrastructure
│ ├── __init__.py # Package exports
│ ├── config.py # Configuration management
│ └── logger.py # Centralized logging
│
├── datasets/ # Data storage
│ ├── source_crm/ # CRM source data
│ └── source_erp/ # ERP source data
│
├── logs/ # Logging infrastructure ✅ COMPLETE
│ ├── __init__.py # Package exports
│ ├── audit_logger.py # Process & config logging (30 tests)
│ ├── data_lineage.py # Lineage tracking & analysis (26 tests)
│ ├── error_handler.py # Error logging & recovery (35 tests)
│ └── performance_monitor.py # Performance metrics (28 tests)
│
├── medallion/ # Data layers
│ ├── bronze/ # Raw data layer 🚧 NEXT PHASE
│ ├── silver/ # Cleansed data layer 🚧 FUTURE
│ └── gold/ # Analytics layer 🚧 FUTURE
│
├── models/ # ORM Models ✅ NEW IN PHASE 1
│ ├── __init__.py # Centralized model exports
│ └── logs_models.py # All logging ORM definitions
│
├── setup/ # Database setup ✅ COMPLETE
│ ├── __init__.py # Package exports
│ ├── create_database.py # Database creation (21 tests)
│ ├── create_schemas.py # Schema creation (45 tests)
│ ├── create_logs.py # Logging tables (22 tests)
│ └── setup_orchestrator.py # Setup coordination
│
├── sql/ # SQL utilities
│ ├── __init__.py # Package exports
│ ├── ddl.py # Data Definition Language
│ ├── dml.py # Data Manipulation Language
│ ├── query_builder.py # Query builders
│ └── common_queries.py # Common patterns
│
├── tests/ # Test suite ✅ 207 PASSING TESTS
│ ├── conftest.py # Root test configuration
│ ├── tests_logs/ # Logging module tests (119 tests)
│ │ ├── conftest.py # Shared fixtures
│ │ ├── test_audit_logger.py
│ │ ├── test_data_lineage.py
│ │ ├── test_error_handler.py
│ │ └── test_performance_monitor.py
│ └── tests_setup/ # Setup module tests (88 tests)
│ ├── conftest.py # Shared fixtures
│ ├── test_create_database.py
│ ├── test_create_schemas.py
│ └── test_create_logs.py
│
├── utils/ # Utility functions
│ ├── __init__.py # Package exports
│ └── database_utils.py # Database utilities
│
├── .env # Environment configuration (not in git)
├── .gitignore # Git ignore rules
├── LICENSE # MIT License
├── PHASE1_COMPLETE.md # Phase 1 completion documentation
├── README.md # This file
├── REFACTORING_PLAN.md # Technical refactoring details
├── pytest.ini # Pytest configuration
└── requirements.txt # Python dependencies (frozen)
Configuration Management
Centralized configuration system loading from .env file:
from core.config import config
# Database connection
conn_string = config.get_connection_string(use_warehouse=True)
print(f"Connecting to {config.db_host}:{config.db_port}")
# Project paths
data_dir = config.project.data_dir
logs_dir = config.project.logs_dirKey Features:
- Single source of truth for all settings
- Type-safe configuration classes
- Environment variable validation
- Connection string generation
Centralized Logging
Unified logging configuration for the entire application:
from core.logger import get_logger
logger = get_logger(__name__)
logger.info("Processing started")
logger.error("An error occurred", exc_info=True)Key Features:
- Consistent log formatting
- Module-specific loggers
- File and console output
- Exception tracking
Centralized ORM Definitions ✅ NEW IN PHASE 1
All SQLAlchemy ORM models for the logging infrastructure, eliminating circular dependencies:
from models.logs_models import (
ProcessLog, ErrorLog, PerformanceMetric,
DataLineage, ConfigurationLog
)
# All models accessible from one location
# No circular imports between setup and logs modulesKey Features:
- Single source of truth for all ORM definitions
- No circular dependencies - imported by setup and logs modules
- Clean architecture - separation of concerns
- Type safety - SQLAlchemy type hints
- Centralized Base - shared metadata for all tables
ORM Models:
-
ProcessLog - Tracks ETL process execution
- process_id (PK), process_name, layer, status
- start_time, end_time, duration_seconds
- rows_processed, error_message, metadata
-
ErrorLog - Comprehensive error tracking
- error_id (PK), process_id (FK), error_type
- severity, error_message, stack_trace
- resolution_status, resolved_at, resolution_notes
-
PerformanceMetric - System and process metrics
- metric_id (PK), process_id (FK), metric_name
- metric_value, metric_unit, tags
- dimension, recorded_at
-
DataLineage - End-to-end lineage tracking
- lineage_id (PK), process_id (FK)
- source_schema, source_table, source_column
- target_schema, target_table, target_column
- transformation_logic, lineage_level
-
ConfigurationLog - Configuration change audit
- config_id (PK), config_key, config_value
- previous_value, changed_by, changed_at
Integration:
- Imported by
setup.create_logsfor table creation - Imported by all
logs.*modules for data operations - Exported via
models/__init__.pyfor easy access - Used in all 207 tests for mocking and validation
Database Creation Module
Creates the target PostgreSQL database with proper encoding and collation:
from setup import DatabaseCreator
creator = DatabaseCreator(
host='localhost',
user='postgres',
password='password',
admin_db='postgres',
target_db='warehouse'
)
if creator.create_database():
print("Database created successfully")Key Features:
- UTF-8 encoding with en_GB collation
- Connection termination for cleanup
- Database existence checking
- Uses SQL from
sql.ddlmodule
Integration:
- Called by
setup_orchestrator - Uses
sql.ddlfor SQL generation - Connects to admin database (not target)
Schema Creation Module
Creates all medallion architecture schemas:
from setup import SchemaCreator
creator = SchemaCreator(
host='localhost',
user='postgres',
password='password',
database='warehouse'
)
results = creator.create_all_schemas()
# Creates: bronze, silver, gold, logsKey Features:
- Four core schemas: bronze, silver, gold, logs
- SQLAlchemy-based operations
- Schema verification and inspection
- Idempotent execution
Integration:
- Called by
setup_orchestrator - Uses
sql.query_builderfor metadata queries - Creates foundation for
create_logs
Logging Infrastructure Module
Creates comprehensive logging tables using SQLAlchemy ORM:
from setup import LoggingInfrastructure
logs = LoggingInfrastructure(
host='localhost',
user='postgres',
password='password',
database='warehouse'
)
results = logs.create_all_tables()Tables Created:
logs.process_log- ETL process execution trackinglogs.error_log- Centralized error logginglogs.data_lineage- Data flow trackinglogs.performance_metrics- Performance monitoringlogs.configuration_log- Configuration changes
Key Features:
- SQLAlchemy ORM models
- Foreign key relationships
- JSONB for flexible metadata
- PostgreSQL-specific types
Integration:
- Models used by
logs/package - Called by
setup_orchestrator - Foundation for audit and monitoring
Setup Coordination Module
Orchestrates the complete warehouse setup process:
from setup import SetupOrchestrator
orchestrator = SetupOrchestrator()
# Complete setup
results = orchestrator.run_complete_setup(include_samples=True)
# Individual steps
orchestrator.create_database()
orchestrator.create_schemas()
orchestrator.create_logging_infrastructure()
orchestrator.create_sample_medallion_tables()
# Rollback
orchestrator.rollback_setup(keep_database=False)Setup Sequence:
- Create target database
- Create medallion schemas (bronze/silver/gold/logs)
- Create logging infrastructure
- Create sample tables (optional)
Key Features:
- Dependency management
- Process tracking via audit logs
- Comprehensive error handling
- Rollback capabilities
- Step timing and metrics
Integration:
- Uses
DatabaseCreator - Uses
SchemaCreator - Uses
LoggingInfrastructure - Uses
sql.ddlfor sample tables - Tracks execution via
logs.audit_logger
CLI Usage:
# Complete setup
python -m setup.setup_orchestrator
# Setup without samples
python -m setup.setup_orchestrator --no-samples
# Rollback keeping database
python -m setup.setup_orchestrator --rollback --keep-db
# Verbose output
python -m setup.setup_orchestrator --verboseData Definition Language Utilities
Functions for generating PostgreSQL DDL statements:
from sql.ddl import create_table, create_medallion_table_template
# Standard table
table_sql = create_table(
schema='bronze',
table='customer_data',
columns=[
{'name': 'id', 'type': 'SERIAL', 'constraints': ['PRIMARY KEY']},
{'name': 'name', 'type': 'VARCHAR(255)', 'constraints': ['NOT NULL']}
],
medallion_metadata=True,
comment='Raw customer data'
)
# Full medallion template
medallion_sql = create_medallion_table_template(
schema='silver',
table='customers',
business_columns=[
{'name': 'customer_id', 'type': 'INTEGER'},
{'name': 'customer_name', 'type': 'VARCHAR(255)'}
],
partition_by='created_at'
)Key Functions:
create_schema()- Schema creationcreate_table()- Table with medallion metadatacreate_index()- Performance indexescreate_constraint()- Integrity constraintscreate_medallion_table_template()- Full medallion tablecreate_database_sql()- Database creationdrop_database_sql()- Database cleanupterminate_connections_sql()- Connection management
Medallion Metadata Columns:
created_at- Record creation timestampupdated_at- Last update timestampcreated_by- User/system that created recordupdated_by- User/system that updated recordsource_system- Source system identifierbatch_id- ETL batch identifieris_deleted- Soft delete flagrow_hash- Data integrity hash
Integration:
- Used by
setup_orchestrator - Used by
create_logs - Foundation for all table creation
Data Manipulation Language Utilities
Functions for generating PostgreSQL DML statements:
from sql.dml import bulk_insert, upsert, merge_statement
# Bulk insert
insert_sql = bulk_insert(
schema='bronze',
table='customer_data',
columns=['customer_id', 'name', 'email'],
on_conflict='DO NOTHING'
)
# Upsert operation
upsert_sql = upsert(
schema='silver',
table='customers',
columns=['customer_id', 'name', 'email'],
key_columns=['customer_id'],
update_columns=['name', 'email']
)
# MERGE operation
merge_sql = merge_statement(
target_schema='silver',
target_table='customers',
source_query='SELECT * FROM bronze.crm_customers',
key_columns=['customer_id'],
insert_columns=['customer_id', 'name', 'email'],
update_columns=['name', 'email']
)Key Functions:
bulk_insert()- Efficient bulk insertsupsert()- INSERT ... ON CONFLICTsoft_delete()- Soft delete operationsbatch_update()- Batch updatesmerge_statement()- MERGE-like operations with CTEsincremental_load()- Incremental data loadinggenerate_copy_statement()- COPY for CSV imports
Integration:
- Used by ETL processes
- Supports medallion patterns
- Handles metadata columns automatically
Query Building Utilities
Low-level builders for constructing SQL queries:
from sql.query_builder import (
select_builder,
window_function_builder,
cte_builder
)
# SELECT query
query = select_builder(
schema='silver',
table='customers',
columns=['customer_id', 'name', 'email'],
where_conditions=['status = :status'],
order_by=['created_at DESC'],
limit=100
)
# Window function
window_sql = window_function_builder(
function_name='ROW_NUMBER',
partition_by=['customer_id'],
order_by=['created_at DESC'],
alias='row_num'
)
# Common Table Expression
cte_sql = cte_builder(
cte_name='recent_customers',
cte_query='SELECT * FROM silver.customers WHERE created_at > NOW() - INTERVAL \'30 days\''
)Key Builders:
select_builder()- SELECT statementsjoin_builder()- JOIN clauseswhere_builder()- WHERE conditionspagination_builder()- LIMIT/OFFSETcte_builder()- Common Table Expressionswindow_function_builder()- Window functionssubquery_builder()- Subqueriesrecursive_cte_builder()- Recursive CTEs
Metadata Queries:
check_schema_exists_sql()- Schema existenceget_schema_info_sql()- Schema metadataget_table_info_sql()- Table metadataget_column_info_sql()- Column metadataget_database_info_sql()- Database infoget_table_stats_sql()- Table statistics
Integration:
- Used by
common_queries - Used by
create_schemas - Foundation for all query construction
Common Query Patterns
High-level query patterns for common use cases:
from sql.common_queries import (
analyze_medallion_layer,
trace_data_lineage,
check_data_quality
)
# Analyze medallion layer
query = analyze_medallion_layer(
layer='silver',
table='customers',
business_date='2024-01-01',
source_system='crm'
)
# Trace data lineage
lineage_query = trace_data_lineage(
target_schema='gold',
target_table='customer_analytics',
include_upstream=True,
max_depth=3
)
# Data quality checks
quality_query = check_data_quality(
schema='silver',
table='customers',
checks=[
{'column': 'email', 'check_type': 'not_null'},
{'column': 'customer_id', 'check_type': 'unique'}
]
)Key Functions:
analyze_medallion_layer()- Query medallion layerstrace_data_lineage()- Lineage tracingcheck_data_quality()- Data quality validationcompute_pivot_table()- Pivot operationscompute_running_totals()- Running calculationsanalyze_lag_lead()- Time series analysisanalyze_cohort_retention()- Cohort analysisdetect_time_series_gaps()- Gap detection
Integration:
- Uses
query_builderinternally - Provides medallion-optimized patterns
- Supports audit column filtering
Process and Configuration Auditing
Track ETL process execution and configuration changes:
from logs import ProcessLogger, ConfigurationLogger
# Process logging
process_logger = ProcessLogger(
host='localhost',
user='postgres',
password='password',
database='warehouse'
)
process_id = process_logger.start_process(
process_name='bronze_ingestion',
process_description='Load CRM data',
source_system='CRM',
target_layer='bronze'
)
# ... do ETL work ...
process_logger.end_process(
log_id=process_id,
status='SUCCESS',
rows_processed=1000,
error_message=None
)
# Configuration logging
config_logger = ConfigurationLogger()
config_logger.log_config_change(
parameter_name='batch_size',
old_value='1000',
new_value='5000',
change_reason='Performance optimization'
)Key Classes:
ProcessLogger- ETL process lifecycle trackingConfigurationLogger- Configuration change auditingBatchLogger- Specialized batch operation logging
Integration:
- Uses
create_logstable models - Called by
setup_orchestrator - Foundation for compliance and debugging
Data Lineage Tracking and Analysis
Track data flow through medallion architecture:
from logs import LineageTracker, LineageAnalyzer, ImpactAnalyzer
# Track lineage
tracker = LineageTracker(
host='localhost',
user='postgres',
password='password',
database='warehouse'
)
lineage_id = tracker.log_lineage(
process_log_id=123,
source_schema='bronze',
source_table='crm_customers',
target_schema='silver',
target_table='customers',
transformation_logic='Data cleansing and standardization',
rows_read=1000,
rows_written=950
)
# Analyze lineage
analyzer = LineageAnalyzer()
upstream = analyzer.get_upstream_lineage('gold', 'customer_analytics', max_depth=5)
downstream = analyzer.get_downstream_lineage('bronze', 'crm_customers', max_depth=5)
medallion_flow = analyzer.get_medallion_flow()
# Impact analysis
impact_analyzer = ImpactAnalyzer(analyzer)
impact = impact_analyzer.analyze_impact(
changed_schema='silver',
changed_table='customers',
change_type='SCHEMA_CHANGE'
)Key Classes:
LineageTracker- Record data transformationsLineageAnalyzer- Analyze lineage relationshipsImpactAnalyzer- Assess downstream impact
Key Features:
- Complete lineage tracking across layers
- Transformation logic documentation
- Source-to-target mapping
- Impact analysis for changes
- Critical path identification
Integration:
- Uses
create_logstable models - Supports compliance and governance
- Enables root cause analysis
Error Logging and Recovery
Centralized error handling with recovery mechanisms:
from logs import ErrorLogger, ErrorRecovery
# Error logging
error_logger = ErrorLogger(
host='localhost',
user='postgres',
password='password',
database='warehouse'
)
error_logger.log_error(
process_log_id=123,
error_message="Validation failed: Invalid email format",
error_code="DATA_VALIDATION_ERROR",
error_level="ERROR",
affected_table="bronze.crm_customers",
recovery_suggestion="Check source data format"
)
# Error recovery
recovery = ErrorRecovery()
recovery_plan = recovery.get_recovery_plan('DATA_VALIDATION_ERROR')
success = recovery.attempt_recovery(error_id=456, max_retries=3)Key Classes:
ErrorLogger- Centralized error loggingErrorRecovery- Automated recovery mechanismsErrorAnalyzer- Error pattern analysis
Error Levels:
INFO- Informational messagesWARNING- Warnings (not critical)ERROR- Errors requiring attentionCRITICAL- Critical failures
Integration:
- Uses
create_logstable models - Provides recovery suggestions
- Supports error trend analysis
Performance Monitoring
Track and analyze performance metrics:
from logs import PerformanceMonitor, MetricsCollector
# Performance monitoring
monitor = PerformanceMonitor(
host='localhost',
user='postgres',
password='password',
database='warehouse'
)
# Record metrics
monitor.record_metric(
process_log_id=123,
metric_name='processing_time',
metric_value=45.2,
metric_unit='seconds'
)
monitor.record_metric(
process_log_id=123,
metric_name='memory_usage',
metric_value=512.5,
metric_unit='MB'
)
# Context manager for automatic timing
with monitor.track_performance(process_log_id=123, operation_name='data_load'):
# ... do work ...
pass
# Collect and analyze metrics
collector = MetricsCollector()
summary = collector.get_performance_summary(
process_name='bronze_ingestion',
days=7
)Key Classes:
PerformanceMonitor- Track performance metricsMetricsCollector- Collect and aggregate metricsThroughputAnalyzer- Analyze data processing throughput
Key Features:
- Execution time tracking
- Resource usage monitoring (CPU, memory)
- Custom metric recording
- Context manager for automatic timing
- Performance trend analysis
Integration:
- Uses
create_logstable models - Uses
psutilfor system metrics - Supports SLA monitoring
Comprehensive Test Suite - 207 Passing Tests ✅
Total: 207 tests across 7 test files
├── tests_logs/ 119 tests (all logging modules)
│ ├── test_audit_logger.py 30 tests
│ ├── test_data_lineage.py 26 tests
│ ├── test_error_handler.py 35 tests
│ └── test_performance_monitor.py 28 tests
└── tests_setup/ 88 tests (all setup modules)
├── test_create_database.py 21 tests
├── test_create_schemas.py 45 tests
└── test_create_logs.py 22 tests
# Run all tests
python -m pytest tests/ -v
# Run specific module tests
python -m pytest tests/tests_logs/test_audit_logger.py -v
# Run with coverage
python -m pytest tests/ --cov=logs --cov=setup --cov-report=html
# Run only setup tests
python -m pytest tests/tests_setup/ -v
# Run only logging tests
python -m pytest tests/tests_logs/ -v- Isolated environments: Each test uses SQLite in-memory databases
- Comprehensive fixtures: Shared fixtures via
conftest.pyfiles - Mocked dependencies: External dependencies properly mocked
- Edge case coverage: Tests for error conditions, edge cases, and integrations
- Fast execution: All 207 tests run in ~1.3 seconds
- No database required: Tests run without PostgreSQL installation
- Unit Tests: Individual function and method testing
- Integration Tests: Multi-component workflow testing
- Smoke Tests: Basic import and initialization verification
- Error Tests: Exception handling and edge cases
- Lifecycle Tests: Complete operation workflows
# PostgreSQL Connection
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_password
# Database Names
POSTGRES_DB=postgres
WAREHOUSE_DB=sql_retail_analytics_warehousefrom core.config import config
# Database connection settings
host = config.db_host
port = config.db_port
user = config.db_user
password = config.db_password
admin_db = config.db_name
warehouse_db = config.warehouse_db_name
# Connection string
conn_str = config.get_connection_string(use_warehouse=True)
# Project paths
data_dir = config.project.data_dir
logs_dir = config.project.logs_dir
medallion_dir = config.project.medallion_dirfrom setup import SetupOrchestrator
orchestrator = SetupOrchestrator()
results = orchestrator.run_complete_setup(include_samples=True)
for step, success in results.items():
print(f"{step}: {'✓' if success else '✗'}")from logs import ProcessLogger, PerformanceMonitor, LineageTracker
# Initialize loggers
process_logger = ProcessLogger()
perf_monitor = PerformanceMonitor()
lineage = LineageTracker()
# Start process
process_id = process_logger.start_process(
process_name='bronze_to_silver',
process_description='Transform bronze data to silver',
source_system='bronze',
target_layer='silver'
)
try:
# Track performance
with perf_monitor.track_performance(process_id, 'transformation'):
# ... ETL logic ...
rows_processed = 1000
# Log lineage
lineage.log_lineage(
process_log_id=process_id,
source_schema='bronze',
source_table='crm_customers',
target_schema='silver',
target_table='customers',
transformation_logic='Data cleansing and validation',
rows_read=1000,
rows_written=950
)
# End process
process_logger.end_process(
log_id=process_id,
status='SUCCESS',
rows_processed=rows_processed
)
except Exception as e:
# Log error
from logs import ErrorLogger
error_logger = ErrorLogger()
error_logger.log_error(
process_log_id=process_id,
error_message=str(e),
error_level='ERROR'
)
# End process with failure
process_logger.end_process(
log_id=process_id,
status='FAILED',
error_message=str(e)
)from sql.common_queries import check_data_quality
from sqlalchemy import create_engine, text
from core.config import config
# Define quality checks
checks = [
{'column': 'email', 'check_type': 'not_null'},
{'column': 'email', 'check_type': 'format', 'pattern': '%@%.%'},
{'column': 'customer_id', 'check_type': 'unique'},
{'column': 'registration_date', 'check_type': 'range', 'min': '2020-01-01'}
]
# Generate and execute quality check query
quality_sql = check_data_quality(
schema='silver',
table='customers',
checks=checks
)
engine = create_engine(config.get_connection_string(use_warehouse=True))
with engine.connect() as conn:
results = conn.execute(text(quality_sql)).fetchall()from logs.data_lineage import LineageAnalyzer, ImpactAnalyzer
# Initialize analyzers
lineage_analyzer = LineageAnalyzer()
impact_analyzer = ImpactAnalyzer(lineage_analyzer)
# Analyze impact of schema change
impact = impact_analyzer.analyze_impact(
changed_schema='silver',
changed_table='customers',
change_type='SCHEMA_CHANGE'
)
print(f"Impact Severity: {impact['impact_severity']}")
print(f"Affected Tables: {impact['downstream']['total_downstream_tables']}")
print(f"Critical Paths: {len(impact['critical_paths'])}")
print("\nRecommendations:")
for rec in impact['recommendations']:
print(f" - {rec}")# Database
psycopg2-binary==2.9.10 # PostgreSQL adapter
SQLAlchemy==2.0.44 # ORM and database toolkit
# Configuration
python-dotenv==1.2.1 # Environment variable management
# Monitoring
psutil==7.1.3 # System and process monitoring
# Utilities
typing_extensions==4.15.0 # Type hint extensions
greenlet==3.2.4 # Coroutine support (SQLAlchemy dependency)# Run all tests
python -m pytest tests/
# Run specific test module
python -m pytest tests/test_setup.py
# Run with coverage
python -m pytest --cov=. tests/The project follows Python best practices:
- PEP 8 style guide
- Type hints for function signatures
- Comprehensive docstrings
- Modular, reusable code
- Create module in appropriate package (
core/,logs/,sql/,setup/) - Add exports to package
__init__.py - Update this README with module documentation
- Add tests in
tests/directory
This project is licensed under the MIT License - see the LICENSE file for details.
This project demonstrates:
- Medallion Architecture: Multi-layered data warehouse design
- SQLAlchemy ORM: Type-safe database operations
- Enterprise Patterns: Audit trails, data lineage, error handling
- Modular Design: Reusable, maintainable code
- Configuration Management: Environment-based settings
- Performance Monitoring: Metrics and optimization
Contributions are welcome! Please feel free to submit a Pull Request.
For questions or issues, please open an issue on the repository.
Built with ❤️ using Python, PostgreSQL, and SQLAlchemy