A containerized data engineering solution demonstrating end-to-end data pipeline capabilities: ingestion from multiple sources, multi-layer data warehouse architecture, and quality monitoring dashboards.
┌─────────────────┐ ┌─────────────────┐
│ REST API │ │ CSV Files │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────────────────────────────┐
│ INGESTION (Python/Airflow) │
└─────────────────────────────────────────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ RAW │ │ STAGING │ │ DW │
│ Layer │→│ Layer │→│ Layer │
└─────────┘ └─────────┘ └─────────┘
│
┌──────────┴──────────┐
▼ ▼
┌───────────┐ ┌───────────┐
│ Data Marts│ │ Quality │
│ │ │ Reports │
└───────────┘ └───────────┘
│ │
└──────────┬──────────┘
▼
┌─────────────────────┐
│ Apache Superset │
│ Dashboards │
└─────────────────────┘
| Component | Technology |
|---|---|
| Database | SQL Server 2022 |
| Orchestration | Apache Airflow 2.8 |
| Visualization | Apache Superset 3.1 |
| Data Processing | Python (pandas, requests) |
| Infrastructure | Docker Compose |
- Docker Desktop 4.0+
- Docker Compose 2.0+
- 8GB RAM available for containers
-
Clone the repository
git clone <repository-url> cd microsoft_data_stack
-
Create environment file
cp .env.example .env # Edit .env to customize passwords if needed -
Start the stack
docker compose up -d
-
Wait for initialization (~2-3 minutes)
# Check container status docker compose ps # View logs docker compose logs -f sqlserver-init
| Service | URL | Credentials |
|---|---|---|
| Airflow | http://localhost:8080 | admin / admin |
| Superset | http://localhost:8088 | admin / admin |
| SQL Server | localhost:1433 | sa / (from .env) |
microsoft_data_stack/
├── docker-compose.yml # Container orchestration
├── .env.example # Environment template
├── SPEC.md # Detailed specification
│
├── airflow/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── dags/ # Pipeline definitions
│
├── superset/
│ ├── Dockerfile
│ └── superset_init.sh
│
├── sql/
│ ├── 01_create_schemas.sql
│ ├── 02_raw_tables.sql
│ ├── 03_staging_tables.sql
│ ├── 04_dw_tables.sql
│ ├── 05_mart_tables.sql
│ └── 06_quality_tables.sql
│
├── src/
│ ├── connectors/ # Data source connectors
│ ├── loaders/ # Data loading logic
│ ├── transformers/ # Data transformation
│ └── quality/ # Quality tracking
│
└── data/
├── sample/ # Sample data files
└── raw/ # Landing zone (git-ignored)
| Layer | Schema | Purpose |
|---|---|---|
| Raw | raw.* |
Landing zone, data preserved as-is |
| Staging | staging.* |
Cleaned, validated, typed data |
| Warehouse | dw.* |
Star schema (dimensions + facts) |
| Marts | mart.* |
Pre-aggregated analytics views |
| Quality | quality.* |
Pipeline metrics and audit logs |
docker compose up -d- Open Airflow: http://localhost:8080
- Enable and trigger
data_ingestionDAG - Wait for completion (~1 min)
- Enable and trigger
data_transformationDAG - Wait for completion (~1 min)
- Open Superset: http://localhost:8088
- Navigate to SQL Lab to explore data
- Create dashboards using pre-built views (see below)
Two dashboards are available:
- Revenue, Orders, Customer KPIs
- Daily sales trend
- Sales by category
- Top products by revenue
- Customer segments
- Overall quality score
- Records received vs stored
- Quality trend over time
- Pipeline run status
See superset/dashboards/DASHBOARD_SETUP.md for detailed setup instructions.
-- Sales views
SELECT * FROM dw.vw_sales_kpi_summary;
SELECT * FROM dw.vw_daily_sales_trend;
SELECT * FROM dw.vw_sales_by_category;
SELECT * FROM dw.vw_top_products;
-- Quality views
SELECT * FROM quality.vw_quality_summary;
SELECT * FROM quality.vw_quality_by_source;Every pipeline run logs:
- Records Received: Count extracted from source
- Records Stored: Count successfully loaded
- Records Rejected: Count failed validation
- Quality Score:
(stored / received) × 100
Query the quality metrics:
SELECT
source_name,
records_received,
records_stored,
quality_score,
created_at
FROM quality.ingestion_log
ORDER BY created_at DESC;# Start all services
docker compose up -d
# Stop all services
docker compose down
# View logs
docker compose logs -f <service-name>
# Restart a specific service
docker compose restart airflow-webserver
# Reset everything (delete volumes)
docker compose down -v
# Connect to SQL Server
docker exec -it mds_sqlserver /opt/mssql-tools18/bin/sqlcmd \
-S localhost -U sa -P 'YourStrong@Passw0rd123' -C- Ensure you have at least 2GB RAM allocated to Docker
- Check the password meets SQL Server complexity requirements
# Check scheduler logs
docker compose logs airflow-scheduler
# Verify SQL Server connection
docker exec mds_airflow_webserver python -c "
from airflow.providers.microsoft.mssql.hooks.mssql import MsSqlHook
hook = MsSqlHook(mssql_conn_id='mssql_default')
print(hook.get_conn())
"- Verify SQL Server is healthy:
docker compose ps - Check connection string in Superset uses
sqlserveras hostname (notlocalhost)
See SPEC.md for:
- Detailed architecture diagrams
- Star schema design
- Implementation phases
- Sample queries