This project demonstrates the design and implementation of a modern data engineering pipeline that consolidates structured and semi-structured customer data into a centralized analytics warehouse for reporting and predictive analytics.
The solution integrates relational (MS SQL Server) and NoSQL (MongoDB) data sources, processes and transforms data using ETL pipelines, and loads it into a star-schema analytics warehouse.
The pipeline is orchestrated using both Azure Data Factory (ADF) and a prototype Apache Airflow workflow for comparison. Data loading is also demonstrated with SQL Server Integration Services (SSIS).
This project highlights best practices in data integration, transformation, and workflow management, with a focus on maintainability and scalability.
flowchart LR
A[MS SQL Server: CRM Data] --> B[ETL Pipeline (ADF/Airflow)]
C[MongoDB: Customer Events] --> B
B --> D[Staging]
D --> E[SQL Server: Analytics Warehouse]
E --> F[Reporting & Dashboards]
-
Sources:
- MS SQL Server: Structured customer and transaction data
- MongoDB: Semi-structured customer event logs
-
Orchestration:
- Azure Data Factory (production-level orchestration)
- Apache Airflow (academic prototype for DAG-based orchestration)
-
Loading:
- SSIS packages for efficient batch inserts
✅ Integrate multiple heterogeneous data sources into a single warehouse
✅ Automate data cleaning, transformation, and enrichment
✅ Enable timely and reliable business reporting
✅ Demonstrate proficiency in modern data engineering tools
| Tool/Technology | Purpose |
|---|---|
| MS SQL Server | Relational database and analytics warehouse |
| MongoDB | NoSQL source for event-based data |
| Azure Data Factory (ADF) | Cloud-based ETL orchestration |
| SQL Server Integration Services (SSIS) | ETL for batch loading |
| Apache Airflow | Workflow orchestration (prototype) |
| Python | Data cleaning and Airflow DAG scripts |
| Jupyter Notebook | Exploratory data analysis |
customer-data-integration-pipeline/
├── README.md
├── data/
│ ├── sample_sql_server_export.csv
│ ├── sample_mongodb_export.json
├── dags/
│ └── airflow_dag.py
├── ssis/
│ └── ssis_package.ispac
├── adf/
│ └── pipeline_definition.json
├── sql/
│ ├── warehouse_schema.sql
│ └── transformations.sql
├── notebooks/
│ └── exploratory_analysis.ipynb
├── docs/
│ └── architecture_diagram.png
└── LICENSE
- SQL Server instance (local or cloud)
- MongoDB instance (local or cloud)
- Azure account with Data Factory (optional for demonstration)
- Apache Airflow (optional, can be run locally using Docker)
- Clone the repository.
- Create the analytics warehouse schema using
sql/warehouse_schema.sql. - Load the sample data into SQL Server and MongoDB from
/data. - Deploy the ADF pipeline from
/adf/pipeline_definition.json(or simulate steps with Airflow DAG). - Run the SSIS package in
/ssisto load staged data into the warehouse. - Explore the results or extend with custom dashboards.
✅ Consolidated customer data from multiple systems into one analytics-ready database.
✅ Automated repeatable workflows for daily ETL operations.
✅ Improved data consistency, quality, and availability for reporting teams.
This project is licensed under the MIT License. See the LICENSE file for details.