Skip to content

DrFaustest/Data_Engineering_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Data Integration and Analytics Pipeline

Overview

This project demonstrates the design and implementation of a modern data engineering pipeline that consolidates structured and semi-structured customer data into a centralized analytics warehouse for reporting and predictive analytics.

The solution integrates relational (MS SQL Server) and NoSQL (MongoDB) data sources, processes and transforms data using ETL pipelines, and loads it into a star-schema analytics warehouse.

The pipeline is orchestrated using both Azure Data Factory (ADF) and a prototype Apache Airflow workflow for comparison. Data loading is also demonstrated with SQL Server Integration Services (SSIS).

This project highlights best practices in data integration, transformation, and workflow management, with a focus on maintainability and scalability.

📐 Architecture

flowchart LR
    A[MS SQL Server: CRM Data] --> B[ETL Pipeline (ADF/Airflow)]
    C[MongoDB: Customer Events] --> B
    B --> D[Staging]
    D --> E[SQL Server: Analytics Warehouse]
    E --> F[Reporting & Dashboards]
Loading
  • Sources:

    • MS SQL Server: Structured customer and transaction data
    • MongoDB: Semi-structured customer event logs
  • Orchestration:

    • Azure Data Factory (production-level orchestration)
    • Apache Airflow (academic prototype for DAG-based orchestration)
  • Loading:

    • SSIS packages for efficient batch inserts

🎯 Objectives

✅ Integrate multiple heterogeneous data sources into a single warehouse
✅ Automate data cleaning, transformation, and enrichment
✅ Enable timely and reliable business reporting
✅ Demonstrate proficiency in modern data engineering tools

🛠️ Technologies Used

Tool/Technology Purpose
MS SQL Server Relational database and analytics warehouse
MongoDB NoSQL source for event-based data
Azure Data Factory (ADF) Cloud-based ETL orchestration
SQL Server Integration Services (SSIS) ETL for batch loading
Apache Airflow Workflow orchestration (prototype)
Python Data cleaning and Airflow DAG scripts
Jupyter Notebook Exploratory data analysis

📁 Repository Structure

customer-data-integration-pipeline/
├── README.md
├── data/
│   ├── sample_sql_server_export.csv
│   ├── sample_mongodb_export.json
├── dags/
│   └── airflow_dag.py
├── ssis/
│   └── ssis_package.ispac
├── adf/
│   └── pipeline_definition.json
├── sql/
│   ├── warehouse_schema.sql
│   └── transformations.sql
├── notebooks/
│   └── exploratory_analysis.ipynb
├── docs/
│   └── architecture_diagram.png
└── LICENSE

🚀 Getting Started

Prerequisites

  • SQL Server instance (local or cloud)
  • MongoDB instance (local or cloud)
  • Azure account with Data Factory (optional for demonstration)
  • Apache Airflow (optional, can be run locally using Docker)

Steps

  1. Clone the repository.
  2. Create the analytics warehouse schema using sql/warehouse_schema.sql.
  3. Load the sample data into SQL Server and MongoDB from /data.
  4. Deploy the ADF pipeline from /adf/pipeline_definition.json (or simulate steps with Airflow DAG).
  5. Run the SSIS package in /ssis to load staged data into the warehouse.
  6. Explore the results or extend with custom dashboards.

📊 Outcomes

✅ Consolidated customer data from multiple systems into one analytics-ready database.
✅ Automated repeatable workflows for daily ETL operations.
✅ Improved data consistency, quality, and availability for reporting teams.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors