This project showcases a real-world solution leveraging Microsoft Fabric, SQL Database, and AI capabilities. Using data from the AdventureWorks database, it integrates ingestion, transformation, and AI-driven analytics to deliver actionable insights and efficient workflows.
AdventureWorks Database is used as the on-premise database source, providing a rich dataset for enterprise-level demonstrations.
-
Data Ingestion:
- Data is extracted from the AdventureWorks database.
- Supports Full and Delta Integration.
- Pipelines built with Data Factory Pipelines and Notebooks.
-
Data Transformation:
- Utilizes Notebooks for processing delta datasets.
- Ensures metadata consistency across SQL Database and Lakehouse.
-
AI-Driven Insights:
- Includes reconciliation and validation for data consistency.
-
Orchestration:
- Automated pipelines manage the entire data workflow, ensuring seamless progression from ingestion to preparation.
- The system structures and organizes data effectively, making it ready for downstream analytics and decision-making processes.
-
Logging and Monitoring:
- Detailed logging tracks every step of the ingestion process for transparency and auditing.
- Monitoring systems provide real-time insights into pipeline performance, ensuring data flows smoothly and potential issues are detected early.
-
Notebooks:
Aggregation of Log data.Notebook: Summarizes log data for monitoring and debugging.Create Tables.Notebook: Automates the creation of required tables in Lakehouse.Create query for source system.Notebook: Generates queries to extract data from the source system.Delta_Join_Parquet.Notebook: Identifies and processes daily delta updates for ingestion, using the fields that individualize the records of each table.Get Columns.Notebook: Retrieves metadata for columns across source system.Initialize log ctrl file.Notebook: Sets up and initializes logging controls for ingestion.Load_System_Tables.Notebook: Automates the loading of reference and system tables.Load_fabric_metadata.Notebook: Creates the metadata control table in order to parameterize the pipelines.Reconciliation Check.Notebook: Ensures consistency and validation of ingested data.Reconciliation Checks Total.Notebook: Aggregation of reconciliation checks regarding all loaded tables.Truncate Table.Notebook: Used for re-run purposes of the source system daily counts.Update_fabric_metadata_table.Notebook: Updates metadata table after ingestion process is completed.Log.Notebook: Monitors processes for debugging and auditing.
-
Pipelines:
Load_System_Tables: Manages the ingestion of system and reference tables.Full_Load_LH: Executes a full load data for scenarios requiring complete dataset ingestion, bypassing delta constraints.Delta_Load_LH: Handles incremental delta loading for the Lakehouse environment.Load_Counts_Daily: Automates daily ingestion and metadata tracking processes.Orchestration: Oversees and automates workflow execution.
This repository provides the foundation for building an integrated, scalable data platform using Microsoft Fabric and on-premise SQL Server Database.
- Microsoft Fabric: Seamlessly integrates Lakehouse and SQL services.
- SQL Server Database: Centralized data storage for query and analysis.
- AdventureWorks: Rich on-premise dataset for demonstration.
- AI: Enable advanced insights.
- Data Factory Pipelines: Automate data ingestion and processing.
- Notebooks: Enable interactive and advanced transformation workflows.
The workflow for this project follows a structured sequence to ensure efficient data handling, transformation, and analysis:
-
DDL Generation from On-Premise Database:
- Extracts the DDL definitions of source tables from the AdventureWorks on-premise database.
- Automatically generates equivalent DDL scripts for the Microsoft Fabric environment.
- Notebooks are used to analyze and modify the data types as needed to ensure compatibility with Fabric's data processing capabilities.
-
Data Ingestion:
- The process begins with extracting data from the AdventureWorks database.
- The data is loaded into staging areas in the Lakehouse.
-
System Table Loading:
- The
Load_system_tables.DataPipelineinitializes and populates metadata and reference tables, ensuring the foundational system data is ready for subsequent operations.
- The
-
Data Delta Processing:
- Pipelines and notebooks identify and extract the daily differences (delta) in data from the previous day.
- This approach ensures efficient handling of only the changed data, minimizing processing overhead and keeping the system up-to-date.
- The
Reconciliation Check.Notebookvalidates the accuracy of the deltas, ensuring consistency between the source and target systems.
-
Orchestration and Logging:
- The
Orchestration.DataPipelineautomates the execution of all workflows. - Logs are recorded in
Log.Notebookto track processes and identify any issues.
- The
This end-to-end workflow enables robust and scalable data operations, delivering meaningful insights from raw data efficiently.
- Azure Subscription with Microsoft Fabric enabled.
- AdventureWorks Database (download from here).
- Microsoft Getway (download from here).
-
Clone the Repository:
git clone https://github.com/ManolisTr/DevpostProject.git
-
Set Up Resources
- AdventureWorks Database:
- Download the AdventureWorks database from AdventureWorks GitHub.
- Configure it as the on-premise source.
- Pipelines and Connectors:
- Deploy the Data Factory Pipelines to handle data ingestion and transformation.
- AdventureWorks Database:
-
Run Pipelines
- Trigger the orchestration pipeline to start the data ingestion process.
- Monitor the transformation and reconciliation mechanisms to validate data accuracy.