Azure Databricks Notebook Link
This project demonstrates the full ETL lifecycle for the AdventureWorks dataset using Azure Databricks, PySpark, and Azure Data Lake Storage Gen2.
- Load and transform retail CSV/Parquet data into a Delta Lake data warehouse.
- Apply column transformations and cleansing for accurate analytics.
- Perform aggregations to derive business metrics like total orders per day.
- Automate data storage and management in Silver/Gold layers of the data lake.
- Azure Databricks – Notebook development and Spark processing
- PySpark – DataFrame transformations, aggregations, and cleansing
- Azure Data Lake Storage Gen2 – Data storage (Parquet & Delta)
- Delta Lake – ACID-compliant storage for reliable ETL pipelines
- Python – Scripting, functions, and logic
- Source: AdventureWorks raw datasets (sales, customers, returns, products)
- Transformations:
- Column concatenation and renaming (e.g., full customer name)
- Data cleansing (e.g., removing nulls, correcting formats)
- Aggregations (e.g., total orders per day)
- Sorting, filtering, and ranking
- Destination: Silver and Gold layers in Delta/Parquet formats on ADLS Gen2
- Ensures all numeric columns are valid
- Removes duplicate records
- Handles missing or inconsistent values
- Silver Layer: Cleansed and enriched data from raw sources
- Gold Layer: Aggregated datasets ready for BI or reporting
- Append Mode: Efficiently adds new data without overwriting existing tables
- Partitioning: Organized by
OrderDatefor performance and cost optimization
-
– 📈 Sales performance metrics- **💰 Product profitability analysis
- **👥 Customer demographics & distribution
Marwa Medhat Data Engineer | Azure | Databricks | PySpark | Synapse | Power BI
.png?raw=true)
