- Project Overview
- Technologies Used
- Data Pipeline
- Repository Structure
- Prerequisites
- Configuration
- Running the Project
- Cleaning Up
- Dashboard
- Acknowledgments
- Conclusion
- Contacts
This ETL (Extract, Transform, Load) project demonstrates the process of extracting data from a SQL Server database, transforming it using Python, orchestrating the data pipeline with Apache Airflow (running in a Docker container), loading the transformed data into Google BigQuery data warehouse, and finally creating a dashboard using Looker Studio.
The following technologies and tools were used in this project:
- Python: For data extraction and transformation.
- Apache Airflow: To orchestrate the ETL pipeline.
- Docker: To run the Apache Airflow instance in a container.
- Google Cloud Platform (GCP): For BigQuery data warehousing.
- Looker Studio: For creating the dashboard.
Here is the sales data pipeline :
sales_data_pipeline:.
│ etl_script.ipynb
│ README.md
│ vocal-raceway-400113-7e5d84b42be1.json
│
├───airflow_installation
│ docker-compose.yml
│ Dockerfile
│
└───images
connection_mssql.png
data_in_bigQuery.png
project_metadata.png
Sales_Dashboard.jpg
sales_Data_pipeline.png
sales_data_sqlserver.png
note that the json file contains your project meta-data and some other informations, you can generate it from your ptoject craeted on GCP (BigQuery) Here is a capture of the file :
Before running this project, you should have the following prerequisites installed and configured:
-
Python: Install Python (version >= 3.x) on your system.
-
Apache Airflow: Set up Apache Airflow, and ensure it's running in a Docker container.
-
Google Cloud Platform (GCP) Account: You need a GCP account with appropriate permissions to create and manage a BigQuery dataset.
-
Looker Studio: Sign in to Looker Studio.
Before running the ETL project, you'll need to configure some settings:
In etl_script.ipynb
, modify the following variables:
-
BIGQUERY_PROJECT_ID
: Your GCP project ID where BigQuery is located. -
BIGQUERY_DATASET_ID
: The BigQuery dataset where you want to load the transformed data.
In the docker-compose.yml
file, configure Airflow environment variables such as database connection, authentication, and DAG location.
Before running thsi project ensure that you have the data in your sql Server database
Follow these steps to run the ETL project:
-
Start the Apache Airflow instance:
docker-compose up -d
-
Access the Apache Airflow web UI to trigger the ETL process. Open a web browser and navigate to
http://localhost:8080
. Log in to the Airflow web UI.Create the mssql connection from admin connection -
Trigger the ETL DAG to start the data extraction and transformation process.
-
Once the data is transformed, it will be loaded into the specified BigQuery dataset.
- In Looker Studio, after connecting to bigQuery, you can now create a dashboard
After completing the project, you can clean up as follows:
-
Stop the Apache Airflow instance:
docker-compose down
-
Optionally, delete the BigQuery dataset and any other resources you created on GCP.
Here is the sales Dashboard created in Looker Studio:
Special thanks to the open-source communities behind Apache Airflow,
Python
, and Looker Studio
for providing powerful tools for data extraction, transformation, orchestration, and visualization.
This ETL project showcases a comprehensive workflow for extracting, transforming, and loading data into a cloud-based data warehouse while also creating interactive dashboard for data exploration and analysis.
For any questions or inquiries related to this project, feel free to contact me :)
- Linkedin : Tejodhay Bonam