Skip to content

This ETL (Extract, Transform, Load) project demonstrates the process of extracting data from a SQL Server database, transforming it using Python, orchestrating the data pipeline with Apache Airflow (running in a Docker container), loading the transformed data into Google BigQuery data warehouse, and finally creating a dashboard using Looker Studio.

Notifications You must be signed in to change notification settings

TejodhayBonam/Sales-ETL-data-pipeline-using-GCP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Sales ETL Project

Table of Contents

  1. Project Overview
  2. Technologies Used
  3. Data Pipeline
  4. Repository Structure
  5. Prerequisites
  6. Configuration
  7. Running the Project
  8. Cleaning Up
  9. Dashboard
  10. Acknowledgments
  11. Conclusion
  12. Contacts

Project Overview

This ETL (Extract, Transform, Load) project demonstrates the process of extracting data from a SQL Server database, transforming it using Python, orchestrating the data pipeline with Apache Airflow (running in a Docker container), loading the transformed data into Google BigQuery data warehouse, and finally creating a dashboard using Looker Studio.

Technologies Used

The following technologies and tools were used in this project:

  • Python: For data extraction and transformation.
  • Apache Airflow: To orchestrate the ETL pipeline.
  • Docker: To run the Apache Airflow instance in a container.
  • Google Cloud Platform (GCP): For BigQuery data warehousing.
  • Looker Studio: For creating the dashboard.

Data Pipeline

Here is the sales data pipeline :

sales_Data_pipeline

Repository Structure

sales_data_pipeline:.
│   etl_script.ipynb
│   README.md
│   vocal-raceway-400113-7e5d84b42be1.json
│
├───airflow_installation
│       docker-compose.yml
│       Dockerfile
│
└───images
        connection_mssql.png
        data_in_bigQuery.png
        project_metadata.png
        Sales_Dashboard.jpg
        sales_Data_pipeline.png
        sales_data_sqlserver.png

note that the json file contains your project meta-data and some other informations, you can generate it from your ptoject craeted on GCP (BigQuery) Here is a capture of the file :

project metadata

Prerequisites

Before running this project, you should have the following prerequisites installed and configured:

  1. Python: Install Python (version >= 3.x) on your system.

  2. Apache Airflow: Set up Apache Airflow, and ensure it's running in a Docker container.

  3. Google Cloud Platform (GCP) Account: You need a GCP account with appropriate permissions to create and manage a BigQuery dataset.

  4. Looker Studio: Sign in to Looker Studio.

Configuration

Before running the ETL project, you'll need to configure some settings:

ETL Configuration

In etl_script.ipynb, modify the following variables:

  • BIGQUERY_PROJECT_ID: Your GCP project ID where BigQuery is located.

  • BIGQUERY_DATASET_ID: The BigQuery dataset where you want to load the transformed data.

Apache Airflow Configuration

In the docker-compose.yml file, configure Airflow environment variables such as database connection, authentication, and DAG location.

Running the Project

Before running thsi project ensure that you have the data in your sql Server database sql server database

Follow these steps to run the ETL project:

  1. Start the Apache Airflow instance:

    docker-compose up -d
  2. Access the Apache Airflow web UI to trigger the ETL process. Open a web browser and navigate to http://localhost:8080. Log in to the Airflow web UI.Create the mssql connection from admin connection mssql connection

  3. Trigger the ETL DAG to start the data extraction and transformation process.

  4. Once the data is transformed, it will be loaded into the specified BigQuery dataset.

data in BigQuery

  1. In Looker Studio, after connecting to bigQuery, you can now create a dashboard

Cleaning Up

After completing the project, you can clean up as follows:

  • Stop the Apache Airflow instance:

    docker-compose down
  • Optionally, delete the BigQuery dataset and any other resources you created on GCP.

Dashboard

Here is the sales Dashboard created in Looker Studio:

Sales Dashboard

Acknowledgments

Special thanks to the open-source communities behind Apache Airflow, Python, and Looker Studio for providing powerful tools for data extraction, transformation, orchestration, and visualization.

Conclusion

This ETL project showcases a comprehensive workflow for extracting, transforming, and loading data into a cloud-based data warehouse while also creating interactive dashboard for data exploration and analysis.

Contacts

For any questions or inquiries related to this project, feel free to contact me :)

About

This ETL (Extract, Transform, Load) project demonstrates the process of extracting data from a SQL Server database, transforming it using Python, orchestrating the data pipeline with Apache Airflow (running in a Docker container), loading the transformed data into Google BigQuery data warehouse, and finally creating a dashboard using Looker Studio.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages