👨‍🔧 Online Retail Data Pipeline 👷

Retail Data Pipeline with Terraform, Airflow, GCP BigQuery, dbt, Soda, and Looker

📝 Table of Contents

Project Overview
Key Insights
Project Architecture
Usage
Credits
Contact

🔬 Project Overview

This an end-to-end data engineering project, where I created an ELT data pipeline to extract, analyze, and visualize insights from the data of an online retail company based in the UK.

💾 Dataset

This is a transnational data set that contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The dataset includes the following columns:

Column	Description
InvoiceNo	Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
StockCode	Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description	Product (item) name. Nominal.
Quantity	The quantities of each product (item) per transaction. Numeric.
InvoiceDate	Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice	Unit price. Numeric, Product price per unit in sterling.
CustomerID	Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country	Country name. Nominal, the name of the country where each customer resides.

🎯 Project Goals

Set up the cloud infrastructure using Terraform.
Set up your Airflow local environment with the Astro CLI.
Create a data pipeline from scratch using Apache Airflow.
Upload CSV files into Google Cloud Storage.
Ingest data into BigQuery.
Implement data quality checks in the pipeline using Soda.
Integrate dbt and run data models with Airflow and Cosmos.
Visualize insights using Looker Studio.

🕵️ Key Insights

💸 Total Revenue by Country
- The UK 🇬🇧 is the country that generated the most of the company's revenue with over 7M followed by the Netherlands with 285k.
🎁 Top 3 sold products
- N°1: "PAPER CRAFT, LITTLE BIRDIE" is the Top selling product with 24,5%
- N°2: "MEDIUM CERAMIC TOP STORAGE JAR" is the second Top seller with 23%
- N°3: "WORLD WAR 2 GLIDERS" are the third most selling gift product with 15,6%

These 3 products alone account for more than 50% of the sales.

📈 Revenue by months
- The month with the most revenue is November with more than 1,1M.
- The month with the lowest revenue is February.

This can be explained by the months of October, November, and December having important holidays like Christmas, Halloween, etc... where people tend to give each other gifts.

📝 Project Architecture

The end-to-end data pipeline includes the following steps:

Setting up the infrastructure on GCP (Terraform)
Downloading, processing, and uploading the initial dataset to a Data Lake (GCP Storage Bucket)
Moving the data from the lake to a Data Warehouse (GCP BigQuery)
Transforming the data in the Data Warehouse and preparing it for the dashboard (dbt)
Checking the quality of the data in the Data Warehouse (Soda)
Creating the dashboard (Looker Studio)

You can find the detailed information on the diagram below:

🔧 Pipeline Architecture

🌪️ Pipeline on Airflow

⚙️ Data Modeling

🛠️ Technologies Used

Infrastructure: Terraform
Google Cloud Platform (GCP)
- Data Lake (DL): Cloud Storage
- Data Warehouse (DWH): BigQuery
Astro SDK for Airflow
Workflow orchestration: Apache Airflow
Transforming data: dbt (Data Build Tool)
Data quality checks: Soda
Containerization: Docker
Data Visualization: Looker Studio

💻 Usage

First, clone this repository.

git clone https://github.com/Hamagistral/OnlineRetail-DataEng.git

1. Pre-requisites

Make sure you have the following pre-installed components:

2. Google Cloud Platform

To set up GCP, please follow the steps below:

If you don't have a GCP account, please create a free trial.
Set up new project and write down your Project ID.
Configure service account to get access to this project and download auth-keys (.json). Please check the service account has all the permissions below:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
Download SDK for local setup.
Set environment variable to point to your downloaded auth-keys:

export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login

Enable the following options under the APIs and services section:

Identity and Access Management (IAM) API
IAM service account credentials API

3. Terraform

We use Terraform to build and manage GCP infrastructure. Terraform configuration files are located in the separate folder. There are 3 configuration files:

terraform-version - contains information about the installed version of Terraform;
variables.tf - contains variables to make your configuration more dynamic and flexible;
main.tf - is a key configuration file consisting of several sections.

Now you can use the steps below to generate resources inside the GCP:

Move to the terraform folder using bash command cd
Run terraform init command to initialize the configuration.
Use terraform plan to match previews local changes against a remote state.
Apply changes to the cloud with terraform apply command.

Note: In steps 3 and 4 Terraform may ask you to specify the Project ID. Please use the ID that you noted down earlier at the project setup stage.

If you would like to remove your stack from the Cloud, use the terraform destroy command.

4. Airflow

Start Airflow on your local machine by running:

astro dev start

This command will spin up 4 Docker containers on your machine, each for a different Airflow component:

Postgres: Airflow's Metadata Database
Webserver: The Airflow component responsible for rendering the Airflow UI
Scheduler: The Airflow component responsible for monitoring and triggering tasks
Triggerer: The Airflow component responsible for triggering deferred tasks

Verify that all 4 Docker containers were created by running 'docker ps'.

Note: Running 'astro dev start' will start your project with the Airflow Webserver exposed at port 8080 and Postgres exposed at port 5432. If you already have either of those ports allocated, you can either stop your existing Docker containers or change the port.

Access the Airflow UI for your local Airflow project. To do so, go to http://localhost:8080/ and log in with 'admin' for both your Username and Password.

You should also be able to access your Postgres Database at 'localhost:5432/postgres'.

Configure your Google Cloud Platform credentials.
Create and configure the necessary connections in Airflow.
Customize the Airflow DAGs to suit your specific requirements.
Run the pipeline and monitor its execution.
Explore the data using Looker Studio for insights and visualization.

📋 Credits

This Project is inspired by the recent video of the YouTube Channel Data With Marc
Readme inspired by this Project

📨 Contact Me

LinkedIn • Website • Gmail

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.astro		.astro
dags		dags
include		include
tests/dags		tests/dags
.dockerignore		.dockerignore
.gitignore		.gitignore
Architecture.png		Architecture.png
Dockerfile		Dockerfile
README.md		README.md
packages.txt		packages.txt
requirements.txt		requirements.txt

Hamagistral/OnlineRetail-DataEng

Folders and files

Latest commit

History

Repository files navigation