An end-to-end batch data pipeline that ingests, transforms, and visualizes European payment transaction statistics published by the European Central Bank (ECB).
Quick start: The project can be run using
makecommands (see Makefile) or manually step by step (see Setup).
The ECB publishes the PST (Payment Statistics) dataset — semi-annual data on payment transactions processed across European payment systems (T2 and retail/large-value systems) reported by National Central Banks of EU member states. The data covers volumes and values of transactions broken down by country, transaction type, payment channel, currency, and payment system — available from 2000 onwards.
This project answers two questions:
- How has the volume of different transaction types evolved in the EU over time?
- Which countries dominate European payment systems?
ECB API (CSV)
│
▼
[ fetch_data.py ]
│ pandas → Parquet
▼
Google Cloud Storage (raw)
│
▼
[ transform_pyspark.py ]
│ PySpark aggregations → pandas → BigQuery
▼
BigQuery (aggregated dashboard tables)
│
▼
Metabase (dashboards)
Orchestration: Dagster manages the full pipeline as a graph of assets.
Infrastructure as Code: Terraform provisions GCS bucket and BigQuery dataset.
| Layer | Technology |
|---|---|
| Cloud | Google Cloud Platform |
| IaC | Terraform |
| Data Lake | Google Cloud Storage |
| Data Warehouse | BigQuery |
| Transformation | PySpark (local) |
| Orchestration | Dagster |
| Visualization | Metabase (Docker) |
Line chart showing how transaction volumes for each TYP_TRNSCTN (e.g. credit transfers, card payments, direct debits) evolved across semi-annual periods. Filter by UNIT_MEASURE = PN (pure number) to compare transaction counts.
Horizontal bar chart ranking countries (REF_AREA) by total transaction volume, sorted descending. Shows which EU member states dominate European payment systems.
After configuring .env (step 2 below), you can use make shortcuts:
make bootstrap # full setup from scratch: install deps, provision infra, start everything
make setup # install uv, Python 3.13, venv and all dependencies
make install-java # install OpenJDK 21 (optional, if Java is not present)
make infra-init # terraform init
make infra-apply # provision GCS + BigQuery
make metabase-up # start Metabase
make pipeline # start Dagster UI
make run # infra-apply + metabase-up + pipeline
make infra-destroy # tear down infrastructure- Python 3.13+
- Java 21 (required by PySpark — tested with OpenJDK 21.0.10)
- Docker & Docker Compose
- Terraform
- GCP project with a service account that has roles:
Storage Admin,BigQuery Admin
git clone <repo-url>
cd payment_transaction
make setupThis will install uv (if missing), Python 3.13, create a virtual environment and install all dependencies.
Copy .env_example to .env and fill in your values:
cp .env_example .envTF_VAR_project="your-gcp-project-id"
TF_VAR_gcs_bucket_name="your-unique-bucket-name"
TF_VAR_bq_dataset_name="payment_transaction_dataset"
GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account-key.json"Load environment variables:
set -a && source .env && set +acd terraform
terraform init
terraform apply
cd ..This creates a GCS bucket and BigQuery dataset in europe-central2.
docker compose up -dMetabase will be available at http://localhost:3000.
Start Dagster:
dagster devOpen http://localhost:3000 (or the port shown in logs if 3000 is taken) and materialize all assets.
Asset execution order:
payment_transactions_gcs— fetches ECB data and uploads to GCSpayment_description_gcs— uploads column descriptions to GCSpayment_transactions_local— downloads parquet locally for PySparkdash_eu_trend_data,dash_country_map_data— PySpark transformationsdash_eu_trend,dash_country_map— writes aggregated tables to BigQuery
In Metabase: Settings → Databases → Add database → BigQuery
- Project ID: value of
TF_VAR_project - Dataset ID: value of
TF_VAR_bq_dataset_name - Service account JSON: paste contents of your credentials file
├── src/
│ ├── constants.py # Shared constants (URLs, table names, dtypes)
│ ├── fetch_data.py # DataIngestion: ECB API → GCS
│ └── transform_pyspark.py # DataTransform: GCS → PySpark → BigQuery
├── dagster_pipeline/
│ ├── assets/
│ │ ├── ingestion.py # Dagster ingestion assets
│ │ └── transformation.py # Dagster transformation + BQ write assets
│ └── definitions.py # Dagster Definitions entry point
├── terraform/ # GCS bucket + BigQuery dataset
├── docker-compose.yml # Metabase
├── dagster.yaml # Dagster logs configuration
└── .env_example # Environment variables template


