- Objective
- Technologies Used
- Problem Description
- Project Overview
- Architecture Overview
- Setup Instructions
- Airflow Pipeline Details
- dbt Transformation Details
- Running the Complete Pipeline
- Visualization in Power BI
- Next Steps
Unlocking Insights into Mental Health: Building a Robust Data Pipeline for Global Analysis
The primary objective of this project is to engineer an automated, end-to-end data intelligence pipeline that monitors and analyzes the global intersection of professional environments and mental health outcomes. By leveraging Airflow, PySpark, and Google Cloud Platform, the project transforms static global survey data into a dynamic analytical ecosystem. This system is designed to identify shifting trends in workplace stress, bridge the 'treatment gap' through predictive data modeling, and provide public health stakeholders with a real-time diagnostic tool to optimize workplace wellness and intervention strategies
This project leverages cutting-edge cloud and data engineering tools:
Terraform - Infrastructure as Code
Google Cloud Platform (GCP) - Cloud Services
Apache Airflow - Workflow Orchestration
Google BigQuery - Data Warehouse
dbt - Data Transformation
Power BI - Data Visualization
This project utilizes the "Mental Health Dataset" from Kaggle, available here. This open-source dataset provides a foundation for analyzing mental health patterns.
The goal is to construct a comprehensive data pipeline that ingests, processes, and visualizes mental health data, revealing insights across time, geography, and demographic variables to inform public health strategies.
The project encompasses:
- Infrastructure Provisioning: Deploy cloud resources using Terraform.
- Data Ingestion: Load dataset into Google Cloud Storage (GCS).
- Data Transfer: Move data from GCS to BigQuery via Airflow.
- Data Transformation: Process data in BigQuery using dbt.
- Visualization: Create dashboards in Power BI.
All development occurs within a GCP Virtual Machine for consistency.
This project implements a modern data engineering pipeline with three key stages:
┌─────────────────────────────────────────────────────────────────┐
│ DATA PIPELINE ARCHITECTURE │
└─────────────────────────────────────────────────────────────────┘
STAGE 1: INFRASTRUCTURE (Terraform)
┌──────────────────────────────────────────────────────────────────┐
│ Google Cloud Platform │
│ ├── GCS Bucket (data lake) - Stores raw/processed data │
│ ├── BigQuery Dataset - Data warehouse │
│ └── Compute Instance (n2-std-4) - Runs Spark & Airflow │
└──────────────────────────────────────────────────────────────────┘
↓
STAGE 2: ORCHESTRATION & INGESTION (Apache Airflow)
┌──────────────────────────────────────────────────────────────────┐
│ Airflow DAGs (Running in Docker) │
│ ├── kaggle_ingestion_dag │
│ │ ├── Download from Kaggle │
│ │ └── Verify data integrity │
│ │ │
│ ├── gcp_upload_dag (Full Pipeline) │
│ │ ├── Download from Kaggle │
│ │ ├── Upload to GCS (via PySpark) │
│ │ └── Transfer GCS → BigQuery │
│ │ │
│ └── gcs_to_bigquery_dag (Optional) │
│ └── Direct GCS → BigQuery transfer │
└──────────────────────────────────────────────────────────────────┘
↓
STAGE 3: TRANSFORMATION & ANALYTICS (dbt)
┌──────────────────────────────────────────────────────────────────┐
│ dbt Models & Transformations │
│ ├── Staging Layer (Views) │
│ │ └── stg_mental_health: Clean & standardize raw data │
│ │ │
│ └── Mart Layer (Tables) │
│ └── fct_mental_health_analysis: Analytical fact table │
└──────────────────────────────────────────────────────────────────┘
↓
STAGE 4: VISUALIZATION
┌──────────────────────────────────────────────────────────────────┐
│ Power BI Dashboards │
│ └── Connected to dbt fact tables in BigQuery │
└──────────────────────────────────────────────────────────────────┘
Key Technologies:
- Infrastructure: Terraform + GCP
- Orchestration: Apache Airflow (Docker)
- Processing: PySpark + Spark 3.5.0
- Storage: Google Cloud Storage (GCS) + BigQuery
- Transformation: dbt (data build tool)
- Visualization: Power BI
- Create a new project in the Google Cloud Console.
- Navigate to IAM & Admin > Service Accounts and create a new service account.
- Generate a JSON key for the service account and download it.
- Save the key file as
teraform-mar-0fb97fcd6586.jsonin thekeys/directory.
- Enable the Compute Engine API: Go to "APIs & Services" > "Library," search for "Compute Engine API," and enable it.
- Under IAM & Admin > IAM, assign these roles to the service account:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
- Compute Admin
- Generate an SSH key pair:
ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USERNAME -b 2048 - Copy the public key content and paste it into Compute Engine > Metadata > SSH Keys.
Terraform automates the creation of essential cloud infrastructure. This section provides step-by-step guidance for beginners to replicate the setup.
- Install Terraform on your local machine.
- Ensure GCP credentials are configured as above.
The Terraform configuration deploys:
- Google Cloud Storage (GCS) Bucket: For data lake storage with automatic cleanup of incomplete uploads after 1 day.
- BigQuery Dataset: Named
Mar_Mental_Health_Big_Data_Projectfor data warehousing. - GCP Virtual Machine: An
n2-standard-4instance inus-central1-awith Debian 11, tagged for the project, and equipped with a local SSD for performance.
-
Navigate to the Terraform Directory:
cd Terraform/ -
Initialize Terraform:
terraform init
This downloads necessary providers and sets up the working directory.
-
Review the Plan (Optional but Recommended):
terraform plan
This shows what resources will be created without applying changes.
-
Apply the Configuration:
terraform apply
- Review the proposed changes and type
yesto confirm. - Terraform will create the GCS bucket, BigQuery dataset, and VM instance.
- Review the proposed changes and type
-
Verify Resources:
- Check the GCP Console for the new bucket, dataset, and VM.
After deployment:
-
Go to Compute Engine > VM instances in the GCP Console.
-
Note the External IP of the
mental-health-vminstance. -
Connect via SSH:
ssh -i ~/.ssh/KEY_FILENAME USERNAME@EXTERNAL_IPReplace
KEY_FILENAMEandUSERNAMEwith your values. -
Optional: Configure SSH Config for easier access. Add to
~/.ssh/config:Host mental-health-vm HostName EXTERNAL_IP User USERNAME IdentityFile ~/.ssh/KEY_FILENAMEThen connect with:
ssh mental-health-vm
Docker is essential for containerizing applications and running services like Airflow. Follow these step-by-step instructions to install Docker and Docker Compose on a fresh Debian/Ubuntu system.
-
Update Package Index:
sudo apt update
-
Install Prerequisites:
sudo apt install apt-transport-https ca-certificates curl gnupg lsb-release -y
-
Add Docker's Official GPG Key:
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg -
Add Docker Repository:
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
-
Install Docker and Docker Compose:
sudo apt update sudo apt install docker-ce docker-ce-cli containerd.io docker-compose-plugin -y
-
Start Docker Service:
sudo systemctl start docker sudo systemctl enable docker -
Add User to Docker Group:
sudo usermod -aG docker $USERNote: Log out and back in for changes to take effect.
-
Verify Installation:
docker --version docker compose version
This section outlines the steps to set up Java, Apache Spark, and PySpark on the GCP VM.
-
Download and extract OpenJDK 11:
wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz rm openjdk-11.0.2_linux-x64_bin.tar.gz
-
Configure environment variables:
export JAVA_HOME="${HOME}/spark/jdk-11.0.2" export PATH="${JAVA_HOME}/bin:${PATH}"
Add these lines to
~/.bashrcfor persistence.
-
Download and unpack Spark:
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz tar xzfv spark-3.5.0-bin-hadoop3.tgz rm spark-3.5.0-bin-hadoop3.tgz
-
Update environment variables:
export SPARK_HOME="${HOME}/spark-3.5.0-bin-hadoop3" export PATH="${SPARK_HOME}/bin:${PATH}"
Add these lines to
~/.bashrcfor persistence. -
Install GCS Connector: To allow Spark to read from and write to Google Cloud Storage (
gs://paths), download the GCS connector JAR:wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar mv gcs-connector-hadoop3-latest.jar ${SPARK_HOME}/jars/
-
Install Python and Jupyter:
sudo apt update sudo apt install -y python3-pip python3-pandas pip3 install jupyter
-
Launch PySpark with Jupyter support:
PYSPARK_DRIVER_PYTHON=jupyter \ PYSPARK_DRIVER_PYTHON_OPTS="notebook" \ $SPARK_HOME/bin/pyspark
This section provides detailed steps to set up Apache Airflow using Docker Compose, along with troubleshooting tips for common issues.
- Docker and Docker Compose installed
- At least 2GB of available RAM
- Ports 8080 (Airflow UI) and 5432 (PostgreSQL) available
- Kaggle Account: Required for data ingestion.
-
Navigate to the Airflow directory:
cd Apache_airflow -
Kaggle Configuration (Crucial for Ingestion):
- Place your
kaggle.jsonfile in~/.kaggle/kaggle.json. - Permissions: Ensure the directory and file are accessible by the Docker user (UID 50000):
chmod 755 ~/.kaggle chmod 644 ~/.kaggle/kaggle.json
- Place your
-
Data Directory Permissions:
- Ensure the
data/directory is world-writable so Airflow can save the downloaded files:chmod 777 ../data
- Ensure the
-
Start services:
docker compose up -d
-
Access Airflow UI:
- URL: http://localhost:8080
- Username:
admin - Password:
admin
The project includes an automated pipeline to fetch the "Mental Health Dataset" directly from Kaggle.
The docker-compose.yml is configured to:
- Install Dependencies: Automatically installs
kaggleandpandasat runtime using:_PIP_ADDITIONAL_REQUIREMENTS: "kaggle pandas"
- Volume Mounts: Maps local directories for persistent storage and configuration:
volumes: - ../data:/opt/airflow/data # For downloaded datasets - ~/.kaggle:/home/airflow/.kaggle # For Kaggle API credentials
For manual testing or initial exploration, a Jupyter notebook is provided with a reusable function:
from kaggle.api.kaggle_api_extended import KaggleApi
def download_kaggle_dataset(dataset, download_path='./data'):
api = KaggleApi()
api.authenticate()
api.dataset_download_files(dataset, path=download_path, unzip=True, force=True)Located in Apache_airflow/dags/mental_health_etl.py, this DAG automates the daily ingestion:
download_from_kaggle: Authenticates and downloads the latest ZIP file, extracting it to/opt/airflow/data.process_and_verify_data: Usespandasto read the CSV and log data statistics (row/column counts) to ensure the file is valid.
.
├── ingestion script.ipynb # Manual ingestion & testing
├── data/ # Host directory for datasets (mounted to /opt/airflow/data)
└── Apache_airflow/
├── docker-compose.yml # Orchestration & Volume setup
├── dags/ # Airflow DAGs
│ └── mental_health_etl.py # Kaggle ETL Pipeline
├── logs/ # Task execution logs
└── plugins/ # Custom plugins
-
Permission Denied (Kaggle API):
- If the task fails with
Missing username in configuration, verify~/.kaggle/kaggle.jsonis world-readable (chmod 644) and the mount indocker-compose.ymlis correct.
- If the task fails with
-
Permission Denied (Data Folder):
- If
pandasorkagglecannot write to/opt/airflow/data, runchmod 777 data/on the host machine.
- If
-
ModuleNotFoundError (kaggle/pandas):
- These are installed on startup. If missing, restart the containers:
docker compose down && docker compose up -d.
- These are installed on startup. If missing, restart the containers:
-
DAG not appearing:
- Airflow can take up to 60 seconds to parse new files. Refresh the UI and ensure no filters (like "Active") are hiding the
kaggle_ingestion_dag.
- Airflow can take up to 60 seconds to parse new files. Refresh the UI and ensure no filters (like "Active") are hiding the
# Check service status
docker compose ps
# Check PostgreSQL
docker compose exec postgres pg_isready -U airflow -d airflow
# Check Airflow webserver
curl http://localhost:8080/healthTo achieve a fully automated pipeline that leverages high-performance processing, this project uses a hybrid architecture where Apache Airflow runs in Docker while utilizing Spark and Java installed on the host Linux machine.
- Host Machine: Houses the Spark 3.5.0 binaries and OpenJDK 11. This allows for persistent management of Spark configurations and JARs (like the GCS connector).
- Docker Containers: Airflow services (Webserver, Scheduler, Init) run in isolated containers but access the host's Spark and Java through high-performance volume mounts.
To enable the mental_health_full_pipeline DAG, the following dependencies are automatically managed:
- PIP Requirements:
pyspark,kaggle, andpandasare injected into the containers at runtime via the_PIP_ADDITIONAL_REQUIREMENTSvariable indocker-compose.yml. - Spark-GCS Connectivity: The
gcs-connector-hadoop3-latest.jaris placed in the host's Sparkjars/directory, making it available to the containerized Airflow workers.
The following mappings were established to bridge the Host and Docker environments:
| Host Path | Container Path | Purpose |
|---|---|---|
~/spark-3.5.0-bin-hadoop3 |
/usr/local/spark |
Provides Spark binaries and PySpark libraries. |
~/jdk-11.0.2 |
/usr/local/openjdk-11 |
Provides Java runtime required by Spark. |
../key |
/opt/airflow/key |
Grants access to GCP Service Account JSON keys. |
~/.kaggle |
/home/airflow/.kaggle |
Grants access to Kaggle API credentials. |
Several key configurations were implemented to ensure smooth execution:
- PATH Resolution: A customized
PATHwas set inside the containers to include/usr/local/spark/binand/usr/local/openjdk-11/binwhile preserving the default Airflow binary paths. This fixed the "airflow: command not found" error encountered during initial setup. - PySpark Integration:
PYTHONPATHwas configured to point to/usr/local/spark/pythonand the necessarypy4jsource ZIPs, allowing Airflow's Python interpreter to importpysparkcorrectly. - GCP Authentication: The
GOOGLE_APPLICATION_CREDENTIALSvariable was added to the Airflow environment, pointing to the mounted key path. This allows GCP Operators (likeGCSToBigQueryOperator) to authenticate without manual connection setup in the UI. - Java Runtime: By mounting the host's JDK and setting
JAVA_HOME, we resolved thejava: command not founderrors that typically occur when running Spark jobs in standard Airflow images.
Dag running spark, python uploading data to gcp and bigquery

Apache Airflow orchestrates the complete data ingestion and transfer pipeline. The project includes three main DAGs that handle different aspects of the data pipeline.
Purpose: Basic data ingestion from Kaggle
File: Apache_airflow/dags/mental_health_etl.py
Schedule: Daily (@daily)
Start Date: 2026-04-10
Tasks:
download_from_kaggle: Authenticates to Kaggle API and downloads the Mental Health Dataset- Dataset:
divaniazzahra/mental-health-dataset - Output: Extracted CSV files to
/opt/airflow/data
- Dataset:
process_and_verify_data: Reads the downloaded CSV and verifies data integrity- Logs row count, column count, and sample data
- Ensures the file is valid before downstream processing
Dependencies: download_from_kaggle >> process_and_verify_data
Configuration:
# From docker-compose.yml
_PIP_ADDITIONAL_REQUIREMENTS: "kaggle pandas pyspark"
Volume mounts:
- ../data:/opt/airflow/data (for downloaded datasets)
- ~/.kaggle:/home/airflow/.kaggle (for Kaggle API credentials)Purpose: Complete pipeline: Kaggle → GCS → BigQuery
File: Apache_airflow/dags/gcp_upload.py
Type: Full orchestration with data processing
Tasks:
download_from_kaggle: Downloads Mental Health Dataset from Kaggleupload_to_gcs: Reads CSV and uploads to GCS as Parquet using PySpark- Uses GCS connector for Hadoop 3
- Spark configuration includes GCP service account authentication
- Output path:
gs://mar_mental_health_bucket/mental_health_data/
load_to_bigquery: Transfers Parquet data from GCS to BigQuery- Destination:
teraform-mar.Mar_Mental_Health_Big_Data_Project.mental_health_data - Creates table with schema matching raw data structure
- Destination:
Data Schema (17 fields):
Timestamp, Gender, Country, Occupation, self_employed, family_history,
treatment, Days_Indoors, Growing_Stress, Changes_Habits,
Mental_Health_History, Mood_Swings, Coping_Struggles, Work_Interest,
Social_Weakness, mental_health_interview, care_options
Spark Configuration:
spark = SparkSession.builder \
.appName("MentalHealthDataUpload") \
.config("spark.jars.packages", "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.5") \
.getOrCreate()
# GCP Authentication
conf.set("google.cloud.auth.service.account.enable", "true")
conf.set("google.cloud.auth.service.account.json.keyfile", "/opt/airflow/key/teraform-mar-0fb97fcd6586.json")Purpose: Direct data transfer from GCS to BigQuery
File: Apache_airflow/dags/gcs_to_bigquery.py
Tasks:
gcs_to_bigquery_transfer: Uses GCSToBigQueryOperator for direct transfer- Source:
gs://mar_mental_health_bucket/mental_health_data/*.parquet - Destination: BigQuery table
- Write disposition: WRITE_TRUNCATE (overwrites existing data)
- Source:
Airflow Scheduler (Every Day)
↓
Kaggle Ingestion DAG
├─→ Download from Kaggle
│ └─→ CSV to /opt/airflow/data
└─→ Verify Data Integrity
└─→ Log statistics
↓ (or) ↓
GCP Upload DAG (Full Pipeline)
├─→ Download from Kaggle
│ └─→ CSV extracted to /opt/airflow/data
├─→ Upload to GCS (PySpark)
│ ├─ Configure Spark + GCS Connector
│ ├─ Read CSV with defined schema
│ └─ Write to GCS as Parquet
│ └─→ gs://mar_mental_health_bucket/mental_health_data/
└─→ Transfer to BigQuery
└─→ teraform-mar.Mar_Mental_Health_Big_Data_Project.mental_health_data
↓
GCS to BigQuery DAG (Optional)
└─→ Direct transfer from GCS → BigQuery
This project uses dbt (data build tool) for transforming raw data in BigQuery into analytical models.
It is recommended to use a Python virtual environment to manage dbt and its adapters.
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dbt Core and the BigQuery adapter
pip install dbt-core dbt-bigqueryVerify that dbt and the BigQuery adapter are correctly installed:
dbt --versionYou should see Core: 1.x.x and bigquery: 1.x.x in the output.
For an optimized development experience, install these extensions:
- dbt Power User: Highly recommended for model lineage, real-time query validation, and compiled SQL preview.
- dbt: Basic syntax highlighting and support.
dbt uses a profiles.yml file to store connection details. This file is typically located at ~/.dbt/profiles.yml or C:\Users\<User>\.dbt\profiles.yml.
The project uses a Service Account JSON key for authentication, located at:
DE_PROJECT1/key/teraform-mar-0fb97fcd6586.json
Update your profiles.yml with the following (adjust paths based on your system):
mental_health_data:
outputs:
dev:
type: bigquery
method: service-account
project: teraform-mar
dataset: Mar_Mental_Health_Big_Data_Project
threads: 4
keyfile: /path/to/teraform-mar-0fb97fcd6586.json
location: US
priority: interactive
job_execution_timeout_seconds: 300
job_retries: 1
target: devNavigate to the dbt project directory (dbt/mental_health_data) and run:
dbt debugA successful connection will show Connection test: [OK connection ok].
The raw data sources are defined in models/staging/sources.yml. To pull data from the raw BigQuery dataset and create your first staging model:
# Run the staging model
dbt run --select stg_mental_healthThe dbt project is organized as follows:
dbt/mental_health_data/
├── dbt_project.yml # Project configuration
├── models/
│ ├── staging/
│ │ ├── stg_mental_health.sql # Staging model (View)
│ │ └── sources.yml # Source definitions
│ │
│ ├── marts/
│ │ └── fct_mental_health_analysis.sql # Fact table for analytics (Table)
│ │
│ ├── example/ # Example models (optional)
│ │ ├── my_first_dbt_model.sql
│ │ ├── my_second_dbt_model.sql
│ │ └── schema.yml
│ │
│ └── intermediate/ # Placeholder for intermediate transformations
│
├── macros/ # Custom dbt macros
├── tests/ # Data quality tests
├── seeds/ # Static data files
└── logs/ # Execution logs
BigQuery Raw Data
↓
Staging: stg_mental_health (View)
├─ Source: raw_mental_health.mental_health_data
└─ Operation: Pass-through with SELECT *
↓
Mart: fct_mental_health_analysis (Table)
├─ Dependency: stg_mental_health
├─ Transformations:
│ ├─ Rename columns for clarity
│ │ (Timestamp → created_at, Gender → gender, etc.)
│ └─ Select analytical dimensions & measures
└─ Output: Analytical fact table ready for BI
↓
Power BI Dashboards
1. Staging Model (stg_mental_health.sql)
- Type: View
- Materialization:
view(as configured indbt_project.yml) - Purpose: Cleans and standardizes raw mental health data from BigQuery
- Source Definition:
database: teraform-mar schema: Mar_Mental_Health_Big_Data_Project table: mental_health_data
- Operations: Simple pass-through SELECT to standardize raw data
- Output Columns: All 17 columns from raw data
- Timestamp, Gender, Country, Occupation, self_employed, family_history, treatment
- Days_Indoors, Growing_Stress, Changes_Habits, Mental_Health_History
- Mood_Swings, Coping_Struggles, Work_Interest, Social_Weakness
- mental_health_interview, care_options
2. Fact Table Model (fct_mental_health_analysis.sql)
- Type: Table
- Materialization:
table(as configured indbt_project.yml) - Purpose: Creates an analytical fact table with curated columns and cleaned naming conventions
- Dependencies: References
stg_mental_healthstaging model via{{ ref('stg_mental_health') }} - Key Transformations:
- Column renaming for clarity and consistency
- Consolidation of related fields
- Ready for visualization
- Output Dimensions:
- Temporal:
created_at(from Timestamp) - Demographic:
gender,country,occupation - Employment:
self_employed - Health Indicators:
family_history,treatment,mental_health_history - Behavioral:
days_indoors,growing_stress,changes_habits,mood_swings - Social:
coping_struggles,work_interest,social_weakness - Response Data:
mental_health_interview,care_options
- Temporal:
- Usage: Connected to Power BI for dashboard creation
cd dbt/mental_health_data
# Activate virtual environment (if using one)
source .venv/bin/activate
# Run all models
dbt run# Run only the staging model
dbt run --select stg_mental_health
# Run only the fact table model
dbt run --select fct_mental_health_analysis
# Run with multiple threads for faster execution
dbt run --threads 4dbt compile# Generate dbt documentation and lineage
dbt docs generate
# View documentation locally (opens on http://localhost:8000)
dbt docs serve# Run all data quality tests
dbt test
# Run tests for a specific model
dbt test --select stg_mental_healthAfter running dbt models, the transformed data is available in BigQuery:
Project: teraform-mar
Dataset: Mar_Mental_Health_Big_Data_Project
Objects Created:
stg_mental_health(View) - Staging layerfct_mental_health_analysis(Table) - Analytical fact table
Access Methods:
-
BigQuery Web Console:
- Navigate to Google Cloud Console
- BigQuery > Projects >
teraform-mar> Dataset >Mar_Mental_Health_Big_Data_Project - View tables/views and execute SQL queries
-
Query in BigQuery:
-- View transformed data SELECT * FROM `teraform-mar.Mar_Mental_Health_Big_Data_Project.fct_mental_health_analysis` LIMIT 100;
-
Connect from Power BI:
- Create a new data source: BigQuery connector
- Project:
teraform-mar - Dataset:
Mar_Mental_Health_Big_Data_Project - Select
fct_mental_health_analysisfor dashboard creation
This section provides a comprehensive guide to running the entire pipeline from start to finish.
# 1. Navigate to Terraform directory
cd Terraform
# 2. Initialize Terraform
terraform init
# 3. Review planned changes
terraform plan
# 4. Deploy infrastructure
terraform apply
# 5. Verify in GCP Console
# - Check GCS bucket created
# - Check BigQuery dataset created
# - Check Compute Instance created# 1. SSH into the GCP Virtual Machine
ssh -i ~/.ssh/KEY_FILENAME USERNAME@EXTERNAL_IP
# 2. Install Java (OpenJDK 11)
wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz
export JAVA_HOME="${HOME}/jdk-11.0.2"
export PATH="${JAVA_HOME}/bin:${PATH}"
# 3. Install Apache Spark
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar xzfv spark-3.5.0-bin-hadoop3.tgz
export SPARK_HOME="${HOME}/spark-3.5.0-bin-hadoop3"
export PATH="${SPARK_HOME}/bin:${PATH}"
# 4. Install GCS Connector
wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar
mv gcs-connector-hadoop3-latest.jar ${SPARK_HOME}/jars/
# 5. Install Docker and Docker Compose
sudo apt update
sudo apt install -y docker.io docker-compose-plugin
sudo usermod -aG docker $USER
# 6. Setup Kaggle credentials
mkdir -p ~/.kaggle
# Place your kaggle.json in ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json# 1. Navigate to Airflow directory
cd Apache_airflow
# 2. Set permissions for shared directories
chmod 777 ../data
# 3. Start Airflow services
docker compose up -d
# 4. Wait for services to be ready (~30 seconds)
sleep 30
# 5. Access Airflow UI
# Open http://localhost:8080
# Username: admin | Password: admin
# 6. Enable and trigger DAGs
# In Airflow UI:
# - Enable "kaggle_ingestion_dag" or "gcp_upload_dag"
# - Monitor task execution in real-time
# - Check logs for any errors# Check if data has been downloaded
ls -la ../data/
# Verify data in GCS
gsutil ls -r gs://mar_mental_health_bucket/
# Verify data in BigQuery
gcloud bigquery query --use_legacy_sql=false \
'SELECT COUNT(*) as row_count FROM `teraform-mar.Mar_Mental_Health_Big_Data_Project.mental_health_data`'# 1. Navigate to dbt project
cd ../dbt/mental_health_data
# 2. Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# 3. Install dbt and BigQuery adapter
pip install dbt-core dbt-bigquery
# 4. Configure dbt profiles
# Edit ~/.dbt/profiles.yml with your project credentials
# 5. Test dbt connection
dbt debug
# 6. Run dbt transformations
dbt run
dbt test
# 7. Generate documentation
dbt docs generate
dbt docs serve1. Open Power BI Desktop
2. Get Data > Google BigQuery
3. Project: teraform-mar
4. Dataset: Mar_Mental_Health_Big_Data_Project
5. Table: fct_mental_health_analysis
6. Create visualizations and dashboards
To fully automate the pipeline, you can orchestrate dbt runs within Airflow:
Create DAG: Apache_airflow/dags/dbt_transformation_dag.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-engineering',
'retries': 2,
'retry_delay': timedelta(minutes=5),
'start_date': datetime(2026, 4, 15),
}
with DAG(
'daily_dbt_transformation',
default_args=default_args,
description='Run dbt transformations daily after data ingestion',
schedule_interval='0 3 * * *', # 3 AM UTC every day (after 2 AM ingestion)
catchup=False,
) as dag:
dbt_run = BashOperator(
task_id='run_dbt_models',
bash_command="""
cd /path/to/dbt/mental_health_data && \
source .venv/bin/activate && \
dbt run --profiles-dir ~/.dbt
""",
)
dbt_test = BashOperator(
task_id='test_dbt_models',
bash_command="""
cd /path/to/dbt/mental_health_data && \
source .venv/bin/activate && \
dbt test --profiles-dir ~/.dbt
""",
)
dbt_run >> dbt_testDeploy this DAG to Airflow:
cp dbt_transformation_dag.py Apache_airflow/dags/
docker compose restart airflow-scheduler# Check service status
docker compose ps
# View logs for a specific service
docker compose logs airflow-webserver
docker compose logs airflow-scheduler
# Execute a command in running container
docker compose exec airflow-webserver airflow dags test kaggle_ingestion_dag
# Check PostgreSQL connection
docker compose exec postgres pg_isready -U airflow -d airflow| Issue | Solution |
|---|---|
| DAG not appearing in UI | Restart scheduler: docker compose restart airflow-scheduler |
ModuleNotFoundError: kaggle |
Verify _PIP_ADDITIONAL_REQUIREMENTS in docker-compose.yml |
Permission denied on data folder |
Run chmod 777 ../data on host machine |
| GCP authentication failed | Verify service account JSON is in /opt/airflow/key/ and readable |
cd dbt/mental_health_data
# Check dbt debug info
dbt debug
# View compiled SQL
dbt compile
cat target/compiled/mental_health_data/models/staging/stg_mental_health.sql
# Run with debug logging
dbt run --debug
# View execution logs
tail -f logs/dbt.log# Check table row counts
gcloud bigquery query --use_legacy_sql=false \
'SELECT table_name, row_count FROM `teraform-mar.Mar_Mental_Health_Big_Data_Project.__TABLES__` ORDER BY row_count DESC'
# View table schema
gcloud bigquery show --schema teraform-mar:Mar_Mental_Health_Big_Data_Project.fct_mental_health_analysis
# Sample data
gcloud bigquery query --use_legacy_sql=false \
'SELECT * FROM `teraform-mar.Mar_Mental_Health_Big_Data_Project.fct_mental_health_analysis` LIMIT 5'View the transformed data flow:

The fct_mental_health_analysis table in BigQuery is designed to be directly connected to Power BI dashboards.
Connection Steps:
- Open Power BI Desktop
- Get Data → BigQuery (enter project ID:
teraform-mar) - Navigate to dataset
Mar_Mental_Health_Big_Data_Project - Select table
fct_mental_health_analysis - Create visualizations based on dimensions and measures
With the complete pipeline implemented, you can:
- Monitor Data Quality: Add dbt tests and Great Expectations for data validation
- Enhance Transformations: Add more intermediate models and complex transformations
- Scale Analytics: Create additional fact and dimension tables for deeper insights
- Automate Fully: Schedule all DAGs and dbt runs for production-level automation
- Add Documentation: Document business logic and data lineage for stakeholders
- Optimize Performance: Fine-tune Spark jobs and BigQuery queries for cost-efficiency
DE_PROJECT1/
├── Terraform/ # Infrastructure as Code
│ ├── main.tf # Cloud resources definition
│ ├── variable.tf # Variable definitions
│ └── key/ # GCP service account keys
│
├── Apache_airflow/ # Workflow Orchestration
│ ├── docker-compose.yml # Airflow containers setup
│ ├── dags/ # DAG definitions
│ │ ├── mental_health_etl.py # Basic ingestion DAG
│ │ ├── gcp_upload.py # Full pipeline DAG (Kaggle→GCS→BigQuery)
│ │ └── gcs_to_bigquery.py # Direct transfer DAG
│ ├── config/ # Airflow configuration
│ ├── logs/ # Task execution logs
│ └── plugins/ # Custom plugins
│
├── dbt/ # Data Transformation
│ └── mental_health_data/
│ ├── dbt_project.yml # dbt project config
│ ├── models/
│ │ ├── staging/
│ │ │ ├── stg_mental_health.sql # Staging view
│ │ │ └── sources.yml # Data sources
│ │ └── marts/
│ │ └── fct_mental_health_analysis.sql # Fact table
│ ├── macros/ # Custom macros
│ ├── tests/ # Data quality tests
│ └── seeds/ # Static data
│
├── data/ # Local data storage
├── images/ # Documentation images
│ ├── airflow_orchestration.png
│ ├── dbt_lineage.png
│ └── powerbi_report.png
│
├── key/ # GCP authentication keys
├── README.md # This file
└── ingestion script.ipynb # Manual ingestion notebook
For detailed implementation instructions and troubleshooting, refer to the relevant sections above.




