Skip to content

BruceCodin/Coursework-Advanced-Data-Week-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Museum Kiosk Data Pipeline

A real-time ETL pipeline that processes visitor kiosk interactions from Liverpool Museum of Natural History (LMNH), validates the data, and stores it in a PostgreSQL database with accompanying Tableau dashboards for stakeholder analysis.

Project Overview

This project implements an end-to-end data pipeline that:

  • Consumes live kiosk interaction data from a Kafka stream
  • Validates and cleans data based on museum business rules
  • Stores processed data in AWS RDS PostgreSQL database
  • Provides interactive Tableau dashboards for different stakeholder needs

Architecture

Kafka Stream → ETL Pipeline (EC2) → PostgreSQL (RDS) → Tableau Dashboard
     ↓              ↓                    ↓               ↓
Live Kiosks    Data Validation    Clean Storage    Stakeholder Insights

Files Structure

pipeline_live/
├── .env                    # Environment variables (not in repo)
├── consumer.py            # Standalone Kafka consumer for testing
├── pipeline.py           # Main ETL pipeline script
├── load_master_data.py   # One-time exhibition data loader
├── schema.sql           # Database schema definition
├── reset_database.sh    # Script to reset transactional data
├── data/               # Exhibition JSON files from S3
└── .terraform/         # Terraform infrastructure files

Python Requirements

All the required Python extensions and modules can be found in the requirements.txt and found using

pip3 install -r requirements.txt

Infrastructure Requirements

  • AWS Account with S3, RDS and EC2 access
  • Terraform for infrastructure management
  • PostgreSQL database
  • Kafka cluster access (Confluent Cloud)
  • Tableau Online account

Environment Configuration

Create a .env file in the project root with the following variables:

# Kafka Configuration
BOOTSTRAP_SERVERS=your-kafka-cluster-endpoint
SECURITY_PROTOCOL=SASL_SSL
SASL_MECHANISM=PLAIN
USERNAME=your-kafka-username
PASSWORD=your-kafka-password

# Database Configuration
DATABASE_NAME=museum
DATABASE_USERNAME=postgres
DATABASE_PASSWORD=your-rds-password
DATABASE_IP=your-rds-endpoint.rds.amazonaws.com
DATABASE_PORT=5432

Installation & Setup

1. Infrastructure Setup

# Deploy AWS infrastructure
terraform init
terraform plan
terraform apply

2. Database Setup

# Create database schema
psql -h your-rds-endpoint -U postgres -d postgres -f schema.sql

# Load master data (exhibitions)
python3 load_master_data.py

3. Local Development

# Test Kafka connection
python3 consumer.py

# Run pipeline locally
python3 pipeline.py

4. Production Deployment

# SSH into EC2 instance
ssh -i your-key.pem ec2-user@ec2-ip-address

# Install dependencies
sudo yum update -y
sudo yum install python3 python3-pip git postgresql15-devel gcc python3-devel -y
pip3 install confluent-kafka psycopg2-binary python-dotenv

# Transfer files to EC2
scp -i your-key.pem pipeline.py .env ec2-user@ec2-ip-address:~

# Run pipeline in background
nohup python3 pipeline.py > pipeline.log 2>&1 &

How to Run

Development/Testing

# Start the ETL pipeline locally
python3 pipeline.py

Production

# On EC2 instance, run in background
nohup python3 pipeline.py > pipeline.log 2>&1 &

# Monitor logs
tail -f pipeline.log

# Check process status
ps aux | grep pipeline

Data Validation Rules

The pipeline validates incoming kiosk interactions against these business rules:

  • Operating Hours: Only interactions between 8:45 AM - 6:15 PM
  • Rating Values: Must be between -1 and 4 (where -1 = assistance request)
  • Button Types: 0 (assistance) or 1 (emergency)
  • Valid Exhibitions: Only exhibitions 0-5 (cross-referenced with master data)
  • Data Integrity: Required fields (timestamp, site, value) must be present

Invalid data is logged and rejected, ensuring only clean data reaches the database.

Database Schema

Core Tables

  • exhibitions - Master data for museum exhibitions
  • kiosk_interactions - Visitor interaction records
  • rating_types - Lookup table for rating descriptions
  • button_types - Lookup table for assistance/emergency types

Key Features

  • Referential integrity with foreign keys
  • Check constraints for data validation
  • Separation of ratings vs assistance requests (NULL handling)

Tableau Dashboards

Three specialized dashboards serve different stakeholder needs:

  1. Overview Dashboard - General museum metrics and KPIs
  2. Exhibition Rating Data - Visitor satisfaction analysis for Angela (Exhibition Manager)
  3. Exhibition Security Data - Assistance/emergency patterns for Rita (Security Manager)

Interactive Features

  • Real-time data updates from live pipeline
  • Interactive filters for date ranges and exhibitions
  • Drill-down capabilities by time and location

Monitoring & Management

Reset Database (Preserve Master Data)

./reset_database.sh

Pipeline Health Checks

# Check if pipeline is running
ps aux | grep pipeline

# View recent logs
tail -n 100 pipeline.log

# Monitor database row count
psql -h rds-endpoint -U postgres -d museum -c "SELECT COUNT(*) FROM kiosk_interactions;"

Project Context

This pipeline was developed as part of a data engineering coursework project, demonstrating real-world ETL pipeline implementation with modern cloud technologies. It showcases integration of Kafka streaming, PostgreSQL databases, AWS infrastructure, and business intelligence tools.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published