Museum Kiosk Data Pipeline

A real-time ETL pipeline that processes visitor kiosk interactions from Liverpool Museum of Natural History (LMNH), validates the data, and stores it in a PostgreSQL database with accompanying Tableau dashboards for stakeholder analysis.

Project Overview

This project implements an end-to-end data pipeline that:

Consumes live kiosk interaction data from a Kafka stream
Validates and cleans data based on museum business rules
Stores processed data in AWS RDS PostgreSQL database
Provides interactive Tableau dashboards for different stakeholder needs

Architecture

Kafka Stream → ETL Pipeline (EC2) → PostgreSQL (RDS) → Tableau Dashboard
     ↓              ↓                    ↓               ↓
Live Kiosks    Data Validation    Clean Storage    Stakeholder Insights

Files Structure

pipeline_live/
├── .env                    # Environment variables (not in repo)
├── consumer.py            # Standalone Kafka consumer for testing
├── pipeline.py           # Main ETL pipeline script
├── load_master_data.py   # One-time exhibition data loader
├── schema.sql           # Database schema definition
├── reset_database.sh    # Script to reset transactional data
├── data/               # Exhibition JSON files from S3
└── .terraform/         # Terraform infrastructure files

Python Requirements

All the required Python extensions and modules can be found in the requirements.txt and found using

pip3 install -r requirements.txt

Infrastructure Requirements

AWS Account with S3, RDS and EC2 access
Terraform for infrastructure management
PostgreSQL database
Kafka cluster access (Confluent Cloud)
Tableau Online account

Environment Configuration

Create a .env file in the project root with the following variables:

# Kafka Configuration
BOOTSTRAP_SERVERS=your-kafka-cluster-endpoint
SECURITY_PROTOCOL=SASL_SSL
SASL_MECHANISM=PLAIN
USERNAME=your-kafka-username
PASSWORD=your-kafka-password

# Database Configuration
DATABASE_NAME=museum
DATABASE_USERNAME=postgres
DATABASE_PASSWORD=your-rds-password
DATABASE_IP=your-rds-endpoint.rds.amazonaws.com
DATABASE_PORT=5432

Installation & Setup

1. Infrastructure Setup

# Deploy AWS infrastructure
terraform init
terraform plan
terraform apply

2. Database Setup

# Create database schema
psql -h your-rds-endpoint -U postgres -d postgres -f schema.sql

# Load master data (exhibitions)
python3 load_master_data.py

3. Local Development

# Test Kafka connection
python3 consumer.py

# Run pipeline locally
python3 pipeline.py

4. Production Deployment

# SSH into EC2 instance
ssh -i your-key.pem ec2-user@ec2-ip-address

# Install dependencies
sudo yum update -y
sudo yum install python3 python3-pip git postgresql15-devel gcc python3-devel -y
pip3 install confluent-kafka psycopg2-binary python-dotenv

# Transfer files to EC2
scp -i your-key.pem pipeline.py .env ec2-user@ec2-ip-address:~

# Run pipeline in background
nohup python3 pipeline.py > pipeline.log 2>&1 &

How to Run

Development/Testing

# Start the ETL pipeline locally
python3 pipeline.py

Production

# On EC2 instance, run in background
nohup python3 pipeline.py > pipeline.log 2>&1 &

# Monitor logs
tail -f pipeline.log

# Check process status
ps aux | grep pipeline

Data Validation Rules

The pipeline validates incoming kiosk interactions against these business rules:

Operating Hours: Only interactions between 8:45 AM - 6:15 PM
Rating Values: Must be between -1 and 4 (where -1 = assistance request)
Button Types: 0 (assistance) or 1 (emergency)
Valid Exhibitions: Only exhibitions 0-5 (cross-referenced with master data)
Data Integrity: Required fields (timestamp, site, value) must be present

Invalid data is logged and rejected, ensuring only clean data reaches the database.

Database Schema

Core Tables

exhibitions - Master data for museum exhibitions
kiosk_interactions - Visitor interaction records
rating_types - Lookup table for rating descriptions
button_types - Lookup table for assistance/emergency types

Key Features

Referential integrity with foreign keys
Check constraints for data validation
Separation of ratings vs assistance requests (NULL handling)

Tableau Dashboards

Three specialized dashboards serve different stakeholder needs:

Overview Dashboard - General museum metrics and KPIs
Exhibition Rating Data - Visitor satisfaction analysis for Angela (Exhibition Manager)
Exhibition Security Data - Assistance/emergency patterns for Rita (Security Manager)

Interactive Features

Real-time data updates from live pipeline
Interactive filters for date ranges and exhibitions
Drill-down capabilities by time and location

Monitoring & Management

Reset Database (Preserve Master Data)

./reset_database.sh

Pipeline Health Checks

# Check if pipeline is running
ps aux | grep pipeline

# View recent logs
tail -n 100 pipeline.log

# Monitor database row count
psql -h rds-endpoint -U postgres -d museum -c "SELECT COUNT(*) FROM kiosk_interactions;"

Project Context

This pipeline was developed as part of a data engineering coursework project, demonstrating real-world ETL pipeline implementation with modern cloud technologies. It showcases integration of Kafka streaming, PostgreSQL databases, AWS infrastructure, and business intelligence tools.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dashboard		dashboard
pipeline_live		pipeline_live
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Museum Kiosk Data Pipeline

Project Overview

Architecture

Files Structure

Python Requirements

Infrastructure Requirements

Environment Configuration

Installation & Setup

1. Infrastructure Setup

2. Database Setup

3. Local Development

4. Production Deployment

How to Run

Development/Testing

Production

Data Validation Rules

Database Schema

Core Tables

Key Features

Tableau Dashboards

Interactive Features

Monitoring & Management

Reset Database (Preserve Master Data)

Pipeline Health Checks

Project Context

About

Uh oh!

Releases

Packages

Languages

BruceCodin/Coursework-Advanced-Data-Week-2

Folders and files

Latest commit

History

Repository files navigation

Museum Kiosk Data Pipeline

Project Overview

Architecture

Files Structure

Python Requirements

Infrastructure Requirements

Environment Configuration

Installation & Setup

1. Infrastructure Setup

2. Database Setup

3. Local Development

4. Production Deployment

How to Run

Development/Testing

Production

Data Validation Rules

Database Schema

Core Tables

Key Features

Tableau Dashboards

Interactive Features

Monitoring & Management

Reset Database (Preserve Master Data)

Pipeline Health Checks

Project Context

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages