# Data Forge Service Connections Validation

This notebook validates connections to all services in the Data Forge platform.

## Services Covered:
- 📊 **PostgreSQL** - Primary database
- 🚀 **ClickHouse** - Analytics database 
- 🗄️ **MinIO** - Object storage
- 📨 **Kafka** - Message streaming
- ⚡ **Trino** - SQL query engine
- 🔥 **Spark** - Distributed computing
- 📈 **Superset** - Business intelligence
- 🌪️ **Airflow** - Workflow orchestration
- ⚡ **Redis** - Caching layer

## Environment Setup & Imports

In [1]:
import os
import pandas as pd
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import connection libraries
import psycopg2
import clickhouse_connect
import boto3
from kafka import KafkaProducer, KafkaConsumer
import redis
import requests
from trino.dbapi import connect as trino_connect
from pyspark.sql import SparkSession

print("✅ All libraries imported successfully!")
print(f"📅 Validation started at: {datetime.now()}")

✅ All libraries imported successfully!
📅 Validation started at: 2025-09-04 09:25:58.663656


## Connection Configuration

Reading connection parameters from environment variables passed by Docker Compose.

In [2]:
# Database configurations
POSTGRES_CONFIG = {
    'host': 'postgres',
    'port': 5432,
    'database': os.getenv('POSTGRES_DB', 'metastore'),
    'user': os.getenv('POSTGRES_USER', 'admin'),
    'password': os.getenv('POSTGRES_PASSWORD', 'admin')
}

CLICKHOUSE_CONFIG = {
    'host': 'clickhouse',
    'port': 8123,
    'database': os.getenv('CLICKHOUSE_DB', 'analytics'),
    'username': os.getenv('CLICKHOUSE_USER', 'admin'),
    'password': os.getenv('CLICKHOUSE_PASSWORD', 'admin')
}

# Object Storage configuration
MINIO_CONFIG = {
    'endpoint_url': 'http://minio:9000',
    'aws_access_key_id': os.getenv('MINIO_ROOT_USER', 'minio'),
    'aws_secret_access_key': os.getenv('MINIO_ROOT_PASSWORD', 'minio123')
}

# Streaming configuration
KAFKA_CONFIG = {
    'bootstrap_servers': [os.getenv('KAFKA_BOOTSTRAP_SERVERS', 'kafka:9092')],
    'schema_registry_url': os.getenv('SCHEMA_REGISTRY_URL', 'http://schema-registry:8081')
}

# Service URLs
SERVICE_URLS = {
    'trino': os.getenv('TRINO_URL', 'http://trino:8080'),
    'superset': os.getenv('SUPERSET_URL', 'http://superset:8088'),
    'airflow': os.getenv('AIRFLOW_URL', 'http://airflow-apiserver:8080'),
    'spark': os.getenv('SPARK_MASTER_URL', 'spark://spark-master:7077')
}

print("✅ Configuration loaded from environment variables")
print(f"🗄️ PostgreSQL: {POSTGRES_CONFIG['host']}:{POSTGRES_CONFIG['port']}")
print(f"📊 ClickHouse: {CLICKHOUSE_CONFIG['host']}:{CLICKHOUSE_CONFIG['port']}")
print(f"☁️ MinIO: {MINIO_CONFIG['endpoint_url']}")
print(f"📨 Kafka: {KAFKA_CONFIG['bootstrap_servers']}")

✅ Configuration loaded from environment variables
🗄️ PostgreSQL: postgres:5432
📊 ClickHouse: clickhouse:8123
☁️ MinIO: http://minio:9000
📨 Kafka: ['kafka:9092']


## 1. PostgreSQL Connection Test

Testing connection to the primary PostgreSQL database used for metadata and Airflow.

In [3]:
def test_postgresql():
    try:
        # Connect to PostgreSQL
        conn = psycopg2.connect(**POSTGRES_CONFIG)
        cursor = conn.cursor()
        
        # Test query
        cursor.execute("SELECT version();")
        version = cursor.fetchone()[0]
        
        # Get database list
        cursor.execute("SELECT datname FROM pg_database WHERE datistemplate = false;")
        databases = [row[0] for row in cursor.fetchall()]
        
        cursor.close()
        conn.close()
        
        print("✅ PostgreSQL Connection: SUCCESS")
        print(f"📊 Version: {version.split(',')[0]}")
        print(f"🗄️ Available Databases: {', '.join(databases)}")
        return True
        
    except Exception as e:
        print(f"❌ PostgreSQL Connection: FAILED")
        print(f"🚨 Error: {str(e)}")
        return False

postgres_status = test_postgresql()

✅ PostgreSQL Connection: SUCCESS
📊 Version: PostgreSQL 16.10 (Debian 16.10-1.pgdg13+1) on x86_64-pc-linux-gnu
🗄️ Available Databases: postgres, metastore


## 2. ClickHouse Connection Test

Testing connection to ClickHouse analytics database.

In [4]:
def test_clickhouse():
    try:
        # Connect to ClickHouse
        client = clickhouse_connect.get_client(**CLICKHOUSE_CONFIG)
        
        # Test query
        result = client.query("SELECT version()")
        version = result.result_rows[0][0]
        
        # Get databases
        databases = client.query("SHOW DATABASES")
        db_list = [row[0] for row in databases.result_rows]
        
        # Test table creation and data insertion
        client.command("""
            CREATE TABLE IF NOT EXISTS test_table (
                id UInt32,
                name String,
                timestamp DateTime
            ) ENGINE = Memory
        """)
        
        client.insert('test_table', [
            [1, 'Test Record', datetime.now()]
        ])
        
        count = client.query("SELECT count() FROM test_table").result_rows[0][0]
        
        client.close()
        
        print("✅ ClickHouse Connection: SUCCESS")
        print(f"📊 Version: {version}")
        print(f"🗄️ Available Databases: {', '.join(db_list)}")
        print(f"📝 Test Records: {count}")
        return True
        
    except Exception as e:
        print(f"❌ ClickHouse Connection: FAILED")
        print(f"🚨 Error: {str(e)}")
        return False

clickhouse_status = test_clickhouse()

✅ ClickHouse Connection: SUCCESS
📊 Version: 25.8.1.5101
🗄️ Available Databases: INFORMATION_SCHEMA, analytics, default, information_schema, system
📝 Test Records: 1


## 3. MinIO Object Storage Test

Testing connection to MinIO S3-compatible object storage.

In [5]:
def test_minio():
    try:
        # Create S3 client
        s3_client = boto3.client(
            's3',
            endpoint_url=MINIO_CONFIG['endpoint_url'],
            aws_access_key_id=MINIO_CONFIG['aws_access_key_id'],
            aws_secret_access_key=MINIO_CONFIG['aws_secret_access_key']
        )
        
        # List buckets
        response = s3_client.list_buckets()
        buckets = [bucket['Name'] for bucket in response['Buckets']]
        
        # Create test bucket if it doesn't exist
        test_bucket = 'test-bucket'
        if test_bucket not in buckets:
            s3_client.create_bucket(Bucket=test_bucket)
        
        # Upload test file
        test_data = "Hello from Data Forge! 🚀"
        s3_client.put_object(
            Bucket=test_bucket,
            Key='test-file.txt',
            Body=test_data.encode('utf-8')
        )
        
        # List objects in test bucket
        objects = s3_client.list_objects_v2(Bucket=test_bucket)
        object_count = objects.get('KeyCount', 0)
        
        print("✅ MinIO Connection: SUCCESS")
        print(f"🪣 Available Buckets: {', '.join(buckets) if buckets else 'None'}")
        print(f"📁 Objects in test-bucket: {object_count}")
        print(f"📄 Test file uploaded successfully")
        return True
        
    except Exception as e:
        print(f"❌ MinIO Connection: FAILED")
        print(f"🚨 Error: {str(e)}")
        return False

minio_status = test_minio()

✅ MinIO Connection: SUCCESS
🪣 Available Buckets: None
📁 Objects in test-bucket: 1
📄 Test file uploaded successfully


## 4. Kafka Messaging Test

Testing Kafka producer and consumer functionality.

In [6]:
def test_kafka():
    try:
        # Test Kafka Producer
        producer = KafkaProducer(
            bootstrap_servers=KAFKA_CONFIG['bootstrap_servers'],
            value_serializer=lambda v: json.dumps(v).encode('utf-8')
        )
        
        # Send test message
        test_topic = 'test-topic'
        test_message = {
            'message': 'Hello from Data Forge!',
            'timestamp': datetime.now().isoformat(),
            'source': 'jupyter-validation'
        }
        
        future = producer.send(test_topic, test_message)
        record_metadata = future.get(timeout=10)
        producer.close()
        
        # Test Schema Registry
        schema_response = requests.get(f"{KAFKA_CONFIG['schema_registry_url']}/subjects")
        
        print("✅ Kafka Connection: SUCCESS")
        print(f"📨 Message sent to topic: {test_topic}")
        print(f"📍 Partition: {record_metadata.partition}, Offset: {record_metadata.offset}")
        print(f"📋 Schema Registry Status: {schema_response.status_code}")
        return True
        
    except Exception as e:
        print(f"❌ Kafka Connection: FAILED")
        print(f"🚨 Error: {str(e)}")
        return False

kafka_status = test_kafka()

✅ Kafka Connection: SUCCESS
📨 Message sent to topic: test-topic
📍 Partition: 0, Offset: 0
📋 Schema Registry Status: 200


## 5. Redis Cache Test

Testing Redis connection and basic operations.

In [7]:
def test_redis():
    try:
        # Connect to Redis
        r = redis.Redis(host='redis', port=6379, decode_responses=True)
        
        # Test connection
        ping_response = r.ping()
        
        # Test basic operations
        r.set('test:key', 'Hello from Data Forge!')
        value = r.get('test:key')
        
        # Get Redis info
        info = r.info()
        redis_version = info['redis_version']
        used_memory = info['used_memory_human']
        
        print("✅ Redis Connection: SUCCESS")
        print(f"🏓 Ping Response: {ping_response}")
        print(f"📊 Version: {redis_version}")
        print(f"💾 Used Memory: {used_memory}")
        print(f"📝 Test Value: {value}")
        return True
        
    except Exception as e:
        print(f"❌ Redis Connection: FAILED")
        print(f"🚨 Error: {str(e)}")
        return False

redis_status = test_redis()

✅ Redis Connection: SUCCESS
🏓 Ping Response: True
📊 Version: 7.2.10
💾 Used Memory: 982.02K
📝 Test Value: Hello from Data Forge!


## 6. Trino SQL Engine Test

Testing Trino distributed SQL query engine.

In [8]:
def test_trino():
    try:
        # Connect to Trino
        conn = trino_connect(
            host='trino',
            port=8080,
            user='admin',
            catalog='system',
            schema='runtime'
        )
        
        cursor = conn.cursor()
        
        # Test basic query
        cursor.execute("SELECT node_version FROM system.runtime.nodes")
        nodes = cursor.fetchall()
        
        # Get catalogs
        cursor.execute("SHOW CATALOGS")
        catalogs = [row[0] for row in cursor.fetchall()]
        
        # Test sample query
        cursor.execute("SELECT current_timestamp")
        current_time = cursor.fetchone()[0]
        
        cursor.close()
        conn.close()
        
        print("✅ Trino Connection: SUCCESS")
        print(f"🖥️ Active Nodes: {len(nodes)}")
        print(f"📚 Available Catalogs: {', '.join(catalogs)}")
        print(f"🕐 Current Time: {current_time}")
        return True
        
    except Exception as e:
        print(f"❌ Trino Connection: FAILED")
        print(f"🚨 Error: {str(e)}")
        return False

trino_status = test_trino()

✅ Trino Connection: SUCCESS
🖥️ Active Nodes: 1
📚 Available Catalogs: clickhouse, iceberg, kafka, postgres, redis, system
🕐 Current Time: 2025-09-04 09:26:23.038000+00:00


## 7. Spark Distributed Computing Test

Testing Apache Spark connection and basic operations.

In [9]:
def test_spark():
    try:
        # Create Spark Session
        spark = SparkSession.builder \
            .appName("DataForgeValidation") \
            .master(SERVICE_URLS['spark']) \
            .getOrCreate()
        
        # Test basic operations
        data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
        columns = ["id", "name", "age"]
        
        df = spark.createDataFrame(data, columns)
        
        # Perform transformations
        result = df.filter(df.age > 25).select("name", "age").collect()
        
        # Get Spark context info
        sc = spark.sparkContext
        app_id = sc.applicationId
        default_parallelism = sc.defaultParallelism
        
        spark.stop()
        
        print("✅ Spark Connection: SUCCESS")
        print(f"🆔 Application ID: {app_id}")
        print(f"⚡ Default Parallelism: {default_parallelism}")
        print(f"📊 Test Result Count: {len(result)}")
        print(f"👥 Filtered Users: {[row.name for row in result]}")
        return True
        
    except Exception as e:
        print(f"❌ Spark Connection: FAILED")
        print(f"🚨 Error: {str(e)}")
        return False

spark_status = test_spark()

✅ Spark Connection: SUCCESS
🆔 Application ID: app-20250904092628-0006
⚡ Default Parallelism: 2
📊 Test Result Count: 2
👥 Filtered Users: ['Bob', 'Charlie']


## 8. Service Health Checks

Testing HTTP endpoints for web services.

In [10]:
def test_http_services():
    services_status = {}
    
    # Test each service
    for service_name, url in SERVICE_URLS.items():
        try:
            if service_name == 'spark':  # Skip Spark master web UI for now
                continue
                
            # Adjust URLs for HTTP requests
            if service_name == 'trino':
                test_url = f"{url}/v1/info"
            elif service_name == 'airflow':
                test_url = f"{url}/health"
            elif service_name == 'superset':
                test_url = f"{url}/health"
            else:
                test_url = url
            
            response = requests.get(test_url, timeout=10)
            services_status[service_name] = {
                'status': 'SUCCESS' if response.status_code == 200 else f'HTTP {response.status_code}',
                'url': test_url
            }
            
        except Exception as e:
            services_status[service_name] = {
                'status': 'FAILED',
                'error': str(e),
                'url': test_url if 'test_url' in locals() else url
            }
    
    print("🌐 HTTP Services Health Check:")
    for service, status in services_status.items():
        emoji = "✅" if status['status'] == 'SUCCESS' else "❌"
        print(f"{emoji} {service.title()}: {status['status']}")
        if 'error' in status:
            print(f"   🚨 Error: {status['error']}")
    
    return services_status

http_services_status = test_http_services()

🌐 HTTP Services Health Check:
✅ Trino: SUCCESS
✅ Superset: SUCCESS
❌ Airflow: FAILED
   🚨 Error: HTTPConnectionPool(host='airflow-apiserver', port=8080): Max retries exceeded with url: /health (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7248567facd0>: Failed to resolve 'airflow-apiserver' ([Errno -2] Name or service not known)"))


## 📊 Connection Summary Report

Overview of all service connection tests.

In [11]:
# Create summary report
services_summary = {
    'PostgreSQL': postgres_status,
    'ClickHouse': clickhouse_status,
    'MinIO': minio_status,
    'Kafka': kafka_status,
    'Redis': redis_status,
    'Trino': trino_status,
    'Spark': spark_status
}

# Add HTTP services
for service, status in http_services_status.items():
    services_summary[f"{service.title()} (HTTP)"] = status['status'] == 'SUCCESS'

print("" + "="*60)
print("🎯 DATA FORGE SERVICES CONNECTION SUMMARY")
print("" + "="*60)

successful_connections = 0
total_connections = len(services_summary)

for service, is_successful in services_summary.items():
    status_emoji = "✅" if is_successful else "❌"
    status_text = "CONNECTED" if is_successful else "FAILED"
    print(f"{status_emoji} {service:<25} {status_text}")
    if is_successful:
        successful_connections += 1

print("" + "="*60)
success_rate = (successful_connections / total_connections) * 100
print(f"📈 Success Rate: {successful_connections}/{total_connections} ({success_rate:.1f}%)")
print(f"🕐 Validation completed at: {datetime.now()}")

if success_rate == 100:
    print("🎉 All services are connected and operational!")
elif success_rate >= 80:
    print("🟡 Most services are operational. Check failed connections.")
else:
    print("🔴 Multiple service connection issues detected. Please check service status.")

🎯 DATA FORGE SERVICES CONNECTION SUMMARY
✅ PostgreSQL                CONNECTED
✅ ClickHouse                CONNECTED
✅ MinIO                     CONNECTED
✅ Kafka                     CONNECTED
✅ Redis                     CONNECTED
✅ Trino                     CONNECTED
✅ Spark                     CONNECTED
✅ Trino (HTTP)              CONNECTED
✅ Superset (HTTP)           CONNECTED
❌ Airflow (HTTP)            FAILED
📈 Success Rate: 9/10 (90.0%)
🕐 Validation completed at: 2025-09-04 09:26:50.591864
🟡 Most services are operational. Check failed connections.


## 🔧 Troubleshooting Guide

If any connections failed, here are common solutions:

### PostgreSQL Issues
- Ensure PostgreSQL container is running: `docker compose ps postgres`
- Check logs: `docker compose logs postgres`

### ClickHouse Issues
- Verify ClickHouse is healthy: `docker compose ps clickhouse`
- Check configuration: `docker compose logs clickhouse`

### MinIO Issues
- Check MinIO service: `docker compose ps minio`
- Verify credentials in `.env` file

### Kafka Issues
- Ensure Kafka is running: `docker compose ps kafka`
- Check broker logs: `docker compose logs kafka`

### Spark Issues
- Check Spark master: `docker compose ps spark-master`
- Verify workers: `docker compose ps | grep spark-worker`

### General Issues
- Restart services: `docker compose --profile core --profile explore restart`
- Check all services: `docker compose ps`
- View logs: `docker compose logs [service-name]`