# Data Forge Connections

Ready-to-run connection code for all stack services.

Run Environment Setup first, then any service cell you need.

## ⚙️ Environment Setup

Loads connection URLs and credentials from Docker environment.

🛑 Run this cell first before any service connections.

In [1]:
import os

# Database connections
POSTGRES_URL = f"postgresql://{os.getenv('POSTGRES_USER', 'admin')}:{os.getenv('POSTGRES_PASSWORD', 'admin')}@postgres:5432/{os.getenv('POSTGRES_DB', 'metastore')}"
CLICKHOUSE_URL = f"clickhouse://{os.getenv('CLICKHOUSE_USER', 'admin')}:{os.getenv('CLICKHOUSE_PASSWORD', 'admin')}@clickhouse:8123/{os.getenv('CLICKHOUSE_DB', 'analytics')}"

# Object storage
MINIO_ENDPOINT = "http://minio:9000"
MINIO_ACCESS_KEY = os.getenv('MINIO_ROOT_USER', 'minio')
MINIO_SECRET_KEY = os.getenv('MINIO_ROOT_PASSWORD', 'minio123')

# Streaming
KAFKA_SERVERS = os.getenv('KAFKA_BOOTSTRAP_SERVERS', 'kafka:9092')
SCHEMA_REGISTRY_URL = os.getenv('SCHEMA_REGISTRY_URL', 'http://schema-registry:8081')

# Services
TRINO_URL = "http://trino:8080"
SPARK_MASTER = os.getenv('SPARK_MASTER_URL', 'spark://spark-master:7077')

print("Environment configured")

Environment configured


## 🧩 Spark

Distributed processing engine for batch and stream workloads.

In [2]:
from pyspark.sql import SparkSession

print(f"Connecting to Spark: {SPARK_MASTER}")

spark = SparkSession.builder \
    .appName("DataForge") \
    .master(SPARK_MASTER) \
    .config("spark.executor.memory", "512m") \
    .config("spark.driver.memory", "512m") \
    .config("spark.executor.cores", "1") \
    .config("spark.cores.max", "2") \
    .getOrCreate()

print(f"Spark ready: {spark.version}")

# Test query
test_data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(test_data, ["name", "age"])
df.show()

print("Spark session active")

Connecting to Spark: spark://spark-master:7077
Spark ready: 3.5.0
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+

Spark session active


## 🧩 PostgreSQL

Primary OLTP database for source data.

In [3]:
import pandas as pd
from sqlalchemy import create_engine

pg_engine = create_engine(POSTGRES_URL)
df = pd.read_sql("SELECT current_timestamp as now", pg_engine)
print("PostgreSQL result:")
print(df)

print("PostgreSQL ready")

PostgreSQL result:
                               now
0 2025-09-05 09:34:02.091861+00:00
PostgreSQL ready


## 🧩 ClickHouse

Columnar analytics database for fast queries.

In [4]:
import clickhouse_connect

ch_client = clickhouse_connect.get_client(
    host='clickhouse',
    port=8123,
    username=os.getenv('CLICKHOUSE_USER', 'admin'),
    password=os.getenv('CLICKHOUSE_PASSWORD', 'admin'),
    database=os.getenv('CLICKHOUSE_DB', 'analytics')
)

result = ch_client.query("SELECT 'Hello ClickHouse' as message, now() as timestamp")
df_ch = pd.DataFrame(result.result_rows, columns=result.column_names)
print("ClickHouse result:")
print(df_ch)

print("ClickHouse ready")

ClickHouse result:
            message           timestamp
0  Hello ClickHouse 2025-09-05 09:34:04
ClickHouse ready


## 🧩 MinIO

S3-compatible object storage for data lakes.

In [5]:
import boto3

s3_client = boto3.client(
    's3',
    endpoint_url=MINIO_ENDPOINT,
    aws_access_key_id=MINIO_ACCESS_KEY,
    aws_secret_access_key=MINIO_SECRET_KEY
)

buckets = s3_client.list_buckets()
print(f"Available buckets: {[b['Name'] for b in buckets['Buckets']]}")

print("MinIO ready")

Available buckets: ['test-bucket']
MinIO ready


## 🧩 Trino

Federated SQL query engine across data sources.

In [6]:
from trino.dbapi import connect as trino_connect

trino_conn = trino_connect(
    host='trino',
    port=8080,
    user='admin',
    catalog='system',
    schema='runtime'
)

def query_trino(sql):
    cursor = trino_conn.cursor()
    cursor.execute(sql)
    columns = [desc[0] for desc in cursor.description]
    data = cursor.fetchall()
    return pd.DataFrame(data, columns=columns)

catalogs_df = query_trino("SHOW CATALOGS")
print("Available catalogs:")
print(catalogs_df)

print("Trino ready")

Available catalogs:
      Catalog
0  clickhouse
1     iceberg
2       kafka
3    postgres
4       redis
5      system
Trino ready


## 🧩 Redis

In-memory cache and message broker.

In [7]:
import redis

r = redis.Redis(host='redis', port=6379, decode_responses=True)

# Test cache
r.set('test:notebook', 'Data Forge connection test')
message = r.get('test:notebook')
print(f"Cached message: {message}")

def cache_dataframe(key, df, expire_seconds=3600):
    json_data = df.to_json(orient='records')
    r.setex(key, expire_seconds, json_data)
    print(f"DataFrame cached: {key}")

def get_cached_dataframe(key):
    json_data = r.get(key)
    if json_data:
        return pd.read_json(json_data, orient='records')
    return None

print("Redis ready")

Cached message: Data Forge connection test
Redis ready


## 🧩 Kafka

Event streaming platform for real-time data pipelines.

In [8]:
from kafka import KafkaProducer, KafkaConsumer
import json
from datetime import datetime

producer = KafkaProducer(
    bootstrap_servers=[KAFKA_SERVERS],
    value_serializer=lambda x: json.dumps(x).encode('utf-8')
)

def send_message(topic, message):
    data = {
        'timestamp': datetime.now().isoformat(),
        'message': message
    }
    future = producer.send(topic, data)
    record = future.get(timeout=10)
    print(f"Sent to {topic}: partition {record.partition}, offset {record.offset}")
    return record

def create_consumer(topic, group_id='notebook-consumer'):
    return KafkaConsumer(
        topic,
        bootstrap_servers=[KAFKA_SERVERS],
        group_id=group_id,
        value_deserializer=lambda m: json.loads(m.decode('utf-8')),
        auto_offset_reset='latest'
    )

print("Kafka ready")

Kafka ready


## 🚀 Status Check

Verify all service connections are working.

In [9]:
def check_all_connections():
    status = {}
    
    # PostgreSQL
    try:
        pd.read_sql("SELECT 1", pg_engine)
        status['PostgreSQL'] = '✅'
    except Exception as e:
        status['PostgreSQL'] = '❌'
        print(f"PostgreSQL error: {str(e)[:50]}...")
    
    # ClickHouse
    try:
        ch_client.query("SELECT 1")
        status['ClickHouse'] = '✅'
    except Exception as e:
        status['ClickHouse'] = '❌'
        print(f"ClickHouse error: {str(e)[:50]}...")
    
    # MinIO
    try:
        s3_client.list_buckets()
        status['MinIO'] = '✅'
    except Exception as e:
        status['MinIO'] = '❌'
        print(f"MinIO error: {str(e)[:50]}...")
    
    # Redis
    try:
        r.ping()
        status['Redis'] = '✅'
    except Exception as e:
        status['Redis'] = '❌'
        print(f"Redis error: {str(e)[:50]}...")
    
    # Kafka
    try:
        producer.bootstrap_connected()
        status['Kafka'] = '✅'
    except Exception as e:
        status['Kafka'] = '❌'
        print(f"Kafka error: {str(e)[:50]}...")
    
    # Trino
    try:
        query_trino("SELECT 1 as test")
        status['Trino'] = '✅'
    except Exception as e:
        status['Trino'] = '❌'
        print(f"Trino error: {str(e)[:50]}...")
    
    # Spark
    try:
        if 'spark' in globals() and spark is not None:
            spark.sql("SELECT 1").collect()
            status['Spark'] = '✅'
        else:
            status['Spark'] = '❌ (not initialized)'
    except Exception as e:
        status['Spark'] = '❌'
        print(f"Spark error: {str(e)[:50]}...")
    
    print("Connection Status:")
    for service, stat in status.items():
        print(f"  {stat} {service}")
    
    successful = sum(1 for s in status.values() if '✅' in s)
    total = len(status)
    print(f"\nResult: {successful}/{total} services connected")
    
    return status

check_all_connections()

Connection Status:
  ✅ PostgreSQL
  ✅ ClickHouse
  ✅ MinIO
  ✅ Redis
  ✅ Kafka
  ✅ Trino
  ✅ Spark

Result: 7/7 services connected


{'PostgreSQL': '✅',
 'ClickHouse': '✅',
 'MinIO': '✅',
 'Redis': '✅',
 'Kafka': '✅',
 'Trino': '✅',
 'Spark': '✅'}

## 📏 Usage Patterns

Common data pipeline operations:

**PostgreSQL → Spark**
```python
df_spark = spark.read.format("jdbc") \
    .option("url", POSTGRES_URL.replace("postgresql://", "jdbc:postgresql://")) \
    .option("dbtable", "your_table") \
    .option("user", os.getenv('POSTGRES_USER', 'admin')) \
    .option("password", os.getenv('POSTGRES_PASSWORD', 'admin')) \
    .load()
```

**Spark → ClickHouse**
```python
pandas_df = spark_df.toPandas()
ch_client.insert_df('your_table', pandas_df)
```

**Kafka messaging**
```python
send_message('your-topic', {'key': 'value'})
consumer = create_consumer('your-topic')
```

**DataFrame caching**
```python
cache_dataframe('my_data', df)
cached_df = get_cached_dataframe('my_data')
```

**Trino federated queries**
```python
result = query_trino('''
    SELECT pg.*, ch.analytics_column 
    FROM postgresql.public.users pg
    JOIN clickhouse.analytics.events ch ON pg.id = ch.user_id
''')
```