# Challenge 5: PostgreSQL Integration for Security Analytics

## Task Description
In this challenge, we need to:
1. Write streaming results to PostgreSQL for persistence
2. Configure proper checkpointing for fault tolerance
3. Implement different output modes based on use cases
4. Ensure exactly-once semantics for critical data

## Prerequisites
Complete Challenges 1-4 to have streaming data with anomaly detection.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Create a Spark session
spark = SparkSession.builder \
    .appName("VPN Security Stream Processing") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1") \
    .getOrCreate()

# Set log level
spark.sparkContext.setLogLevel("WARN")

# Assume we have alerts_df from Challenge 4

## Setting up the PostgreSQL Sink

In [None]:
# Define connection properties
jdbc_url = "jdbc:postgresql://postgres:5432/datamart"
connection_properties = {
    "user": "spark",
    "password": "spark",
    "driver": "org.postgresql.Driver"
}

# TODO: Define a function to write to PostgreSQL
def write_to_postgres(df, epoch_id):
    # Write dataframe to PostgreSQL
    # Note: This is called for each micro-batch
    df.write \
        .jdbc(
            url=jdbc_url,
            table="security_alerts",
            mode="append",
            properties=connection_properties
        )

## Configuring Checkpointing

In [None]:
# TODO: Set up checkpoint directory for fault tolerance
checkpoint_path = "/tmp/spark-checkpoints/security-analytics"

# Alternative: Use a more permanent location like HDFS or S3
# checkpoint_path = "hdfs:///checkpoints/security-analytics"

## Writing Aggregations to PostgreSQL

In [None]:
# Write security alerts to PostgreSQL
postgres_query = alerts_df \
    .writeStream \
    .foreachBatch(write_to_postgres) \
    .option("checkpointLocation", checkpoint_path + "/alerts") \
    .start()

## Writing Dashboard Metrics

In [None]:
# TODO: Write aggregated metrics for dashboard
# Hint: Use update mode for dashboards that need real-time updates

dashboard_metrics = df_with_watermark \
    .groupBy(
        window(col("event_time"), "1 minute"),
        col("country")
    ) \
    .agg(
        count("*").alias("connection_count"),
        sum(when(col("is_successful") == True, 1).otherwise(0)).alias("successful_connections"),
        sum(when(col("is_successful") == False, 1).otherwise(0)).alias("failed_connections")
    )

# Write metrics for dashboard
dashboard_query = dashboard_metrics \
    .writeStream \
    .foreachBatch(lambda df, epoch_id: df.write \
        .jdbc(
            url=jdbc_url,
            table="connection_metrics",
            mode="append",
            properties=connection_properties
        )) \
    .option("checkpointLocation", checkpoint_path + "/metrics") \
    .start()

## Schema Evolution Considerations

In [None]:
# TODO: Handle schema evolution
# Consider how to handle schema changes without breaking the pipeline

## Exactly-Once Semantics

In [None]:
# TODO: Implement transaction management for exactly-once semantics
# Hint: You might need to use transactions with PostgreSQL

## Testing Notes
- Make sure database tables exist before running
- Verify data is correctly persisted to PostgreSQL
- Test recovery from failures using checkpoint information
- Check that duplicate data isn't written on restart