# Challenge 3: Windowed Aggregations for VPN Events

## Task Description
In this challenge, we need to:
1. Apply time windows to the streaming data
2. Perform aggregations within these time windows
3. Calculate statistics to identify patterns
4. Output the results in a meaningful way

## Prerequisites
Complete Challenges 1 & 2 to have a properly structured streaming DataFrame.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Create a Spark session
spark = SparkSession.builder \
    .appName("VPN Security Stream Processing") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1") \
    .getOrCreate()

# Set log level
spark.sparkContext.setLogLevel("WARN")

# Assume we have the enriched_df from Challenge 2

## Define Watermarking and Windows

In [None]:
# Add watermarking to handle late data
# This is crucial for stateful operations like windowing
df_with_watermark = enriched_df \
    .withWatermark("event_time", "10 minutes")

# TODO: Define different types of windows
# Hint: Consider both tumbling and sliding windows

## Implement Window-based Aggregations

In [None]:
# TODO: Count connection attempts by user and country in 5-minute windows
user_country_counts = df_with_watermark \
    .groupBy(
        window(col("event_time"), "5 minutes"),
        col("user_id"),
        col("country")
    ) \
    .agg(
        # TODO: Add aggregation metrics
        # Examples: count, success_ratio, distinct devices
    )

In [None]:
# TODO: Calculate success/failure ratios by country
country_metrics = df_with_watermark \
    .groupBy(
        window(col("event_time"), "5 minutes", "1 minute"),  # Sliding window
        col("country")
    ) \
    .agg(
        # TODO: Add aggregation metrics
        # Hint: Use sum(when()) for conditional counts
    )

## Calculate Rate-based Metrics

In [None]:
# TODO: Calculate connection attempt rates
# Hint: You might need to use groupBy and count to calculate rates

## Output Aggregation Results

In [None]:
# TODO: Output window aggregations to console (for development)
window_query = user_country_counts \
    .writeStream \
    .outputMode("update")  # Use "update" mode for aggregations \
    .format("console") \
    .option("truncate", False) \
    .start()

## Notes and Hints
- Consider how to interpret the window results
- Think about what thresholds would indicate suspicious behavior
- Consider both count-based and rate-based aggregations
- Remember that windows can slide or tumble (fixed)