# Challenge 4: Anomaly Detection for VPN Security

## Task Description
In this challenge, we need to:
1. Detect suspicious patterns in VPN connection attempts
2. Identify rapid country-hopping (same user connecting from different countries)
3. Detect potential brute force attacks (multiple failed connections)
4. Flag anomalous activity for further investigation

## Prerequisites
Complete Challenges 1-3 to have streaming data with windows and aggregations.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Create a Spark session
spark = SparkSession.builder \
    .appName("VPN Security Stream Processing") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1") \
    .getOrCreate()

# Set log level
spark.sparkContext.setLogLevel("WARN")

# Assume we have df_with_watermark from Challenge 3

## Country-Hopping Detection

In [None]:
# TODO: Detect users connecting from multiple countries in a short timeframe
# Hint: Group by user_id and count distinct countries within time windows

country_hopping_df = df_with_watermark \
    .groupBy(
        window(col("event_time"), "10 minutes"),
        col("user_id")
    ) \
    .agg(
        collect_set("country").alias("countries"),
        count("*").alias("connection_count")
    ) \
    .select(
        col("window"),
        col("user_id"),
        col("countries"),
        col("connection_count"),
        size(col("countries")).alias("country_count")
    ) \
    .filter(col("country_count") >= 2)  # Adjust threshold as needed

## Brute Force Detection

In [None]:
# TODO: Detect multiple failed connection attempts
# Hint: Look for patterns of consecutive failures

brute_force_df = df_with_watermark \
    .filter(col("is_successful") == False) \
    .groupBy(
        window(col("event_time"), "5 minutes"),
        col("user_id")
    ) \
    .agg(
        count("*").alias("failed_attempts")
    ) \
    .filter(col("failed_attempts") >= 5)  # Adjust threshold as needed

## Unusual Access Patterns

In [None]:
# TODO: Detect access from unusual countries for a user
# Hint: You might need to track historical patterns and compare

# TODO: Identify irregular connection times
# Hint: Look at the hour of day distribution

## Combining Anomaly Signals

In [None]:
# TODO: Create a unified view of security alerts
# Combine different anomaly signals with appropriate severity levels

alerts_df = country_hopping_df.select(
    col("window.end").alias("alert_time"),
    col("user_id"),
    lit("country_hopping").alias("alert_type"),
    col("country_count").alias("severity_factor"),
    map(
        lit("countries"), col("countries"),
        lit("connection_count"), col("connection_count")
    ).alias("details")
)

# TODO: Join with other alert types

## Output Alerts

In [None]:
# Write alerts to console for monitoring
alert_query = alerts_df \
    .writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", False) \
    .start()

## Notes and Hints
- Consider what constitutes "normal" vs. "suspicious" behavior
- You might need different thresholds for different types of users
- Consider time of day in your anomaly detection
- Remember that false positives are also problematic