# Challenge 2: Basic Transformations for VPN Event Stream

## Task Description
In this challenge, we need to:
1. Parse JSON data from Kafka messages
2. Extract relevant fields
3. Apply transformations to derive new columns
4. Filter events based on conditions

## Setup
Use the same Spark session setup from Challenge 1.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Create a Spark session
spark = SparkSession.builder \
    .appName("VPN Security Stream Processing") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.4.1") \
    .getOrCreate()

# Set log level
spark.sparkContext.setLogLevel("WARN")

## Parse JSON and Extract Fields

In [None]:
# Recreate stream connection
kafka_bootstrap_servers = "kafka:9092"
kafka_topic = "vpn_connection_events"

stream_df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("subscribe", kafka_topic) \
    .option("startingOffsets", "earliest") \
    .load()

# Parse value column from Kafka (contains JSON data)
parsed_df = stream_df.selectExpr("CAST(value AS STRING)")

# Define schema for the JSON data
schema = StructType([
    # TODO: Complete the schema based on Challenge 1
])

# TODO: Convert JSON string to structured data
# Hint: Use from_json() function with the schema defined above
parsed_json_df = parsed_df.select(
    # TODO: Use from_json to parse the JSON string
)

# TODO: Select and rename columns as needed
# Hint: Extract fields from the parsed JSON structure
transformed_df = parsed_json_df.select(
    # Select fields from the parsed JSON
    # Example: col("data.user_id").alias("user_id")
)

## Add Derived Columns

In [None]:
# TODO: Add derived columns based on existing data
enriched_df = transformed_df \
    .withColumn("event_time", to_timestamp(col("timestamp"))) \
    .withColumn("is_mobile", /* TODO: Logic to determine if device is mobile */) \
    .withColumn("is_successful", /* TODO: Logic to check if connection was successful */) \
    .withColumn("event_hour", hour(col("event_time"))) \
    .withColumn("event_date", to_date(col("event_time")))

# TODO: Add any other useful derived columns

## Apply Filters

In [None]:
# TODO: Filter for specific conditions
# Examples:
# - Only successful connections
# - Only mobile devices
# - Connections from specific countries
filtered_df = enriched_df.filter(/* TODO: Add filter conditions */)

## Output Stream

In [None]:
# Display the transformed and enriched stream
query = enriched_df \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", False) \
    .start()

query.awaitTermination()

## Testing Notes
- Make sure your JSON parsing matches the actual data format
- Verify derived columns work as expected
- Check that filter conditions are working correctly