## Exercise One

Our brand is having a flash sale! All physical stores are discounting their prices for a national holiday, but as we want to boost sales from our store locations, and don't necessarily want to make this available to our international customers, we need to transform our data to see what this would look like!

First, using the PySpark transfomation methods we looked at previously, create a new column on the incoming data to show the discounted amount. Then remove any online stores from the transformed data.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructType, StructField, TimestampType, IntegerType
from pyspark.sql.functions import from_json, col

# Define the path to the jars on the EC2 instance
spark_jars_path = "/home/ec2-user/stream-processing-template/jars"  # <-- Update this path

spark = SparkSession.builder.appName("retail_pysaprk_consumer") \
    .config("spark.jars", f"{spark_jars_path}/commons-pool2-2.11.1.jar,"
            f"{spark_jars_path}/spark-sql-kafka-0-10_2.12-3.4.0.jar,"
            f"{spark_jars_path}/spark-streaming-kafka-0-10-assembly_2.12-3.4.0.jar") \
    .getOrCreate()

# Define the schema for our data
schema = StructType([
    StructField("store_location", StringType(), True),
    StructField("time_of_purchase", TimestampType(), True),
    StructField("product_ID", StringType(), True),
    StructField("transaction_amount", IntegerType(), True)
])

# Stream from Kafka topic
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "b-1.monstercluster1.6xql65.c3.kafka.eu-west-2.amazonaws.com:9092") \
    .option("subscribe", "retail_transactions") \
    .load()

transactions = (df.selectExpr("CAST(value AS STRING)")
                .withColumn("data", from_json(col("value"), schema))
                .select("data.*"))

Objective: Calculate and add a new column discounted_amount, which is 90% of the transaction_amount.

In [None]:
from pyspark.sql.functions import col
# Apply the discount
with_discount = transactions.withColumn("discounted_amount", col("transaction_amount") * 0.9)

# Define a streaming query to view the results
query_with_discount = with_discount.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

# Await termination of the stream
query_with_discount.awaitTermination()

In [None]:
query_with_discount.stop()

Objective: Filter out transactions from the store_location "online" (e.g., only physical store transactions).

In [None]:
physical_transactions = transactions.filter(transactions.store_location != "online")

# Define a streaming query to view the results
physical_transactions_query = physical_transactions.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

# Await termination of the stream
physical_transactions_query.awaitTermination()

In [None]:
physical_transactions_query.stop()