### Stock Market Data processing using Spark Structured Streaming

#### Setting the Environment Variables

##### Defining variables such as the Kafka Broker in Confluent, Kafka Topics, and API Keys.


In [0]:
import os

# Set environment variables
os.environ["KAFKA_BROKER"] = "pkc-921jm.us-east-2.aws.confluent.cloud:9092"
os.environ["KAFKA_TOPIC"] = "market_data"
os.environ["KAFKA_TOPIC_PROCESSED"] = "processed_data"
os.environ["KAFKA_API_KEY"] = "FFUTO33UE6P76HEB"
os.environ["KAFKA_API_SECRET"] = "qY5xvsluuxAhnUr0fNZsSOo/cpZ/9g2Ck4/M4gZeKU+mezzQ4UoANrGcS6IV9/9S"

#### Stream Processing

##### Calculating the Average Price Point of IBM Stock for Every One-Minute Interval

##### Once the average price of IBM stock is calculated every minute, the next step is visualization.


In [0]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window, avg
from pyspark.sql.types import StructType, StructField, StringType, FloatType

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Stock Price Streaming") \
    .getOrCreate()

# Kafka configuration
kafka_bootstrap_servers = "pkc-921jm.us-east-2.aws.confluent.cloud:9092"
# os.getenv("KAFKA_BROKER")
kafka_topic = "market_data"
# os.getenv("KAFKA_TOPIC")
kafka_topic_processed = "processed_data"
# os.getenv("KAFKA_TOPIC_PROCESSED")
checkpoint_location = "/mnt/checkpoint/kafka_sink_v3"

kafka_config = {
    'kafka.bootstrap.servers': kafka_bootstrap_servers,
    'subscribe': kafka_topic,
    'startingOffsets': 'earliest',  # Start from the earliest message
    'kafka.security.protocol': 'SASL_SSL',
    'kafka.sasl.mechanism': 'PLAIN',
    "failOnDataLoss": "false",
    "kafka.ssl.endpoint.identification.algorithm": "https",
    'kafka.sasl.jaas.config': f'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="{os.getenv("KAFKA_API_KEY")}" password="{os.getenv("KAFKA_API_SECRET")}";',
}

# Define schema for incoming data
schema = StructType([
    StructField("timestamp", StringType(), True),
    StructField("price", FloatType(), True),
    StructField("ticker", StringType(), True)
])

# Read data from Kafka
raw_stream = spark.readStream \
    .format("kafka") \
    .options(**kafka_config) \
    .load()

# Parse JSON data
parsed_stream = raw_stream.selectExpr("CAST(value AS STRING)") \
    .select(from_json(col("value"), schema).alias("data")) \
    .select("data.*")

# Repartition the stream to ensure all data for the same ticker is processed in a single executor
parsed_stream = parsed_stream.repartition("ticker")


# Transformation: Calculate average price over a 1-minute window
average_price_stream = parsed_stream.withColumn("timestamp", col("timestamp").cast("timestamp")).withWatermark("timestamp", "2 minutes").groupBy(window(col("timestamp"), "1 minute"), col("ticker")).agg(avg("price").alias("average_price"))


# Prepare DataFrame for Kafka
kafka_df = average_price_stream.selectExpr(
    "CAST(ticker AS STRING) AS key",
    "to_json(struct(window.start AS window_start, window.end AS window_end, ticker, average_price)) AS value"
)

# Write the DataFrame to Kafka
query = kafka_df.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("topic", kafka_topic_processed) \
    .option("checkpointLocation", checkpoint_location) \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.jaas.config", f'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="{os.getenv("KAFKA_API_KEY")}" password="{os.getenv("KAFKA_API_SECRET")}";')\
    .outputMode("update") \
    .start()



#### Live Dashboard on Databricks or Google Colab

##### Once the data is processed, it is visualized immediately using the visualization tool in Databricks.

###### NB: In this case, we visualize the data immediately after processing. Unlike when running the project locally, where the processed data is stored in a "processed_data" topic in Kafka and then visualized using Plotly/Dash.

In [0]:
average_price_stream_formated = average_price_stream \
    .withColumn("window.start", col("window.start").cast("timestamp")) 


In [0]:
display(average_price_stream_formated)

window,ticker,average_price,window.start
"List(2025-01-02T20:19:00Z, 2025-01-02T20:20:00Z)",IBM,100.93499946594238,2025-01-02T20:19:00Z
"List(2025-01-07T15:16:00Z, 2025-01-07T15:17:00Z)",IBM,87.12916628519694,2025-01-07T15:16:00Z
"List(2025-01-02T19:55:00Z, 2025-01-02T19:56:00Z)",IBM,92.48416646321614,2025-01-02T19:55:00Z
"List(2025-01-02T20:06:00Z, 2025-01-02T20:07:00Z)",IBM,102.55583254496256,2025-01-02T20:06:00Z
"List(2025-01-02T19:41:00Z, 2025-01-02T19:42:00Z)",IBM,118.79916763305664,2025-01-02T19:41:00Z
"List(2025-01-02T20:01:00Z, 2025-01-02T20:02:00Z)",IBM,106.48916689554852,2025-01-02T20:01:00Z
"List(2025-01-02T20:18:00Z, 2025-01-02T20:19:00Z)",IBM,119.74583435058594,2025-01-02T20:18:00Z
"List(2025-01-02T19:48:00Z, 2025-01-02T19:49:00Z)",IBM,90.85833168029784,2025-01-02T19:48:00Z
"List(2025-01-02T20:38:00Z, 2025-01-02T20:39:00Z)",IBM,111.98666699727376,2025-01-02T20:38:00Z
"List(2025-01-02T20:13:00Z, 2025-01-02T20:14:00Z)",IBM,115.1433334350586,2025-01-02T20:13:00Z


Databricks visualization. Run in Databricks to view.