# Exercice 17 - Consumer Kafka avec Spark

## Objectifs
- Lire des messages Kafka avec Spark Structured Streaming
- Parser des messages JSON
- Traiter les donnees en streaming
- Ecrire les resultats dans MinIO

---

## 1. Architecture Consumer Spark

```
+------------------------------------------------------------------+
|               SPARK STRUCTURED STREAMING + KAFKA                 |
+------------------------------------------------------------------+
|                                                                  |
|   KAFKA                  SPARK                   DESTINATION    |
|                                                                  |
|  +--------+         +-------------+            +----------+     |
|  | Topic  |         |             |            |          |     |
|  |--------|  read   |  DataFrame  |   write    |  MinIO   |     |
|  | Part 0 |-------->|  Streaming  |----------->|  Parquet |     |
|  | Part 1 |         |             |            |          |     |
|  | Part 2 |         +------+------+            +----------+     |
|  +--------+                |                                    |
|                            v                                    |
|                     +------+------+            +----------+     |
|                     |             |            |          |     |
|                     | Aggregation |----------->|  Console |     |
|                     |             |            |          |     |
|                     +-------------+            +----------+     |
|                                                                  |
+------------------------------------------------------------------+

Modes de sortie :
- append   : Ajoute uniquement les nouvelles lignes
- complete : Reecrit toute la table (pour aggregations)
- update   : Met a jour les lignes modifiees
```

## 2. Configuration Spark

In [1]:
# === Setup (Kafka + MinIO + Spark) ===
# IMPORTANT: If you change packages below, restart the kernel before re-running.
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, to_timestamp, window, count, sum as spark_sum, avg, max as spark_max, explode, current_timestamp, round
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType, TimestampType

# Align versions with the Docker image (Spark 4.x uses Scala 2.13)
SPARK_VER = os.environ.get("SPARK_VER", "4.0.1")
SCALA_SUFFIX = os.environ.get("SCALA_SUFFIX", "2.13")

packages = ",".join([
    f"org.apache.spark:spark-sql-kafka-0-10_{SCALA_SUFFIX}:{SPARK_VER}",
    f"org.apache.spark:spark-token-provider-kafka-0-10_{SCALA_SUFFIX}:{SPARK_VER}",
    # S3A / MinIO
    "org.apache.hadoop:hadoop-aws:3.4.1",
    "com.amazonaws:aws-java-sdk-bundle:1.12.262",
])

# Ensure Spark loads the right connector JARs (avoids Scala/Spark mismatch errors)
os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {packages} pyspark-shell"

# (Re)create Spark session
spark = (
    SparkSession.builder
    .appName("KafkaConsumer")
    # MinIO (S3A) configuration
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin")
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
    .config(
        "spark.hadoop.fs.s3a.aws.credentials.provider",
        "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider",
    )
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("Spark version:", spark.version)
print("Scala:", spark.sparkContext._jvm.scala.util.Properties.versionNumberString())
print("Packages:", packages)

# Docker-internal Kafka listener (use this from inside the Jupyter container)
KAFKA_BROKER = "broker:29092"

# Default MinIO base path for this lab
MINIO_BUCKET = "s3a://bronze"
print("Configuration prête")


Spark version: 4.0.1
Scala: 2.13.16
Packages: org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.1,org.apache.spark:spark-token-provider-kafka-0-10_2.13:4.0.1,org.apache.hadoop:hadoop-aws:3.4.1,com.amazonaws:aws-java-sdk-bundle:1.12.262
Configuration prête


In [2]:
# Configuration
KAFKA_BROKER = "broker:29092"
MINIO_BUCKET = "s3a://bronze"

print("Configuration prete")

Configuration prete


## 3. Lecture batch depuis Kafka

In [3]:
# Lire les messages existants (mode batch)
df_kafka = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "commandes-json") \
    .option("startingOffsets", "earliest") \
    .load()

print(f"Messages lus: {df_kafka.count()}")
df_kafka.printSchema()

Messages lus: 66
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



In [4]:
# Structure des donnees Kafka
# - key: cle du message (binaire)
# - value: contenu du message (binaire)
# - topic: nom du topic
# - partition: numero de partition
# - offset: position dans la partition
# - timestamp: horodatage Kafka

df_kafka.select(
    col("key").cast("string").alias("key"),
    col("value").cast("string").alias("value"),
    "topic",
    "partition",
    "offset",
    "timestamp"
).show(5, truncate=50)

+--------+--------------------------------------------------+--------------+---------+------+-----------------------+
|     key|                                             value|         topic|partition|offset|              timestamp|
+--------+--------------------------------------------------+--------------+---------+------+-----------------------+
|CUST-016|{"order_id": 35924, "customer_id": "CUST-016", ...|commandes-json|        0|     0|2026-01-18 10:08:23.143|
|CUST-013|{"order_id": 29067, "customer_id": "CUST-013", ...|commandes-json|        0|     1|2026-01-18 10:08:23.146|
|CUST-017|{"order_id": 91880, "customer_id": "CUST-017", ...|commandes-json|        0|     2|2026-01-18 10:08:23.147|
|CUST-013|{"order_id": 34621, "customer_id": "CUST-013", ...|commandes-json|        0|     3|2026-01-18 10:08:23.147|
|CUST-020|{"order_id": 11525, "customer_id": "CUST-020", ...|commandes-json|        0|     4|2026-01-18 10:08:23.148|
+--------+----------------------------------------------

## 4. Parser les messages JSON

In [5]:
# Definir le schema des commandes
item_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("unit_price", DoubleType(), True),
    StructField("subtotal", DoubleType(), True)
])

commande_schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_id", StringType(), True),
    StructField("timestamp", StringType(), True),
    StructField("items", ArrayType(item_schema), True),
    StructField("total", DoubleType(), True),
    StructField("status", StringType(), True)
])

print("Schema defini")

Schema defini


In [6]:
# Parser le JSON
df_commandes = df_kafka \
    .select(
        col("key").cast("string").alias("customer_key"),
        from_json(col("value").cast("string"), commande_schema).alias("data"),
        "partition",
        "offset",
        col("timestamp").alias("kafka_timestamp")
    ) \
    .select(
        "customer_key",
        "data.*",
        "partition",
        "offset",
        "kafka_timestamp"
    )

df_commandes.printSchema()

root
 |-- customer_key: string (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- product_id: integer (nullable = true)
 |    |    |-- product_name: string (nullable = true)
 |    |    |-- quantity: integer (nullable = true)
 |    |    |-- unit_price: double (nullable = true)
 |    |    |-- subtotal: double (nullable = true)
 |-- total: double (nullable = true)
 |-- status: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- kafka_timestamp: timestamp (nullable = true)



In [7]:
# Afficher les commandes
df_commandes.select(
    "order_id",
    "customer_id",
    "total",
    "status",
    "partition"
).show(10)

+--------+-----------+-------+-------+---------+
|order_id|customer_id|  total| status|partition|
+--------+-----------+-------+-------+---------+
|   35924|   CUST-016|1409.91|created|        0|
|   29067|   CUST-013| 539.94|created|        0|
|   91880|   CUST-017|5239.92|created|        0|
|   34621|   CUST-013| 359.96|created|        0|
|   11525|   CUST-020|1599.92|created|        0|
|   64269|   CUST-002|3249.92|created|        0|
|   89777|   CUST-016|2239.95|created|        0|
|   20996|   CUST-013| 399.92|created|        0|
|   56230|   CUST-013|2199.95|created|        0|
|   24810|   CUST-002|3179.92|created|        0|
+--------+-----------+-------+-------+---------+
only showing top 10 rows


## 5. Analyser les items des commandes

In [8]:
# Exploser le tableau items
df_items = df_commandes \
    .select(
        "order_id",
        "customer_id",
        "timestamp",
        explode("items").alias("item")
    ) \
    .select(
        "order_id",
        "customer_id",
        "timestamp",
        col("item.product_id"),
        col("item.product_name"),
        col("item.quantity"),
        col("item.unit_price"),
        col("item.subtotal")
    )

df_items.show(10)

+--------+-----------+--------------------+----------+------------+--------+----------+--------+
|order_id|customer_id|           timestamp|product_id|product_name|quantity|unit_price|subtotal|
+--------+-----------+--------------------+----------+------------+--------+----------+--------+
|   35924|   CUST-016|2026-01-18T10:08:...|         6|      Webcam|       3|     89.99|  269.97|
|   35924|   CUST-016|2026-01-18T10:08:...|         2|      Souris|       3|     29.99|   89.97|
|   35924|   CUST-016|2026-01-18T10:08:...|         4|       Ecran|       3|    349.99| 1049.97|
|   29067|   CUST-013|2026-01-18T10:08:...|         6|      Webcam|       2|     89.99|  179.98|
|   29067|   CUST-013|2026-01-18T10:08:...|         2|      Souris|       2|     29.99|   59.98|
|   29067|   CUST-013|2026-01-18T10:08:...|         5|      Casque|       2|    149.99|  299.98|
|   91880|   CUST-017|2026-01-18T10:08:...|         1|      Laptop|       3|    999.99| 2999.97|
|   91880|   CUST-017|2026-01-

In [9]:
# Statistiques par produit
df_stats_produit = df_items.groupBy("product_name").agg(
    count("*").alias("nb_ventes"),
    spark_sum("quantity").alias("quantite_totale"),
    spark_sum("subtotal").alias("ca_total")
).orderBy(col("ca_total").desc())

df_stats_produit.show()

+------------+---------+---------------+------------------+
|product_name|nb_ventes|quantite_totale|          ca_total|
+------------+---------+---------------+------------------+
|      Laptop|       28|             62|          61999.38|
|       Ecran|       16|             35|12249.650000000001|
|      Casque|       27|             56|           8399.44|
|     Clavier|       20|             42|3359.5799999999995|
|      Webcam|       17|             36|3239.6400000000003|
|         SSD|       10|             17|           2209.83|
|     USB Hub|       26|             44|           1759.56|
|      Souris|       26|             50|            1499.5|
+------------+---------+---------------+------------------+



## 6. Sauvegarder dans MinIO

In [10]:
# Sauvegarder les commandes en Parquet
df_commandes \
    .drop("items") \
    .write \
    .mode("overwrite") \
    .parquet(f"{MINIO_BUCKET}/kafka/commandes")

print("Commandes sauvegardees")

Commandes sauvegardees


In [11]:
# Sauvegarder les items
df_items \
    .write \
    .mode("overwrite") \
    .parquet(f"{MINIO_BUCKET}/kafka/items")

print("Items sauvegardes")

Items sauvegardes


## 7. Streaming en temps reel

In [12]:
# Lire en mode streaming
df_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "commandes-json") \
    .option("startingOffsets", "latest") \
    .load()

print("Stream configure")
print(f"Is streaming: {df_stream.isStreaming}")

Stream configure
Is streaming: True


In [13]:
# Parser le stream
df_stream_parsed = df_stream \
    .select(
        from_json(col("value").cast("string"), commande_schema).alias("data"),
        "timestamp"
    ) \
    .select(
        "data.order_id",
        "data.customer_id",
        "data.total",
        "data.status",
        col("timestamp").alias("kafka_time")
    )

print("Stream parse")

Stream parse


In [14]:
# Ecrire dans la console (pour debug)
# Attention: Executez ce code puis envoyez des messages Kafka depuis un autre notebook

query = df_stream_parsed \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", False) \
    .trigger(processingTime="5 seconds") \
    .start()

print("Stream demarre - envoyez des messages Kafka pour les voir")
print("Executez query.stop() pour arreter")

Stream demarre - envoyez des messages Kafka pour les voir
Executez query.stop() pour arreter


In [15]:
# Attendre quelques secondes puis arreter
import time
time.sleep(30)  # Attendre 30 secondes
query.stop()
print("Stream arrete")

Stream arrete


## 8. Aggregations en streaming

In [16]:
# Lire a nouveau le stream
df_stream2 = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "commandes-json") \
    .option("startingOffsets", "earliest") \
    .load()

# Parser
df_agg = df_stream2 \
    .select(
        from_json(col("value").cast("string"), commande_schema).alias("data"),
        "timestamp"
    ) \
    .select(
        "data.customer_id",
        "data.total",
        col("timestamp").alias("event_time")
    )

print("Stream prepare pour aggregation")

Stream prepare pour aggregation


In [17]:
# Aggregation par fenetre de temps
df_windowed = df_agg \
    .groupBy(
        window(col("event_time"), "1 minute"),
        "customer_id"
    ) \
    .agg(
        count("*").alias("nb_commandes"),
        spark_sum("total").alias("total_commandes")
    )

print("Aggregation definie")

Aggregation definie


In [18]:
# Ecrire les aggregations
query_agg = df_windowed \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", False) \
    .trigger(processingTime="10 seconds") \
    .start()

print("Stream aggregation demarre")

Stream aggregation demarre


In [19]:
# Attendre puis arreter
time.sleep(20)
query_agg.stop()
print("Stream arrete")

Stream arrete


## 9. Lire les logs applicatifs

In [20]:
# Schema des logs
log_schema = StructType([
    StructField("timestamp", StringType(), True),
    StructField("level", StringType(), True),
    StructField("module", StringType(), True),
    StructField("message", StringType(), True),
    StructField("request_id", StringType(), True),
    StructField("user_id", StringType(), True),
    StructField("duration_ms", IntegerType(), True)
])

# Lire les logs
df_logs = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "logs-application") \
    .option("startingOffsets", "earliest") \
    .load() \
    .select(
        from_json(col("value").cast("string"), log_schema).alias("log")
    ) \
    .select("log.*")

print(f"Logs lus: {df_logs.count()}")
df_logs.show(10, truncate=False)

Logs lus: 15
+--------------------------+-------+--------+---------------------+----------+-------+-----------+
|timestamp                 |level  |module  |message              |request_id|user_id|duration_ms|
+--------------------------+-------+--------+---------------------+----------+-------+-----------+
|2026-01-18T10:08:42.729874|ERROR  |database|Connection closed    |req-58629 |user-3 |312        |
|2026-01-18T10:08:42.739303|INFO   |cache   |Cache hit            |req-60579 |user-95|168        |
|2026-01-18T10:08:42.744905|INFO   |auth    |Authentication failed|req-35880 |user-46|464        |
|2026-01-18T10:08:42.759420|INFO   |auth    |User login           |req-70606 |user-30|170        |
|2026-01-18T10:08:42.765970|ERROR  |api     |Invalid request      |req-20739 |user-18|221        |
|2026-01-18T10:08:42.774219|INFO   |auth    |Session expired      |req-89030 |user-39|208        |
+--------------------------+-------+--------+---------------------+----------+-------+----------

In [21]:
# Statistiques par niveau de log
df_logs.groupBy("level").count().show()

+-------+-----+
|  level|count|
+-------+-----+
|   INFO|    7|
|  ERROR|    4|
+-------+-----+



In [22]:
# Statistiques par module
df_logs.groupBy("module").agg(
    count("*").alias("nb_logs"),
    avg("duration_ms").alias("duree_moyenne_ms")
).orderBy(col("nb_logs").desc()).show()

+--------+-------+----------------+
|  module|nb_logs|duree_moyenne_ms|
+--------+-------+----------------+
|    auth|      5|           211.2|
|     api|      4|           342.0|
|database|      2|           226.0|
|   cache|      2|           186.0|
| payment|      2|           198.5|
+--------+-------+----------------+



In [23]:
# Fermer Spark
# spark.stop()

---

## Exercice

**Objectif** : Analyser les metriques Kafka

**Consigne** :
1. Lisez le topic "metrics" depuis Kafka
2. Parsez les messages JSON
3. Calculez les moyennes CPU et memoire par serveur
4. Identifiez les serveurs les plus charges

A vous de jouer :

In [24]:
# TODO: Definir le schema des metriques

# Configuration
KAFKA_BROKER = "broker:29092"
TOPIC_METRICS = "metrics"

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType, StructType

# Schéma imbriqué pour le champ "metrics"
metrics_inner_schema = StructType([
    StructField("cpu_percent", DoubleType()),
    StructField("memory_percent", DoubleType()),
    StructField("disk_percent", DoubleType()),
    StructField("response_time_ms", DoubleType())
])

# Schéma global du message JSON
schema_json = StructType([
    StructField("timestamp", StringType()),
    StructField("host", StringType()),
    StructField("metrics", metrics_inner_schema)
])

print(f"Configuration : Broker={KAFKA_BROKER} | Topic={TOPIC_METRICS}")

Configuration : Broker=broker:29092 | Topic=metrics


In [25]:
# TODO: Lire le topic metrics
df_raw = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", TOPIC_METRICS) \
    .option("startingOffsets", "earliest") \
    .load()

# Conversion du JSON (colonne 'value') en colonnes structurées
df_parsed = df_raw.select(
    from_json(col("value").cast("string"), schema_json).alias("data"),
    col("timestamp").alias("kafka_timestamp")
).select("data.*")

# Conversion du timestamp string en vrai timestamp pour les fenêtres temporelles
df_clean = df_parsed.withColumn("event_time", to_timestamp("timestamp"))

print("Stream initialisé. Schéma :")
df_clean.printSchema()

Stream initialisé. Schéma :
root
 |-- timestamp: string (nullable = true)
 |-- host: string (nullable = true)
 |-- metrics: struct (nullable = true)
 |    |-- cpu_percent: double (nullable = true)
 |    |-- memory_percent: double (nullable = true)
 |    |-- disk_percent: double (nullable = true)
 |    |-- response_time_ms: double (nullable = true)
 |-- event_time: timestamp (nullable = true)



In [27]:
# TODO: Calculer les statistiques par serveur

# Agrégation avec fenêtre glissante (Tumbling Window) de 30 secondes
df_stats = df_clean \
    .withWatermark("event_time", "1 minute") \
    .groupBy(
        window("event_time", "30 seconds"),
        "host"
    ) \
    .agg(
        round(avg("metrics.cpu_percent"), 2).alias("avg_cpu"),
        round(spark_max("metrics.memory_percent"), 2).alias("max_mem"),
        count("*").alias("nb_mesures")
    )

# Affichage des résultats en console (Output Mode : Update ou Complete)
query = df_stats.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", "false") \
    .queryName("stats_serveurs") \
    .start()

import time
time.sleep(20)
query.stop()
print("Streaming terminé.")

Streaming terminé.


---

## Resume

Dans ce notebook, vous avez appris :
- Comment **lire des messages Kafka** avec Spark
- Comment **parser des messages JSON** avec un schema
- Comment utiliser le **mode batch** et le **mode streaming**
- Comment faire des **aggregations en streaming** avec des fenetres de temps
- Comment **sauvegarder les donnees** dans MinIO

### Prochaine etape
Dans le prochain notebook, nous approfondirons le streaming Spark avec des concepts avances.