# Exercice 17 - Consumer Kafka avec Spark

## Objectifs
- Lire des messages Kafka avec Spark Structured Streaming
- Parser des messages JSON
- Traiter les donnees en streaming
- Ecrire les resultats dans MinIO

---

## 1. Architecture Consumer Spark

```
+------------------------------------------------------------------+
|               SPARK STRUCTURED STREAMING + KAFKA                 |
+------------------------------------------------------------------+
|                                                                  |
|   KAFKA                  SPARK                   DESTINATION    |
|                                                                  |
|  +--------+         +-------------+            +----------+     |
|  | Topic  |         |             |            |          |     |
|  |--------|  read   |  DataFrame  |   write    |  MinIO   |     |
|  | Part 0 |-------->|  Streaming  |----------->|  Parquet |     |
|  | Part 1 |         |             |            |          |     |
|  | Part 2 |         +------+------+            +----------+     |
|  +--------+                |                                    |
|                            v                                    |
|                     +------+------+            +----------+     |
|                     |             |            |          |     |
|                     | Aggregation |----------->|  Console |     |
|                     |             |            |          |     |
|                     +-------------+            +----------+     |
|                                                                  |
+------------------------------------------------------------------+

Modes de sortie :
- append   : Ajoute uniquement les nouvelles lignes
- complete : Reecrit toute la table (pour aggregations)
- update   : Met a jour les lignes modifiees
```

## 2. Configuration Spark

In [1]:
import os
from pyspark.sql import SparkSession

SPARK_VER = "4.0.1"
SCALA_SUFFIX = "2.13"

packages = ",".join([
    f"org.apache.spark:spark-sql-kafka-0-10_{SCALA_SUFFIX}:{SPARK_VER}",
    f"org.apache.spark:spark-token-provider-kafka-0-10_{SCALA_SUFFIX}:{SPARK_VER}",
    "org.apache.hadoop:hadoop-aws:3.4.1",
    "com.amazonaws:aws-java-sdk-bundle:1.12.262"
])

os.environ["PYSPARK_SUBMIT_ARGS"] = f"--packages {packages} pyspark-shell"

spark = (
    SparkSession.builder
    .appName("KafkaConsumer")
    .getOrCreate()
)

spark.sparkContext.setLogLevel("WARN")
print("Spark:", spark.version)
print("Scala:", spark.sparkContext._jvm.scala.util.Properties.versionNumberString())

KeyboardInterrupt: 

In [2]:
# Configuration
KAFKA_BROKER = "broker:29092"
MINIO_BUCKET = "s3a://bronze"

print("Configuration prete")

Configuration prete


## 3. Lecture batch depuis Kafka

In [3]:
# Lire les messages existants (mode batch)
df_kafka = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "commandes-json") \
    .option("startingOffsets", "earliest") \
    .load()

print(f"Messages lus: {df_kafka.count()}")
df_kafka.printSchema()

Py4JJavaError: An error occurred while calling o36.count.
: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.
	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
	at org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:65)
	at org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:64)
	at org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:101)
	at org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:112)
	at org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchPartitionOffsets(KafkaOffsetReaderAdmin.scala:133)
	at org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.getOffsetRangesFromUnresolvedOffsets(KafkaOffsetReaderAdmin.scala:382)
	at org.apache.spark.sql.kafka010.KafkaRelation.buildScan(KafkaRelation.scala:69)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:396)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
	at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
	at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:727)
	at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:721)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1306)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
	at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
	at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:727)
	at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:721)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1306)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
	at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
	at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:727)
	at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:721)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1306)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
	at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
	at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:593)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazySparkPlan$2(QueryExecution.scala:223)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:278)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:654)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:278)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:277)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazySparkPlan$1(QueryExecution.scala:223)
	at scala.util.Try$.apply(Try.scala:217)
	at org.apache.spark.util.Utils$.doTryWithCallerStacktrace(Utils.scala:1378)
	at org.apache.spark.util.LazyTry.tryT$lzycompute(LazyTry.scala:46)
	at org.apache.spark.util.LazyTry.tryT(LazyTry.scala:46)
	at org.apache.spark.util.LazyTry.get(LazyTry.scala:58)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:227)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyExecutedPlan$2(QueryExecution.scala:238)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:278)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:654)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:278)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:277)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyExecutedPlan$1(QueryExecution.scala:238)
	at scala.util.Try$.apply(Try.scala:217)
	at org.apache.spark.util.Utils$.getTryWithCallerStacktrace(Utils.scala:1439)
	at org.apache.spark.util.LazyTry.get(LazyTry.scala:58)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:248)
	at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:297)
	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:344)
	at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:312)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$8(SQLExecution.scala:149)
	at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:272)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$7(SQLExecution.scala:125)
	at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
	at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112)
	at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:106)
	at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:295)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:124)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:237)
	at org.apache.spark.sql.classic.Dataset.withAction(Dataset.scala:2232)
	at org.apache.spark.sql.classic.Dataset.count(Dataset.scala:1499)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:840)
	Suppressed: org.apache.spark.util.Utils$OriginalTryStackTraceException: Full stacktrace of original doTryWithCallerStacktrace caller
		at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
		at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
		at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
		at org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:65)
		at org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:64)
		at org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:101)
		at org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:112)
		at org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchPartitionOffsets(KafkaOffsetReaderAdmin.scala:133)
		at org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.getOffsetRangesFromUnresolvedOffsets(KafkaOffsetReaderAdmin.scala:382)
		at org.apache.spark.sql.kafka010.KafkaRelation.buildScan(KafkaRelation.scala:69)
		at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:396)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
		at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
		at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
		at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:727)
		at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:721)
		at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1306)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
		at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
		at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
		at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:727)
		at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:721)
		at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1306)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
		at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
		at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
		at scala.collection.IterableOnceOps.foldLeft(IterableOnce.scala:727)
		at scala.collection.IterableOnceOps.foldLeft$(IterableOnce.scala:721)
		at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1306)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$2(QueryPlanner.scala:75)
		at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)
		at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)
		at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
		at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)
		at org.apache.spark.sql.execution.QueryExecution$.createSparkPlan(QueryExecution.scala:593)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazySparkPlan$2(QueryExecution.scala:223)
		at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:278)
		at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:654)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:278)
		at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
		at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:277)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazySparkPlan$1(QueryExecution.scala:223)
		at scala.util.Try$.apply(Try.scala:217)
		at org.apache.spark.util.Utils$.doTryWithCallerStacktrace(Utils.scala:1378)
		at org.apache.spark.util.LazyTry.tryT$lzycompute(LazyTry.scala:46)
		at org.apache.spark.util.LazyTry.tryT(LazyTry.scala:46)
		at org.apache.spark.util.LazyTry.get(LazyTry.scala:58)
		at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:227)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyExecutedPlan$2(QueryExecution.scala:238)
		at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:278)
		at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:654)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:278)
		at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
		at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:277)
		at org.apache.spark.sql.execution.QueryExecution.$anonfun$lazyExecutedPlan$1(QueryExecution.scala:238)
		at scala.util.Try$.apply(Try.scala:217)
		at org.apache.spark.util.Utils$.doTryWithCallerStacktrace(Utils.scala:1378)
		at org.apache.spark.util.LazyTry.tryT$lzycompute(LazyTry.scala:46)
		at org.apache.spark.util.LazyTry.tryT(LazyTry.scala:46)
		... 32 more
Caused by: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.


In [None]:
# Structure des donnees Kafka
# - key: cle du message (binaire)
# - value: contenu du message (binaire)
# - topic: nom du topic
# - partition: numero de partition
# - offset: position dans la partition
# - timestamp: horodatage Kafka

df_kafka.select(
    col("key").cast("string").alias("key"),
    col("value").cast("string").alias("value"),
    "topic",
    "partition",
    "offset",
    "timestamp"
).show(5, truncate=50)

## 4. Parser les messages JSON

In [None]:
# Definir le schema des commandes
item_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("unit_price", DoubleType(), True),
    StructField("subtotal", DoubleType(), True)
])

commande_schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_id", StringType(), True),
    StructField("timestamp", StringType(), True),
    StructField("items", ArrayType(item_schema), True),
    StructField("total", DoubleType(), True),
    StructField("status", StringType(), True)
])

print("Schema defini")

In [None]:
# Parser le JSON
df_commandes = df_kafka \
    .select(
        col("key").cast("string").alias("customer_key"),
        from_json(col("value").cast("string"), commande_schema).alias("data"),
        "partition",
        "offset",
        col("timestamp").alias("kafka_timestamp")
    ) \
    .select(
        "customer_key",
        "data.*",
        "partition",
        "offset",
        "kafka_timestamp"
    )

df_commandes.printSchema()

In [None]:
# Afficher les commandes
df_commandes.select(
    "order_id",
    "customer_id",
    "total",
    "status",
    "partition"
).show(10)

## 5. Analyser les items des commandes

In [None]:
# Exploser le tableau items
df_items = df_commandes \
    .select(
        "order_id",
        "customer_id",
        "timestamp",
        explode("items").alias("item")
    ) \
    .select(
        "order_id",
        "customer_id",
        "timestamp",
        col("item.product_id"),
        col("item.product_name"),
        col("item.quantity"),
        col("item.unit_price"),
        col("item.subtotal")
    )

df_items.show(10)

In [None]:
# Statistiques par produit
df_stats_produit = df_items.groupBy("product_name").agg(
    count("*").alias("nb_ventes"),
    spark_sum("quantity").alias("quantite_totale"),
    spark_sum("subtotal").alias("ca_total")
).orderBy(col("ca_total").desc())

df_stats_produit.show()

## 6. Sauvegarder dans MinIO

In [None]:
# Sauvegarder les commandes en Parquet
df_commandes \
    .drop("items") \
    .write \
    .mode("overwrite") \
    .parquet(f"{MINIO_BUCKET}/kafka/commandes")

print("Commandes sauvegardees")

In [None]:
# Sauvegarder les items
df_items \
    .write \
    .mode("overwrite") \
    .parquet(f"{MINIO_BUCKET}/kafka/items")

print("Items sauvegardes")

## 7. Streaming en temps reel

In [None]:
# Lire en mode streaming
df_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "commandes-json") \
    .option("startingOffsets", "latest") \
    .load()

print("Stream configure")
print(f"Is streaming: {df_stream.isStreaming}")

In [None]:
# Parser le stream
df_stream_parsed = df_stream \
    .select(
        from_json(col("value").cast("string"), commande_schema).alias("data"),
        "timestamp"
    ) \
    .select(
        "data.order_id",
        "data.customer_id",
        "data.total",
        "data.status",
        col("timestamp").alias("kafka_time")
    )

print("Stream parse")

In [None]:
# Ecrire dans la console (pour debug)
# Attention: Executez ce code puis envoyez des messages Kafka depuis un autre notebook

query = df_stream_parsed \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", False) \
    .trigger(processingTime="5 seconds") \
    .start()

print("Stream demarre - envoyez des messages Kafka pour les voir")
print("Executez query.stop() pour arreter")

In [None]:
# Attendre quelques secondes puis arreter
import time
time.sleep(30)  # Attendre 30 secondes
query.stop()
print("Stream arrete")

## 8. Aggregations en streaming

In [None]:
# Lire a nouveau le stream
df_stream2 = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "commandes-json") \
    .option("startingOffsets", "earliest") \
    .load()

# Parser
df_agg = df_stream2 \
    .select(
        from_json(col("value").cast("string"), commande_schema).alias("data"),
        "timestamp"
    ) \
    .select(
        "data.customer_id",
        "data.total",
        col("timestamp").alias("event_time")
    )

print("Stream prepare pour aggregation")

In [None]:
# Aggregation par fenetre de temps
df_windowed = df_agg \
    .groupBy(
        window(col("event_time"), "1 minute"),
        "customer_id"
    ) \
    .agg(
        count("*").alias("nb_commandes"),
        spark_sum("total").alias("total_commandes")
    )

print("Aggregation definie")

In [None]:
# Ecrire les aggregations
query_agg = df_windowed \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", False) \
    .trigger(processingTime="10 seconds") \
    .start()

print("Stream aggregation demarre")

In [None]:
# Attendre puis arreter
time.sleep(20)
query_agg.stop()
print("Stream arrete")

## 9. Lire les logs applicatifs

In [None]:
# Schema des logs
log_schema = StructType([
    StructField("timestamp", StringType(), True),
    StructField("level", StringType(), True),
    StructField("module", StringType(), True),
    StructField("message", StringType(), True),
    StructField("request_id", StringType(), True),
    StructField("user_id", StringType(), True),
    StructField("duration_ms", IntegerType(), True)
])

# Lire les logs
df_logs = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "logs-application") \
    .option("startingOffsets", "earliest") \
    .load() \
    .select(
        from_json(col("value").cast("string"), log_schema).alias("log")
    ) \
    .select("log.*")

print(f"Logs lus: {df_logs.count()}")
df_logs.show(10, truncate=False)

In [None]:
# Statistiques par niveau de log
df_logs.groupBy("level").count().show()

In [None]:
# Statistiques par module
df_logs.groupBy("module").agg(
    count("*").alias("nb_logs"),
    avg("duration_ms").alias("duree_moyenne_ms")
).orderBy(col("nb_logs").desc()).show()

In [None]:
# Fermer Spark
# spark.stop()

---

## Exercice

**Objectif** : Analyser les metriques Kafka

**Consigne** :
1. Lisez le topic "metrics" depuis Kafka
2. Parsez les messages JSON
3. Calculez les moyennes CPU et memoire par serveur
4. Identifiez les serveurs les plus charges

A vous de jouer :

In [None]:
# TODO: Definir le schema des metriques

In [None]:
# TODO: Lire le topic metrics

In [None]:
# TODO: Calculer les statistiques par serveur

---

## Resume

Dans ce notebook, vous avez appris :
- Comment **lire des messages Kafka** avec Spark
- Comment **parser des messages JSON** avec un schema
- Comment utiliser le **mode batch** et le **mode streaming**
- Comment faire des **aggregations en streaming** avec des fenetres de temps
- Comment **sauvegarder les donnees** dans MinIO

### Prochaine etape
Dans le prochain notebook, nous approfondirons le streaming Spark avec des concepts avances.