# Streaming: C√°lculo de Estad√≠sticas en Tiempo Real

**Objetivo:** Procesar el stream de eventos de Kafka y calcular estad√≠sticas en tiempo real.

**Estad√≠sticas a calcular:**
- M√≠nimos, M√°ximos, Promedios y Varianza de:
  - Pases por equipo
  - Distancia recorrida
  - Posiciones promedio
  - Posesi√≥n

**Arquitectura:** Spark Structured Streaming + RAPIDS GPU

## 1. Setup y Configuraci√≥n

In [1]:
import os
import sys
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime

print(f"Setup complete - {datetime.now()}")

Setup complete - 2025-11-06 18:17:30.130333


## 2. Inicializar Spark Session con GPU y Kafka

In [2]:
# Initialize Spark Session with Kafka support
# Note: GPU/CPU configuration is controlled by spark-defaults.conf
# This allows easy switching between GPU and CPU architectures
spark = SparkSession.builder \
    .appName("StatsBomb-Streaming-Statistics") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1") \
    .config("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint-stats") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

print(f"‚úì Spark Version: {spark.version}")
print(f"‚úì Spark Master: {spark.sparkContext.master}")
print(f"‚úì Spark UI: http://localhost:4040")
print(f"\nSpark Configuration:")
print(f"  RAPIDS enabled: {spark.conf.get('spark.rapids.sql.enabled', 'false')}")
print(f"  GPU resources: {spark.conf.get('spark.executor.resource.gpu.amount', 'none')}")
print("\nüìä Monitor Spark UI para capturar m√©tricas de rendimiento:")
print("   - Job Time")
print("   - Shuffle Read/Write")
print("   - I/O")
print("   - Scheduler Delay")
print("   - Spill (Memory/Disk)")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-bb6df69a-8b2e-4214-a485-92d2a6289c5d;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central
	found org.apache.kafka#kafka-clients;2.8.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.8.4 in central
	found org.slf4j#slf4j-api;1.7.32 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.2 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.hadoop#hadoop-client-api;3.3.2 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.apache.commons#commons-pool2;2.11.1 in central
:: resolution report 

25/11/06 18:17:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/11/06 18:17:34 WARN ResourceUtils: The configuration of resource: gpu (exec = 1, task = 0.1/10, runnable tasks = 10) will result in wasted resources due to resource cpus limiting the number of runnable tasks per executor to: 4. Please adjust your configuration.
25/11/06 18:17:35 WARN RapidsPluginUtils: RAPIDS Accelerator 23.12.2 using cudf 23.12.1.
25/11/06 18:17:35 WARN RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 20.
25/11/06 18:17:35 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
25/11/06 18:17:35 WARN RapidsPluginUtils: spark.rapids.sql.explain is set to `ALL`. Set it to 'NONE' to suppress the diagnostics logging about the query placement on the GPU.
‚úì Spark Version: 3.3.1
‚úì Spark Master: spark://spark-master:7077
‚úì Spark UI: http://localhost:4040

Spark Configuration:
  RAPIDS enabled: true
  GPU resources: 1

üìä Monitor Spark UI para capturar m√©tricas de rendimiento:


## 3. Definir Schema de Eventos de StatsBomb

In [3]:
# Define schema for StatsBomb events
event_schema = StructType([
    StructField("event", StructType([
        StructField("id", StringType(), True),
        StructField("index", IntegerType(), True),
        StructField("period", IntegerType(), True),
        StructField("timestamp", StringType(), True),
        StructField("minute", IntegerType(), True),
        StructField("second", IntegerType(), True),
        StructField("type", StructType([
            StructField("id", IntegerType(), True),
            StructField("name", StringType(), True)
        ]), True),
        StructField("team", StructType([
            StructField("id", IntegerType(), True),
            StructField("name", StringType(), True)
        ]), True),
        StructField("player", StructType([
            StructField("id", IntegerType(), True),
            StructField("name", StringType(), True)
        ]), True),
        StructField("location", ArrayType(DoubleType()), True),
        StructField("pass_end_location", ArrayType(DoubleType()), True),
        StructField("under_pressure", BooleanType(), True),
    ]), True),
    StructField("metadata", StructType([
        StructField("producer_timestamp", StringType(), True),
        StructField("producer_id", StringType(), True)
    ]), True)
])

print("‚úì Schema definido")
print("\nEstructura del schema:")
print(event_schema.simpleString())

‚úì Schema definido

Estructura del schema:
struct<event:struct<id:string,index:int,period:int,timestamp:string,minute:int,second:int,type:struct<id:int,name:string>,team:struct<id:int,name:string>,player:struct<id:int,name:string>,location:array<double>,pass_end_location:array<double>,under_pressure:boolean>,metadata:struct<producer_timestamp:string,producer_id:string>>


## 4. Conectar a Kafka Stream

In [4]:
# Kafka configuration
KAFKA_BOOTSTRAP_SERVERS = "kafka:9092"
KAFKA_TOPIC = "statsbomb-360-events"

# Read from Kafka
kafka_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS) \
    .option("subscribe", KAFKA_TOPIC) \
    .option("startingOffsets", "latest") \
    .option("maxOffsetsPerTrigger", 10000) \
    .load()

print(f"‚úì Conectado a Kafka: {KAFKA_BOOTSTRAP_SERVERS}")
print(f"‚úì Topic: {KAFKA_TOPIC}")
print(f"‚úì Stream creado")

# Parse JSON from Kafka value
parsed_stream = kafka_stream.select(
    col("timestamp").alias("kafka_timestamp"),
    from_json(col("value").cast("string"), event_schema).alias("data")
).select(
    "kafka_timestamp",
    "data.event.*",
    "data.metadata.*"
)

print("‚úì Stream parseado")
print("\nColumnas disponibles:")
print(parsed_stream.columns)

‚úì Conectado a Kafka: kafka:9092
‚úì Topic: statsbomb-360-events
‚úì Stream creado
‚úì Stream parseado

Columnas disponibles:
['kafka_timestamp', 'id', 'index', 'period', 'timestamp', 'minute', 'second', 'type', 'team', 'player', 'location', 'pass_end_location', 'under_pressure', 'producer_timestamp', 'producer_id']


## 5. Procesamiento de Eventos - Feature Extraction

In [5]:
# Extract relevant features from events
events_with_features = parsed_stream \
    .withColumn("event_time", col("kafka_timestamp")) \
    .withColumn("event_type", col("type.name")) \
    .withColumn("team_name", col("team.name")) \
    .withColumn("player_name", col("player.name")) \
    .withColumn("location_x", col("location").getItem(0)) \
    .withColumn("location_y", col("location").getItem(1)) \
    .withColumn("is_pass", when(col("event_type") == "Pass", 1).otherwise(0)) \
    .withColumn("is_shot", when(col("event_type") == "Shot", 1).otherwise(0)) \
    .withColumn("is_dribble", when(col("event_type") == "Dribble", 1).otherwise(0)) \
    .withColumn(
        "pass_distance",
        when(
            col("pass_end_location").isNotNull(),
            sqrt(
                pow(col("pass_end_location").getItem(0) - col("location_x"), 2) +
                pow(col("pass_end_location").getItem(1) - col("location_y"), 2)
            )
        ).otherwise(0)
    )

print("‚úì Features extra√≠dos de los eventos")
print("\nFeatures calculados:")
print("  - event_type: Tipo de evento")
print("  - team_name: Equipo")
print("  - location_x, location_y: Posici√≥n en el campo")
print("  - is_pass, is_shot, is_dribble: Flags de tipo de evento")
print("  - pass_distance: Distancia del pase")

‚úì Features extra√≠dos de los eventos

Features calculados:
  - event_type: Tipo de evento
  - team_name: Equipo
  - location_x, location_y: Posici√≥n en el campo
  - is_pass, is_shot, is_dribble: Flags de tipo de evento
  - pass_distance: Distancia del pase


## 6. Agregaciones por Ventanas de Tiempo (1 minuto)

In [6]:
# Define time windows (1 minute tumbling window)
windowed_stats = events_with_features \
    .withWatermark("event_time", "30 seconds") \
    .groupBy(
        window(col("event_time"), "1 minute"),
        col("team_name")
    ) \
    .agg(
        # Pases
        sum("is_pass").alias("total_passes"),
        min("is_pass").alias("min_passes"),
        max("is_pass").alias("max_passes"),
        avg("is_pass").alias("avg_passes"),
        stddev("is_pass").alias("stddev_passes"),
        
        # Distancia de pases
        sum("pass_distance").alias("total_pass_distance"),
        min("pass_distance").alias("min_pass_distance"),
        max("pass_distance").alias("max_pass_distance"),
        avg("pass_distance").alias("avg_pass_distance"),
        stddev("pass_distance").alias("stddev_pass_distance"),
        
        # Posiciones promedio
        avg("location_x").alias("avg_position_x"),
        avg("location_y").alias("avg_position_y"),
        stddev("location_x").alias("stddev_position_x"),
        stddev("location_y").alias("stddev_position_y"),
        
        # Tiros y dribles
        sum("is_shot").alias("total_shots"),
        sum("is_dribble").alias("total_dribbles"),
        
        # Total de eventos (proxy para posesi√≥n)
        count("*").alias("total_events")
    ) \
    .select(
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        "team_name",
        "total_passes",
        "min_passes",
        "max_passes",
        "avg_passes",
        "stddev_passes",
        "total_pass_distance",
        "min_pass_distance",
        "max_pass_distance",
        "avg_pass_distance",
        "stddev_pass_distance",
        "avg_position_x",
        "avg_position_y",
        "stddev_position_x",
        "stddev_position_y",
        "total_shots",
        "total_dribbles",
        "total_events"
    ) \
    .orderBy("window_start", "team_name")

print("‚úì Agregaciones configuradas")
print("‚úì Ventanas: 1 minuto (tumbling window)")
print("‚úì Watermark: 30 segundos")

‚úì Agregaciones configuradas
‚úì Ventanas: 1 minuto (tumbling window)
‚úì Watermark: 30 segundos
25/11/06 18:18:16 ERROR TaskSchedulerImpl: Lost executor 0 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
25/11/06 19:14:37 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 3378133 ms exceeds timeout 120000 ms
25/11/06 19:14:37 ERROR TaskSchedulerImpl: Lost executor 1 on 172.19.0.6: Executor heartbeat timed out after 3378133 ms
25/11/06 19:15:20 ERROR TaskSchedulerImpl: Lost executor 2 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.


## 7. Iniciar Stream - Output a Consola

In [7]:
# Start streaming query with console output
query = windowed_stats.writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", "false") \
    .option("numRows", 50) \
    .trigger(processingTime="10 seconds") \
    .start()

print("="*80)
print("STREAMING ACTIVO - ESTAD√çSTICAS EN TIEMPO REAL")
print("="*80)
print(f"Query ID: {query.id}")
print(f"Query Name: {query.name}")
print(f"Status: {query.status}")
print("\nüìä CAPTURA DE M√âTRICAS:")
print("   1. Abre Spark UI en http://localhost:4040")
print("   2. Ve a la pesta√±a 'Streaming'")
print("   3. Captura las siguientes m√©tricas:")
print("      - Input Rate (eventos/segundo)")
print("      - Process Rate (eventos/segundo)")
print("      - Batch Duration")
print("      - Scheduling Delay")
print("   4. Ve a la pesta√±a 'Jobs' para m√©tricas detalladas:")
print("      - Job Time")
print("      - Shuffle Read/Write")
print("      - Spill (Memory/Disk)")
print("="*80)
print("\n‚ö†Ô∏è  Presiona el bot√≥n STOP en la celda para detener el streaming")
print("="*80)

25/11/06 18:09:34 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
STREAMING ACTIVO - ESTAD√çSTICAS EN TIEMPO REAL
Query ID: c32852ab-e3a7-4e33-bc09-122fd9205586
Query Name: None
Status: {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}

üìä CAPTURA DE M√âTRICAS:
   1. Abre Spark UI en http://localhost:4040
   2. Ve a la pesta√±a 'Streaming'
   3. Captura las siguientes m√©tricas:
      - Input Rate (eventos/segundo)
      - Process Rate (eventos/segundo)
      - Batch Duration
      - Scheduling Delay
   4. Ve a la pesta√±a 'Jobs' para m√©tricas detalladas:
      - Job Time
      - Shuffle Read/Write
      - Spill (Memory/Disk)

‚ö†Ô∏è  Presiona el bot√≥n STOP en la celda para detener el streaming
25/11/06 18:09:36 WARN GpuOverrides: 
! <WriteToDataSourceV2Exec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.dat

[Stage 0:>                                                          (0 + 0) / 1]

25/11/06 18:10:10 ERROR TaskSchedulerImpl: Lost executor 2 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.


[Stage 0:>                                                          (0 + 0) / 1]

25/11/06 18:10:49 ERROR TaskSchedulerImpl: Lost executor 3 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
25/11/06 18:11:28 ERROR TaskSchedulerImpl: Lost executor 4 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.


[Stage 0:>                                                          (0 + 0) / 1]

25/11/06 18:12:07 ERROR TaskSchedulerImpl: Lost executor 5 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.


[Stage 0:>                                                          (0 + 0) / 1]

25/11/06 18:12:46 ERROR TaskSchedulerImpl: Lost executor 6 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
25/11/06 18:13:25 ERROR TaskSchedulerImpl: Lost executor 7 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.


[Stage 0:>                                                          (0 + 0) / 1]

In [None]:
# Monitor query status
import time

try:
    while query.isActive:
        print(f"\n[{datetime.now().strftime('%H:%M:%S')}] Query Status: {query.status}")
        print(f"Recent Progress:")
        
        if query.recentProgress:
            latest = query.recentProgress[-1]
            print(f"  - Batch ID: {latest.get('batchId', 'N/A')}")
            print(f"  - Input Rows: {latest.get('numInputRows', 'N/A')}")
            print(f"  - Process Rate: {latest.get('processedRowsPerSecond', 'N/A'):.2f} rows/sec" if latest.get('processedRowsPerSecond') else "  - Process Rate: N/A")
            print(f"  - Duration: {latest.get('durationMs', {}).get('triggerExecution', 'N/A')} ms")
        
        time.sleep(10)
        
except KeyboardInterrupt:
    print("\n‚ö†Ô∏è  Deteniendo streaming...")
    query.stop()
    print("‚úì Streaming detenido")

## 8. Detener Stream (ejecutar cuando sea necesario)

In [None]:
# Stop the streaming query
if query.isActive:
    print("Deteniendo streaming query...")
    query.stop()
    query.awaitTermination(timeout=30)
    print("‚úì Query detenido")
else:
    print("Query no est√° activo")

## 9. Resumen de M√©tricas a Capturar

In [None]:
print("="*80)
print("M√âTRICAS A CAPTURAR DESDE SPARK UI (localhost:4040)")
print("="*80)
print("\n1. STREAMING TAB:")
print("   ‚úì Input Rate (eventos/segundo)")
print("   ‚úì Process Rate (eventos/segundo)")
print("   ‚úì Batch Duration (ms)")
print("   ‚úì Operation Duration (ms)")
print("   ‚úì Scheduling Delay (ms)")
print("\n2. JOBS TAB:")
print("   ‚úì Job Duration")
print("   ‚úì Shuffle Read Size")
print("   ‚úì Shuffle Write Size")
print("   ‚úì Spill (Memory)")
print("   ‚úì Spill (Disk)")
print("\n3. STAGES TAB:")
print("   ‚úì Stage Duration")
print("   ‚úì Task Deserialization Time")
print("   ‚úì GC Time")
print("   ‚úì Input Size / Records")
print("   ‚úì Output Size / Records")
print("\n4. EXECUTORS TAB:")
print("   ‚úì Storage Memory Used")
print("   ‚úì Disk Used")
print("   ‚úì Total Tasks")
print("   ‚úì Task Time (GC Time)")
print("="*80)
print("\nüí° COMPARACI√ìN ARQUITECTURAS:")
print("   Para arquitectura 2 (CPU), tu compa√±ero debe:")
print("   1. Comentar l√≠neas de RAPIDS en spark-defaults.conf")
print("   2. Remover asignaci√≥n de GPU en docker-compose.yml")
print("   3. Re-ejecutar este mismo notebook")
print("   4. Comparar m√©tricas GPU vs CPU")
print("="*80)

In [None]:
# Clean up
# spark.stop()
print("\nNota: Spark session sigue activa.")
print("Ejecuta 'spark.stop()' cuando termines todas las pruebas.")