# Entrenamiento del Modelo Live Win Probability (LWP)

**Objetivo:** Entrenar un modelo de Machine Learning que prediga la probabilidad de victoria en vivo basado en el estado actual del partido.

**Datos:** 4 temporadas de La Liga con datos 360 de StatsBomb

**Salidas:** 
- `P(Victoria Local)`
- `P(Empate)`
- `P(Victoria Visitante)`

## 1. Setup y Configuración

In [1]:
import os
import sys
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from datetime import datetime
import statsbombpy.sb as sb

# PySpark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

print(f"Setup complete - {datetime.now()}")

Setup complete - 2025-11-06 04:42:25.332770


## 2. Inicializar Spark Session

**Nota:** Este notebook usa configuración CPU para entrenamiento batch. La configuración GPU/CPU se controla en `spark-conf/spark-defaults.conf` y actualmente está en **modo CPU** para evitar conflictos de GPU entre múltiples executors.

In [2]:
# Initialize Spark Session
# Note: GPU/CPU configuration is controlled by spark-defaults.conf
# For training: CPU mode (GPU disabled in spark-defaults.conf)
# For streaming: GPU mode (GPU enabled in spark-defaults.conf)
spark = SparkSession.builder \
    .appName("LWP-Model-Training") \
    .master("spark://spark-master:7077") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .getOrCreate()

print(f"Spark Version: {spark.version}")
print(f"Spark Master: {spark.sparkContext.master}")
print(f"Spark UI: http://localhost:4040")
print("\nSpark Configuration:")
print(f"  RAPIDS enabled: {spark.conf.get('spark.rapids.sql.enabled', 'false')}")
print(f"  GPU resources: {spark.conf.get('spark.executor.resource.gpu.amount', 'none')}")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e4f05839-668d-45a5-b39d-2abbb492bfc1;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central
	found org.apache.kafka#kafka-clients;2.8.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.8.4 in central
	found org.slf4j#slf4j-api;1.7.32 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.2 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.hadoop#hadoop-client-api;3.3.2 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.apache.commons#commons-pool2;2.11.1 in central
:: resolution report 

25/11/06 04:42:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Spark Version: 3.3.1
Spark Master: spark://spark-master:7077
Spark UI: http://localhost:4040

Spark Configuration:
  RAPIDS enabled: false
  GPU resources: none


## 3. Cargar Datos Históricos de StatsBomb

In [3]:
# La Liga competition ID
COMPETITION_ID = 11  # La Liga

print("Cargando competiciones...")
competitions = sb.competitions()
la_liga = competitions[competitions['competition_id'] == COMPETITION_ID]
print(f"\nTemporadas disponibles de La Liga:")
print(la_liga[['season_id', 'season_name']])

# Get the last 4 seasons
season_ids = [90,42,4,1,2]
print(f"\nTemporadas seleccionadas: {season_ids}")

Cargando competiciones...

Temporadas disponibles de La Liga:
    season_id season_name
38         90   2020/2021
39         42   2019/2020
40          4   2018/2019
41          1   2017/2018
42          2   2016/2017
43         27   2015/2016
44         26   2014/2015
45         25   2013/2014
46         24   2012/2013
47         23   2011/2012
48         22   2010/2011
49         21   2009/2010
50         41   2008/2009
51         40   2007/2008
52         39   2006/2007
53         38   2005/2006
54         37   2004/2005
55        278   1973/1974

Temporadas seleccionadas: [90, 42, 4, 1, 2]


In [4]:
# Load all matches from the selected seasons
all_matches = []

for season_id in season_ids:
    print(f"\nCargando partidos de temporada {season_id}...")
    matches = sb.matches(competition_id=COMPETITION_ID, season_id=season_id)
    all_matches.append(matches)
    print(f"  - {len(matches)} partidos cargados")

matches_df = pd.concat(all_matches, ignore_index=True)
print(f"\nTotal de partidos: {len(matches_df)}")
print(f"Columnas: {matches_df.columns.tolist()}")
matches_df.head()


Cargando partidos de temporada 90...
  - 35 partidos cargados

Cargando partidos de temporada 42...
  - 33 partidos cargados

Cargando partidos de temporada 4...
  - 34 partidos cargados

Cargando partidos de temporada 1...
  - 36 partidos cargados

Cargando partidos de temporada 2...
  - 34 partidos cargados

Total de partidos: 172
Columnas: ['match_id', 'match_date', 'kick_off', 'competition', 'season', 'home_team', 'away_team', 'home_score', 'away_score', 'match_status', 'match_status_360', 'last_updated', 'last_updated_360', 'match_week', 'competition_stage', 'stadium', 'referee', 'home_managers', 'away_managers', 'data_version', 'shot_fidelity_version', 'xy_fidelity_version']


Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,...,last_updated_360,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version,shot_fidelity_version,xy_fidelity_version
0,3773386,2020-10-31,21:00:00.000,Spain - La Liga,2020/2021,Deportivo Alavés,Barcelona,1,1,available,...,2023-07-25T04:25:41.348202,8,Regular Season,Estadio de Mendizorroza,,Pablo Javier Machín Díez,Ronald Koeman,1.1.0,2,2
1,3773565,2021-01-09,18:30:00.000,Spain - La Liga,2020/2021,Granada,Barcelona,0,4,available,...,2023-07-25T04:30:16.058384,18,Regular Season,Estadio Nuevo Los Cármenes,Ricardo De Burgos Bengoetxea,Diego Martínez Penas,Ronald Koeman,1.1.0,2,2
2,3773457,2021-05-16,18:30:00.000,Spain - La Liga,2020/2021,Barcelona,Celta Vigo,1,2,available,...,2023-04-27T23:03:53.506485,37,Regular Season,Spotify Camp Nou,,Ronald Koeman,Eduardo Germán Coudet,1.1.0,2,2
3,3773631,2021-02-07,21:00:00.000,Spain - La Liga,2020/2021,Real Betis,Barcelona,2,3,available,...,2023-07-25T03:56:34.733180,22,Regular Season,Estadio Benito Villamarín,,Manuel Luis Pellegrini Ripamonti,Ronald Koeman,1.1.0,2,2
4,3773665,2021-03-06,21:00:00.000,Spain - La Liga,2020/2021,Osasuna,Barcelona,0,2,available,...,2023-04-28T02:57:03.412841,26,Regular Season,Estadio El Sadar,Guillermo Cuadra Fernández,Jagoba Arrasate Elustondo,Ronald Koeman,1.1.0,2,2


## 4. Feature Engineering - Extracción de Características

In [5]:
def extract_features_from_events(match_id, events_df):
    """
    Extrae características para cada snapshot temporal del partido.
    Cada fila representa el estado del partido en un momento dado.
    """
    features = []
    
    # Get match info
    match_info = matches_df[matches_df['match_id'] == match_id].iloc[0]
    home_team = match_info['home_team']
    away_team = match_info['away_team']
    home_score = match_info['home_score']
    away_score = match_info['away_score']
    
    # Determine final result
    if home_score > away_score:
        result = 'home_win'
    elif home_score < away_score:
        result = 'away_win'
    else:
        result = 'draw'
    
    # Sort events by time
    events_df = events_df.sort_values(['period', 'minute', 'second'])
    
    # Initialize tracking variables
    current_home_score = 0
    current_away_score = 0
    home_shots = 0
    away_shots = 0
    home_passes = 0
    away_passes = 0
    
    # Create snapshots every 5 minutes
    for minute in range(0, 95, 5):
        events_until_now = events_df[events_df['minute'] <= minute]
        
        if len(events_until_now) == 0:
            continue
        
        # Count events by team
        home_events = events_until_now[events_until_now['team'] == home_team]
        away_events = events_until_now[events_until_now['team'] == away_team]
        
        # Calculate current score
        goals = events_until_now[events_until_now['type'] == 'Shot']
        current_home_score = len(goals[(goals['team'] == home_team) & (goals['shot_outcome'] == 'Goal')])
        current_away_score = len(goals[(goals['team'] == away_team) & (goals['shot_outcome'] == 'Goal')])
        
        # Calculate stats
        home_shots = len(home_events[home_events['type'] == 'Shot'])
        away_shots = len(away_events[away_events['type'] == 'Shot'])
        home_passes = len(home_events[home_events['type'] == 'Pass'])
        away_passes = len(away_events[away_events['type'] == 'Pass'])
        
        # Calculate possession (simplified)
        total_events = len(home_events) + len(away_events)
        home_possession = len(home_events) / total_events if total_events > 0 else 0.5
        
        # Create feature row
        feature_row = {
            'match_id': match_id,
            'minute': minute,
            'home_score': current_home_score,
            'away_score': current_away_score,
            'score_diff': current_home_score - current_away_score,
            'home_shots': home_shots,
            'away_shots': away_shots,
            'shots_diff': home_shots - away_shots,
            'home_passes': home_passes,
            'away_passes': away_passes,
            'passes_diff': home_passes - away_passes,
            'home_possession': home_possession,
            'time_remaining': 90 - minute,
            'result': result
        }
        
        features.append(feature_row)
    
    return features

In [6]:
# Extract features from all matches
print("Extrayendo características de los partidos...")
print("Nota: Este proceso puede tomar varios minutos debido a las llamadas a la API.\n")

all_features = []
match_ids = matches_df['match_id'].head(50).tolist()  # Use first 50 matches for training

for i, match_id in enumerate(match_ids, 1):
    try:
        print(f"[{i}/{len(match_ids)}] Procesando match {match_id}...", end=" ")
        events = sb.events(match_id=match_id, split=False, flatten_attrs=True)
        features = extract_features_from_events(match_id, events)
        all_features.extend(features)
        print(f"✓ {len(features)} snapshots")
    except Exception as e:
        print(f"✗ Error: {e}")
        continue

# Convert to pandas DataFrame
features_pd = pd.DataFrame(all_features)
print(f"\nTotal de snapshots (filas): {len(features_pd)}")
print(f"Distribución de resultados:")
print(features_pd['result'].value_counts())
features_pd.head(10)

Extrayendo características de los partidos...
Nota: Este proceso puede tomar varios minutos debido a las llamadas a la API.

[1/50] Procesando match 3773386... ✓ 19 snapshots
[2/50] Procesando match 3773565... ✓ 19 snapshots
[3/50] Procesando match 3773457... ✓ 19 snapshots
[4/50] Procesando match 3773631... ✓ 19 snapshots
[5/50] Procesando match 3773665... ✓ 19 snapshots
[6/50] Procesando match 3773497... ✓ 19 snapshots
[7/50] Procesando match 3773660... ✓ 19 snapshots
[8/50] Procesando match 3773593... ✓ 19 snapshots
[9/50] Procesando match 3773466... ✓ 19 snapshots
[10/50] Procesando match 3773585... ✓ 19 snapshots
[11/50] Procesando match 3773552... ✓ 19 snapshots
[12/50] Procesando match 3773672... ✓ 19 snapshots
[13/50] Procesando match 3773587... ✓ 19 snapshots
[14/50] Procesando match 3773656... ✓ 19 snapshots
[15/50] Procesando match 3773377... ✓ 19 snapshots
[16/50] Procesando match 3773586... ✓ 19 snapshots
[17/50] Procesando match 3773372... ✓ 19 snapshots
[18/50] Procesand

Unnamed: 0,match_id,minute,home_score,away_score,score_diff,home_shots,away_shots,shots_diff,home_passes,away_passes,passes_diff,home_possession,time_remaining,result
0,3773386,0,0,0,0,0,0,0,2,19,-17,0.225352,90,draw
1,3773386,5,0,0,0,0,0,0,33,61,-28,0.398148,85,draw
2,3773386,10,0,0,0,0,0,0,57,93,-36,0.407921,80,draw
3,3773386,15,0,0,0,0,1,-1,76,126,-50,0.397626,75,draw
4,3773386,20,0,0,0,1,2,-1,92,178,-86,0.368192,70,draw
5,3773386,25,0,0,0,1,3,-2,101,202,-101,0.366442,65,draw
6,3773386,30,1,0,1,2,3,-1,114,254,-140,0.35074,60,draw
7,3773386,35,1,0,1,2,4,-2,116,320,-204,0.319051,55,draw
8,3773386,40,1,0,1,2,4,-2,129,353,-224,0.323565,50,draw
9,3773386,45,1,0,1,3,6,-3,135,389,-254,0.317884,45,draw


## 5. Preparar Datos para Entrenamiento en Spark

In [7]:
# Convert pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(features_pd)

print("Spark DataFrame creado:")
spark_df.printSchema()
print(f"\nNúmero de filas: {spark_df.count()}")
spark_df.show(10)

Spark DataFrame creado:
root
 |-- match_id: long (nullable = true)
 |-- minute: long (nullable = true)
 |-- home_score: long (nullable = true)
 |-- away_score: long (nullable = true)
 |-- score_diff: long (nullable = true)
 |-- home_shots: long (nullable = true)
 |-- away_shots: long (nullable = true)
 |-- shots_diff: long (nullable = true)
 |-- home_passes: long (nullable = true)
 |-- away_passes: long (nullable = true)
 |-- passes_diff: long (nullable = true)
 |-- home_possession: double (nullable = true)
 |-- time_remaining: long (nullable = true)
 |-- result: string (nullable = true)



                                                                                


Número de filas: 950
+--------+------+----------+----------+----------+----------+----------+----------+-----------+-----------+-----------+-------------------+--------------+------+
|match_id|minute|home_score|away_score|score_diff|home_shots|away_shots|shots_diff|home_passes|away_passes|passes_diff|    home_possession|time_remaining|result|
+--------+------+----------+----------+----------+----------+----------+----------+-----------+-----------+-----------+-------------------+--------------+------+
| 3773386|     0|         0|         0|         0|         0|         0|         0|          2|         19|        -17|0.22535211267605634|            90|  draw|
| 3773386|     5|         0|         0|         0|         0|         0|         0|         33|         61|        -28|0.39814814814814814|            85|  draw|
| 3773386|    10|         0|         0|         0|         0|         0|         0|         57|         93|        -36| 0.4079207920792079|            80|  draw|
| 3773

In [8]:
# Prepare features and labels
feature_cols = [
    'minute', 'home_score', 'away_score', 'score_diff',
    'home_shots', 'away_shots', 'shots_diff',
    'home_passes', 'away_passes', 'passes_diff',
    'home_possession', 'time_remaining'
]

# Create vector assembler
assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

# Create label indexer (convert string labels to numeric)
label_indexer = StringIndexer(
    inputCol="result",
    outputCol="label"
)

# Transform data
data_with_features = assembler.transform(spark_df)
data_with_labels = label_indexer.fit(data_with_features).transform(data_with_features)

print("Datos preparados para entrenamiento:")
data_with_labels.select('features', 'label', 'result').show(10, truncate=False)

                                                                                

Datos preparados para entrenamiento:
+---------------------------------------------------------------------------+-----+------+
|features                                                                   |label|result|
+---------------------------------------------------------------------------+-----+------+
|(12,[7,8,9,10,11],[2.0,19.0,-17.0,0.22535211267605634,90.0])               |2.0  |draw  |
|(12,[0,7,8,9,10,11],[5.0,33.0,61.0,-28.0,0.39814814814814814,85.0])        |2.0  |draw  |
|(12,[0,7,8,9,10,11],[10.0,57.0,93.0,-36.0,0.4079207920792079,80.0])        |2.0  |draw  |
|[15.0,0.0,0.0,0.0,0.0,1.0,-1.0,76.0,126.0,-50.0,0.39762611275964393,75.0]  |2.0  |draw  |
|[20.0,0.0,0.0,0.0,1.0,2.0,-1.0,92.0,178.0,-86.0,0.3681917211328976,70.0]   |2.0  |draw  |
|[25.0,0.0,0.0,0.0,1.0,3.0,-2.0,101.0,202.0,-101.0,0.36644165863066536,65.0]|2.0  |draw  |
|[30.0,1.0,0.0,1.0,2.0,3.0,-1.0,114.0,254.0,-140.0,0.35074045206547155,60.0]|2.0  |draw  |
|[35.0,1.0,0.0,1.0,2.0,4.0,-2.0,116.0,320.0,-204.0,0.

In [9]:
# Split data into training and test sets
train_data, test_data = data_with_labels.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} filas")
print(f"Test set: {test_data.count()} filas")

Training set: 763 filas
Test set: 187 filas


## 6. Entrenar Modelo de Clasificación

In [10]:
# Create Random Forest classifier
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="label",
    numTrees=100,
    maxDepth=10,
    seed=42
)

print("Entrenando modelo Random Forest con GPU...")
print("Revisa Spark UI en http://localhost:4040 para métricas de rendimiento\n")

start_time = datetime.now()
rf_model = rf.fit(train_data)
training_time = (datetime.now() - start_time).total_seconds()

print(f"✓ Modelo entrenado en {training_time:.2f} segundos")
print(f"✓ Número de árboles: {rf_model.getNumTrees}")
print(f"✓ Feature importances disponibles")

Entrenando modelo Random Forest con GPU...
Revisa Spark UI en http://localhost:4040 para métricas de rendimiento

25/11/06 04:44:21 WARN DAGScheduler: Broadcasting large task binary with size 1091.0 KiB
25/11/06 04:44:22 WARN DAGScheduler: Broadcasting large task binary with size 1462.8 KiB
25/11/06 04:44:22 WARN DAGScheduler: Broadcasting large task binary with size 1843.4 KiB
25/11/06 04:44:23 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
✓ Modelo entrenado en 4.55 segundos
✓ Número de árboles: 100
✓ Feature importances disponibles


## 7. Evaluar Modelo

In [11]:
# Make predictions on test set
predictions = rf_model.transform(test_data)

print("Predicciones del modelo:")
predictions.select('minute', 'score_diff', 'result', 'label', 'prediction', 'probability').show(20, truncate=False)

Predicciones del modelo:
25/11/06 04:44:36 WARN DAGScheduler: Broadcasting large task binary with size 1652.6 KiB
+------+----------+--------+-----+----------+--------------------------------------------------------------+
|minute|score_diff|result  |label|prediction|probability                                                   |
+------+----------+--------+-----+----------+--------------------------------------------------------------+
|10    |0         |draw    |2.0  |1.0       |[0.2774039370570409,0.5583783577940511,0.164217705148908]     |
|30    |1         |draw    |2.0  |1.0       |[0.3200557924228494,0.5270165103649969,0.1529276972121536]    |
|40    |1         |draw    |2.0  |2.0       |[0.14277476802417793,0.3071699612147394,0.5500552707610826]   |
|65    |0         |draw    |2.0  |2.0       |[0.02,0.18749969939605937,0.7925003006039407]                 |
|0     |0         |away_win|1.0  |0.0       |[0.7583378396274043,0.162545551281854,0.07911660909074174]    |
|20    |0     

In [12]:
# Evaluate model
evaluator_accuracy = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)

accuracy = evaluator_accuracy.evaluate(predictions)
f1_score = evaluator_f1.evaluate(predictions)

print("="*50)
print("MÉTRICAS DEL MODELO")
print("="*50)
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1_score:.4f}")
print(f"Training Time: {training_time:.2f} seconds")
print("="*50)

25/11/06 04:44:54 WARN DAGScheduler: Broadcasting large task binary with size 1666.0 KiB
25/11/06 04:44:54 WARN DAGScheduler: Broadcasting large task binary with size 1666.0 KiB
MÉTRICAS DEL MODELO
Accuracy: 0.8396
F1 Score: 0.8382
Training Time: 4.55 seconds


In [13]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.featureImportances.toArray()
}).sort_values('importance', ascending=False)

print("\nImportancia de características:")
print(feature_importance)


Importancia de características:
            feature  importance
3        score_diff    0.190174
10  home_possession    0.157861
6        shots_diff    0.108214
9       passes_diff    0.098928
5        away_shots    0.081349
4        home_shots    0.066008
2        away_score    0.065497
1        home_score    0.061120
8       away_passes    0.056377
7       home_passes    0.049368
0            minute    0.041527
11   time_remaining    0.023578


## 8. Guardar Modelo Entrenado

In [14]:
# Save model to disk
MODEL_PATH = "/work/models/lwp_model"

print(f"Guardando modelo en {MODEL_PATH}...")
rf_model.write().overwrite().save(MODEL_PATH)
print("✓ Modelo guardado exitosamente")

# Save label mapping
label_mapping = label_indexer.fit(data_with_features).labels
print(f"\nMapeo de etiquetas: {label_mapping}")
print("  0 = Victoria Local (home_win)")
print("  1 = Empate (draw)")
print("  2 = Victoria Visitante (away_win)")

Guardando modelo en /work/models/lwp_model...
✓ Modelo guardado exitosamente

Mapeo de etiquetas: ['home_win', 'away_win', 'draw']
  0 = Victoria Local (home_win)
  1 = Empate (draw)
  2 = Victoria Visitante (away_win)


## 9. Test de Inferencia

In [15]:
# Test inference with sample data
test_scenario = spark.createDataFrame([
    # Scenario 1: Home team winning 2-0 at minute 70
    (70, 2, 0, 2, 8, 3, 5, 250, 180, 70, 0.58, 20),
    # Scenario 2: Tied 1-1 at minute 45
    (45, 1, 1, 0, 5, 5, 0, 200, 200, 0, 0.50, 45),
    # Scenario 3: Away team leading 0-1 at minute 80
    (80, 0, 1, -1, 6, 8, -2, 280, 220, -60, 0.56, 10),
], feature_cols)

test_features = assembler.transform(test_scenario)
test_predictions = rf_model.transform(test_features)

print("Test de inferencia con escenarios de ejemplo:")
test_predictions.select(
    'minute', 'score_diff', 'home_possession', 'time_remaining',
    'prediction', 'probability'
).show(truncate=False)

print("\nInterpretación de probabilidades:")
print("probability[0] = P(Victoria Local)")
print("probability[1] = P(Empate)")
print("probability[2] = P(Victoria Visitante)")

Test de inferencia con escenarios de ejemplo:
25/11/06 04:45:16 WARN DAGScheduler: Broadcasting large task binary with size 1624.0 KiB
25/11/06 04:45:16 WARN DAGScheduler: Broadcasting large task binary with size 1624.0 KiB
25/11/06 04:45:16 WARN DAGScheduler: Broadcasting large task binary with size 1624.0 KiB
+------+----------+---------------+--------------+----------+------------------------------------------------------------+
|minute|score_diff|home_possession|time_remaining|prediction|probability                                                 |
+------+----------+---------------+--------------+----------+------------------------------------------------------------+
|70    |2         |0.58           |20            |0.0       |[0.97,0.0,0.03]                                             |
|45    |0         |0.5            |45            |1.0       |[0.08806183902755549,0.4833456566543206,0.428592504318124]  |
|80    |-1        |0.56           |10            |1.0       |[0.03736078

## 10. Resumen y Próximos Pasos

In [16]:
print("="*60)
print("RESUMEN DEL ENTRENAMIENTO")
print("="*60)
print(f"✓ Modelo: Random Forest Classifier")
print(f"✓ Datos: {len(matches_df)} partidos procesados")
print(f"✓ Features: {len(feature_cols)} características")
print(f"✓ Training samples: {train_data.count()}")
print(f"✓ Test samples: {test_data.count()}")
print(f"✓ Accuracy: {accuracy:.4f}")
print(f"✓ F1 Score: {f1_score:.4f}")
print(f"✓ Training time: {training_time:.2f} seconds")
print(f"✓ Modelo guardado en: {MODEL_PATH}")
print("="*60)
print("\nPRÓXIMOS PASOS:")
print("1. Ejecutar notebook 03_Streaming_Estadisticas.ipynb")
print("2. Ejecutar notebook 04_Streaming_Inferencia_LWP.ipynb")
print("3. Capturar métricas desde Spark UI (localhost:4040)")
print("="*60)

RESUMEN DEL ENTRENAMIENTO
✓ Modelo: Random Forest Classifier
✓ Datos: 172 partidos procesados
✓ Features: 12 características
✓ Training samples: 763
✓ Test samples: 187
✓ Accuracy: 0.8396
✓ F1 Score: 0.8382
✓ Training time: 4.55 seconds
✓ Modelo guardado en: /work/models/lwp_model

PRÓXIMOS PASOS:
1. Ejecutar notebook 03_Streaming_Estadisticas.ipynb
2. Ejecutar notebook 04_Streaming_Inferencia_LWP.ipynb
3. Capturar métricas desde Spark UI (localhost:4040)


In [17]:
# Stop Spark session
# spark.stop()
print("\nNota: Spark session sigue activa para exploración adicional.")
print("Ejecuta 'spark.stop()' cuando termines.")


Nota: Spark session sigue activa para exploración adicional.
Ejecuta 'spark.stop()' cuando termines.
25/11/06 04:46:34 ERROR TaskSchedulerImpl: Lost executor 0 on 172.19.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
25/11/06 04:46:34 ERROR Utils: Uncaught exception in thread dispatcher-CoarseGrainedScheduler
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
	at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:178)
	at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)
	at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:193)
	at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:563)
	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.$anonfun$reviveOffers$1(CoarseGrainedSchedulerBackend.scala:630)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1484)
	at org.a

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 43958)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/dist-packages/pyspark/accumulators.py", line 281, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.10/dist-packages/pyspark/accumulators.py", line 253, in poll
    if func():
  File "/usr/local/lib/python3.10/dist-packages/pyspark/accumulators.py", line 257, in accum_updates
    num_updates = read_int(self.rfile)
  File "