# 02. Entrenamiento Avanzado (The Brain)

Este notebook entrena, optimiza y ensambla los modelos predictivos.

**Fases:**
1. **Preparaci√≥n de Targets**: Convertir 'H', 'D', 'A' a num√©rico.
2. **Torneo de Modelos**: Comparativa inicial (XGB vs RF vs LR).
3. **Optimizaci√≥n Extrema**: GridSearchCV con TimeSeriesSplit.
4. **Stacking Ensemble**: Creaci√≥n del Super-Modelo.
5. **Persistencia**: Guardado del modelo final.

In [41]:
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, log_loss
import matplotlib.pyplot as plt

# Config
INPUT_FILE = 'df_final_features.csv'
MODEL_FILE = 'modelo_city_group.joblib'

## 1. Preparaci√≥n de Datos

In [42]:
df = pd.read_csv(INPUT_FILE)

# Encoding Target
le = LabelEncoder()
df['Target'] = le.fit_transform(df['FTR']) # A=0, D=1, H=2 (Verificar orden)
print("Mapping Target:", dict(zip(le.classes_, le.transform(le.classes_))))

# Definir Features X
exclude = ['Date', 'Season', 'HomeTeam', 'AwayTeam', 'FTR', 'FTHG', 'FTAG', 'Target', 'B365H', 'B365D', 'B365A']
features = [c for c in df.columns if c not in exclude]
print(f"Entrenando con {len(features)} variables: {features}")

X = df[features]
y = df['Target']

# Split Temporal (Respetar orden cronologico es CRITICO)
# Usaremos ultimos 20% para test final ("Out of Time")
split_idx = int(len(df) * 0.80)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

# Scaling (Importante para Regresion Logistica)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Mapping Target: {'A': 0, 'D': 1, 'H': 2}
Entrenando con 18 variables: ['Home_Elo', 'Away_Elo', 'Home_Att_Strength', 'Away_Att_Strength', 'Home_Def_Weakness', 'Away_Def_Weakness', 'Home_FIFA_Ova', 'Away_FIFA_Ova', 'Home_Market_Value', 'Away_Market_Value', 'Home_Rest_Days', 'Away_Rest_Days', 'Home_xG_Proxy', 'Away_xG_Proxy', 'Home_Dominance', 'Away_Dominance', 'Home_Pressure', 'Away_Pressure']


## 2. Torneo de Modelos

In [43]:
models = {
    'LogReg': LogisticRegression(max_iter=1000, C=0.1),
    'RandomForest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', n_estimators=100, max_depth=3, learning_rate=0.05)
}

tscv = TimeSeriesSplit(n_splits=5)

results = {}
for name, model in models.items():
    # Usar datos escalados para todos por simplicidad aqui, aunque arboles no lo necesitan
    scores = cross_val_score(model, X_train_scaled, y_train, cv=tscv, scoring='neg_log_loss')
    results[name] = -scores.mean()
    print(f"{name} Log Loss: {-scores.mean():.4f} (+/- {scores.std():.4f})")

best_model_name = min(results, key=results.get)
print(f"\nüèÜ Ganador Fase Previa: {best_model_name}")

LogReg Log Loss: 0.8991 (+/- 0.0252)
RandomForest Log Loss: 0.9321 (+/- 0.0265)


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost Log Loss: 0.9103 (+/- 0.0208)

üèÜ Ganador Fase Previa: LogReg


## 3. Ensemble Stacking (El Super-Modelo)

In [44]:
# Creamos un VotingClassifier con los 3 modelos (Soft Voting)
# VotingClassifier SI es compatible con TimeSeriesSplit indirectamente al usarse dentro de CalibratedClassifierCV
# o simplemente como estimador robusto.
estimators = [
    ('lr', models['LogReg']),
    ('rf', models['RandomForest']),
    ('xgb', models['XGBoost'])
]

# Usamos VotingClassifier en lugar de Stacking para evitar problemas de particion con TimeSeriesSplit
voting_clf = VotingClassifier(
    estimators=estimators,
    voting='soft'
)

print("Entrenando Voting Ensemble...")
# Para VotingClassifier, fit entrena los estimadores base
voting_clf.fit(X_train_scaled, y_train)

# Calibraci√≥n de Probabilidades (Isotonic)
# Crucial para apuestas: asegurar que 60% prob signifique 60% veces gana
calibrated_clf = CalibratedClassifierCV(voting_clf, method='isotonic', cv=tscv)
calibrated_clf.fit(X_train_scaled, y_train)

# Evaluacion Final
y_prob = calibrated_clf.predict_proba(X_test_scaled)
loss = log_loss(y_test, y_prob)
acc = accuracy_score(y_test, calibrated_clf.predict(X_test_scaled))

print(f"\nRESULTADOS FINALES TEST SET:")
print(f"Log Loss: {loss:.4f}")
print(f"Accuracy: {acc:.4f}")

Entrenando Voting Ensemble...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



RESULTADOS FINALES TEST SET:
Log Loss: 0.9181
Accuracy: 0.5702


## 4. Guardar Modelo

In [45]:
# Guardamos Modelo + Scaler + Encoder en un diccionario para la App
artifact = {
    'model': calibrated_clf,
    'scaler': scaler,
    'label_encoder': le,
    'features': features
}

joblib.dump(artifact, MODEL_FILE)
print(f"‚úÖ Modelo guardado en {MODEL_FILE}")

‚úÖ Modelo guardado en modelo_city_group.joblib
