# 05.5 XGBoost Scratch Demo

Este notebook demuestra la funcionalidad del prototipo de XGBoost implementado 'desde cero' (`src/tree/xgboost_scratch.py`).

**Objetivo:** Validar que el algoritmo de boosting secuencial funciona correctamente utilizando un subconjunto de los datos procesados.

In [None]:
import sys
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, classification_report, confusion_matrix

# Add project root to path to import src modules
sys.path.append(os.path.abspath('..'))

from src.tree.xgboost_scratch import XGBoostScratch

## 1. Carga de Datos
Cargamos el dataset procesado y seleccionamos un subconjunto de variables para una demostración rápida.

In [None]:
data_path = '../data/02_intermediate/process_data.parquet'
try:
    df = pd.read_parquet(data_path)
    print(f"Data loaded. Shape: {df.shape}")
except FileNotFoundError:
    print("Data file not found. Creating synthetic data for demo purposes.")
    # Synthetic fallback if data is missing in sandbox
    df = pd.DataFrame({
        'Age': np.random.randint(20, 80, 100),
        'SystolicBP': np.random.randint(90, 180, 100),
        'BMI': np.random.uniform(18, 40, 100),
        'Glucose': np.random.randint(70, 200, 100),
        'TotalCholesterol': np.random.randint(150, 300, 100),
        'HeartDisease': np.random.randint(0, 2, 100)
    })

df.head()

In [None]:
# Configuration
features = ['Age', 'SystolicBP', 'BMI', 'Glucose', 'TotalCholesterol']
target = 'HeartDisease'

# Ensure features exist
available_features = [f for f in features if f in df.columns]
print(f"Using features: {available_features}")

# Prepare X and y
df_subset = df[available_features + [target]].dropna()
X = df_subset[available_features].values
y = df_subset[target].values

print(f"Dataset for training shape: {X.shape}")

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}")

## 2. Entrenamiento del Modelo Scratch
Instanciamos `XGBoostScratch` y entrenamos.

In [None]:
# Initialize model
xgb = XGBoostScratch(n_estimators=10, learning_rate=0.1, max_depth=3)

print("Starting training...")
xgb.fit(X_train, y_train)
print("Training complete.")

## 3. Evaluación
Realizamos predicciones y evaluamos el desempeño.

In [None]:
y_pred = xgb.predict(X_test)
y_proba = xgb.predict_proba(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

In [None]:
# Example of single prediction
print("Example Probabilities:", y_proba[:5])
print("Example Predictions:", y_pred[:5])
print("True Labels:", y_test[:5])