# üèéÔ∏è Pipeline Completo de Machine Learning: F1 Results

Este notebook demuestra un **pipeline completo de Machine Learning** usando el dataset de resultados de F√≥rmula 1, aprovechando todas las capacidades de visualizaci√≥n de **BESTLIB**.

## Objetivos del Pipeline

1. **Exploratory Data Analysis (EDA)** extenso con visualizaciones interactivas
2. **Preprocesamiento de datos** con an√°lisis de calidad
3. **Feature Engineering** y selecci√≥n de caracter√≠sticas
4. **Modelado de Machine Learning** (clasificaci√≥n y regresi√≥n)
5. **Evaluaci√≥n y visualizaci√≥n de resultados**

## Dataset: F1 Results

El dataset contiene resultados hist√≥ricos de carreras de F√≥rmula 1 con informaci√≥n sobre:
- Carreras (raceId, year, race_name, race_date)
- Pilotos (driverId, first_name, last_name, nationality)
- Constructores (constructorId)
- Resultados (position, points, laps, fastest_lap, fastest_lap_speed)

## Problemas de ML a Resolver

1. **Clasificaci√≥n**: Predecir si un piloto terminar√° en el podio (top 3)
2. **Regresi√≥n**: Predecir los puntos obtenidos por un piloto
3. **Clasificaci√≥n multiclase**: Predecir la posici√≥n final (top 5, 6-10, 11+)


## üì¶ Paso 1: Importaci√≥n de Librer√≠as y Configuraci√≥n


In [None]:
# Librer√≠as est√°ndar
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# BESTLIB para visualizaciones
from BESTLIB.reactive import ReactiveMatrixLayout, SelectionModel
from BESTLIB.matrix import MatrixLayout

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, r2_score, mean_squared_error
from sklearn.impute import SimpleImputer

# Configuraci√≥n de visualizaci√≥n
import matplotlib.pyplot as plt
import seaborn as sns

# Configurar pandas para mostrar m√°s columnas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("‚úÖ Todas las librer√≠as importadas correctamente")


## üìä Paso 2: Carga y Exploraci√≥n Inicial de Datos


In [None]:
# Cargar dataset
df = pd.read_csv('../datasets/f1_results.csv')

print(f"üìä Dataset cargado: {df.shape[0]} filas, {df.shape[1]} columnas")
print(f"\nüìã Columnas: {list(df.columns)}")
print(f"\nüîç Informaci√≥n del dataset:")
df.info()


In [None]:
# Primeras filas
print("üëÄ Primeras 10 filas del dataset:")
df.head(10)


In [None]:
# Estad√≠sticas descriptivas
print("üìà Estad√≠sticas descriptivas de variables num√©ricas:")
df.describe()


## üîç Paso 3: An√°lisis de Calidad de Datos (Data Quality Assessment)


In [None]:
# An√°lisis de valores faltantes
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Columna': missing_data.index,
    'Valores_Faltantes': missing_data.values,
    'Porcentaje': missing_percent.values
})
missing_df = missing_df[missing_df['Valores_Faltantes'] > 0].sort_values('Valores_Faltantes', ascending=False)

print("‚ùå Columnas con valores faltantes:")
print(missing_df)

# Visualizar valores faltantes con BESTLIB
if len(missing_df) > 0:
    MatrixLayout.map_barchart('M', missing_df,
                              category_col='Columna',
                              value_col='Valores_Faltantes',
                              xLabel='Columna',
                              yLabel='Valores Faltantes',
                              title='An√°lisis de Valores Faltantes',
                              interactive=True)
    
    layout_missing = MatrixLayout("M")
    layout_missing.display()


In [None]:
# An√°lisis de duplicados
duplicates = df.duplicated().sum()
print(f"üîÑ Filas duplicadas: {duplicates}")

# An√°lisis de tipos de datos
print("\nüìä Tipos de datos:")
print(df.dtypes)


## üé® Paso 4: Exploratory Data Analysis (EDA) con BESTLIB

En esta secci√≥n realizaremos un EDA completo usando **todas las visualizaciones de BESTLIB** con **vistas enlazadas** para explorar los datos de forma interactiva.

### 4.1 Preparaci√≥n de Datos para EDA


In [None]:
# Preparar datos para EDA
# Limpiar datos: reemplazar '\N' con NaN y convertir a num√©rico
df_clean = df.copy()

# Convertir columnas num√©ricas
numeric_cols = ['position', 'points', 'laps', 'fastest_lap', 'fastest_lap_speed']
for col in numeric_cols:
    if col in df_clean.columns:
        df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

# Crear variables derivadas para an√°lisis
df_clean['full_name'] = df_clean['first_name'] + ' ' + df_clean['last_name']
df_clean['is_podium'] = (df_clean['position'] <= 3).astype(int)
df_clean['is_top5'] = (df_clean['position'] <= 5).astype(int)
df_clean['is_top10'] = (df_clean['position'] <= 10).astype(int)

# Clasificaci√≥n de posici√≥n
def classify_position(pos):
    if pd.isna(pos):
        return 'DNF'
    elif pos <= 3:
        return 'Podium (1-3)'
    elif pos <= 5:
        return 'Top 5 (4-5)'
    elif pos <= 10:
        return 'Top 10 (6-10)'
    else:
        return 'Outside Top 10 (11+)'

df_clean['position_category'] = df_clean['position'].apply(classify_position)

print("‚úÖ Datos preparados para EDA")
print(f"üìä Filas: {len(df_clean)}, Columnas: {len(df_clean.columns)}")


### 4.2 Dashboard Interactivo: An√°lisis de Posiciones y Puntos

Este dashboard permite explorar la relaci√≥n entre posici√≥n, puntos y otras variables usando **vistas enlazadas**. Selecciona puntos en el scatter plot principal para filtrar autom√°ticamente los otros gr√°ficos.


In [None]:
# Crear dashboard interactivo con vistas enlazadas
selection_eda = SelectionModel()
layout_eda = ReactiveMatrixLayout("""
SPH
BXP
""", selection_model=selection_eda)

layout_eda.set_data(df_clean)

# S: Scatter plot principal - Posici√≥n vs Puntos (vista principal)
layout_eda.add_scatter('S',
                       x_col='position',
                       y_col='points',
                       category_col='position_category',
                       interactive=True,
                       xLabel='Posici√≥n Final',
                       yLabel='Puntos Obtenidos',
                       title='Posici√≥n vs Puntos (Selecciona para filtrar otros gr√°ficos)',
                       pointRadius=4,
                       opacity=0.6)

# P: Pie chart - Distribuci√≥n de categor√≠as de posici√≥n
layout_eda.add_pie('P',
                   category_col='position_category',
                   linked_to='S',
                   interactive=True,
                   selection_var='selected_position_category')

# H: Histograma - Distribuci√≥n de posiciones
layout_eda.add_histogram('H',
                         column='position',
                         bins=20,
                         linked_to='S',
                         interactive=True,
                         selection_var='selected_positions',
                         xLabel='Posici√≥n Final',
                         yLabel='Frecuencia')

# B: Bar chart - Puntos por categor√≠a de posici√≥n
layout_eda.add_barchart('B',
                        category_col='position_category',
                        value_col='points',
                        linked_to='S',
                        interactive=True,
                        selection_var='selected_categories',
                        xLabel='Categor√≠a de Posici√≥n',
                        yLabel='Puntos Promedio')

# X: Boxplot - Distribuci√≥n de puntos por categor√≠a
layout_eda.add_boxplot('X',
                       column='points',
                       category_col='position_category',
                       linked_to='S',
                       xLabel='Categor√≠a de Posici√≥n',
                       yLabel='Puntos')

print("‚úÖ Dashboard EDA configurado")
print("\nüí° Instrucciones:")
print("   - Selecciona puntos en el scatter plot (S) para filtrar otros gr√°ficos")
print("   - Haz click en barras, pie slices o histogramas para ver detalles")
print("   - Usa las variables de selecci√≥n para an√°lisis posterior")

layout_eda.display()


In [None]:
# Dashboard de an√°lisis de pilotos
selection_pilots = SelectionModel()
layout_pilots = ReactiveMatrixLayout("""
AB
CD
""", selection_model=selection_pilots)

layout_pilots.set_data(df_clean)

# A: Scatter - Puntos vs Laps completados
layout_pilots.add_scatter('A',
                          x_col='laps',
                          y_col='points',
                          category_col='nationality',
                          interactive=True,
                          xLabel='Vueltas Completadas',
                          yLabel='Puntos Obtenidos',
                          title='Rendimiento: Vueltas vs Puntos')

# B: Bar chart - Top pilotos por puntos totales
top_pilots = df_clean.groupby('full_name')['points'].sum().reset_index()
top_pilots = top_pilots.sort_values('points', ascending=False).head(20)
top_pilots.columns = ['category', 'value']

MatrixLayout.map_barchart('B', top_pilots,
                          category_col='category',
                          value_col='value',
                          xLabel='Piloto',
                          yLabel='Puntos Totales',
                          title='Top 20 Pilotos por Puntos Totales',
                          interactive=True)

layout_pilots._layout._map['B'] = MatrixLayout._map.get('B', {})

# C: Grouped Bar Chart - Puntos por constructor y a√±o
constructor_year = df_clean.groupby(['constructorId', 'year'])['points'].sum().reset_index()
constructor_year.columns = ['main_col', 'sub_col', 'value']
constructor_year['main_col'] = constructor_year['main_col'].astype(str)
constructor_year['sub_col'] = constructor_year['sub_col'].astype(str)

MatrixLayout.map_grouped_barchart('C', constructor_year.head(100),
                                  main_col='main_col',
                                  sub_col='sub_col',
                                  value_col='value',
                                  xLabel='Constructor',
                                  yLabel='Puntos Totales')

layout_pilots._layout._map['C'] = MatrixLayout._map.get('C', {})

# D: Line chart - Evoluci√≥n de puntos por a√±o (top constructores)
top_constructors = df_clean.groupby('constructorId')['points'].sum().sort_values(ascending=False).head(5).index
df_top_constructors = df_clean[df_clean['constructorId'].isin(top_constructors)]
constructor_evolution = df_top_constructors.groupby(['year', 'constructorId'])['points'].sum().reset_index()
constructor_evolution.columns = ['x', 'series', 'y']
constructor_evolution['constructorId'] = constructor_evolution['constructorId'].astype(str)

MatrixLayout.map_line('D', constructor_evolution,
                      x_col='x',
                      y_col='y',
                      series_col='constructorId',
                      xLabel='A√±o',
                      yLabel='Puntos Totales',
                      title='Evoluci√≥n de Puntos por Constructor (Top 5)')

layout_pilots._layout._map['D'] = MatrixLayout._map.get('D', {})

layout_pilots.display()


In [None]:
# Dashboard con visualizaciones multidimensionales
selection_multidim = SelectionModel()
layout_multidim = ReactiveMatrixLayout("""
RS
PC
""", selection_model=selection_multidim)

layout_multidim.set_data(df_clean)

# R: RadViz - Visualizaci√≥n multidimensional de caracter√≠sticas
df_radviz = df_clean[['position', 'points', 'laps', 'fastest_lap_speed', 'position_category']].dropna()
layout_multidim._radviz_data = df_radviz

MatrixLayout.map_radviz('R', df_radviz,
                        features=['position', 'points', 'laps', 'fastest_lap_speed'],
                        class_col='position_category',
                        interactive=True)

layout_multidim._layout._map['R'] = MatrixLayout._map.get('R', {})

# S: Star Coordinates - Visualizaci√≥n interactiva con nodos movibles
layout_multidim.add_star_coordinates('S',
                                     features=['position', 'points', 'laps', 'fastest_lap_speed'],
                                     class_col='position_category',
                                     linked_to='R',
                                     interactive=True)

# P: Parallel Coordinates - An√°lisis de m√∫ltiples dimensiones
layout_multidim.add_parallel_coordinates('P',
                                          dimensions=['position', 'points', 'laps', 'fastest_lap_speed'],
                                          category_col='position_category',
                                          linked_to='R',
                                          interactive=True)

# C: Correlation Heatmap
layout_multidim.add_correlation_heatmap('C',
                                         linked_to='R',
                                         showValues=True)

print("‚úÖ Dashboard multidimensional configurado")
print("\nüí° Instrucciones:")
print("   - En Star Coordinates: arrastra los nodos para explorar diferentes perspectivas")
print("   - En Parallel Coordinates: arrastra columnas para reordenarlas, click en l√≠neas para seleccionar")
print("   - Selecciona en RadViz para filtrar otros gr√°ficos")

layout_multidim.display()


## üßπ Paso 5: Preprocesamiento de Datos

En este paso limpiamos y preparamos los datos para el modelado de Machine Learning.


In [None]:
# Limpieza y preparaci√≥n de datos para ML
print("üîß Iniciando preprocesamiento de datos...")

# 1. Manejo de valores faltantes
df_ml = df_clean.copy()

# Para variables num√©ricas, usar mediana
numeric_cols_ml = ['position', 'points', 'laps', 'fastest_lap', 'fastest_lap_speed']
for col in numeric_cols_ml:
    if col in df_ml.columns:
        median_val = df_ml[col].median()
        df_ml[col].fillna(median_val, inplace=True)

# 2. Crear variables objetivo
# Objetivo 1: Clasificaci√≥n binaria - ¬øTermin√≥ en podio?
df_ml['target_podium'] = (df_ml['position'] <= 3).astype(int)

# Objetivo 2: Regresi√≥n - Puntos obtenidos
df_ml['target_points'] = df_ml['points']

# Objetivo 3: Clasificaci√≥n multiclase - Categor√≠a de posici√≥n
df_ml['target_category'] = df_ml['position_category']

# 3. Feature Engineering
# Variables num√©ricas
df_ml['points_per_lap'] = df_ml['points'] / (df_ml['laps'] + 1)  # Evitar divisi√≥n por cero
df_ml['speed_per_position'] = df_ml['fastest_lap_speed'] / (df_ml['position'] + 1)

# Codificar variables categ√≥ricas
le_nationality = LabelEncoder()
df_ml['nationality_encoded'] = le_nationality.fit_transform(df_ml['nationality'].astype(str))

le_constructor = LabelEncoder()
df_ml['constructor_encoded'] = le_constructor.fit_transform(df_ml['constructorId'].astype(str))

le_category = LabelEncoder()
df_ml['target_category_encoded'] = le_category.fit_transform(df_ml['target_category'])

# 4. Seleccionar features para ML
feature_cols = [
    'year',
    'position',
    'laps',
    'fastest_lap',
    'fastest_lap_speed',
    'nationality_encoded',
    'constructor_encoded',
    'points_per_lap',
    'speed_per_position'
]

# Filtrar solo filas con todas las features
df_ml_clean = df_ml[feature_cols + ['target_podium', 'target_points', 'target_category_encoded']].dropna()

print(f"‚úÖ Preprocesamiento completado")
print(f"üìä Filas finales para ML: {len(df_ml_clean)}")
print(f"üìã Features seleccionadas: {len(feature_cols)}")
print(f"\nüîç Features:")
for i, feat in enumerate(feature_cols, 1):
    print(f"   {i}. {feat}")


In [None]:
# Visualizar distribuci√≥n de variables objetivo
selection_preprocessing = SelectionModel()
layout_preprocessing = ReactiveMatrixLayout("""
AB
CD
""", selection_model=selection_preprocessing)

layout_preprocessing.set_data(df_ml_clean)

# A: Histograma - Distribuci√≥n de puntos (target de regresi√≥n)
layout_preprocessing.add_histogram('A',
                                   column='target_points',
                                   bins=30,
                                   interactive=True,
                                   xLabel='Puntos',
                                   yLabel='Frecuencia',
                                   title='Distribuci√≥n de Puntos (Target Regresi√≥n)')

# B: Pie chart - Distribuci√≥n de podio (target clasificaci√≥n binaria)
podium_dist = pd.DataFrame({
    'category': ['Podium', 'No Podium'],
    'value': [df_ml_clean['target_podium'].sum(), (df_ml_clean['target_podium'] == 0).sum()]
})

MatrixLayout.map_pie('B', podium_dist,
                     category_col='category',
                     value_col='value',
                     title='Distribuci√≥n Podium vs No Podium')

layout_preprocessing._layout._map['B'] = MatrixLayout._map.get('B', {})

# C: Bar chart - Distribuci√≥n de categor√≠as (target multiclase)
category_counts = df_ml_clean['target_category'].value_counts().reset_index()
category_counts.columns = ['category', 'value']

MatrixLayout.map_barchart('C', category_counts,
                          category_col='category',
                          value_col='value',
                          xLabel='Categor√≠a de Posici√≥n',
                          yLabel='Frecuencia',
                          title='Distribuci√≥n de Categor√≠as (Target Multiclase)',
                          interactive=True)

layout_preprocessing._layout._map['C'] = MatrixLayout._map.get('C', {})

# D: Boxplot - Distribuci√≥n de features por target_podium
layout_preprocessing.add_boxplot('D',
                                column='target_points',
                                category_col='target_podium',
                                xLabel='Podium (0=No, 1=S√≠)',
                                yLabel='Puntos',
                                title='Distribuci√≥n de Puntos: Podium vs No Podium')

layout_preprocessing.display()


## ü§ñ Paso 6: Modelado de Machine Learning

### 6.1 Preparaci√≥n de Datos para ML


In [None]:
# Preparar datos para entrenamiento
X = df_ml_clean[feature_cols]
y_podium = df_ml_clean['target_podium']
y_points = df_ml_clean['target_points']
y_category = df_ml_clean['target_category_encoded']

# Dividir en train y test
X_train, X_test, y_podium_train, y_podium_test = train_test_split(
    X, y_podium, test_size=0.2, random_state=42, stratify=y_podium
)

X_train, X_test, y_points_train, y_points_test = train_test_split(
    X, y_points, test_size=0.2, random_state=42
)

X_train, X_test, y_category_train, y_category_test = train_test_split(
    X, y_category, test_size=0.2, random_state=42, stratify=y_category
)

# Estandarizar features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"‚úÖ Datos preparados para ML")
print(f"üìä Train: {X_train.shape[0]} muestras, Test: {X_test.shape[0]} muestras")
print(f"üìã Features: {X_train.shape[1]}")


### 6.2 Modelo 1: Clasificaci√≥n Binaria - Predicci√≥n de Podium


In [None]:
# Entrenar m√∫ltiples modelos para clasificaci√≥n binaria
print("üéØ Modelo 1: Clasificaci√≥n Binaria - Predicci√≥n de Podium")
print("=" * 60)

# Random Forest
rf_podium = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
rf_podium.fit(X_train_scaled, y_podium_train)
rf_podium_pred = rf_podium.predict(X_test_scaled)
rf_podium_score = accuracy_score(y_podium_test, rf_podium_pred)

# Logistic Regression
lr_podium = LogisticRegression(random_state=42, max_iter=1000)
lr_podium.fit(X_train_scaled, y_podium_train)
lr_podium_pred = lr_podium.predict(X_test_scaled)
lr_podium_score = accuracy_score(y_podium_test, lr_podium_pred)

# Gradient Boosting
gb_podium = GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=5)
gb_podium.fit(X_train_scaled, y_podium_train)
gb_podium_pred = gb_podium.predict(X_test_scaled)
gb_podium_score = accuracy_score(y_podium_test, gb_podium_pred)

print(f"\nüìä Resultados:")
print(f"   Random Forest:      {rf_podium_score:.4f}")
print(f"   Logistic Regression: {lr_podium_score:.4f}")
print(f"   Gradient Boosting:   {gb_podium_score:.4f}")

# Seleccionar mejor modelo
best_podium_model = rf_podium if rf_podium_score >= max(lr_podium_score, gb_podium_score) else (
    lr_podium if lr_podium_score >= gb_podium_score else gb_podium
)
best_podium_name = 'Random Forest' if rf_podium_score >= max(lr_podium_score, gb_podium_score) else (
    'Logistic Regression' if lr_podium_score >= gb_podium_score else 'Gradient Boosting'
)

print(f"\nüèÜ Mejor modelo: {best_podium_name} ({max(rf_podium_score, lr_podium_score, gb_podium_score):.4f})")


In [None]:
# Visualizar resultados del modelo de clasificaci√≥n binaria
selection_ml1 = SelectionModel()
layout_ml1 = ReactiveMatrixLayout("""
CM
HB
""", selection_model=selection_ml1)

# Crear DataFrame con predicciones y valores reales
results_podium = pd.DataFrame({
    'real': y_podium_test.values,
    'predicted': rf_podium_pred,
    'correct': (y_podium_test.values == rf_podium_pred).astype(int)
})

layout_ml1.set_data(results_podium)

# C: Confusion Matrix (simulada con heatmap)
cm = confusion_matrix(y_podium_test, rf_podium_pred)
cm_df = pd.DataFrame(cm, 
                     index=['No Podium Real', 'Podium Real'],
                     columns=['No Podium Pred', 'Podium Pred'])

# Convertir a formato largo para heatmap
cm_long = []
for i, row_label in enumerate(cm_df.index):
    for j, col_label in enumerate(cm_df.columns):
        cm_long.append({
            'x': col_label,
            'y': row_label,
            'value': int(cm_df.iloc[i, j])
        })

MatrixLayout.map_heatmap('C', pd.DataFrame(cm_long),
                         x_col='x',
                         y_col='y',
                         value_col='value',
                         title=f'Matriz de Confusi√≥n - {best_podium_name}',
                         showValues=True)

layout_ml1._layout._map['C'] = MatrixLayout._map.get('C', {})

# M: Scatter - Valores reales vs Predichos
layout_ml1.add_scatter('M',
                       x_col='real',
                       y_col='predicted',
                       category_col='correct',
                       interactive=True,
                       xLabel='Valor Real',
                       yLabel='Valor Predicho',
                       title='Real vs Predicho (Verde=Correcto, Rojo=Incorrecto)')

# H: Histograma - Distribuci√≥n de errores
layout_ml1.add_histogram('H',
                          column='correct',
                          bins=2,
                          linked_to='M',
                          interactive=True,
                          xLabel='Correcto (1) vs Incorrecto (0)',
                          yLabel='Frecuencia')

# B: Bar chart - M√©tricas del modelo
metrics_data = pd.DataFrame({
    'category': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'value': [
        accuracy_score(y_podium_test, rf_podium_pred),
        classification_report(y_podium_test, rf_podium_pred, output_dict=True)['1']['precision'],
        classification_report(y_podium_test, rf_podium_pred, output_dict=True)['1']['recall'],
        classification_report(y_podium_test, rf_podium_pred, output_dict=True)['1']['f1-score']
    ]
})

MatrixLayout.map_barchart('B', metrics_data,
                          category_col='category',
                          value_col='value',
                          xLabel='M√©trica',
                          yLabel='Valor',
                          title='M√©tricas del Modelo de Clasificaci√≥n Binaria',
                          interactive=True)

layout_ml1._layout._map['B'] = MatrixLayout._map.get('B', {})

layout_ml1.display()

# Mostrar reporte de clasificaci√≥n
print("\nüìã Reporte de Clasificaci√≥n Detallado:")
print(classification_report(y_podium_test, rf_podium_pred, 
                          target_names=['No Podium', 'Podium']))


### 6.3 Modelo 2: Regresi√≥n - Predicci√≥n de Puntos


In [None]:
# Entrenar modelos de regresi√≥n
print("üéØ Modelo 2: Regresi√≥n - Predicci√≥n de Puntos")
print("=" * 60)

# Random Forest Regressor
rf_points = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10)
rf_points.fit(X_train_scaled, y_points_train)
rf_points_pred = rf_points.predict(X_test_scaled)
rf_points_r2 = r2_score(y_points_test, rf_points_pred)
rf_points_rmse = np.sqrt(mean_squared_error(y_points_test, rf_points_pred))

# Linear Regression
lr_points = LinearRegression()
lr_points.fit(X_train_scaled, y_points_train)
lr_points_pred = lr_points.predict(X_test_scaled)
lr_points_r2 = r2_score(y_points_test, lr_points_pred)
lr_points_rmse = np.sqrt(mean_squared_error(y_points_test, lr_points_pred))

print(f"\nüìä Resultados:")
print(f"   Random Forest Regressor:")
print(f"      R¬≤ Score: {rf_points_r2:.4f}")
print(f"      RMSE: {rf_points_rmse:.4f}")
print(f"\n   Linear Regression:")
print(f"      R¬≤ Score: {lr_points_r2:.4f}")
print(f"      RMSE: {lr_points_rmse:.4f}")

# Seleccionar mejor modelo
best_points_model = rf_points if rf_points_r2 >= lr_points_r2 else lr_points
best_points_name = 'Random Forest' if rf_points_r2 >= lr_points_r2 else 'Linear Regression'
best_points_pred = rf_points_pred if rf_points_r2 >= lr_points_r2 else lr_points_pred

print(f"\nüèÜ Mejor modelo: {best_points_name} (R¬≤ = {max(rf_points_r2, lr_points_r2):.4f})")


In [None]:
# Visualizar resultados de regresi√≥n
selection_ml2 = SelectionModel()
layout_ml2 = ReactiveMatrixLayout("""
SH
XB
""", selection_model=selection_ml2)

# Crear DataFrame con predicciones y valores reales
results_points = pd.DataFrame({
    'real': y_points_test.values,
    'predicted': best_points_pred,
    'error': np.abs(y_points_test.values - best_points_pred),
    'error_percent': (np.abs(y_points_test.values - best_points_pred) / (y_points_test.values + 1)) * 100
})

layout_ml2.set_data(results_points)

# S: Scatter - Valores reales vs Predichos (vista principal)
layout_ml2.add_scatter('S',
                       x_col='real',
                       y_col='predicted',
                       interactive=True,
                       xLabel='Puntos Reales',
                       yLabel='Puntos Predichos',
                       title=f'Regresi√≥n: Real vs Predicho ({best_points_name})')

# H: Histograma - Distribuci√≥n de errores
layout_ml2.add_histogram('H',
                         column='error',
                         bins=30,
                         linked_to='S',
                         interactive=True,
                         xLabel='Error Absoluto',
                         yLabel='Frecuencia',
                         title='Distribuci√≥n de Errores')

# X: Boxplot - Distribuci√≥n de errores por rango de puntos reales
results_points['points_range'] = pd.cut(results_points['real'], 
                                        bins=[0, 5, 10, 15, 25], 
                                        labels=['0-5', '6-10', '11-15', '16+'])
layout_ml2._points_range_data = results_points

layout_ml2.add_boxplot('X',
                       column='error',
                       category_col='points_range',
                       linked_to='S',
                       xLabel='Rango de Puntos Reales',
                       yLabel='Error Absoluto',
                       title='Error por Rango de Puntos')

# B: Bar chart - M√©tricas del modelo
metrics_reg = pd.DataFrame({
    'category': ['R¬≤ Score', 'RMSE', 'MAE'],
    'value': [
        max(rf_points_r2, lr_points_r2),
        min(rf_points_rmse, lr_points_rmse),
        np.mean(results_points['error'])
    ]
})

MatrixLayout.map_barchart('B', metrics_reg,
                          category_col='category',
                          value_col='value',
                          xLabel='M√©trica',
                          yLabel='Valor',
                          title='M√©tricas del Modelo de Regresi√≥n',
                          interactive=True)

layout_ml2._layout._map['B'] = MatrixLayout._map.get('B', {})

layout_ml2.display()


### 6.4 Modelo 3: Clasificaci√≥n Multiclase - Categor√≠a de Posici√≥n


In [None]:
# Entrenar modelo de clasificaci√≥n multiclase
print("üéØ Modelo 3: Clasificaci√≥n Multiclase - Categor√≠a de Posici√≥n")
print("=" * 60)

# Random Forest para clasificaci√≥n multiclase
rf_category = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
rf_category.fit(X_train_scaled, y_category_train)
rf_category_pred = rf_category.predict(X_test_scaled)
rf_category_score = accuracy_score(y_category_test, rf_category_pred)

print(f"\nüìä Resultados:")
print(f"   Accuracy: {rf_category_score:.4f}")

# Mostrar reporte de clasificaci√≥n
category_names = le_category.inverse_transform(range(len(le_category.classes_)))
print(f"\nüìã Reporte de Clasificaci√≥n:")
print(classification_report(y_category_test, rf_category_pred, 
                          target_names=category_names))


In [None]:
# Visualizar resultados de clasificaci√≥n multiclase
selection_ml3 = SelectionModel()
layout_ml3 = ReactiveMatrixLayout("""
CM
PB
""", selection_model=selection_ml3)

# Crear matriz de confusi√≥n multiclase
cm_multiclass = confusion_matrix(y_category_test, rf_category_pred)
cm_multiclass_df = pd.DataFrame(cm_multiclass,
                                index=category_names,
                                columns=category_names)

# Convertir a formato largo
cm_multiclass_long = []
for i, row_label in enumerate(cm_multiclass_df.index):
    for j, col_label in enumerate(cm_multiclass_df.columns):
        cm_multiclass_long.append({
            'x': col_label,
            'y': row_label,
            'value': int(cm_multiclass_df.iloc[i, j])
        })

MatrixLayout.map_heatmap('C', pd.DataFrame(cm_multiclass_long),
                         x_col='x',
                         y_col='y',
                         value_col='value',
                         title='Matriz de Confusi√≥n - Clasificaci√≥n Multiclase',
                         showValues=True)

layout_ml3._layout._map['C'] = MatrixLayout._map.get('C', {})

# M: Scatter - Real vs Predicho
results_category = pd.DataFrame({
    'real': y_category_test.values,
    'predicted': rf_category_pred,
    'correct': (y_category_test.values == rf_category_pred).astype(int)
})

layout_ml3.set_data(results_category)
layout_ml3.add_scatter('M',
                       x_col='real',
                       y_col='predicted',
                       category_col='correct',
                       interactive=True,
                       xLabel='Categor√≠a Real (Encoded)',
                       yLabel='Categor√≠a Predicha (Encoded)',
                       title='Real vs Predicho - Clasificaci√≥n Multiclase')

# P: Pie chart - Distribuci√≥n de predicciones
pred_dist = pd.Series(rf_category_pred).value_counts()
pred_dist_df = pd.DataFrame({
    'category': [category_names[i] for i in pred_dist.index],
    'value': pred_dist.values
})

MatrixLayout.map_pie('P', pred_dist_df,
                    category_col='category',
                    value_col='value',
                    title='Distribuci√≥n de Predicciones',
                    linked_to='M')

layout_ml3._layout._map['P'] = MatrixLayout._map.get('P', {})

# B: Bar chart - F1-Score por clase
class_report = classification_report(y_category_test, rf_category_pred, 
                                    target_names=category_names, output_dict=True)
accuracy_by_class = []
for cat in category_names:
    if cat in class_report:
        accuracy_by_class.append({
            'category': cat,
            'value': class_report[cat]['f1-score']
        })

MatrixLayout.map_barchart('B', pd.DataFrame(accuracy_by_class),
                          category_col='category',
                          value_col='value',
                          xLabel='Categor√≠a',
                          yLabel='F1-Score',
                          title='F1-Score por Categor√≠a',
                          interactive=True)

layout_ml3._layout._map['B'] = MatrixLayout._map.get('B', {})

layout_ml3.display()


## üìä Paso 7: An√°lisis de Importancia de Features


In [None]:
# Analizar importancia de features
print("üîç An√°lisis de Importancia de Features")
print("=" * 60)

# Obtener importancia de features del mejor modelo de clasificaci√≥n
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_podium.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

print("\nüìä Importancia de Features (Top 10):")
print(feature_importance.head(10))

# Visualizar importancia de features
selection_features = SelectionModel()
layout_features = ReactiveMatrixLayout("""
BH
""", selection_model=selection_features)

# B: Bar chart - Importancia de features
MatrixLayout.map_barchart('B', feature_importance.head(15),
                          category_col='feature',
                          value_col='importance',
                          xLabel='Feature',
                          yLabel='Importancia',
                          title='Importancia de Features (Top 15)',
                          interactive=True,
                          selection_var='selected_features')

layout_features._layout._map['B'] = MatrixLayout._map.get('B', {})

# H: Histograma - Distribuci√≥n de importancia
layout_features.add_histogram('H',
                              column='importance',
                              bins=20,
                              linked_to='B',
                              interactive=True,
                              xLabel='Importancia',
                              yLabel='Frecuencia',
                              title='Distribuci√≥n de Importancia de Features')

layout_features.set_data(feature_importance)
layout_features.display()


## üéØ Paso 8: Comparaci√≥n de Modelos y Resumen Final


In [None]:
# Crear dashboard comparativo final
selection_final = SelectionModel()
layout_final = ReactiveMatrixLayout("""
ABC
DEF
""", selection_model=selection_final)

# A: Bar chart - Comparaci√≥n de accuracy de modelos de clasificaci√≥n binaria
model_comparison_binary = pd.DataFrame({
    'category': ['Random Forest', 'Logistic Regression', 'Gradient Boosting'],
    'value': [rf_podium_score, lr_podium_score, gb_podium_score]
})

MatrixLayout.map_barchart('A', model_comparison_binary,
                          category_col='category',
                          value_col='value',
                          xLabel='Modelo',
                          yLabel='Accuracy',
                          title='Comparaci√≥n de Modelos - Clasificaci√≥n Binaria',
                          interactive=True)

layout_final._layout._map['A'] = MatrixLayout._map.get('A', {})

# B: Bar chart - Comparaci√≥n de R¬≤ de modelos de regresi√≥n
model_comparison_reg = pd.DataFrame({
    'category': ['Random Forest', 'Linear Regression'],
    'value': [rf_points_r2, lr_points_r2]
})

MatrixLayout.map_barchart('B', model_comparison_reg,
                          category_col='category',
                          value_col='value',
                          xLabel='Modelo',
                          yLabel='R¬≤ Score',
                          title='Comparaci√≥n de Modelos - Regresi√≥n',
                          interactive=True)

layout_final._layout._map['B'] = MatrixLayout._map.get('B', {})

# C: Correlation Heatmap - Correlaciones entre features
layout_final.add_correlation_heatmap('C',
                                     showValues=True)

# D: Scatter - Comparaci√≥n de errores entre modelos
error_comparison = pd.DataFrame({
    'model': ['Random Forest', 'Linear Regression'] * len(y_points_test),
    'error': list(np.abs(y_points_test.values - rf_points_pred)) + 
             list(np.abs(y_points_test.values - lr_points_pred)),
    'real': list(y_points_test.values) * 2
})

layout_final.set_data(error_comparison)
layout_final.add_scatter('D',
                         x_col='real',
                         y_col='error',
                         category_col='model',
                         interactive=True,
                         xLabel='Puntos Reales',
                         yLabel='Error Absoluto',
                         title='Comparaci√≥n de Errores: RF vs LR')

# E: Boxplot - Distribuci√≥n de errores por modelo
layout_final.add_boxplot('E',
                         column='error',
                         category_col='model',
                         linked_to='D',
                         xLabel='Modelo',
                         yLabel='Error Absoluto',
                         title='Distribuci√≥n de Errores por Modelo')

# F: Pie chart - Distribuci√≥n de clases en test set
test_category_dist = pd.Series(y_category_test).value_counts()
test_category_df = pd.DataFrame({
    'category': [category_names[i] for i in test_category_dist.index],
    'value': test_category_dist.values
})

MatrixLayout.map_pie('F', test_category_df,
                    category_col='category',
                    value_col='value',
                    title='Distribuci√≥n de Clases en Test Set')

layout_final._layout._map['F'] = MatrixLayout._map.get('F', {})

layout_final.display()


## üìù Resumen y Conclusiones


In [None]:
print("=" * 60)
print("üìä RESUMEN DEL PIPELINE DE MACHINE LEARNING")
print("=" * 60)

print("\n‚úÖ PASOS COMPLETADOS:")
print("   1. Carga y exploraci√≥n inicial de datos")
print("   2. An√°lisis de calidad de datos")
print("   3. EDA extenso con visualizaciones BESTLIB")
print("   4. Preprocesamiento y feature engineering")
print("   5. Modelado de ML (3 problemas diferentes)")
print("   6. Evaluaci√≥n y visualizaci√≥n de resultados")
print("   7. An√°lisis de importancia de features")

print("\nüéØ MODELOS ENTRENADOS:")
print(f"   1. Clasificaci√≥n Binaria (Podium): {best_podium_name} - Accuracy: {max(rf_podium_score, lr_podium_score, gb_podium_score):.4f}")
print(f"   2. Regresi√≥n (Puntos): {best_points_name} - R¬≤: {max(rf_points_r2, lr_points_r2):.4f}")
print(f"   3. Clasificaci√≥n Multiclase (Categor√≠a): Random Forest - Accuracy: {rf_category_score:.4f}")

print("\nüìä VISUALIZACIONES CREADAS:")
print("   - Scatter plots interactivos con brush selection")
print("   - Histogramas, bar charts, pie charts")
print("   - Boxplots, heatmaps, line charts")
print("   - RadViz, Star Coordinates, Parallel Coordinates")
print("   - Vistas enlazadas para exploraci√≥n interactiva")
print("   - Matrices de confusi√≥n y m√©tricas de evaluaci√≥n")

print("\nüí° CARACTER√çSTICAS DESTACADAS:")
print("   - Todas las visualizaciones son interactivas")
print("   - Vistas enlazadas permiten exploraci√≥n din√°mica")
print("   - Variables de selecci√≥n para an√°lisis posterior")
print("   - Datos seleccionados devueltos como DataFrames")

print("\nüéâ Pipeline completado exitosamente!")
