# ü§ñ Fundamentos de Inteligencia Artificial ‚Äî Entrenamiento de Modelos

**Instructor:** Alexander  
**Duraci√≥n:** 3-4 horas  
**Nivel:** Intermedio

## üìã Objetivos de Aprendizaje

Al finalizar este notebook, ser√°s capaz de:

1. ‚úÖ Comprender el flujo completo de un proyecto de Machine Learning
2. ‚úÖ Preparar y explorar datos de manera efectiva
3. ‚úÖ Implementar y comparar m√∫ltiples algoritmos de ML
4. ‚úÖ Evaluar modelos con m√©tricas apropiadas
5. ‚úÖ Optimizar hiperpar√°metros de manera sistem√°tica
6. ‚úÖ Desplegar modelos en producci√≥n

## üìö Contenido

1. Configuraci√≥n del entorno
2. Flujo de trabajo en Machine Learning
3. Exploraci√≥n y preparaci√≥n de datos
4. Modelos de clasificaci√≥n (7+ algoritmos)
5. Modelos de regresi√≥n (5+ algoritmos)
6. Evaluaci√≥n y m√©tricas avanzadas
7. Preprocesamiento y Feature Engineering
8. Pipelines y automatizaci√≥n
9. Optimizaci√≥n de hiperpar√°metros
10. Manejo de desbalanceo de clases
11. Validaci√≥n cruzada avanzada
12. Guardado y despliegue de modelos
13. Mejores pr√°cticas y tips profesionales
14. Ejercicios pr√°cticos
15. Proyecto final

---

## 1Ô∏è‚É£ Configuraci√≥n del Entorno

### Instalaci√≥n de dependencias

Ejecuta esta celda si necesitas instalar las librer√≠as:

In [None]:
# Descomenta y ejecuta si necesitas instalar
# !pip install scikit-learn pandas numpy matplotlib seaborn joblib xgboost lightgbm imbalanced-learn

### Importar librer√≠as

In [None]:
# Librer√≠as b√°sicas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Configuraci√≥n de visualizaci√≥n
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# scikit-learn - Datasets
from sklearn.datasets import load_iris, load_diabetes, load_breast_cancer, make_classification, make_regression

# scikit-learn - Preprocesamiento
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, StratifiedKFold, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# scikit-learn - Modelos de Clasificaci√≥n
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

# scikit-learn - Modelos de Regresi√≥n
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# scikit-learn - M√©tricas
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score, auc,
    mean_squared_error, mean_absolute_error, r2_score,
    precision_recall_curve, average_precision_score
)

# Selecci√≥n de features
from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel

# Manejo de desbalanceo
try:
    from imblearn.over_sampling import SMOTE, RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler
    from imblearn.pipeline import Pipeline as ImbPipeline
    IMBLEARN_AVAILABLE = True
except ImportError:
    print("‚ö†Ô∏è imbalanced-learn no est√° instalado. Algunas funciones no estar√°n disponibles.")
    IMBLEARN_AVAILABLE = False

# Persistencia
import joblib
import pickle
from datetime import datetime

# Versiones
import sklearn
print('‚úÖ Librer√≠as cargadas exitosamente')
print(f'üì¶ NumPy: {np.__version__}')
print(f'üì¶ Pandas: {pd.__version__}')
print(f'üì¶ Scikit-learn: {sklearn.__version__}')
print(f'üì¶ Matplotlib: {plt.matplotlib.__version__}')
print(f'üì¶ Seaborn: {sns.__version__}')

---

## 2Ô∏è‚É£ Flujo de Trabajo en Machine Learning

### üîÑ Pipeline Completo

```
1. DEFINICI√ìN DEL PROBLEMA
   ‚îú‚îÄ ¬øClasificaci√≥n o Regresi√≥n?
   ‚îú‚îÄ ¬øQu√© m√©trica de √©xito?
   ‚îî‚îÄ ¬øQu√© restricciones tengo?

2. RECOLECCI√ìN DE DATOS
   ‚îú‚îÄ Fuentes de datos
   ‚îú‚îÄ Calidad y cantidad
   ‚îî‚îÄ Consideraciones √©ticas

3. EXPLORACI√ìN (EDA)
   ‚îú‚îÄ Estad√≠sticas descriptivas
   ‚îú‚îÄ Visualizaciones
   ‚îú‚îÄ Correlaciones
   ‚îî‚îÄ Detecci√≥n de outliers

4. PREPARACI√ìN DE DATOS
   ‚îú‚îÄ Limpieza (valores faltantes, duplicados)
   ‚îú‚îÄ Transformaciones
   ‚îú‚îÄ Codificaci√≥n de variables categ√≥ricas
   ‚îî‚îÄ Feature Engineering

5. DIVISI√ìN DE DATOS
   ‚îú‚îÄ Train / Test (t√≠picamente 70-80% / 20-30%)
   ‚îú‚îÄ Train / Validation / Test (60% / 20% / 20%)
   ‚îî‚îÄ Estratificaci√≥n si es necesario

6. ENTRENAMIENTO
   ‚îú‚îÄ Selecci√≥n de modelos
   ‚îú‚îÄ Entrenamiento (fit)
   ‚îî‚îÄ Validaci√≥n cruzada

7. EVALUACI√ìN
   ‚îú‚îÄ M√©tricas apropiadas
   ‚îú‚îÄ Matriz de confusi√≥n
   ‚îú‚îÄ Curvas ROC/PR
   ‚îî‚îÄ An√°lisis de errores

8. OPTIMIZACI√ìN
   ‚îú‚îÄ Ajuste de hiperpar√°metros
   ‚îú‚îÄ Feature selection
   ‚îî‚îÄ Ensemble methods

9. VALIDACI√ìN FINAL
   ‚îú‚îÄ Evaluaci√≥n en test set
   ‚îú‚îÄ Comparaci√≥n con baseline
   ‚îî‚îÄ An√°lisis de sesgo/varianza

10. DESPLIEGUE
    ‚îú‚îÄ Guardar modelo
    ‚îú‚îÄ Documentaci√≥n
    ‚îú‚îÄ Monitoreo
    ‚îî‚îÄ Actualizaci√≥n continua
```

---

## 3Ô∏è‚É£ Exploraci√≥n y Preparaci√≥n de Datos

### Dataset: Iris (Clasificaci√≥n Multiclase)

In [None]:
# Cargar el dataset Iris
iris = load_iris(as_frame=True)
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Crear DataFrame completo para an√°lisis
df_iris = X_iris.copy()
df_iris['target'] = y_iris
df_iris['species'] = df_iris['target'].map(lambda x: target_names[x])

print("üìä Dataset Iris cargado")
print(f"   Dimensiones: {df_iris.shape}")
print(f"   Features: {len(feature_names)}")
print(f"   Clases: {len(target_names)} - {list(target_names)}")
print("\nüîç Primeras filas:")
df_iris.head(10)

### An√°lisis Exploratorio de Datos (EDA)

In [None]:
# Estad√≠sticas descriptivas
print("üìà Estad√≠sticas Descriptivas:\n")
print(df_iris.describe())

print("\nüî¢ Informaci√≥n del Dataset:\n")
print(df_iris.info())

print("\n‚ö†Ô∏è Valores faltantes:")
print(df_iris.isnull().sum())

print("\nüìä Distribuci√≥n de clases:")
print(df_iris['species'].value_counts())

In [None]:
# Visualizaciones
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Distribuci√≥n de clases
df_iris['species'].value_counts().plot(kind='bar', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('Distribuci√≥n de Especies', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('Frecuencia')
axes[0, 0].set_xlabel('Especie')

# Pairplot simplificado (2 features)
for species in df_iris['species'].unique():
    subset = df_iris[df_iris['species'] == species]
    axes[0, 1].scatter(subset['sepal length (cm)'], subset['sepal width (cm)'], 
                       label=species, alpha=0.6, s=50)
axes[0, 1].set_xlabel('Sepal Length (cm)')
axes[0, 1].set_ylabel('Sepal Width (cm)')
axes[0, 1].set_title('Sepal Length vs Width', fontsize=14, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Boxplot
df_iris.boxplot(column='petal length (cm)', by='species', ax=axes[1, 0])
axes[1, 0].set_title('Petal Length por Especie', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Especie')
axes[1, 0].set_ylabel('Petal Length (cm)')
plt.sca(axes[1, 0])
plt.xticks(rotation=0)

# Matriz de correlaci√≥n
corr_matrix = df_iris[feature_names].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[1, 1], 
            square=True, cbar_kws={'shrink': 0.8})
axes[1, 1].set_title('Matriz de Correlaci√≥n', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("‚úÖ Visualizaciones generadas")

### Divisi√≥n de Datos

**Buenas pr√°cticas:**
- Usar `stratify` para mantener la proporci√≥n de clases
- Fijar `random_state` para reproducibilidad
- Separar test set ANTES de cualquier preprocesamiento

In [None]:
# Divisi√≥n estratificada (mantiene proporci√≥n de clases)
X_train, X_test, y_train, y_test = train_test_split(
    X_iris, y_iris, 
    test_size=0.25,      # 75% train, 25% test
    random_state=42,     # Reproducibilidad
    stratify=y_iris      # Mantiene proporci√≥n de clases
)

print("‚úÇÔ∏è Divisi√≥n de datos completada")
print(f"   Train: {X_train.shape[0]} muestras ({X_train.shape[0]/len(X_iris)*100:.1f}%)")
print(f"   Test:  {X_test.shape[0]} muestras ({X_test.shape[0]/len(X_iris)*100:.1f}%)")
print("\nüìä Distribuci√≥n de clases en Train:")
print(pd.Series(y_train).value_counts().sort_index())
print("\nüìä Distribuci√≥n de clases en Test:")
print(pd.Series(y_test).value_counts().sort_index())

---

## 4Ô∏è‚É£ Modelos de Clasificaci√≥n

### Comparaci√≥n de M√∫ltiples Algoritmos

Entrenaremos y compararemos 10+ algoritmos de clasificaci√≥n:

In [None]:
# Diccionario de modelos a comparar
classification_models = {
    'Logistic Regression': LogisticRegression(max_iter=500, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Extra Trees': ExtraTreesClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
    'SVM (Linear)': SVC(kernel='linear', probability=True, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Gaussian Naive Bayes': GaussianNB(),
    'LDA': LinearDiscriminantAnalysis(),
    'QDA': QuadraticDiscriminantAnalysis(),
    'Ridge Classifier': RidgeClassifier(random_state=42),
    'SGD Classifier': SGDClassifier(max_iter=1000, random_state=42)
}

print(f"üéØ {len(classification_models)} modelos de clasificaci√≥n listos para entrenar")

### Entrenamiento y Evaluaci√≥n con Pipeline

In [None]:
# Resultados de todos los modelos
results = []

print("üöÄ Entrenando modelos...\n")

for name, model in classification_models.items():
    # Crear pipeline con escalado + modelo
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  # Normalizaci√≥n
        ('classifier', model)
    ])
    
    # Entrenar
    pipeline.fit(X_train, y_train)
    
    # Predecir
    y_pred = pipeline.predict(X_test)
    
    # Calcular m√©tricas
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Validaci√≥n cruzada (5-fold)
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'CV Mean': cv_mean,
        'CV Std': cv_std
    })
    
    print(f"‚úÖ {name:25s} | Acc: {accuracy:.4f} | CV: {cv_mean:.4f} ¬± {cv_std:.4f}")

# Crear DataFrame de resultados
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('Accuracy', ascending=False).reset_index(drop=True)

print("\n" + "="*80)
print("üìä RESUMEN DE RESULTADOS")
print("="*80)
print(results_df.to_string(index=False))

### Visualizaci√≥n de Resultados

In [None]:
# Gr√°fico comparativo
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Accuracy comparison
results_df_sorted = results_df.sort_values('Accuracy')
axes[0].barh(results_df_sorted['Model'], results_df_sorted['Accuracy'], color='steelblue')
axes[0].set_xlabel('Accuracy', fontsize=12)
axes[0].set_title('Comparaci√≥n de Accuracy por Modelo', fontsize=14, fontweight='bold')
axes[0].axvline(x=results_df_sorted['Accuracy'].mean(), color='red', linestyle='--', 
                label=f"Media: {results_df_sorted['Accuracy'].mean():.3f}")
axes[0].legend()
axes[0].grid(axis='x', alpha=0.3)

# M√©tricas m√∫ltiples del mejor modelo
best_model_row = results_df.iloc[0]
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [best_model_row[m] for m in metrics]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']

bars = axes[1].bar(metrics, values, color=colors, alpha=0.7, edgecolor='black')
axes[1].set_ylabel('Score', fontsize=12)
axes[1].set_title(f'M√©tricas del Mejor Modelo: {best_model_row["Model"]}', 
                  fontsize=14, fontweight='bold')
axes[1].set_ylim([0, 1.1])
axes[1].grid(axis='y', alpha=0.3)

# Agregar valores sobre las barras
for bar, value in zip(bars, values):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.02,
                f'{value:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

### An√°lisis Detallado del Mejor Modelo

In [None]:
# Entrenar el mejor modelo
best_model_name = results_df.iloc[0]['Model']
best_model = classification_models[best_model_name]

best_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', best_model)
])

best_pipeline.fit(X_train, y_train)
y_pred_best = best_pipeline.predict(X_test)

print(f"üèÜ Mejor modelo: {best_model_name}")
print("\n" + "="*80)
print("üìã CLASSIFICATION REPORT")
print("="*80)
print(classification_report(y_test, y_pred_best, target_names=target_names))

### Matriz de Confusi√≥n

In [None]:
# Calcular matriz de confusi√≥n
cm = confusion_matrix(y_test, y_pred_best)

# Visualizaci√≥n mejorada
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=target_names, yticklabels=target_names,
            cbar_kws={'label': 'Cantidad'})
plt.title(f'Matriz de Confusi√≥n - {best_model_name}', fontsize=16, fontweight='bold', pad=20)
plt.ylabel('Valor Real', fontsize=12)
plt.xlabel('Predicci√≥n', fontsize=12)
plt.tight_layout()
plt.show()

# An√°lisis de errores
print("\nüîç An√°lisis de la Matriz de Confusi√≥n:")
print(f"   Diagonal (predicciones correctas): {cm.diagonal().sum()} / {cm.sum()}")
print(f"   Errores totales: {cm.sum() - cm.diagonal().sum()}")
print(f"   Accuracy: {cm.diagonal().sum() / cm.sum():.4f}")

### Curvas ROC (One-vs-Rest)

In [None]:
from sklearn.preprocessing import label_binarize

# Binarizar las etiquetas para ROC multiclase
y_test_bin = label_binarize(y_test, classes=[0, 1, 2])
n_classes = y_test_bin.shape[1]

# Obtener probabilidades de predicci√≥n
if hasattr(best_pipeline, "predict_proba"):
    y_score = best_pipeline.predict_proba(X_test)
elif hasattr(best_pipeline, "decision_function"):
    y_score = best_pipeline.decision_function(X_test)
else:
    print("‚ö†Ô∏è El modelo no soporta probabilidades o decision_function")
    y_score = None

if y_score is not None:
    # Calcular ROC para cada clase
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test_bin[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    
    # Visualizaci√≥n
    plt.figure(figsize=(10, 8))
    colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
    
    for i, color in zip(range(n_classes), colors):
        plt.plot(fpr[i], tpr[i], color=color, lw=2,
                label=f'{target_names[i]} (AUC = {roc_auc[i]:.3f})')
    
    plt.plot([0, 1], [0, 1], 'k--', lw=2, label='Random (AUC = 0.5)')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
    plt.ylabel('True Positive Rate (Sensitivity)', fontsize=12)
    plt.title(f'Curvas ROC (One-vs-Rest) - {best_model_name}', fontsize=14, fontweight='bold')
    plt.legend(loc="lower right", fontsize=10)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print("üìà AUC Scores:")
    for i in range(n_classes):
        print(f"   {target_names[i]:15s}: {roc_auc[i]:.4f}")
    print(f"   Promedio (macro):  {np.mean(list(roc_auc.values())):.4f}")

---

## 5Ô∏è‚É£ Modelos de Regresi√≥n

### Dataset: Diabetes (Regresi√≥n)

In [None]:
# Cargar dataset
diabetes = load_diabetes(as_frame=True)
X_diabetes = diabetes.data
y_diabetes = diabetes.target

df_diabetes = X_diabetes.copy()
df_diabetes['target'] = y_diabetes

print("üìä Dataset Diabetes cargado")
print(f"   Dimensiones: {df_diabetes.shape}")
print(f"   Features: {X_diabetes.shape[1]}")
print(f"\nüéØ Target (progresi√≥n de diabetes):")
print(f"   Min: {y_diabetes.min():.2f}")
print(f"   Max: {y_diabetes.max():.2f}")
print(f"   Mean: {y_diabetes.mean():.2f}")
print(f"   Std: {y_diabetes.std():.2f}")

df_diabetes.head()

In [None]:
# Divisi√≥n de datos
Xr_train, Xr_test, yr_train, yr_test = train_test_split(
    X_diabetes, y_diabetes, 
    test_size=0.25, 
    random_state=42
)

print(f"‚úÇÔ∏è Divisi√≥n completada: {Xr_train.shape[0]} train / {Xr_test.shape[0]} test")

### Comparaci√≥n de Modelos de Regresi√≥n

In [None]:
# Modelos de regresi√≥n
regression_models = {
    'Linear Regression': LinearRegression(),
    'Ridge (L2)': Ridge(alpha=1.0, random_state=42),
    'Lasso (L1)': Lasso(alpha=1.0, random_state=42),
    'ElasticNet': ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=42),
    'Bayesian Ridge': BayesianRidge(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostRegressor(n_estimators=100, random_state=42),
    'SVR (RBF)': SVR(kernel='rbf'),
    'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5)
}

print(f"üéØ {len(regression_models)} modelos de regresi√≥n listos")

In [None]:
# Entrenamiento y evaluaci√≥n
regression_results = []

print("üöÄ Entrenando modelos de regresi√≥n...\n")

for name, model in regression_models.items():
    # Pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('regressor', model)
    ])
    
    # Entrenar
    pipeline.fit(Xr_train, yr_train)
    
    # Predecir
    yr_pred = pipeline.predict(Xr_test)
    
    # M√©tricas
    mse = mean_squared_error(yr_test, yr_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(yr_test, yr_pred)
    r2 = r2_score(yr_test, yr_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(pipeline, Xr_train, yr_train, cv=5, 
                                scoring='neg_mean_squared_error')
    cv_rmse_mean = np.sqrt(-cv_scores.mean())
    cv_rmse_std = np.sqrt(cv_scores.std())
    
    regression_results.append({
        'Model': name,
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R¬≤': r2,
        'CV RMSE': cv_rmse_mean
    })
    
    print(f"‚úÖ {name:22s} | RMSE: {rmse:6.2f} | R¬≤: {r2:6.3f} | CV RMSE: {cv_rmse_mean:6.2f}")

# DataFrame de resultados
regression_results_df = pd.DataFrame(regression_results)
regression_results_df = regression_results_df.sort_values('RMSE').reset_index(drop=True)

print("\n" + "="*90)
print("üìä RESUMEN DE RESULTADOS - REGRESI√ìN")
print("="*90)
print(regression_results_df.to_string(index=False))

### Visualizaci√≥n de Predicciones

In [None]:
# Entrenar el mejor modelo de regresi√≥n
best_reg_name = regression_results_df.iloc[0]['Model']
best_reg_model = regression_models[best_reg_name]

best_reg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', best_reg_model)
])

best_reg_pipeline.fit(Xr_train, yr_train)
yr_pred_best = best_reg_pipeline.predict(Xr_test)

# Visualizaci√≥n
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Gr√°fico 1: Predicho vs Real
axes[0].scatter(yr_test, yr_pred_best, alpha=0.6, s=50, edgecolors='k', linewidth=0.5)
axes[0].plot([yr_test.min(), yr_test.max()], [yr_test.min(), yr_test.max()], 
             'r--', lw=2, label='Predicci√≥n perfecta')
axes[0].set_xlabel('Valor Real', fontsize=12)
axes[0].set_ylabel('Predicci√≥n', fontsize=12)
axes[0].set_title(f'Predicho vs Real - {best_reg_name}', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Gr√°fico 2: Distribuci√≥n de Residuos
residuals = yr_test - yr_pred_best
axes[1].scatter(yr_pred_best, residuals, alpha=0.6, s=50, edgecolors='k', linewidth=0.5)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Predicci√≥n', fontsize=12)
axes[1].set_ylabel('Residuos (Real - Predicho)', fontsize=12)
axes[1].set_title('An√°lisis de Residuos', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Estad√≠sticas de residuos
print("\nüìä Estad√≠sticas de Residuos:")
print(f"   Media: {residuals.mean():.4f} (cercano a 0 es ideal)")
print(f"   Desv. Std: {residuals.std():.4f}")
print(f"   Min: {residuals.min():.4f}")
print(f"   Max: {residuals.max():.4f}")

---

## 6Ô∏è‚É£ Feature Engineering y Selecci√≥n

### Importancia de Features (Random Forest)

In [None]:
# Entrenar Random Forest para obtener importancia
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Obtener importancias
feature_importances = pd.DataFrame({
    'feature': feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("üåü Importancia de Features (Random Forest):\n")
print(feature_importances.to_string(index=False))

# Visualizaci√≥n
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['feature'], feature_importances['importance'], color='teal')
plt.xlabel('Importancia', fontsize=12)
plt.title('Importancia de Features - Random Forest', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

### Selecci√≥n de Features con SelectKBest

In [None]:
# Seleccionar las mejores 2 features
selector = SelectKBest(score_func=f_classif, k=2)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Features seleccionadas
selected_features = X_train.columns[selector.get_support()].tolist()
print(f"‚úÖ Features seleccionadas (k=2): {selected_features}")

# Entrenar modelo con features seleccionadas
lr_selected = LogisticRegression(max_iter=500, random_state=42)
lr_selected.fit(X_train_selected, y_train)
y_pred_selected = lr_selected.predict(X_test_selected)

print(f"\nüìä Accuracy con todas las features: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"üìä Accuracy con 2 mejores features: {accuracy_score(y_test, y_pred_selected):.4f}")

---

## 7Ô∏è‚É£ Optimizaci√≥n de Hiperpar√°metros

### GridSearchCV - B√∫squeda Exhaustiva

In [None]:
# Definir espacio de b√∫squeda
param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__solver': ['lbfgs', 'liblinear'],
    'classifier__max_iter': [200, 500, 1000],
    'classifier__penalty': ['l2']
}

# Pipeline base
pipeline_grid = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline_grid,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("üîç Iniciando GridSearchCV...")
print(f"   Total de combinaciones: {len(param_grid['classifier__C']) * len(param_grid['classifier__solver']) * len(param_grid['classifier__max_iter'])}")
print(f"   CV folds: 5\n")

grid_search.fit(X_train, y_train)

print("\n‚úÖ GridSearchCV completado")
print(f"\nüèÜ Mejor score (CV): {grid_search.best_score_:.4f}")
print(f"\n‚öôÔ∏è Mejores hiperpar√°metros:")
for param, value in grid_search.best_params_.items():
    print(f"   {param}: {value}")

In [None]:
# Evaluar en test set
y_pred_grid = grid_search.predict(X_test)
print(f"\nüìä Accuracy en test set: {accuracy_score(y_test, y_pred_grid):.4f}")

# Top 10 configuraciones
cv_results = pd.DataFrame(grid_search.cv_results_)
top_configs = cv_results.nsmallest(10, 'rank_test_score')[[
    'rank_test_score', 'mean_test_score', 'std_test_score', 'params'
]]

print("\nüìã Top 10 Configuraciones:\n")
for idx, row in top_configs.iterrows():
    print(f"Rank {int(row['rank_test_score'])}: Score {row['mean_test_score']:.4f} ¬± {row['std_test_score']:.4f}")
    print(f"   Params: {row['params']}\n")

### RandomizedSearchCV - B√∫squeda Aleatoria

In [None]:
from scipy.stats import uniform, randint

# Distribuciones para Random Forest
param_distributions = {
    'classifier__n_estimators': randint(50, 200),
    'classifier__max_depth': [None, 5, 10, 15, 20],
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 10),
    'classifier__max_features': ['sqrt', 'log2', None]
}

pipeline_random = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

random_search = RandomizedSearchCV(
    estimator=pipeline_random,
    param_distributions=param_distributions,
    n_iter=20,  # N√∫mero de combinaciones a probar
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

print("üé≤ Iniciando RandomizedSearchCV...\n")
random_search.fit(X_train, y_train)

print("\n‚úÖ RandomizedSearchCV completado")
print(f"\nüèÜ Mejor score (CV): {random_search.best_score_:.4f}")
print(f"\n‚öôÔ∏è Mejores hiperpar√°metros:")
for param, value in random_search.best_params_.items():
    print(f"   {param}: {value}")

# Test set
y_pred_random = random_search.predict(X_test)
print(f"\nüìä Accuracy en test set: {accuracy_score(y_test, y_pred_random):.4f}")

---

## 8Ô∏è‚É£ Manejo de Datos Desbalanceados

### Crear dataset desbalanceado sint√©tico

In [None]:
# Generar dataset desbalanceado
X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.9, 0.1],  # 90% clase 0, 10% clase 1
    flip_y=0.01,
    random_state=42
)

print("üìä Dataset desbalanceado generado")
print(f"   Total: {len(y_imb)} muestras")
print("\n   Distribuci√≥n de clases:")
unique, counts = np.unique(y_imb, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"   Clase {cls}: {count} ({count/len(y_imb)*100:.1f}%)")

# Divisi√≥n
X_imb_train, X_imb_test, y_imb_train, y_imb_test = train_test_split(
    X_imb, y_imb, test_size=0.25, random_state=42, stratify=y_imb
)

### T√©cnica 1: Class Weight

In [None]:
# Sin balanceo
lr_no_balance = LogisticRegression(max_iter=1000, random_state=42)
lr_no_balance.fit(X_imb_train, y_imb_train)
y_pred_no_balance = lr_no_balance.predict(X_imb_test)

# Con class_weight='balanced'
lr_balanced = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
lr_balanced.fit(X_imb_train, y_imb_train)
y_pred_balanced = lr_balanced.predict(X_imb_test)

print("‚öñÔ∏è Comparaci√≥n: Sin balanceo vs Con class_weight\n")
print("SIN BALANCEO:")
print(classification_report(y_imb_test, y_pred_no_balance))
print("\nCON CLASS_WEIGHT='balanced':")
print(classification_report(y_imb_test, y_pred_balanced))

### T√©cnica 2: SMOTE (Oversampling)

In [None]:
if IMBLEARN_AVAILABLE:
    # Aplicar SMOTE
    smote = SMOTE(random_state=42)
    X_imb_train_smote, y_imb_train_smote = smote.fit_resample(X_imb_train, y_imb_train)
    
    print("üîÑ SMOTE aplicado")
    print(f"   Antes: {len(y_imb_train)} muestras")
    print(f"   Despu√©s: {len(y_imb_train_smote)} muestras")
    print("\n   Nueva distribuci√≥n:")
    unique, counts = np.unique(y_imb_train_smote, return_counts=True)
    for cls, count in zip(unique, counts):
        print(f"   Clase {cls}: {count} ({count/len(y_imb_train_smote)*100:.1f}%)")
    
    # Entrenar con datos balanceados
    lr_smote = LogisticRegression(max_iter=1000, random_state=42)
    lr_smote.fit(X_imb_train_smote, y_imb_train_smote)
    y_pred_smote = lr_smote.predict(X_imb_test)
    
    print("\nüìä Resultados con SMOTE:")
    print(classification_report(y_imb_test, y_pred_smote))
else:
    print("‚ö†Ô∏è imbalanced-learn no disponible. Instala con: pip install imbalanced-learn")

---

## 9Ô∏è‚É£ Validaci√≥n Cruzada Avanzada

In [None]:
# Diferentes estrategias de CV
from sklearn.model_selection import cross_validate

# Modelo a evaluar
model_cv = RandomForestClassifier(n_estimators=100, random_state=42)

# M√©tricas m√∫ltiples
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision_weighted',
    'recall': 'recall_weighted',
    'f1': 'f1_weighted'
}

# Cross-validation
cv_results = cross_validate(
    model_cv, X_train, y_train,
    cv=5,
    scoring=scoring,
    return_train_score=True,
    n_jobs=-1
)

print("üìä Resultados de Validaci√≥n Cruzada (5-fold)\n")
print("="*60)
for metric in ['accuracy', 'precision', 'recall', 'f1']:
    train_scores = cv_results[f'train_{metric}']
    test_scores = cv_results[f'test_{metric}']
    
    print(f"{metric.upper():12s}:")
    print(f"  Train: {train_scores.mean():.4f} ¬± {train_scores.std():.4f}")
    print(f"  Test:  {test_scores.mean():.4f} ¬± {test_scores.std():.4f}")
    print()

### Curva de Aprendizaje

In [None]:
from sklearn.model_selection import learning_curve

# Calcular curva de aprendizaje
train_sizes, train_scores, test_scores = learning_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X_train, y_train,
    cv=5,
    n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='accuracy'
)

# Calcular media y desviaci√≥n
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Visualizaci√≥n
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='r', label='Train Score')
plt.plot(train_sizes, test_mean, 'o-', color='g', label='Cross-validation Score')

plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                 alpha=0.1, color='r')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, 
                 alpha=0.1, color='g')

plt.xlabel('Tama√±o del conjunto de entrenamiento', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Curva de Aprendizaje - Random Forest', fontsize=14, fontweight='bold')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üìà Interpretaci√≥n:")
print("   - Gap grande entre train y CV ‚Üí Overfitting")
print("   - Ambas curvas bajas ‚Üí Underfitting")
print("   - Ambas curvas altas y cercanas ‚Üí Buen ajuste")

---

## üîü Guardado y Despliegue de Modelos

In [None]:
# Crear directorio para modelos
import os
model_dir = '/mnt/user-data/outputs'
os.makedirs(model_dir, exist_ok=True)

# Informaci√≥n del modelo
model_info = {
    'model_name': best_model_name,
    'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'sklearn_version': sklearn.__version__,
    'accuracy': float(results_df.iloc[0]['Accuracy']),
    'f1_score': float(results_df.iloc[0]['F1-Score']),
    'features': feature_names,
    'target_names': list(target_names),
    'hyperparameters': best_pipeline.named_steps['classifier'].get_params()
}

# Guardar modelo con joblib (recomendado)
model_path = os.path.join(model_dir, 'best_model_iris.joblib')
joblib.dump(best_pipeline, model_path)
print(f"‚úÖ Modelo guardado: {model_path}")

# Guardar metadata
import json
metadata_path = os.path.join(model_dir, 'model_info.json')
with open(metadata_path, 'w') as f:
    json.dump(model_info, f, indent=2)
print(f"‚úÖ Metadata guardada: {metadata_path}")

# Guardar con pickle (alternativa)
pickle_path = os.path.join(model_dir, 'best_model_iris.pkl')
with open(pickle_path, 'wb') as f:
    pickle.dump(best_pipeline, f)
print(f"‚úÖ Modelo guardado (pickle): {pickle_path}")

### Cargar y Usar Modelo

In [None]:
# Cargar modelo
loaded_model = joblib.load(model_path)
print("‚úÖ Modelo cargado exitosamente\n")

# Hacer predicciones
sample_data = X_test.iloc[:5]
predictions = loaded_model.predict(sample_data)
probabilities = loaded_model.predict_proba(sample_data)

print("üîÆ Predicciones en nuevos datos:\n")
for i, (pred, probs) in enumerate(zip(predictions, probabilities)):
    print(f"Muestra {i+1}:")
    print(f"  Predicci√≥n: {target_names[pred]}")
    print(f"  Probabilidades:")
    for j, prob in enumerate(probs):
        print(f"    {target_names[j]:15s}: {prob:.4f} ({prob*100:.1f}%)")
    print()

---

## 1Ô∏è‚É£1Ô∏è‚É£ Mejores Pr√°cticas y Tips

### ‚úÖ DO's (Hacer)

1. **Reproducibilidad**
   - Siempre fijar `random_state` en modelos y splits
   - Documentar versiones de librer√≠as
   - Guardar seeds y configuraciones

2. **Preprocesamiento**
   - Usar Pipelines para evitar data leakage
   - Escalar features DESPU√âS del split
   - Documentar transformaciones aplicadas

3. **Validaci√≥n**
   - Usar validaci√≥n cruzada
   - Mantener test set intocado hasta el final
   - Estratificar si hay desbalanceo

4. **M√©tricas**
   - Elegir m√©tricas apropiadas al problema
   - Reportar m√∫ltiples m√©tricas
   - Considerar el contexto del negocio

5. **Documentaci√≥n**
   - Documentar decisiones y experimentos
   - Guardar metadata con modelos
   - Versionar modelos y datasets

### ‚ùå DON'Ts (No hacer)

1. **Data Leakage**
   - No usar informaci√≥n del test en el train
   - No escalar usando estad√≠sticas del dataset completo
   - No hacer feature engineering con datos futuros

2. **Overfitting**
   - No confiar solo en accuracy del train
   - No ignorar la validaci√≥n cruzada
   - No usar modelos muy complejos sin regularizaci√≥n

3. **Evaluaci√≥n**
   - No usar solo accuracy en datos desbalanceados
   - No optimizar en el test set
   - No ignorar el an√°lisis de errores

4. **Generalizaci√≥n**
   - No asumir que CV = performance real
   - No ignorar la distribuci√≥n de datos en producci√≥n
   - No olvidar monitorear el modelo en producci√≥n

### üéØ Checklist Pre-Despliegue

- [ ] Modelo entrenado con mejores pr√°cticas
- [ ] Validaci√≥n cruzada realizada
- [ ] Test set evaluado
- [ ] M√©tricas documentadas
- [ ] An√°lisis de errores completado
- [ ] Modelo guardado con metadata
- [ ] Pipeline de preprocesamiento incluido
- [ ] Documentaci√≥n completa
- [ ] C√≥digo versionado
- [ ] Plan de monitoreo definido

---

## 1Ô∏è‚É£2Ô∏è‚É£ Ejercicios Pr√°cticos

### Ejercicio 1: Breast Cancer Classification

**Objetivo:** Predecir si un tumor es maligno o benigno

**Tareas:**
1. Cargar el dataset `load_breast_cancer()`
2. Realizar EDA completo
3. Comparar al menos 5 modelos
4. Optimizar el mejor modelo con GridSearchCV
5. Evaluar con m√∫ltiples m√©tricas
6. Generar matriz de confusi√≥n y curva ROC
7. Guardar el modelo final

In [None]:
# TU C√ìDIGO AQU√ç
# Espacio para resolver el ejercicio 1

# Pista: breast_cancer = load_breast_cancer(as_frame=True)


### Ejercicio 2: Regresi√≥n con Feature Engineering

**Objetivo:** Mejorar el modelo de regresi√≥n de Diabetes

**Tareas:**
1. Crear features polinomiales de grado 2
2. Aplicar selecci√≥n de features
3. Comparar modelos lineales vs tree-based
4. Evaluar con m√∫ltiples m√©tricas (MSE, MAE, R¬≤)
5. Analizar residuos
6. Documentar mejoras

In [None]:
# TU C√ìDIGO AQU√ç
# Espacio para resolver el ejercicio 2

# Pista: from sklearn.preprocessing import PolynomialFeatures


### Ejercicio 3: Pipeline Completo

**Objetivo:** Crear un pipeline robusto con datos mixtos

**Tareas:**
1. Generar dataset con features num√©ricas y categ√≥ricas
2. Implementar ColumnTransformer
3. Manejar valores faltantes
4. Crear pipeline completo
5. Optimizar hiperpar√°metros
6. Guardar pipeline completo

In [None]:
# TU C√ìDIGO AQU√ç
# Espacio para resolver el ejercicio 3


---

## 1Ô∏è‚É£3Ô∏è‚É£ Proyecto Final: Sistema Completo de ML

### üéØ Descripci√≥n del Proyecto

Desarrollar un sistema completo de Machine Learning que incluya:

1. **Carga y exploraci√≥n de datos**
2. **Preprocesamiento robusto**
3. **Comparaci√≥n de m√∫ltiples modelos**
4. **Optimizaci√≥n de hiperpar√°metros**
5. **Evaluaci√≥n exhaustiva**
6. **Guardado y documentaci√≥n**

### üìã Requisitos

- Usar un dataset real (puede ser de Kaggle o sklearn)
- Implementar al menos 5 modelos diferentes
- Crear Pipeline completo
- Incluir validaci√≥n cruzada
- Generar visualizaciones informativas
- Documentar todo el proceso
- Guardar modelo final con metadata

### üèÜ Criterios de Evaluaci√≥n

1. **Calidad del c√≥digo** (20%)
2. **An√°lisis exploratorio** (15%)
3. **Preprocesamiento** (15%)
4. **Modelado** (25%)
5. **Evaluaci√≥n** (15%)
6. **Documentaci√≥n** (10%)

### üí° Datasets Sugeridos

- Titanic (clasificaci√≥n)
- House Prices (regresi√≥n)
- Credit Card Fraud (clasificaci√≥n desbalanceada)
- Wine Quality (clasificaci√≥n multiclase)

¬°Buena suerte! üöÄ

In [None]:
# PROYECTO FINAL - TU C√ìDIGO AQU√ç

# 1. Cargar datos


# 2. EDA


# 3. Preprocesamiento


# 4. Modelado


# 5. Evaluaci√≥n


# 6. Guardado


---

## 1Ô∏è‚É£4Ô∏è‚É£ Referencias y Recursos Adicionales

### üìö Documentaci√≥n Oficial

- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Scikit-learn API Reference](https://scikit-learn.org/stable/modules/classes.html)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [NumPy Documentation](https://numpy.org/doc/)
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)

### üìñ Libros Recomendados

- "Hands-On Machine Learning" - Aur√©lien G√©ron
- "Introduction to Machine Learning with Python" - Andreas M√ºller
- "Python Data Science Handbook" - Jake VanderPlas
- "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman

### üéì Cursos Online

- [Coursera: Machine Learning by Andrew Ng](https://www.coursera.org/learn/machine-learning)
- [Fast.ai: Practical Deep Learning](https://www.fast.ai/)
- [Google's Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)

### üõ†Ô∏è Herramientas √ötiles

- [Kaggle](https://www.kaggle.com/) - Datasets y competencias
- [UCI ML Repository](https://archive.ics.uci.edu/ml/index.php) - Datasets cl√°sicos
- [Papers With Code](https://paperswithcode.com/) - State of the art

### üí¨ Comunidades

- [Stack Overflow - Machine Learning](https://stackoverflow.com/questions/tagged/machine-learning)
- [Reddit - r/MachineLearning](https://www.reddit.com/r/MachineLearning/)
- [Kaggle Forums](https://www.kaggle.com/discussion)

---

## üéâ ¬°Felicitaciones!

Has completado el notebook de Fundamentos de IA. Ahora tienes las herramientas necesarias para:

‚úÖ Preparar y explorar datos  
‚úÖ Entrenar m√∫ltiples modelos  
‚úÖ Evaluar y comparar resultados  
‚úÖ Optimizar hiperpar√°metros  
‚úÖ Desplegar modelos en producci√≥n  

**Pr√≥ximos pasos:**
1. Practicar con datasets reales
2. Participar en competencias de Kaggle
3. Profundizar en temas avanzados (Deep Learning, NLP, Computer Vision)
4. Contribuir a proyectos open source

**¬°Sigue aprendiendo y construyendo! üöÄ**

---

*Notebook creado por Alexander para el curso de Fundamentos de IA*  
*√öltima actualizaci√≥n: Noviembre 2025*