# üìò Sesi√≥n 10: Scikit-learn - Machine Learning Cl√°sico

---

## üéØ Objetivos

- Entender el pipeline de ML
- Preprocesar datos correctamente
- Implementar modelos de clasificaci√≥n y regresi√≥n
- Evaluar y validar modelos

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_boston, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

## 1. Pipeline de Machine Learning

In [None]:
# Cargar datos
iris = load_iris()
X, y = iris.data, iris.target

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Clases: {iris.target_names}")

In [None]:
# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

## 2. Preprocesamiento

In [None]:
# Escalado
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform en train
X_test_scaled = scaler.transform(X_test)        # solo transform en test

print("Media antes:", X_train.mean(axis=0).round(2))
print("Media despu√©s:", X_train_scaled.mean(axis=0).round(2))

In [None]:
# Pipeline completo
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

## 3. Modelos de Clasificaci√≥n

In [None]:
# Comparar modelos
modelos = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'SVM': SVC()
}

for nombre, modelo in modelos.items():
    pipeline = Pipeline([('scaler', StandardScaler()), ('clf', modelo)])
    scores = cross_val_score(pipeline, X, y, cv=5)
    print(f"{nombre}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

In [None]:
# M√©tricas detalladas
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Matriz de confusi√≥n
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', 
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicho')
plt.ylabel('Real')
plt.title('Matriz de Confusi√≥n')
plt.show()

## 4. Validaci√≥n Cruzada y GridSearch

In [None]:
# GridSearchCV para optimizaci√≥n de hiperpar√°metros
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)

print(f"Mejores par√°metros: {grid_search.best_params_}")
print(f"Mejor score CV: {grid_search.best_score_:.4f}")

## 5. Feature Importance

In [None]:
# Importancia de caracter√≠sticas
best_model = grid_search.best_estimator_
importances = best_model.feature_importances_

plt.figure(figsize=(10, 5))
plt.barh(iris.feature_names, importances)
plt.xlabel('Importancia')
plt.title('Feature Importance - Random Forest')
plt.show()

---
## üèãÔ∏è Ejercicios Resueltos

In [None]:
# Ejercicio 1: Pipeline completo de clasificaci√≥n binaria
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                          n_redundant=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', GradientBoostingClassifier())
])

scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Score: {scores.mean():.4f}")

pipeline.fit(X_train, y_train)
print(f"Test Accuracy: {pipeline.score(X_test, y_test):.4f}")

---
## üìù Ejercicios para Practicar

In [None]:
# Ejercicio 1: Implementar regresi√≥n con m√∫ltiples modelos
# Tu c√≥digo aqu√≠

In [None]:
# Ejercicio 2: Pipeline con preprocesamiento de datos mixtos
# Tu c√≥digo aqu√≠

In [None]:
# Ejercicio 3: Comparar ROC curves de varios modelos
# Tu c√≥digo aqu√≠

---
## üéØ Resumen

- **Pipeline**: train_test_split ‚Üí preprocess ‚Üí fit ‚Üí predict ‚Üí evaluate
- **Preprocesamiento**: StandardScaler, OneHotEncoder
- **Modelos**: LogisticRegression, RandomForest, SVM, GradientBoosting
- **Validaci√≥n**: cross_val_score, GridSearchCV
- **M√©tricas**: accuracy, precision, recall, F1, confusion_matrix