# Parte 3: Clasificaci√≥n con Descriptores Cl√°sicos y Deep Learning

**Taller 3 - Clasificaci√≥n de Im√°genes M√©dicas**

Universidad Nacional de Colombia - Visi√≥n por Computador

---

## Objetivos
1. Crear matriz de caracter√≠sticas para todo el dataset
2. Normalizaci√≥n de features (StandardScaler)
3. Reducci√≥n de dimensionalidad (PCA)
4. Entrenar y evaluar m√∫ltiples clasificadores:
   - Support Vector Machine (SVM)
   - Random Forest
   - k-Nearest Neighbors (k-NN)
   - Logistic Regression
   - Convolutional Neural Network (CNN)
5. Validaci√≥n cruzada
6. An√°lisis de importancia de features


## 1. Configuraci√≥n e Imports


In [None]:
# Imports est√°ndar
import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Agregar src al path
sys.path.insert(0, os.path.abspath('..'))

# Imports de nuestros m√≥dulos
from src.data_loader import load_image_paths, split_by_set, labels_to_numeric
from src.preprocessing import read_and_preprocess
from src.features import build_feature_matrix, normalize_features, apply_pca
from src.classical_models import (
    get_default_models, 
    train_all_models, 
    cross_validate_models,
    get_feature_importance_rf,
    get_feature_importance_linear
)
from src.evaluation import compute_metrics, print_metrics, compare_models
from src.visualization import plot_confusion_matrix, plot_roc_curve, plot_training_history

# Configuraci√≥n
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("Imports completados correctamente ‚úì")


In [None]:
# Cargar datos
DATA_DIR = "../data/chest_xray/chest_xray"
paths, labels = load_image_paths(DATA_DIR)
(paths_train, labels_train), (paths_val, labels_val), (paths_test, labels_test) = split_by_set(paths, labels)

# Convertir etiquetas a num√©rico
y_train = labels_to_numeric(labels_train)
y_val = labels_to_numeric(labels_val)
y_test = labels_to_numeric(labels_test)

print("Datos cargados:")
print(f"  Train: {len(paths_train)} im√°genes")
print(f"  Val:   {len(paths_val)} im√°genes")
print(f"  Test:  {len(paths_test)} im√°genes")
print(f"\nDistribuci√≥n de clases en train:")
print(f"  NORMAL (0):    {np.sum(y_train == 0)}")
print(f"  PNEUMONIA (1): {np.sum(y_train == 1)}")


## 2. Extracci√≥n de Caracter√≠sticas

Construimos la matriz de caracter√≠sticas usando todos los descriptores implementados (HOG, Hu, Fourier, LBP, GLCM, Gabor).

‚ö†Ô∏è **Nota**: Este proceso puede tardar varios minutos dependiendo del tama√±o del dataset.


In [None]:
%%time
# Extraer caracter√≠sticas para todos los conjuntos
print("Extrayendo caracter√≠sticas del conjunto de entrenamiento...")
X_train = build_feature_matrix(paths_train, show_progress=True)

print("\nExtrayendo caracter√≠sticas del conjunto de validaci√≥n...")
X_val = build_feature_matrix(paths_val, show_progress=True)

print("\nExtrayendo caracter√≠sticas del conjunto de prueba...")
X_test = build_feature_matrix(paths_test, show_progress=True)

print(f"\n‚úì Extracci√≥n completada:")
print(f"  X_train: {X_train.shape}")
print(f"  X_val:   {X_val.shape}")
print(f"  X_test:  {X_test.shape}")


## 3. Normalizaci√≥n de Features

Usamos StandardScaler para normalizar las caracter√≠sticas (media=0, std=1).


In [None]:
# Normalizar caracter√≠sticas
X_train_norm, X_val_norm, X_test_norm, scaler = normalize_features(X_train, X_val, X_test)

print("Caracter√≠sticas normalizadas:")
print(f"  Media de X_train_norm: {X_train_norm.mean():.6f}")
print(f"  Std de X_train_norm:   {X_train_norm.std():.6f}")

# Visualizar distribuci√≥n antes y despu√©s
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Muestra de features antes de normalizar
sample_features = X_train[:, :5]
axes[0].boxplot(sample_features)
axes[0].set_title('Features Originales (primeras 5)')
axes[0].set_xlabel('Feature')
axes[0].set_ylabel('Valor')

# Despu√©s de normalizar
sample_features_norm = X_train_norm[:, :5]
axes[1].boxplot(sample_features_norm)
axes[1].set_title('Features Normalizadas (primeras 5)')
axes[1].set_xlabel('Feature')
axes[1].set_ylabel('Valor (estandarizado)')

plt.tight_layout()
plt.savefig('../results/03_normalization.png', dpi=150, bbox_inches='tight')
plt.show()


## 4. Reducci√≥n de Dimensionalidad (PCA)

Aplicamos PCA para reducir la dimensionalidad preservando el 95% de la varianza.


In [None]:
# Aplicar PCA
X_train_pca, X_val_pca, X_test_pca, pca = apply_pca(X_train_norm, X_val_norm, X_test_norm, variance_ratio=0.95)

print("Reducci√≥n de dimensionalidad con PCA:")
print(f"  Dimensiones originales: {X_train_norm.shape[1]}")
print(f"  Dimensiones despu√©s de PCA: {X_train_pca.shape[1]}")
print(f"  Reducci√≥n: {100*(1 - X_train_pca.shape[1]/X_train_norm.shape[1]):.1f}%")
print(f"  Varianza explicada: {pca.explained_variance_ratio_.sum()*100:.1f}%")

# Visualizar varianza explicada acumulada
cumsum = np.cumsum(pca.explained_variance_ratio_)
plt.figure(figsize=(10, 5))
plt.plot(cumsum, 'b-', linewidth=2)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% varianza')
plt.xlabel('N√∫mero de componentes')
plt.ylabel('Varianza explicada acumulada')
plt.title('Varianza Explicada por PCA')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('../results/03_pca_variance.png', dpi=150, bbox_inches='tight')
plt.show()


## 5. Entrenamiento de Clasificadores Cl√°sicos

Entrenamos y evaluamos m√∫ltiples clasificadores:
- SVM con kernel lineal
- SVM con kernel RBF
- Random Forest
- k-NN (k=5)
- Regresi√≥n Log√≠stica


In [None]:
# Obtener modelos por defecto
models = get_default_models()
print("Modelos a entrenar:")
for name in models.keys():
    print(f"  - {name}")


In [None]:
%%time
# Entrenar y evaluar todos los modelos
results = train_all_models(
    models,
    X_train_pca, y_train,
    X_val_pca, y_val,
    X_test_pca, y_test,
    verbose=True
)


In [None]:
# Comparar resultados
print("\n" + "="*60)
print("COMPARACI√ìN DE MODELOS")
print("="*60)

# Crear tabla de resultados
import pandas as pd

comparison_data = []
for name, res in results.items():
    comparison_data.append({
        'Modelo': name,
        'Val Accuracy': f"{res['val_accuracy']:.4f}",
        'Val F1': f"{res['val_f1']:.4f}",
        'Test Accuracy': f"{res['test_accuracy']:.4f}",
        'Test F1': f"{res['test_f1']:.4f}"
    })

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

# Mejor modelo
best_model_name = max(results, key=lambda x: results[x]['test_f1'])
print(f"\nüèÜ Mejor modelo (por F1 en test): {best_model_name}")
print(f"   Test F1: {results[best_model_name]['test_f1']:.4f}")


In [None]:
# Visualizar comparaci√≥n
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

model_names = list(results.keys())
test_acc = [results[m]['test_accuracy'] for m in model_names]
test_f1 = [results[m]['test_f1'] for m in model_names]

# Accuracy
axes[0].barh(model_names, test_acc, color='steelblue')
axes[0].set_xlabel('Accuracy')
axes[0].set_title('Test Accuracy por Modelo')
axes[0].set_xlim([0.5, 1.0])
for i, v in enumerate(test_acc):
    axes[0].text(v + 0.01, i, f'{v:.3f}', va='center')

# F1 Score
axes[1].barh(model_names, test_f1, color='coral')
axes[1].set_xlabel('F1 Score')
axes[1].set_title('Test F1 Score por Modelo')
axes[1].set_xlim([0.5, 1.0])
for i, v in enumerate(test_f1):
    axes[1].text(v + 0.01, i, f'{v:.3f}', va='center')

plt.tight_layout()
plt.savefig('../results/03_model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()


## 6. Validaci√≥n Cruzada

Evaluamos los modelos con validaci√≥n cruzada de 5 folds para obtener una estimaci√≥n m√°s robusta del rendimiento.


In [None]:
%%time
# Validaci√≥n cruzada
cv_results = cross_validate_models(models, X_train_pca, y_train, cv=5, scoring='f1', verbose=True)


In [None]:
# Visualizar resultados de CV
cv_means = [cv_results[m][0] for m in model_names]
cv_stds = [cv_results[m][1] for m in model_names]

plt.figure(figsize=(10, 6))
plt.barh(model_names, cv_means, xerr=cv_stds, color='teal', capsize=5)
plt.xlabel('F1 Score (CV)')
plt.title('Validaci√≥n Cruzada (5-fold F1 Score)')
plt.xlim([0.5, 1.0])
for i, (mean, std) in enumerate(zip(cv_means, cv_stds)):
    plt.text(mean + std + 0.02, i, f'{mean:.3f}¬±{std:.3f}', va='center')
plt.tight_layout()
plt.savefig('../results/03_cross_validation.png', dpi=150, bbox_inches='tight')
plt.show()


## 7. Matrices de Confusi√≥n


In [None]:
# Visualizar matrices de confusi√≥n para todos los modelos
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, (name, res) in enumerate(results.items()):
    if idx < 5:
        cm = res['confusion_matrix']
        ax = axes[idx]
        im = ax.imshow(cm, cmap='Blues')
        ax.set_title(f'{name}\nAcc={res["test_accuracy"]:.3f}, F1={res["test_f1"]:.3f}')
        ax.set_xticks([0, 1])
        ax.set_yticks([0, 1])
        ax.set_xticklabels(['NORMAL', 'PNEUMONIA'])
        ax.set_yticklabels(['NORMAL', 'PNEUMONIA'])
        ax.set_xlabel('Predicho')
        ax.set_ylabel('Real')
        
        for i in range(2):
            for j in range(2):
                ax.text(j, i, str(cm[i, j]), ha='center', va='center', 
                       color='white' if cm[i, j] > cm.max()/2 else 'black', fontsize=14)

# Ocultar el √∫ltimo subplot
axes[5].axis('off')

plt.tight_layout()
plt.savefig('../results/03_confusion_matrices.png', dpi=150, bbox_inches='tight')
plt.show()


## 8. Importancia de Caracter√≠sticas

Analizamos qu√© caracter√≠sticas son m√°s importantes seg√∫n Random Forest y modelos lineales.


In [None]:
# Importancia de caracter√≠sticas - Random Forest
rf_model = results['RandomForest']['model']
top_features_rf = get_feature_importance_rf(rf_model, top_n=20, verbose=True)


In [None]:
# Importancia de caracter√≠sticas - SVM Lineal
svm_model = results['SVM-linear']['model']
top_features_svm = get_feature_importance_linear(svm_model, top_n=20, verbose=True)


In [None]:
# Visualizar importancia de caracter√≠sticas
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Random Forest
importances_rf = rf_model.feature_importances_
top_idx_rf = np.argsort(importances_rf)[-15:]
axes[0].barh(range(15), importances_rf[top_idx_rf], color='forestgreen')
axes[0].set_yticks(range(15))
axes[0].set_yticklabels([f'PC{i}' for i in top_idx_rf])
axes[0].set_xlabel('Importancia')
axes[0].set_title('Top 15 Features - Random Forest')

# SVM Lineal
coefs_svm = np.abs(svm_model.coef_)[0]
top_idx_svm = np.argsort(coefs_svm)[-15:]
axes[1].barh(range(15), coefs_svm[top_idx_svm], color='darkorange')
axes[1].set_yticks(range(15))
axes[1].set_yticklabels([f'PC{i}' for i in top_idx_svm])
axes[1].set_xlabel('|Coeficiente|')
axes[1].set_title('Top 15 Features - SVM Lineal')

plt.tight_layout()
plt.savefig('../results/03_feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()


---
## 9. Clasificaci√≥n con CNN (Deep Learning)

Entrenamos una CNN simple que usa directamente las im√°genes en lugar de las caracter√≠sticas extra√≠das manualmente.


In [None]:
# Imports para Deep Learning
import torch
from src.deep_learning import SimpleCNN, ChestXrayDataset, create_dataloaders, train_cnn, evaluate_cnn
from src.deep_learning.models import get_device

device = get_device()
print(f"Dispositivo: {device}")


In [None]:
# Crear DataLoaders
train_loader, val_loader, test_loader = create_dataloaders(
    paths_train, labels_train,
    paths_val, labels_val,
    paths_test, labels_test,
    batch_size=32
)

print(f"Train batches: {len(train_loader)}")
print(f"Val batches:   {len(val_loader)}")
print(f"Test batches:  {len(test_loader)}")


In [None]:
# Crear y entrenar modelo CNN
model = SimpleCNN()
print(f"Arquitectura del modelo:")
print(model)
print(f"\nPar√°metros entrenables: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")


In [None]:
%%time
# Entrenar CNN (reducir √©pocas para demo, aumentar para mejores resultados)
EPOCHS = 10  # Aumentar a 30-50 para mejores resultados

cnn_result = train_cnn(
    model, 
    train_loader, 
    val_loader,
    labels_train=labels_train,
    epochs=EPOCHS,
    lr=1e-4,
    use_class_weights=True,
    verbose=True
)


In [None]:
# Evaluar CNN en test
print("\nEvaluaci√≥n de CNN en conjunto de TEST:")
cnn_metrics = evaluate_cnn(cnn_result['model'], test_loader, verbose=True)


In [None]:
# Visualizar historial de entrenamiento
if 'history' in cnn_result:
    history = cnn_result['history']
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Loss
    axes[0].plot(history['train_loss'], 'b-', label='Train')
    axes[0].set_xlabel('√âpoca')
    axes[0].set_ylabel('Loss')
    axes[0].set_title('Training Loss')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Accuracy
    axes[1].plot(history['val_accuracy'], 'g-', label='Val')
    axes[1].set_xlabel('√âpoca')
    axes[1].set_ylabel('Accuracy')
    axes[1].set_title('Validation Accuracy')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    # F1
    axes[2].plot(history['val_f1'], 'r-', label='Val')
    axes[2].set_xlabel('√âpoca')
    axes[2].set_ylabel('F1 Score')
    axes[2].set_title('Validation F1 Score')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('../results/03_cnn_training.png', dpi=150, bbox_inches='tight')
    plt.show()


In [None]:
# Matriz de confusi√≥n y ROC para CNN
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Matriz de confusi√≥n
cm = cnn_metrics['confusion_matrix']
im = axes[0].imshow(cm, cmap='Blues')
axes[0].set_title(f'CNN - Matriz de Confusi√≥n\nAcc={cnn_metrics["accuracy"]:.3f}, F1={cnn_metrics["f1"]:.3f}')
axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(['NORMAL', 'PNEUMONIA'])
axes[0].set_yticklabels(['NORMAL', 'PNEUMONIA'])
axes[0].set_xlabel('Predicho')
axes[0].set_ylabel('Real')
for i in range(2):
    for j in range(2):
        axes[0].text(j, i, str(cm[i, j]), ha='center', va='center',
                    color='white' if cm[i, j] > cm.max()/2 else 'black', fontsize=14)

# Curva ROC
axes[1].plot(cnn_metrics['fpr'], cnn_metrics['tpr'], 'b-', linewidth=2, 
             label=f'CNN (AUC = {cnn_metrics["auc"]:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('Curva ROC - CNN')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../results/03_cnn_evaluation.png', dpi=150, bbox_inches='tight')
plt.show()


## 10. Comparaci√≥n Final: Cl√°sicos vs Deep Learning


In [None]:
# Comparaci√≥n final
print("="*70)
print("COMPARACI√ìN FINAL: T√âCNICAS CL√ÅSICAS vs DEEP LEARNING")
print("="*70)

# Tabla comparativa
all_results = []
for name, res in results.items():
    all_results.append({
        'Modelo': name,
        'Tipo': 'Cl√°sico',
        'Test Acc': res['test_accuracy'],
        'Test F1': res['test_f1']
    })

all_results.append({
    'Modelo': 'SimpleCNN',
    'Tipo': 'Deep Learning',
    'Test Acc': cnn_metrics['accuracy'],
    'Test F1': cnn_metrics['f1']
})

comparison_final = pd.DataFrame(all_results)
comparison_final = comparison_final.sort_values('Test F1', ascending=False)
print(comparison_final.to_string(index=False))

# Mejor modelo general
best_overall = comparison_final.iloc[0]
print(f"\nüèÜ MEJOR MODELO GENERAL: {best_overall['Modelo']}")
print(f"   Tipo: {best_overall['Tipo']}")
print(f"   Test Accuracy: {best_overall['Test Acc']:.4f}")
print(f"   Test F1 Score: {best_overall['Test F1']:.4f}")


In [None]:
# Gr√°fico comparativo final
fig, ax = plt.subplots(figsize=(12, 6))

models_sorted = comparison_final['Modelo'].tolist()
f1_scores = comparison_final['Test F1'].tolist()
types = comparison_final['Tipo'].tolist()

colors = ['coral' if t == 'Cl√°sico' else 'steelblue' for t in types]
bars = ax.barh(models_sorted, f1_scores, color=colors)

ax.set_xlabel('Test F1 Score')
ax.set_title('Comparaci√≥n Final: Todos los Modelos')
ax.set_xlim([0.5, 1.0])

for i, (bar, f1) in enumerate(zip(bars, f1_scores)):
    ax.text(f1 + 0.01, bar.get_y() + bar.get_height()/2, f'{f1:.3f}', 
            va='center', fontweight='bold')

# Leyenda
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='coral', label='T√©cnicas Cl√°sicas'),
                   Patch(facecolor='steelblue', label='Deep Learning')]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.savefig('../results/03_final_comparison.png', dpi=150, bbox_inches='tight')
plt.show()


## 11. Conclusiones

### Resumen de resultados:

**T√©cnicas Cl√°sicas (descriptores manuales + ML):**
- Requieren dise√±o manual de caracter√≠sticas (HOG, LBP, GLCM, etc.)
- Menor tiempo de entrenamiento
- Interpretabilidad: podemos analizar qu√© caracter√≠sticas son importantes
- Buenos resultados con datasets peque√±os

**Deep Learning (CNN):**
- Aprende caracter√≠sticas autom√°ticamente
- Mayor tiempo de entrenamiento (especialmente sin GPU)
- Potencial de mejor rendimiento con m√°s datos y √©pocas
- Menos interpretable ("caja negra")

### Observaciones:
1. El desbalance de clases afecta el rendimiento de todos los modelos
2. La normalizaci√≥n y PCA mejoran significativamente los resultados de modelos cl√°sicos
3. La CNN requiere m√°s √©pocas para alcanzar su potencial completo
4. Los descriptores de textura (LBP, GLCM) son particularmente √∫tiles para radiograf√≠as

### Recomendaciones:
- Para producci√≥n: Entrenar CNN con m√°s √©pocas (30-50) y data augmentation
- Para interpretabilidad: Usar Random Forest o SVM lineal
- Considerar t√©cnicas de balanceo de clases (SMOTE, pesos de clase)


In [None]:
# Guardar modelo CNN
from src.deep_learning.training import save_model

os.makedirs('../models', exist_ok=True)
save_model(cnn_result['model'], '../models/simple_cnn.pt', history=cnn_result.get('history'))

print("\n‚úì Resultados guardados en ../results/")
print("  - 03_normalization.png")
print("  - 03_pca_variance.png")
print("  - 03_model_comparison.png")
print("  - 03_cross_validation.png")
print("  - 03_confusion_matrices.png")
print("  - 03_feature_importance.png")
print("  - 03_cnn_training.png")
print("  - 03_cnn_evaluation.png")
print("  - 03_final_comparison.png")
print("\n‚úì Modelo guardado en ../models/simple_cnn.pt")
