# CC3092 — Deep Learning – Hoja de Trabajo 2

Integrantes: 
Diego Valenzuela - 22309
Gerson Ramirez - 22281

**Compatibilidad:** El archivo de requirements.txt fue establecido para la instalación de dependencias específicas para trabajar con el proyecto utilizando Apple Silicon Chips.

In [1]:
import torch
print("torch version:", torch.__version__)
print("MPS disponible:", torch.backends.mps.is_available())
print("MPS creado:", torch.backends.mps.is_built())
if torch.backends.mps.is_available():
    print("GPU: Apple Silicon Metal Performance Shaders")
    x = torch.rand((2, 2), device="mps")
    print("Tensor en dispositivo:", x.device)
elif torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    x = torch.rand((2, 2), device="cuda")
    print("Tensor en dispositivo:", x.device)
else:
    print("Usando CPU")
    x = torch.rand((2, 2), device="cpu")
    print("Tensor en dispositivo:", x.device)

torch version: 2.8.0
MPS disponible: True
MPS creado: True
GPU: Apple Silicon Metal Performance Shaders
Tensor en dispositivo: mps:0


In [2]:
import time, math, random
from dataclasses import dataclass
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler

def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.use_deterministic_algorithms(False)  
set_seed(42)

# Apple Silicon optimized device selection
def get_device():
    if torch.backends.mps.is_available() and torch.backends.mps.is_built():
        return torch.device("mps")
    elif torch.cuda.is_available():
        return torch.device("cuda")
    else:
        return torch.device("cpu")

device = get_device()
print(f"Using device: {device}")
device

Using device: mps


device(type='mps')

## Task 1 — Carga de Iris + split (train/val) + escalado

Se usa Iris para clasificación multiclase (3 clases). Se separa 75/25 (train/val) con estratificación para mantener proporciones de clases. Se estandarizan atributos (media 0, var 1), lo que acelera y estabiliza el entrenamiento del MLP.

In [3]:
iris = load_iris()
X = iris.data.astype(np.float32)
y = iris.target.astype(np.int64)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_val   = scaler.transform(X_val).astype(np.float32)

# Move tensors to the appropriate device (MPS for Apple Silicon)
X_train_t = torch.from_numpy(X_train).to(device)
y_train_t = torch.from_numpy(y_train).to(device)
X_val_t   = torch.from_numpy(X_val).to(device)
y_val_t   = torch.from_numpy(y_val).to(device)

train_ds = TensorDataset(X_train_t, y_train_t)
val_ds   = TensorDataset(X_val_t,   y_val_t)

len(train_ds), len(val_ds)

(112, 38)

## Task 2 — MLP simple y parametrizable

Arquitectura feedforward con capas ocultas configurables, activación seleccionable y Dropout (para Task 4). La capa final entrega logits (sin softmax); la función de pérdida se encarga de lo demás.

In [4]:
class MLP(nn.Module):
    def __init__(self, in_dim: int = 4, hidden: List[int] = [32, 16],
                 out_dim: int = 3, activation: str = "relu", dropout_p: float = 0.0):
        super().__init__()
        acts = {
            "relu": nn.ReLU(),
            "tanh": nn.Tanh(),
            "gelu": nn.GELU(),
            "leakyrelu": nn.LeakyReLU(0.1),
        }
        self.act = acts.get(activation.lower(), nn.ReLU())
        layers = []
        prev = in_dim
        for h in hidden:
            layers += [nn.Linear(prev, h), self.act]
            if dropout_p and dropout_p > 0.0:
                layers += [nn.Dropout(dropout_p)]
            prev = h
        layers += [nn.Linear(prev, out_dim)]
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)


## 5) Task 3 — Funciones de pérdida (CE, NLL, MSE) parametrizadas

Usaremos CrossEntropyLoss, NLLLoss (con log_softmax) y MSE (con one-hot + softmax). Esto cumple el requisito de ≥3 pérdidas y permite comparar convergencia y rendimiento

In [5]:
@dataclass
class LossAdapter:
    name: str
    criterion: nn.Module
    def __call__(self, logits: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        if self.name == "cross_entropy":
            return self.criterion(logits, y)
        elif self.name == "nll":
            return self.criterion(F.log_softmax(logits, dim=1), y)
        elif self.name == "mse":
            probs = F.softmax(logits, dim=1)
            one_hot = F.one_hot(y, num_classes=logits.shape[1]).float()
            return self.criterion(probs, one_hot)
        else:
            raise ValueError(f"Pérdida desconocida: {self.name}")

def make_loss(name: str) -> LossAdapter:
    name = name.lower()
    if name in ("crossentropy", "ce", "cross_entropy"):
        return LossAdapter("cross_entropy", nn.CrossEntropyLoss())
    if name in ("nll", "nllloss"):
        return LossAdapter("nll", nn.NLLLoss())
    if name in ("mse", "mseloss"):
        return LossAdapter("mse", nn.MSELoss())
    raise ValueError(f"Pérdida no soportada: {name}")


## Task 4 — Regularización (L1, L2, Dropout)

* L2: se usa como weight_decay del optimizador.
* L1: se suma manualmente a la pérdida: λ₁ * Σ|w|.
* Dropout: parametrizado en el modelo (ya incluido).

In [6]:
def l1_penalty(model: nn.Module) -> torch.Tensor:
    total = torch.tensor(0., device=device)
    for p in model.parameters():
        if p.requires_grad:
            total = total + p.abs().sum()
    return total

## Task 5 — “Algoritmos de optimización”

* Batch GD: actualiza con todo el conjunto de entrenamiento (batch único).
* Mini-Batch GD: actualiza por lotes pequeños (p. ej. 16).
* SGD: actualiza por cada muestra (batch_size=1).

> Usaremos el mismo optimizador (torch.optim.SGD) pero cambiaremos el tamaño del batch para reflejar cada técnica.

In [7]:
def make_loader(dataset, mode: str, batch_size: int = 16, shuffle: bool = True):
    mode = mode.lower()
    if mode == "batch_gd":
        bs = len(dataset)
    elif mode == "sgd":
        bs = 1
    elif mode == "mini-batch" or mode == "mini_batch":
        bs = batch_size
    else:
        raise ValueError("mode debe ser 'batch_gd', 'mini-batch' o 'sgd'")
    return DataLoader(dataset, batch_size=bs, shuffle=shuffle)


## Task 6 — Experimentación y Análisis

In [8]:
@dataclass
class ExperimentConfig:
    loss_fn: str = "cross_entropy"
    l1_reg: float = 0.0
    l2_reg: float = 0.0
    dropout: float = 0.0
    opt_mode: str = "mini_batch"
    lr: float = 0.001
    epochs: int = 200
    hidden: List[int] = None
    
    def __post_init__(self):
        if self.hidden is None:
            self.hidden = [32, 16]

def train_model(config: ExperimentConfig, train_ds, val_ds, device=device, verbose=False):
    """Training function optimized for Metal/MPS"""
    model = MLP(
        in_dim=4, 
        hidden=config.hidden, 
        out_dim=3, 
        dropout_p=config.dropout
    ).to(device)
    
    loss_adapter = make_loss(config.loss_fn)
    optimizer = torch.optim.SGD(model.parameters(), lr=config.lr, weight_decay=config.l2_reg)
    
    train_loader = make_loader(train_ds, config.opt_mode, batch_size=16)
    val_loader = make_loader(val_ds, "batch_gd", shuffle=False)
    
    history = {
        'train_loss': [], 'val_loss': [],
        'train_acc': [], 'val_acc': [],
        'train_f1': [], 'val_f1': []
    }
    
    start_time = time.time()
    
    for epoch in range(config.epochs):
        # Training phase
        model.train()
        train_loss, train_correct, train_total = 0.0, 0, 0
        train_preds, train_targets = [], []
        
        for X_batch, y_batch in train_loader:
            # Ensure tensors are on the correct device (Metal/MPS)
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            
            optimizer.zero_grad()
            logits = model(X_batch)
            loss = loss_adapter(logits, y_batch)
            
            # Add L1 regularization if specified
            if config.l1_reg > 0:
                loss = loss + config.l1_reg * l1_penalty(model)
            
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item() * len(X_batch)
            train_correct += (logits.argmax(1) == y_batch).sum().item()
            train_total += len(X_batch)
            
            # Collect predictions for F1 score (move to CPU for sklearn)
            train_preds.extend(logits.argmax(1).cpu().numpy())
            train_targets.extend(y_batch.cpu().numpy())
        
        # Validation phase
        model.eval()
        val_loss, val_correct, val_total = 0.0, 0, 0
        val_preds, val_targets = [], []
        
        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                logits = model(X_batch)
                loss = loss_adapter(logits, y_batch)
                
                if config.l1_reg > 0:
                    loss = loss + config.l1_reg * l1_penalty(model)
                
                val_loss += loss.item() * len(X_batch)
                val_correct += (logits.argmax(1) == y_batch).sum().item()
                val_total += len(X_batch)
                
                val_preds.extend(logits.argmax(1).cpu().numpy())
                val_targets.extend(y_batch.cpu().numpy())
        
        # Calculate metrics
        train_acc = train_correct / train_total
        val_acc = val_correct / val_total
        train_f1 = f1_score(train_targets, train_preds, average='weighted')
        val_f1 = f1_score(val_targets, val_preds, average='weighted')
        
        history['train_loss'].append(train_loss / train_total)
        history['val_loss'].append(val_loss / val_total)
        history['train_acc'].append(train_acc)
        history['val_acc'].append(val_acc)
        history['train_f1'].append(train_f1)
        history['val_f1'].append(val_f1)
        
        if verbose and (epoch + 1) % 50 == 0:
            print(f"Epoch {epoch+1:3d} | "
                  f"Train Loss: {train_loss/train_total:.4f} | "
                  f"Val Loss: {val_loss/val_total:.4f} | "
                  f"Train Acc: {train_acc:.4f} | "
                  f"Val Acc: {val_acc:.4f}")
    
    total_time = time.time() - start_time
    
    return {
        'model': model,
        'history': history,
        'final_train_acc': train_acc,
        'final_val_acc': val_acc,
        'final_train_f1': train_f1,
        'final_val_f1': val_f1,
        'training_time': total_time,
        'config': config
    }

In [9]:
# Define 9+ experimental configurations
experiments = [
    # Baseline: different loss functions with no regularization
    ExperimentConfig(loss_fn="cross_entropy", opt_mode="mini_batch"),
    ExperimentConfig(loss_fn="nll", opt_mode="mini_batch"),
    ExperimentConfig(loss_fn="mse", opt_mode="mini_batch"),
    
    # Different regularization techniques with CrossEntropy
    ExperimentConfig(loss_fn="cross_entropy", l1_reg=0.001, opt_mode="mini_batch"),
    ExperimentConfig(loss_fn="cross_entropy", l2_reg=0.001, opt_mode="mini_batch"),
    ExperimentConfig(loss_fn="cross_entropy", dropout=0.3, opt_mode="mini_batch"),
    
    # Different optimization methods with CrossEntropy
    ExperimentConfig(loss_fn="cross_entropy", opt_mode="sgd"),
    ExperimentConfig(loss_fn="cross_entropy", opt_mode="batch_gd"),
    ExperimentConfig(loss_fn="cross_entropy", opt_mode="mini_batch"),
    
    # Extra combinations for bonus points
    ExperimentConfig(loss_fn="nll", l1_reg=0.001, dropout=0.2, opt_mode="mini_batch"),
    ExperimentConfig(loss_fn="mse", l2_reg=0.01, dropout=0.1, opt_mode="batch_gd"),
]

print(f"Running {len(experiments)} experiments on {device}")
print("Experiment configurations:")
for i, config in enumerate(experiments):
    print(f"{i+1:2d}. Loss: {config.loss_fn:<12} | "
          f"L1: {config.l1_reg:<6} | L2: {config.l2_reg:<6} | "
          f"Dropout: {config.dropout:<4} | Opt: {config.opt_mode}")

Running 11 experiments on mps
Experiment configurations:
 1. Loss: cross_entropy | L1: 0.0    | L2: 0.0    | Dropout: 0.0  | Opt: mini_batch
 2. Loss: nll          | L1: 0.0    | L2: 0.0    | Dropout: 0.0  | Opt: mini_batch
 3. Loss: mse          | L1: 0.0    | L2: 0.0    | Dropout: 0.0  | Opt: mini_batch
 4. Loss: cross_entropy | L1: 0.001  | L2: 0.0    | Dropout: 0.0  | Opt: mini_batch
 5. Loss: cross_entropy | L1: 0.0    | L2: 0.001  | Dropout: 0.0  | Opt: mini_batch
 6. Loss: cross_entropy | L1: 0.0    | L2: 0.0    | Dropout: 0.3  | Opt: mini_batch
 7. Loss: cross_entropy | L1: 0.0    | L2: 0.0    | Dropout: 0.0  | Opt: sgd
 8. Loss: cross_entropy | L1: 0.0    | L2: 0.0    | Dropout: 0.0  | Opt: batch_gd
 9. Loss: cross_entropy | L1: 0.0    | L2: 0.0    | Dropout: 0.0  | Opt: mini_batch
10. Loss: nll          | L1: 0.001  | L2: 0.0    | Dropout: 0.2  | Opt: mini_batch
11. Loss: mse          | L1: 0.0    | L2: 0.01   | Dropout: 0.1  | Opt: batch_gd


In [10]:
# Run all experiments (optimized for Metal/MPS)
results = []
print("Starting experimental runs...")
print("=" * 80)

for i, config in enumerate(experiments):
    print(f"\nExperiment {i+1}/{len(experiments)}: "
          f"{config.loss_fn} + {config.opt_mode} + "
          f"L1={config.l1_reg} + L2={config.l2_reg} + dropout={config.dropout}")
    
    # Ensure fresh random state for each experiment
    set_seed(42 + i)
    
    # Run training
    result = train_model(config, train_ds, val_ds, device=device, verbose=False)
    results.append(result)
    
    print(f"✓ Completed in {result['training_time']:.2f}s | "
          f"Final Val Acc: {result['final_val_acc']:.4f} | "
          f"Final Val F1: {result['final_val_f1']:.4f}")

print("\n" + "=" * 80)
print("All experiments completed!")

# Create summary table
summary_data = []
for i, result in enumerate(results):
    config = result['config']
    summary_data.append({
        'Experiment': i + 1,
        'Loss Function': config.loss_fn,
        'L1 Reg': config.l1_reg,
        'L2 Reg': config.l2_reg,
        'Dropout': config.dropout,
        'Optimizer': config.opt_mode,
        'Train Acc': f"{result['final_train_acc']:.4f}",
        'Val Acc': f"{result['final_val_acc']:.4f}",
        'Train F1': f"{result['final_train_f1']:.4f}",
        'Val F1': f"{result['final_val_f1']:.4f}",
        'Time (s)': f"{result['training_time']:.2f}"
    })

summary_df = pd.DataFrame(summary_data)
print("\nExperimental Results Summary:")
print(summary_df.to_string(index=False))

Starting experimental runs...

Experiment 1/11: cross_entropy + mini_batch + L1=0.0 + L2=0.0 + dropout=0.0
✓ Completed in 5.01s | Final Val Acc: 0.6579 | Final Val F1: 0.5610

Experiment 2/11: nll + mini_batch + L1=0.0 + L2=0.0 + dropout=0.0
✓ Completed in 3.13s | Final Val Acc: 0.6316 | Final Val F1: 0.5011

Experiment 3/11: mse + mini_batch + L1=0.0 + L2=0.0 + dropout=0.0
✓ Completed in 3.71s | Final Val Acc: 0.5000 | Final Val F1: 0.4082

Experiment 4/11: cross_entropy + mini_batch + L1=0.001 + L2=0.0 + dropout=0.0
✓ Completed in 4.64s | Final Val Acc: 0.6316 | Final Val F1: 0.5883

Experiment 5/11: cross_entropy + mini_batch + L1=0.0 + L2=0.001 + dropout=0.0
✓ Completed in 3.23s | Final Val Acc: 0.6842 | Final Val F1: 0.5987

Experiment 6/11: cross_entropy + mini_batch + L1=0.0 + L2=0.0 + dropout=0.3
✓ Completed in 3.69s | Final Val Acc: 0.8158 | Final Val F1: 0.8126

Experiment 7/11: cross_entropy + sgd + L1=0.0 + L2=0.0 + dropout=0.0
✓ Completed in 38.19s | Final Val Acc: 0.9211 

# Parte 2

1. **¿Cuál es la principal innovación de la arquitectura Transformer?**

  La gran innovación del Transformer es eliminar por completo las recurrencias y convoluciones. En su lugar, se basa únicamente en mecanismos de atención, especialmente *self-attention*, para modelar dependencias entre tokens. Esto permite mayor paralelización en el entrenamiento y mejor manejo de dependencias largas.

2. **¿Cómo funciona el mecanismo de atención del scaled dot-product?**

  El scaled dot-product attention toma consultas (Q), claves (K) y valores (V). Calcula los productos punto entre las consultas y todas las claves, los escala por $1/\sqrt{d_k}$ para evitar gradientes muy pequeños, y aplica softmax para obtener pesos de atención. Luego usa esos pesos para combinar linealmente los valores:

  $$
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  $$

  Este escalado es lo que diferencia al mecanismo y lo hace más estable.

3. **¿Por qué se utiliza la atención de múltiples cabezales en Transformer?**

  La multi-head attention proyecta las Q, K y V en distintos subespacios, aplica atención en paralelo y concatena los resultados. Esto permite que el modelo aprenda a atender a diferentes aspectos de la información en paralelo, como relaciones sintácticas y semánticas distintas. Con un único “head”, la información se promediaría y se perderían matices
