## Monitorización de experimentos (Wandb)

<a target="_blank" href="https://colab.research.google.com/github/pglez82/DeepLearningWeb/blob/master/labs/notebooks/Monitorizaci%C3%B3n%20de%20experimentos%20(Wandb).ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

La monitorización de experimentos es un aspecto muy importante en el aprendizaje profundo. Habitualmente probamos muchas configuraciones, modelos, etc. siendo totalmente necesario tener una herramienta que sea capaz de llevar cuenta de estos experimentos realizados y de sus resultados.

Una de estas herramientas es **Weights and Biases**. Weights and Biases nos permite tener un panel colaborativo donde almacenar todos nuestros experimentos. 

Lo primero es ir a la página de [Weights and Biases](https://wandb.ai/site) y crear una cuenta (es gratuito).

### Instalación de los paquetes necesarios
El uso de Weights and Biases es muy sencillo. Solo requiere la instalación de un paquete.

In [1]:
!pip install wandb

Defaulting to user installation because normal site-packages is not writeable


### Bucles de entrenamiento y validación usando wandb

Realmente los bucles de entrenamiento y validación son los mismos que siempre, lo único que tendremos que intercalar ciertas instrucciones para loguear los resultados en la plataforma de wandb.

In [2]:
import wandb
import math
import random
import torch, torchvision
import torch.nn as nn
import torchvision.transforms as T

device = "cuda:0" if torch.cuda.is_available() else "cpu"

def get_dataloader(is_train, batch_size, slice=5):
    "Get a training dataloader"
    full_dataset = torchvision.datasets.MNIST(root=".", train=is_train, transform=T.ToTensor(), download=True)
    sub_dataset = torch.utils.data.Subset(full_dataset, indices=range(0, len(full_dataset), slice))
    loader = torch.utils.data.DataLoader(dataset=sub_dataset, batch_size=batch_size, shuffle=True if is_train else False, num_workers=2)
    return loader

def get_model(dropout):
    # Modelo simple solo para testear
    model = nn.Sequential(nn.Flatten(),
                         nn.Linear(28*28, 256),
                         nn.BatchNorm1d(256),
                         nn.ReLU(),
                         nn.Dropout(dropout),
                         nn.Linear(256,10)).to(device)
    return model

def validate_model(model, valid_dl, loss_func, log_images=False, batch_idx=0):
    model.eval()
    val_loss = 0.
    with torch.inference_mode():
        correct = 0
        for i, (images, labels) in enumerate(valid_dl):
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            val_loss += loss_func(outputs, labels)*labels.size(0)

            # Compute accuracy and accumulate
            _, predicted = torch.max(outputs.data, 1)
            correct += (predicted == labels).sum().item()

            # Log one batch of images to the dashboard, always same batch_idx.
            if i==batch_idx and log_images:
                log_image_table(images, predicted, labels, outputs.softmax(dim=1))
    return val_loss / len(valid_dl.dataset), correct / len(valid_dl.dataset)

def log_image_table(images, predicted, labels, probs):
    "Log a wandb.Table with (img, pred, target, scores)"
    # 🐝 Create a wandb Table to log images, labels and predictions to
    table = wandb.Table(columns=["image", "pred", "target"]+[f"score_{i}" for i in range(10)])
    for img, pred, targ, prob in zip(images.to("cpu"), predicted.to("cpu"), labels.to("cpu"), probs.to("cpu")):
        table.add_data(wandb.Image(img[0].numpy()*255), pred, targ, *prob.numpy())
    wandb.log({"predictions_table":table}, commit=False)

In [3]:
def train(config):    
    # Get the data
    train_dl = get_dataloader(is_train=True, batch_size=config.batch_size)
    valid_dl = get_dataloader(is_train=False, batch_size=2*config.batch_size)
    n_steps_per_epoch = math.ceil(len(train_dl.dataset) / config.batch_size)
    
    # A simple MLP model
    model = get_model(config.dropout)

    # Make the loss and optimizer
    loss_func = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=config.lr)

   # Training
    example_ct = 0
    step_ct = 0
    for epoch in range(config.epochs):
        model.train()
        for step, (images, labels) in enumerate(train_dl):
            images, labels = images.to(device), labels.to(device)

            outputs = model(images)
            train_loss = loss_func(outputs, labels)
            optimizer.zero_grad()
            train_loss.backward()
            optimizer.step()
            
            example_ct += len(images)
            metrics = {"train/train_loss": train_loss, 
                       "train/epoch": (step + 1 + (n_steps_per_epoch * epoch)) / n_steps_per_epoch, 
                       "train/example_ct": example_ct}
            
            if step + 1 < n_steps_per_epoch:
                # 🐝 Log train metrics to wandb 
                wandb.log(metrics)
                
            step_ct += 1

        val_loss, accuracy = validate_model(model, valid_dl, loss_func, log_images=(epoch==(config.epochs-1)))

        # 🐝 Log train and validation metrics to wandb
        val_metrics = {"val/val_loss": val_loss, 
                       "val/val_accuracy": accuracy}
        wandb.log({**metrics, **val_metrics})
        
        print(f"Train Loss: {train_loss:.3f}, Valid Loss: {val_loss:3f}, Accuracy: {accuracy:.2f}")

    # 🐝 Close your wandb run 
    wandb.finish()

### Lanzando diferentes experimentos

En este ejemplo vamos a lanzar 5 experimentos diferentes con diferentes dropouts. Todos ellos quedarán logueados en nuestro dashboard.

In [4]:
for experiment in range(5):
    # 🐝 initialise a wandb run
    wandb.init(project="test-wandb", name="run_{}".format(experiment), config={
            "epochs": 10,
            "batch_size": 128,
            "lr": 1e-3,
            "dropout": random.uniform(0.01, 0.80),
            })
    
    # Copy your config 
    config = wandb.config
    train(config)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mpglez82[0m. Use [1m`wandb login --relogin`[0m to force relogin


Train Loss: 0.436, Valid Loss: 0.277615, Accuracy: 0.92
Train Loss: 0.217, Valid Loss: 0.219783, Accuracy: 0.94
Train Loss: 0.217, Valid Loss: 0.199790, Accuracy: 0.93
Train Loss: 0.089, Valid Loss: 0.188260, Accuracy: 0.95
Train Loss: 0.140, Valid Loss: 0.154381, Accuracy: 0.95
Train Loss: 0.062, Valid Loss: 0.164541, Accuracy: 0.95
Train Loss: 0.041, Valid Loss: 0.149887, Accuracy: 0.95
Train Loss: 0.043, Valid Loss: 0.150042, Accuracy: 0.95
Train Loss: 0.032, Valid Loss: 0.159821, Accuracy: 0.95
Train Loss: 0.018, Valid Loss: 0.146578, Accuracy: 0.95


wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
[34m[1mwandb[0m: [32m[41mERROR[0m Control-C detected -- Run data was not synced


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016669162216688467, max=1.0…

wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.


wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
