# Prov4ML Torch Example

This notebook is a simple example of how to use Prov4ML with Pytorch and the MNIST dataset. The task is simple digit classification using an MLP model. 

#### Importing the necessary libraries and defining constants

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torchvision import transforms
from torch.utils.data import DataLoader, Subset
from tqdm import tqdm

import prov4ml

: 

In [None]:
PATH_DATASETS = "./data"
BATCH_SIZE = 64
EPOCHS = 2

: 

#### Creating experiment, run and instantiate context

Initialize a new run within an experiment and start logging provenance data. 
This call specifies a user namespace, naming the experiment, defining the directory for saving provenance logs, and setting the logging frequency. 
 - **prov_user_namespace**: The unique identifier for the user or organization, ensuring the provenance data is correctly attributed.
 - **experiment_name**: The name of the experiment, used to group related runs together.
 - **provenance_save_dir**: The directory where the provenance logs are stored.
 - **save_after_n_logs**: The interval for saving logs to file, to empty the variables saved in memory.

In [4]:
prov4ml.start_run(
    prov_user_namespace="www.example.org",
    experiment_name="experiment_name", 
    provenance_save_dir="prov",
    save_after_n_logs=100,
)

#### Defining the model and the datasets

Prov4ml allows to log various metrics and parameters to ensure comprehensive tracking of the experiment’s provenance.
- **log_metric**: Logs a metric value to the provenance data, keeping track of the value, time, epoch and context.
- **log_parameters**:  Logs the parameters used in the experiment to the provenance data.

When defining the dataset transformations, datasets and data loaders, prov4ml allows logging of relevant information through the `log_dataset`  and `log_param` functions. 
- **log_dataset**: Logs various information extracted from the dataset used in the experiment.

In [3]:
class MNISTModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = torch.nn.Sequential(
            torch.nn.Linear(28 * 28, 10), 
        )

    def forward(self, x):
        return self.model(x.view(x.size(0), -1))

In [5]:
tform = transforms.Compose([
    transforms.RandomRotation(10), 
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.ToTensor()
])

prov4ml.log_param("dataset transformation", tform)

train_ds = MNIST(PATH_DATASETS, train=True, download=True, transform=tform)
test_ds = MNIST(PATH_DATASETS, train=False, download=True, transform=tform)
train_ds = Subset(train_ds, range(BATCH_SIZE*4))
test_ds = Subset(test_ds, range(BATCH_SIZE*2))

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE)

prov4ml.log_dataset(train_loader, "train_ds")
prov4ml.log_dataset(test_loader, "test_ds")

#### Training the model

Train the MNIST model using PyTorch, then log the final model version using prov4ml, and evaluate the model on the test dataset.

In [6]:

mnist_model = MNISTModel()
optim = torch.optim.Adam(mnist_model.parameters(), lr=0.0002)
prov4ml.log_param("optimizer", "Adam")

for epoch in tqdm(range(EPOCHS)):
    for i, (x, y) in enumerate(train_loader):
        optim.zero_grad()
        y_hat = mnist_model(x)
        loss = F.cross_entropy(y_hat, y)
        loss.backward()
        optim.step()
        prov4ml.log_metric("MSE_train", loss, context=prov4ml.Context.TRAINING, step=epoch)
    
    prov4ml.log_carbon_metrics(prov4ml.Context.TRAINING, step=epoch)
    prov4ml.log_system_metrics(prov4ml.Context.TRAINING, step=epoch)
    prov4ml.save_model_version(mnist_model, f"mnist_model_version_{epoch}", prov4ml.Context.TRAINING, epoch)
        
for i, (x, y) in tqdm(enumerate(test_loader)):
    y_hat = mnist_model(x)
    loss = F.cross_entropy(y_hat, y)
    prov4ml.log_metric("MSE_test", loss, prov4ml.Context.EVALUATION, step=epoch)

prov4ml.log_model(mnist_model, "mnist_model_final")

100%|██████████| 2/2 [00:00<00:00, 29.95it/s]
2it [00:00, 213.47it/s]


#### Closing the run and saving the model as ProvJSON

Save the provenance data to a ProvJSON file for further analysis and visualization. 

In [8]:
prov4ml.end_run(create_graph=True, create_svg=True)

fatal: not a git repository (or any of the parent directories): .git


Git not found, skipping commit hash retrieval
