# Reproducible Training of nnU-Net

Here is the code to reproduce the training of nnU-Net according to the dataset you want and initialization (not yet available).
This Jupyter Notebook is available on the branch test_leo (``git checkout test_leo``).

Very Important: Before doing anything on this notebook, you should open a service Onyxia entitled "Vscode-pytorch-gpu".

**What is missing?**
- Early stopping (heuristic: 80 epochs which lasts ~5h)
- Different initializations

## 1. Requirements


Python libraries required to run training and handle document downloading / uploading:

In [5]:
!pip install nnunetv2 tqdm s3fs
from tqdm import tqdm
import torch
import s3fs
from pathlib import Path
from tqdm import tqdm
import subprocess
import threading
import time
import os

Collecting argparse (from unittest2->batchgenerators>=0.25.1->nnunetv2)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


Before training the models, you need to enter your credentials. Example: email --> blabla.blabla@ensae.fr, name --> username Onyxia.

In [None]:
email = input("Enter your email ENSAE: ")
name = input("Enter your username Onyxia: ")

subprocess.run(["git", "config", "--global", "user.email", email])
subprocess.run(["git", "config", "--global", "user.name", name])

print(f"Git configured with email : {email} and username : {name}")

Git configured with email : leo.leroy@ensae.fr and username : leoacpr


Now, you must enter your S3 private keys. They are available on Onyxia > Account > Connexion au stockage. 

In [6]:
aws_access_key_id = input("Enter your AWS_ACCESS_KEY_ID: ")
aws_secret_access_key = input("Enter your AWS_SECRET_ACCESS_KEY: ")
aws_session_token = input("Enter your AWS_SESSION_TOKEN: ")

# Environment variables
os.environ["AWS_ACCESS_KEY_ID"] = aws_access_key_id
os.environ["AWS_SECRET_ACCESS_KEY"] = aws_secret_access_key
os.environ["AWS_SESSION_TOKEN"] = aws_session_token

print("AWS keys configured as environment variables.")


AWS keys configured as environment variables.


## 2. Downloading files from S3

The datasets are stored on the S3 service provided by Onyxia. The are available on the path   ``projet-statapp-segmedic/diffusion``. You need to download them locally by running the code below. Estimated time: 4 minutes.

In [7]:
# Connexion to  MinIO S3 Onyxia
s3 = s3fs.S3FileSystem(
    client_kwargs={'endpoint_url': 'https://'+'minio.lab.sspcloud.fr'},
    key=os.getenv("AWS_ACCESS_KEY_ID"),
    secret=os.getenv("AWS_SECRET_ACCESS_KEY"),
    token=os.getenv("AWS_SESSION_TOKEN")
)
#print(len(s3.ls("projet-statapp-segmedic/diffusion/nnunet_dataset/nnUNet_raw/Dataset001_Annot1/labelsTr")))

In [9]:

def download_s3_folder():
    
    # Defining paths
    base_local_path = Path('/tmp/nnunet')
    s3_base_path = "projet-statapp-segmedic/diffusion/nnunet_dataset"
    folders = ['nnUNet_raw', 'nnUNet_preprocessed', 'nnUNet_results']
    
    # Creating local folders
    for folder in folders:
        local_folder = base_local_path / folder
        local_folder.mkdir(parents=True, exist_ok=True)
        
        s3_path = f"{s3_base_path}/{folder}"
        print(f"\nTéléchargement du dossier {folder}...")
        
        # Recursive list of all files from S3
        try:
            files = s3.find(s3_path)
            
            # Progression bar (very nice!)
            with tqdm(total=len(files), desc=f"Fichiers dans {folder}") as pbar:
                for file_path in files:
                    relative_path = file_path.replace(s3_path, '').lstrip('/')
                    local_file_path = local_folder / relative_path
                    
                    # Creating local files if needed
                    local_file_path.parent.mkdir(parents=True, exist_ok=True)
                    
                    # Dowloading files
                    if not local_file_path.exists():
                        try:
                            s3.get(file_path, str(local_file_path))
                        except Exception as e:
                            print(f"Error while downloading {file_path}: {e}")
                    
                    pbar.update(1)
        
        except Exception as e:
            print(f"Error while reading {s3_path}: {e}")
            continue
        
        #ERROR CORRECTED: the nnU-Net dataset naming convention requires 4 digit for image case file, not 3. 
        for string in ['1', '2', '3']:
            images = Path(f"/tmp/nnunet/nnUNet_raw/Dataset00{string}_Annot{string}/imagesTr")
            for f in images.glob("*_000.nii.gz"):
                f.rename(f.with_name(f.name.replace("_000.nii.gz", "_0000.nii.gz")))
    
    # Creating global variables for paths, needed for nnU-Net training. 
    env_vars = {
        'nnUNet_raw': str(base_local_path / 'nnUNet_raw'),
        'nnUNet_preprocessed': str(base_local_path / 'nnUNet_preprocessed'),
        'nnUNet_results': str(base_local_path / 'nnUNet_results')
    }
    
    for var_name, path in env_vars.items():
        os.environ[var_name] = path
    
    # Adding to .bashrc
    with open(os.path.expanduser('~/.bashrc'), 'a') as f:
        f.write('\n# nnUNet paths\n')
        for var_name, path in env_vars.items():
            f.write(f'export {var_name}="{path}"\n')
    
    print("\nConfiguration finished. Environment variables created:")
    for var_name, path in env_vars.items():
        print(f"{var_name}={path}")

    #To apply changes:
    !source ~/.bashrc

download_s3_folder()



Téléchargement du dossier nnUNet_raw...


Fichiers dans nnUNet_raw: 100%|██████████| 133/133 [02:07<00:00,  1.04it/s]



Téléchargement du dossier nnUNet_preprocessed...


Fichiers dans nnUNet_preprocessed: 100%|██████████| 253/253 [00:00<00:00, 33779.81it/s]



Téléchargement du dossier nnUNet_results...


Fichiers dans nnUNet_results: 100%|██████████| 45/45 [00:00<00:00, 27310.62it/s]


Configuration finished. Environment variables created:
nnUNet_raw=/tmp/nnunet/nnUNet_raw
nnUNet_preprocessed=/tmp/nnunet/nnUNet_preprocessed
nnUNet_results=/tmp/nnunet/nnUNet_results





Verify if the downloading has been done successfully by running the following line. 

Expected output: _dataset_fingerprint.json gt_segmentations nnUNetPlans.json dataset.json nnUNetPlans_3d_fullres splits_final.json_

In [10]:
!ls /tmp/nnunet/nnUNet_preprocessed/Dataset002_Annot2

dataset_fingerprint.json  gt_segmentations	  nnUNetPlans.json
dataset.json		  nnUNetPlans_3d_fullres  splits_final.json


If you wish to preprocess and verify the datasets integrity, copy-paste and run the following lines. **BE CAREFUL:** this might make Onyxia crash if you do not increase the CPU and RAM ressources! It also takes more than 20 min per line.The lines have already been run before. You normally do not need to run them. That is why the lines are not in a code cell.

``!nnUNetv2_plan_and_preprocess -h``

``!nnUNetv2_plan_and_preprocess -d Dataset001_Annot1 -c 3d_fullres --verify_dataset_integrity -np 2 -npfp 2``

``!nnUNetv2_plan_and_preprocess -d Dataset002_Annot2 -c 3d_fullres --verify_dataset_integrity -np 2 -npfp 2``

``!nnUNetv2_plan_and_preprocess -d Dataset003_Annot3 -c 3d_fullres --verify_dataset_integrity -np 2 -npfp 2``


(**Optional**) If you wish to upload all the documents stored locally, you can run the following code. Select one folder among ``nnUNet_preprocessed`` or ``nnUNet_results`` (you normally do not need to upload files from nnUNet_raw). Estimated time: between 10s and 1min10s.

In [None]:
def upload_to_s3(folder):
    # Dossier local et distant
    local_folder = Path(f'/tmp/nnunet/{folder}')
    s3_folder = f"projet-statapp-segmedic/diffusion/nnunet_dataset/{folder}"
    
    # Lister tous les fichiers à uploader
    files = list(local_folder.rglob("*"))
    
    print(f"\nUploading {folder} to {s3_folder}...")
    with tqdm(total=len(files), desc=f"Upload {folder}") as pbar:
        for file_path in files:
            if file_path.is_file():
                relative_path = file_path.relative_to(local_folder)
                s3_path = f"{s3_folder}/{relative_path.as_posix()}"
                try:
                    s3.put(str(file_path), s3_path)
                except Exception as e:
                    print(f"Erreur lors de l'upload de {file_path} → {s3_path}: {e}")
            pbar.update(1)

#upload_to_s3(input("Enter nnUNet_preprocessed or nnUNet_results"))

## 3. Training

Training must be jointly done with file uploading: The training creates many documents to save progress. These documents are stored locally, but we need them on S3. Given that epochs take usually about 200s, I decided to set the time interval of uploading to 200s.

Decide on which dataset (i.e. which set of annotations) you want to use: ``Dataset001_Annot1``, ``Dataset002_Annot2``, ``Dataset003_Annot3``. 

**CAREFUL**: The project isn't entirely done. For the moment, there is no early stopping. You should continuoulsy check if Onyxia hasn't crashed during the training (normally it shouldn't happen) and stop about 80 epochs. If you wish to resume training, you can enter this: ``nnUNetv2_train <dataset> 3d_fullres all --npz --c`` but it will only resume from a multiple of 50 epochs (nnU-Net automatically saves its results every 50 epochs). 

In [None]:
# Code to train and upload nnU-Net

dataset=input("Enter one among: Dataset001_Annot1, Dataset002_Annot2, Dataset003_Annot3")

# Upload function with time interval 
# IDEA: upload as soon as the content of temp/results changes
def sync_results_to_s3():
    print("[Uploader] Starting S3 sync thread.")

    while True:
        
        upload_to_s3('nnUNet_results')
        print("upload done")

        time.sleep(300)


# Training function
def run_training():
    print("[Trainer] Launching nnUNet training...")
    command = [
        "nnUNetv2_train",
        f"{dataset}",  # Dataset ID
        "3d_fullres",  # Plan
        "all",  # Fold            
        "--npz",
        "--c"
    ]
    subprocess.run(command)
    print("[Trainer] Training complete.")


# Threads
uploader_thread = threading.Thread(target=sync_results_to_s3, daemon=True)
trainer_thread = threading.Thread(target=run_training)

uploader_thread.start()
trainer_thread.start()

trainer_thread.join()
print("[Main] All done.")

[Uploader] Starting S3 sync thread.

Uploading nnUNet_results to projet-statapp-segmedic/diffusion/nnunet_dataset/nnUNet_results...
[Trainer] Launching nnUNet training...


Upload nnUNet_results:  48%|████▊     | 30/62 [00:05<00:03,  8.32it/s]


############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################



Upload nnUNet_results:  53%|█████▎    | 33/62 [00:08<00:12,  2.39it/s]

2025-05-01 21:06:07.090267: Using torch.compile...


Upload nnUNet_results:  66%|██████▌   | 41/62 [00:11<00:09,  2.25it/s]

2025-05-01 21:06:09.558565: do_dummy_2d_data_aug: False
using pin_memory on device 0


Upload nnUNet_results:  73%|███████▎  | 45/62 [00:14<00:08,  2.02it/s]

using pin_memory on device 0

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [192, 112, 112], 'median_image_size_in_voxels': [943.0, 512.0, 512.0], 'spacing': [1.0, 0.9626015722751617, 0.9626015722751617], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures

Upload nnUNet_results:  76%|███████▌  | 47/62 [00:14<00:05,  2.97it/s]

2025-05-01 21:06:13.151583: Unable to plot network architecture: nnUNet_compile is enabled!
2025-05-01 21:06:13.170194: 
2025-05-01 21:06:13.171853: Epoch 50
2025-05-01 21:06:13.172388: Current learning rate: 0.00955


Upload nnUNet_results: 100%|██████████| 62/62 [00:20<00:00,  3.09it/s]


upload done

Uploading nnUNet_results to projet-statapp-segmedic/diffusion/nnunet_dataset/nnUNet_results...


Upload nnUNet_results: 100%|██████████| 63/63 [00:21<00:00,  2.95it/s]


upload done
2025-05-01 21:11:11.027475: train_loss -0.6404
2025-05-01 21:11:11.028437: val_loss -0.7084
2025-05-01 21:11:11.028638: Pseudo dice [np.float32(0.8433), np.float32(0.9555), np.float32(0.9364)]
2025-05-01 21:11:11.029136: Epoch time: 297.86 s
2025-05-01 21:11:11.029487: Yayy! New best EMA pseudo Dice: 0.8966000080108643
2025-05-01 21:11:13.900142: 
2025-05-01 21:11:13.900580: Epoch 51
2025-05-01 21:11:13.900786: Current learning rate: 0.00954
