<a href="https://colab.research.google.com/github/SzymonNowakowski/Machine-Learning-2025/blob/master/Lab15-EDM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 15 - EDM, Working in ICM

### Author: Szymon Nowakowski

# Introduction

Today I will talk about using methods you can find in the following paper:

Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the Design Space of Diffusion-Based Generative Models. In Proceedings of the Neural Information Processing Systems (NeurIPS) Conference.
https://arxiv.org/abs/2206.00364

### EDM Codebase

https://github.com/NVlabs/edm

Although 3 years old, the codebase and the precomputed networks it provides are still considered the state-of-the-art for generating images.

# Presentation Plan

## Cluster Computation (ICM Example)

## Working with Containers

## Best (Programming) Practices

## EDM

### Training

### Generating Images

### Calculating FID

# Cluster Computation (ICM Example)



## SLURM

In ICM they use ***SLURM***. SLURM stands for **Simple Linux Utility for Resource Management** — it's a workload manager widely used on **HPC (High-Performance Computing)** clusters to schedule and manage jobs.

It handles job submission, queuing, scheduling, monitoring, and resource allocation.

In SLURM each command starts with a letter `s`:

    {bash}
    # Submit a job
    sbatch myjob.sh

    # Check job queue
    squeue -u $USER

    # Check all jobs
    squeue

    # Cancel a job
    scancel 12345


I will show you an example of a SLURM batch job file  (`myjob.sh` in examples above) in a moment.

The great resource is the [ICM help page on SLURM](https://kdm.icm.edu.pl/Tutorials/HPC-intro/slurm_intro/).

The available resouces i.e. configurations, GPUs, available memory can be checked at the [ICM help page on computing resources](https://kdm.icm.edu.pl/Zasoby/komputery_w_icm.pl/).




## LSF


There exist other computing HPC environments, the most popular alternative to SLURM being ***LSF*** (**Load Sharing Facility**), developed originally by Platform Computing (now delivered by IBM).

In LSF each command starts with a letter `b`:

    {bash}
    # Submit a job
    bsub < myjob.lsf

    # Check your jobs
    bjobs

    # Check all jobs
    bjobs -u all

    # Show information about available queues
    bqueues

    # Kill a job
    bkill 12345


# Working with Containers

In many HPC environments (including ICM), it is hard to install your own dependencies. While it is possible to install additional Python packages, it is sometimes hard or impossible to install additionall binaries.

In another cluster I work with in WIM, there is no Internet availability at all, so one cannot even install additional packages.

**The way out is using the containers.** But there is another obstacle: you cannot run Docker directly for security reasons in a cluster.  

Instead, you typically **build docker containers locally** (on your workstation or laptop) and then **transfer them** to the cluster, where they can be rebuild docker containers to *Apptainer* (formerly Singularity).

Below are the steps to create, save, and rebuild a Docker container image.


## `Dockerfile`

The `Dockerfile` describes your environment — e.g., which base image to use, which Python packages to install, etc.

    {bash}
    # Example Dockerfile
    FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

    # Set working directory
    WORKDIR /workspace

    # Copy your project
    COPY . /workspace

    # Install extra dependencies
    RUN pip install -r requirements.txt

    # Default command
    CMD ["python", "train.py"]



## Building the Docker Image

    {bash}
    docker build -t myproject:latest .

This creates a local image named `myproject:latest`. Note the `.` (dot) indicating the `Dockerfile` file in the current directory.

Now, when you run

    {bash}
    docker run myproject
what happens is that the `python train.py` gets executed within the container.

There can only be one `CMD` instruction in a `Dockerfile` (if there are multiple, only the last one is used).

Often, `CMD` is used to launch the main script or application of the container. This command can be overwritten and you can run some other code (you can even pass arguments) by executing

    {bash}
    docker run myproject python some_other_code.py --epochs 10 --lr 1e-3

## Saving the Image to a `.tar` File

    {bash}
    docker save myproject:latest -o myproject_latest.tar
    ls -lh myproject_latest.tar


What happens is that the file `myproject_latest.tar` gets written to the current directory.

## Copying the container image to ICM

Use `scp` (secure copy) to transfer the image file to your home directory at ICM.  
Replace `username` with your ICM login name.

    {bash}
    scp myproject_latest.tar username@hpc.icm.edu.pl:/lu/tetyda/home/username/

`/lu/tetyda/home/username/` is a global path of your rysy home directory visible from `hpc.icm.edu.pl` server.


## Converting the Image to Apptainer Format on ICM

Here is an example of the SLURM configuration file that will rebuild the docker container `myproject_latest.tar` into apptainer's `myproject_latest.sif`:

    {bash}
    # after `ssh`-ing to RYSY
    more rebuild_container.slurm

    #!/bin/bash
    #SBATCH --job-name=docker2apptainer
    #SBATCH --nodes=1
    #SBATCH --ntasks=1
    #SBATCH --gres=gpu:1
    #SBATCH --time=48:00:00
    #SBATCH --account=g99-4302
    #SBATCH --output=slurm-%j.out

    export APPTAINER_TMPDIR=/home/$USER/tmp

    apptainer build ./myproject_latest.sif docker-archive:///home/szymon/edm/myproject_latest.tar

Now we shall go through it line by line.

- `export APPTAINER_TMPDIR=/home/$USER/tmp` - this lines ensures that there is enough space on the output for temporary fies. Obviously, you need to create the `~/tmp` directory first.
- Note the triple `///` - it costed me a few days to figure it out. These days, ChatGPT can supply such hints instantly.

Now one needs to execute on RYSY

    {bash}
    sbatch rebuild_container.slurm
    squeue

to see something like

    {bash}
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    127248       gpu setup_en   ljanis PD       0:00      1 (AssocGrpCPUMinutesLimit)
    134837       gpu      OLA tomaszsu  R 12-08:46:28     1 rysy-n6
    134838       gpu docker2a   szymon  R       0:18      1 rysy-n1
    135515        ve     bash   herman  R    8:13:47      1 pbaran

And after a few minutes or hours, depending on the docker file size, one gets `myproject_latest.sif` file written to the current directory. **This is the apptainer container file**.

    



## EDM Example

The original EDM `Dockerfile` from `https://github.com/NVlabs/edm/blob/main/Dockerfile` is the following

    {bash}
    # Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
    #
    # This work is licensed under a Creative Commons
    # Attribution-NonCommercial-ShareAlike 4.0 International License.
    # You should have received a copy of the license along with this
    # work. If not, see http://creativecommons.org/licenses/by-nc-sa/4.0/

    FROM nvcr.io/nvidia/pytorch:22.10-py3

    ENV PYTHONDONTWRITEBYTECODE 1
    ENV PYTHONUNBUFFERED 1

    RUN pip install imageio imageio-ffmpeg==0.4.4 pyspng==0.1.0

    WORKDIR /workspace

    RUN (printf '#!/bin/bash\nexec \"$@\"\n' >> /entry.sh) && chmod a+x /entry.sh
    ENTRYPOINT ["/entry.sh"]

The original container file gave me errors (I was not able to execute the code from within it,  no doubt due to broken dependencies along the way - the code is 3 years old), and I had no access to the original NVidia image which would have obviously worked fine. I needed to update `pillow` package in my fork **`https://github.com/SzymonNowakowski/edm/blob/main/Dockerfile`:**

    {bash}
    # Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
    #
    # This work is licensed under a Creative Commons
    # Attribution-NonCommercial-ShareAlike 4.0 International License.
    # You should have received a copy of the license along with this
    # work. If not, see http://creativecommons.org/licenses/by-nc-sa/4.0/

    FROM nvcr.io/nvidia/pytorch:22.10-py3

    ENV PYTHONDONTWRITEBYTECODE 1
    ENV PYTHONUNBUFFERED 1

    RUN pip install imageio imageio-ffmpeg==0.4.4 pyspng==0.1.0
    RUN pip install --upgrade pillow

    WORKDIR /workspace

    RUN (printf '#!/bin/bash\nexec \"$@\"\n' >> /entry.sh) && chmod a+x /entry.sh
    ENTRYPOINT ["/entry.sh"]

From there the standard sequence would let me build the docker container and ship it to ICM:

    {bash}
    docker build -t edm:latest .
    docker save edm:latest -o edm_latest.tar
    scp edm_latest.tar username@hpc.icm.edu.pl:/lu/tetyda/home/username/

Now we need to rewrite the SLURM script to include different filenames (it assumes the `~/tmp` directory is available - if not - create it!) so it looks like this:

    {bash}
    # while on your terminal
    ssh username@hpc.icm.edu.pl

    # while on HPC computer
    ssh rysy

    # while on RYSY computer
    more rebuild_container.slurm
    #!/bin/bash
    #SBATCH --job-name=docker2apptainer
    #SBATCH --nodes=1
    #SBATCH --ntasks=1
    #SBATCH --gres=gpu:1
    #SBATCH --time=48:00:00
    #SBATCH --account=g99-4302
    #SBATCH --output=slurm-%j.out

    export APPTAINER_TMPDIR=/home/$USER/tmp

    apptainer build ./edm_latest.sif docker-archive:///home/szymon/edm/edm_latest.tar

Now one needs to execute

    {bash}
    sbatch rebuild_container.slurm
    squeue

to see something like

    {bash}
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    127248       gpu setup_en   ljanis PD       0:00      1 (AssocGrpCPUMinutesLimit)
    134837       gpu      OLA tomaszsu  R 12-08:46:28     1 rysy-n6
    134838       gpu docker2a   szymon  R       0:18      1 rysy-n1
    135515        ve     bash   herman  R    8:13:47      1 pbaran

And after a few hours the file `edm_latest.sif` gets written to the current directory.

**This is the apptainer container file we shall need**.

# Best (Programming) Practices



## Configure Your `ssh` and `scp` Connection

As you need a stable `ssh` and `scp` connection with the cluster, you will need to configure it with a file `~/.ssh/config`. An example of such a config file I use:

    {bash}
    cat ~/.ssh/config

    Host icm
        ForwardX11 yes
        ForwardAgent yes
        UserKnownHostsFile ~/.ssh/known_hosts
        Hostname hpc.icm.edu.pl
        LocalForward 8022 rysy:22
        ServerAliveInterval 120
        ServerAliveCountMax 2
        User szymon

    Host fizyk1.fuw.edu.pl
        Hostname fizyk1.fuw.edu.pl
        ServerAliveInterval 120
        ServerAliveCountMax 2
        User snowakowski

    Host dg1
        ForwardX11 yes
        ForwardAgent yes
        UserKnownHostsFile ~/.ssh/known_hosts
        Hostname 10.21.2.118
        LocalForward 5900 localhost:5900
        LocalForward 8000 localhost:8000
        LocalForward 8001 localhost:8001
        LocalForward 8008 localhost:8008
        LocalForward 8888 localhost:8888
        LocalForward 6006 localhost:6006
        LocalForward 1433 localhost:1433
        RequestTTY yes
        ServerAliveInterval 120
        ServerAliveCountMax 2
        ProxyJump  fizyk1.fuw.edu.pl:22
        User szym

- `ServerAliveInterval 120` - instructs your SSH client to send a small, encrypted keep-alive message to the server every 120 seconds (2 minutes).
So every 2 minutes, your client pings the server silently to say “I'm still here.”

- `ServerAliveCountMax 2` - defines how many unanswered keep-alive messages the client will tolerate before disconnecting.
With `ServerAliveCountMax 2`, if the server fails to respond to two consecutive keep-alives, the client assumes the connection is dead and terminates it.

The `dg1` via `fizyk1` connection is to show you how to configure the proxy jump.

Now, instead of

    {bash}
    # while on your terminal
    ssh -l snowakowski fizyk1.fuw.edu.pl

    # while on fizyk1
    ssh -l szym 10.21.2.118

You can invoke simply one-step connection and with a server name (instead of the IP) and without a username!

    {bash}
    ssh dg1

The same applies to `scp`.

**Next I will show you how to enable the tab-completion of remote paths in `scp`**.



## Configure Your `ssh` and `scp` Certificate

As you will frequently need to `ssh` and `scp` to and from the cluster, the second thing to consider is to set up a password-less connection. You not only skip typing your password every time, but you also unlock extra conveniences such as **tab-completion of remote paths** in `scp`.

To generate a public/private key pair execute:

    {bash}
    ssh-keygen

It generates:
- your private key: `~/.ssh/id_rsa`
- your public key: `~/.ssh/id_rsa.pub`

To copy the public key to the server (for instance to ICM) do it with

    {bash}
    ssh-copy-id icm

(assuming you have your `~/.ssh/config` file already created).

After that

    {bash}
    ssh icm

gets you to the `hpc.icm.edu.pl` server with only the OTP (one time password). That you cannot avoid, or at least I don't know how. In `hpc`, you can set up a direct certificate-based (pasword-less) connection to `rysy`, to be able to type only (without the password).

    {bash}
    ssh rysy

## Smart Use of GitHub

Using GitHub is always a good idea as it provides independent backup and code versioning without much added effort. Showing how to use `git` is much beyond the scope of this class, but I want to make some points along the way.

- As we want to extend EDM