<a href="https://colab.research.google.com/github/SzymonNowakowski/Machine-Learning-2025/blob/master/Lab15-EDM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 15 - EDM, Working in ICM


### Author: Szymon Nowakowski


# Presentation Plan

## Cluster Computation (ICM Example)

## Working with Containers

## Best (Programming) Practices

## EDM

### Training

### Generating Images

### Calculating FID

# Cluster Computation (ICM Example)



## SLURM

In ICM they use ***SLURM***. SLURM stands for **Simple Linux Utility for Resource Management** — it's a workload manager widely used on **HPC (High-Performance Computing)** clusters to schedule and manage jobs.

It handles job submission, queuing, scheduling, monitoring, and resource allocation.

In SLURM each command starts with a letter `s`:

    {bash}
    # Submit a job
    sbatch myjob.sh

    # Check job queue
    squeue -u $USER

    # Check all jobs
    squeue

    # Cancel a job
    scancel 12345


I will show you an example of a SLURM batch job file  (`myjob.sh` in examples above) in a moment.

The great resource is the [ICM help page on SLURM](https://kdm.icm.edu.pl/Tutorials/HPC-intro/slurm_intro/).

The available resouces i.e. configurations, GPUs, available memory can be checked at the [ICM help page on computing resources](https://kdm.icm.edu.pl/Zasoby/komputery_w_icm.pl/).




## LSF


There exist other computing HPC environments, the most popular alternative to SLURM being ***LSF*** (**Load Sharing Facility**), developed originally by Platform Computing (now delivered by IBM).

In LSF each command starts with a letter `b`:

    {bash}
    # Submit a job
    bsub < myjob.lsf

    # Check your jobs
    bjobs

    # Check all jobs
    bjobs -u all

    # Show information about available queues
    bqueues

    # Kill a job
    bkill 12345


# Working with Containers

In many HPC environments (including ICM), you cannot run Docker directly for security reasons.  
Instead, you typically **build containers locally** (on your workstation or laptop) and then **transfer them** to the cluster, where they can be rebuilt to *Apptainer* (formerly Singularity).

Below are the steps to create, save, and transfer a Docker container image.


## `Dockerfile`

The `Dockerfile` describes your environment — e.g., which base image to use, which Python packages to install, etc.

    {bash}
    # Example Dockerfile
    FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

    # Set working directory
    WORKDIR /workspace

    # Copy your project
    COPY . /workspace

    # Install extra dependencies
    RUN pip install -r requirements.txt

    # Default command
    CMD ["python", "train.py"]


## Building the Docker Image

    {bash}
    docker build -t myproject:latest .

This creates a local image named `myproject:latest`.

When you run

    {bash}
    docker run myproject
what happens is that the `python train.py` gets executed within the container.

There can only be one `CMD` instruction in a `Dockerfile` (if there are multiple, only the last one is used).

Often, `CMD` is used to launch the main script or application of the container. This command can be overwritten

## Saving the Image to a `.tar` File

    {bash}
    docker save myproject:latest -o myproject.tar
    ls -lh myproject.tar


What happens is that the file `myproject.tar` gets written to the current directory (a.k.a. folder).

## Copying the container image to ICM

Use `scp` (secure copy) to transfer the image file to your home directory at ICM.  
Replace `username` with your ICM login name.

    {bash}
    scp myproject.tar username@hpc.edu.pl:/lu/tetyda/home/username/




## Converting the Image to Apptainer Format on ICM

    



# Best (Programming) Practices

## Configure Your `ssh` and `scp` Connection

As you need a stable `ssh` and `scp` connection with the cluster, you will need to configure it with a file `~/.ssh/config`. An example of such a config file I use:



## Configure Your `ssh` and `scp` Certificate

The second thing ....

As you will frequently need to `ssh` and `scp` to and from the cluster, it is reasonable to