In [None]:
# IGNORE THIS CELL WHICH CUSTOMIZES LAYOUT AND STYLING OF THE NOTEBOOK !
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import warnings

import matplotlib.pyplot as plt

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings = lambda *a, **kw: None
from IPython.core.display import HTML

HTML(open("../documents/custom.html", "r").read())

<p style="font-size: 2.5em; font-weight: bold;">Section 6a: Introduction to the Euler HPC Cluster</p>
<br/>
<span style="background:#f0f0e0;padding:1em">Copyright (c) 2020-2021 ETH Zurich, Scientific IT Services. This work is licensed under <a href="https://creativecommons.org/licenses/by-nc/4.0/">CC BY-NC 4.0</a></span><br/>
<br/>

> **Target audience:** this presentation is intended for beginners and intermediate cluster users and for users that are new to SLURM (but are familiar with HPC systems). If you are confident you can answer the "check questions" at the end of this part, feel free to skip this presentation

# HPC and Python
* An HPC cluster represents a **collection of tens, hundreds, or thousands of fast computers** able to perform a lot more computations than a single computer. 
* An HPC cluster is usually **shared** between users and research groups.  
* Use cases:
    * Traditionally used in the simulation of physical systems (e.g. quantum systems, weather, climate or fluid dynamics)
    * Nowadays HPC clusters are accomondating Big Data workloads from diverse fields that cannot be run on a PC due to resource requirements. 
* The increased popularity of Python in research (and industry) motivated the development of Python libraries that take advantage of HPC clusters (e.g. IPython Parallel, Dask).

In this section we will learn:
- how to use the ETHZ Euler cluster, and 
- which pythonic solutions are available.

# Introduction to the ETHZ HPC Clusters

The [Scientific IT Services](https://sis.id.ethz.ch) provide two [research IT environments](https://sis.id.ethz.ch/services/hpc/):



| **Euler**                                                           |        **LeoMed**                                                             |
|---------------------------------------------------------------------|-------------------------------------------------------------------------------|
| large CPU & GPU cluster                                             | multiple CPU & GPU clusters                                                   |
| designed for high-performance and high-throughput applications      | designed for biomedical application with sensitive data; includes very high IT security controls               |
| provided by SIS's High Performance Computing group                  | provided by SIS's Research IT Platforms group                                 |
| shareholder model + a small public share for all ETH members (free) | shareholders only                                                             |
| https://scicomp.ethz.ch                                             | https://unlimited.ethz.ch/display/LeoMed2/Leonhard+Med+Intro+for+shareholders |


**In this tutorial we will focus on the Euler cluster**

## The scicomp wiki
* https://scicomp.ethz.ch/
* Documentation on all aspects of Euler

<p>
<img src="./images/scicomp.png" width="700">
</p>

## [Euler cluster](https://scicomp.ethz.ch/wiki/Euler)
- regularly expanded since 2013
- shareholder model 
  - over 180 research groups from almost all departments of ETH invested in Euler)
  - a small public share (only CPUs), with **free** limited access for all ETH members (up to 48 cores and 128 GB of memory)
  - GPU nodes (only for shareholders)



## Using the Euler cluster

* An HPC cluster is shared between multiple users and its mission is to allow the users to run computing tasks, which we call **jobs**.  
* The jobs are running on so-called **compute** nodes. However, in order to have the jobs scheduled on a compute node we have to submit them. 
* For this purpose we have the so-called **login** nodes. These allow the users to login from their PC and submit the jobs.
* login: `ssh <username>@euler.ethz.ch` (see [scicomp wiki](https://scicomp.ethz.ch/wiki/Accessing_the_cluster))
* The submitted jobs are scheduled by the **batch system**, and once there are available resources the jobs are started on the compute nodes. Depending on the resources that you requested and on their availability you might have to wait shorter or longer (or even forever). 

<p>
<img src="./images/Cluster.png" width="1000">
<div>Source: <a href=https://scicomp.ethz.ch/wiki/File:Accessing_the_clusters.png>ETHZ scicomp wiki</a></div>
</p>

## Cluster architecture

The main components of the cluster are:
- Login nodes (for accessing the cluster)
- Compute nodes (where jobs are executed)
- Storage (where data is stored)
- Environment - Modules (centrally installed applications, libraries, compilers, ...)
- Batch system (manages jobs)

## Login vs Compute Nodes
- **login** nodes: allow users to login to the cluster and submit jobs
- **compute** nodes: the place where jobs are running

**Do not run computation on the login node:**

- The batch system balances loads on the compute nodes
- Running on the login node circumvents this mechanism
- As a result, you might make the login node unusable for you and others

## Storage and data management

Data can be stored in different locations (see also the [scicomp wiki](https://scicomp.ethz.ch/wiki/Storage_systems)):

<p>
<img src="./images/Storage.png" width="500">
<div>Source: <a href=https://scicomp.ethz.ch/wiki/File:Storage.png>ETHZ scicomp wiki</a></div>
</p>

* **Personal storage for all users**
    * Home
    * Global Scratch
    * Local Scratch
* **Group storage for shareholders**
    * Project
    * Work
* **External Storage**

<p>
<img src="./images/Storage.png" width="500">
</p>

### Personal storage for all users
* Home
  * `/cluster/home/username`, `$HOME`, `~`
  * safe, long-term storage 
  * for critical data (program source, scripts, etc.)
* Global Scratch
  * `/cluster/scratch/username`, `$SCRATCH`
  * fast, short-term storage 
  * for computations running on the cluster
* Local Scratch
  * `/scratch`, `$TMPDIR` on each compute node
  * very short-term: deleted automatically when the job ends
  * for serial, I/O-intensive applications. 
  * https://scicomp.ethz.ch/wiki/Using_local_scratch


<p>
<img src="./images/Storage.png" width="500">
</p>


### Group storage for shareholders
Shareholders can buy the space on Project and Work as much as they need, and manage access rights
* Project
  * `/cluster/project/groupname`
  * similar to $HOME, but for groups
* Work
  * `/cluster/work/groupname`
  * Similar to global scratch, but without purge



<p>
<img src="./images/Storage.png" width="500">
</p>

### External Storage
* **Central NAS:**
    - [Central Network Attached Storage (NAS)](https://ethz.ch/services/en/it-services/catalogue/storage/nas.html)
    - can be mounted via NFS (not CIFS)
* **Other NAS:**
    - it needs to support NFSv3 (the only NFS version supported by the cluster)
    - contact cluster-support@id.ethz.ch for more details

## Overview
### Life span

|                | Personal storage<br>for all users | Group storage<br>for shareholders |
|----------------|:----------------------------------:|:----------------------------------:|
| <b>long-term  | Home                               | Project                            |
| <b>short-term | Global Scratch<br>Local Scratch    | Work                               |


### File size


|  File system   |    Small files  |  Large files  |
|:--------------|:---------------:|:-------------:|
| Home           |         ✓       |       o       |
| Global Scratch |        o        |        ✓✓     |
|  Local Scratch |        ✓✓       |       o       |
| Project        |         ✓       |       ✓       |
| Work           |         o       |       ✓✓      |
|  Central NAS   |         ✓       |       ✓       |

>  For a more in-depth comparison of the file systems, see the [scicomp wiki](https://scicomp.ethz.ch/wiki/Storage_systems)

## Quotas

* The availability of disk resources is linked to your user account and the shareholder group.
* You can check the available disk space using `lquota`.
* There are limitations (quotas) for 
  * the numbers of files created 
  * the total storage space used
* Soft and hard quota
    * exceeding **soft quota**:
      * you'll have 5 days ("grace period") to reduce your disk usage
      * if you still exceed soft quota after grace period, you can no longer write
    * **hard quota** cannot be exceeded


```bash
$ lquota

+-----------------------------+-------------+------------------+------------------+------------------+
| Storage location:           | Quota type: | Used:            | Soft quota:      | Hard quota:      |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/home/<username>    | space       |          3.36 GB |         17.18 GB |         21.47 GB |
| /cluster/home/<username>    | files       |            75505 |            80000 |           100000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/shadow             | space       |          8.19 kB |          2.15 GB |          2.15 GB |
| /cluster/shadow             | files       |                3 |            50000 |            50000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/scratch/<username> | space       |         85.12 MB |          2.50 TB |          2.70 TB |
| /cluster/scratch/<username> | files       |            20779 |          1000000 |          1500000 |
+-----------------------------+-------------+------------------+------------------+------------------+
```

Remark: *`shadow` is internally used by the batch system to temporarily store the output from your compute jobs.*

**Copying data from/to the cluster:**

In case you want to transfer data to the cluster from the outside you need to copy data on your own. The most common way is to use:

```bash
$ scp [options] source destination
```
Alternatives are: 
- command line tools: `rsync`, `sftp`, `svn`, `git`, `wget`, ...
- tools with a graphical user interface: [FileZilla](https://filezilla-project.org/), [Cyberduck](https://cyberduck.io/), [WinSCP](https://winscp.net/eng/index.php), ...

## Environment - Modules

The clusters provide modules to configure our computing environment for specific tools, e.g.:
- Development tools 
- Scientific libraries 
- Communication libraries (MPI)
- Third-party applications

Advantages:
- User do not need to install software, they can just load existing modules
- Different versions of the same software can co-exist and can be selected explicitly
- We can easily try out different tools and  switch between versions to find out which one works best for us

**The New Software Stack**  


* The HPC team introduced a new system in 2020 to manage the modules, the so called *new software stack*.
    * The old stack still is the default but we recommend to use the new software stack.
    * To permanently switch to the new software stack (more information [here](https://scicomp.ethz.ch/wiki/New_SPACK_software_stack_on_Euler)):
    `/cluster/apps/local/set_software_stack.sh new`
    * You can temporarily switch from 
      * new -> old stack: `lmod2env`
      * old -> new stack: `env2lmod`
* For a list of available Python modules, see [Python on Euler](https://scicomp.ethz.ch/wiki/Python_on_Euler)

**Module command**

* Software orgainzed into modules
* Modules can be managed (listed, loaded...) using the `module` commmand


| Command            	 | Description                                         	  |
|----------------------- |------------------------------------------------------- |
| `module`            	 | get info about module sub-commands                  	  |
| `module spider`      	 | list all modules available on the cluster           	  |
| `module spider <name>` | list all modules that match `<name>`                   |
| `module avail`       	 | list all modules that can be loaded in the current environment  |
| `module avail <name>`  | list all modules that match `<name>`                   |
| `module key <keyword>` | list all modules whose description contains `<keyword>`|
| `module help <name>`   | get information about module `<name> `                 |
| `module show <name>`   | show what module `<name>` does (without loading it)    |
| `module load <name>`   | load module `<name>`                                   |
| `module list`        	 | list all currently loaded modules                   	  |
| `module unload <name>` | unload module `<name>`                                 |
| `module purge`       	 | unload all modules at once                          	  |

* These commands are the same for both the old and the new software stack. 
* The main difference between the stacks are the available modules.


**Example**

```bash

$ env2lmod
Current modulesystem is already LMOD modules, nothing to change for env2lmod

$ module list
Currently Loaded Modules:
  1) StdEnv   2) gcc/4.8.5
```

```bash
$ module spider python

--------------------
  python:
--------------------

     Versions:
        python/2.7.14
        ...
        python/3.10.4
```

```bash
$ module spider python/3.10.4
...
You will need to load all module(s) on any one of the lines below 
before the "python/3.10.4" module is available to load.

      gcc/8.2.0
...
      
$ module load gcc/8.2.0 python/3.10.4
```

```bash
$ which python
/cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/bin/python

$ python --version
Python 3.10.4

$ module list
Currently Loaded Modules:
  1) StdEnv   2) gcc/8.2.0   3) openblas/0.3.15   4) python/3.10.4


```

### The Python ecosystem on Euler

* Conda is not recommended on HPC systems (amongst other things, because it creates a lot of small files which most HPC storage is not optimized for; see [here](https://scicomp.ethz.ch/wiki/Conda))
* When available, use centrally installed packages (installed as part of a Python module; see [Python on Euler](https://scicomp.ethz.ch/wiki/Python_on_Euler))
* Additional packages can be installed in `virtualenvs` (see [here](https://scicomp.ethz.ch/wiki/Python_virtual_environment))
* You can also ask for a package to be installed via a request to cluster-support@id.ethz.ch

To get a list of centrally installed packges:
```bash
$ module load gcc/8.2.0 python/3.10.4
$ pip list 
Package                         Version
------------------------------- ----------------------
absl-py                         1.0.0
AccessControl                   5.3.1
...
```

## Exercise 1: login, modules, data transfer [20min]

### Account
* In the next exercises, you should replace `<username>` with the training account and use the corresponding password.
* After the workshop, you could use your nethz account (the user that you use to login to [mail.ethz.ch](https://mail.ethz.ch/)). The `password` is also the same as the one used for your email address.

### Goals
* login to Euler and familiarize yourself
* navigate the file system
* explore the module system
* transfer some data from your PC

1. Login to the cluster.

```bash
$ ssh <username>@euler.ethz.ch

# Example:
$ ssh msmith@euler.ethz.ch
```

2. Navigate the file system

```bash
$ ls -a # list all files from your home directory

$ cd $SCRATCH # change into the global scratch directory
$ pwd
/cluster/scratch/username

$ cd $TMPDIR
$ pwd
/cluster/home/username # why is the current directory Home and not local scratch? 
```

3. Get familiar with the python modules available on the cluster.

```bash
$ env2lmod
$ module list
$ module spider python
$ module load gcc/8.2.0 python/3.10.4
$ module list
$ python --version
```

4. Exit from the cluster.

```bash
$ exit
```

5. Locate the `fast-python` repo and copy the `./scripts/section_6/euler_scripts` to the home directory of your cluster.

```bash
$ cd <fast-python-repo>
$ scp -r ./scripts/section_6/euler_scripts <username>@euler.ethz.ch:~
```

6. Login to the cluster again and check the content of the `~/euler_scripts` directory

```bash
$ ssh <username>@euler.ethz.ch

$ ls ~/euler_scripts
```

## Batch System

<p>
<img src="./images/Cluster.png" width="640">
<div>Source: <a href=https://scicomp.ethz.ch/wiki/File:Accessing_the_clusters.png>ETHZ scicomp wiki</a></div>
</p>

* The batch system is responsible for scheduling jobs.  
* Different batch systems exist: SGE, LSF, SLURM, PBS, HTCondor, etc
* Each system comes with its own features and syntax

### Ongoing transition to SLURM
* The cluster support team is in the process of transitioning the Euler's batch system from **LSF** to **SLURM**
* By the end of 2022 the majority of nodes will have been migrated to SLURM
* If you are still using LSF, it is recommended to migrate to SLURM as soon as possible 
* Here, we will only discuss SLURM, not LSF
* SLURM docs in the scicomp wiki:
  * For more information on the transition, see the 
  [Transition from LSF to Slurm](https://scicomp.ethz.ch/wiki/Transition_from_LSF_to_Slurm)
  * For more information on SLURM commands and options, see the 
  [LSF to Slurm quick reference](https://scicomp.ethz.ch/wiki/LSF_to_Slurm_quick_reference)
  * [LSF/Slurm submission line advisor](https://scicomp.ethz.ch/public/lsla/index2.html) (helper tool to create submission commands or jobscripts for LSF and Slurm; we'll take a look together later)
 

### Check cluster status

```bash
$ sinfo
$ sinfo -Nel # 1 line per node and more info
NODELIST        NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
...
eu-a2p-273          1  normal.24h       mixed 128    2:64:1 249200   842000    200 AMD,EPYC none                
...
```

### Job types

* Batch job
  * Create job script
  * Submit job to cluster
  * Do something else until the job has finished
  * Inspect results
* Interactive job
  * Submit interactive job
  * Get prompt on compute node
  * Run commands interactively
  * Exit interactive job

### Job types: use cases

* Batch job
  * Standard approach for data analysis
* Interactive job
  * Debugging a batch job (you have the same environment as in the batch job and can inspect states and errors interractively)
  * GUIs (via X11-forwarding; see [scicomp wiki](https://scicomp.ethz.ch/wiki/X11_forwarding_batch_interactive_jobs))
  


### Batch jobs
* Create job file
* Submit to cluster (`sbatch`)
* SLURM adds your job to the queue (`squeue`) 
* SLURM will schedule your job and execute it on a compute node once resources 
are available
* Check if your job is complete/failed/pending
* Output (stdout/stderr) is written to log file

* Create job file
    * Include bash shebang (`#!/bin/bash`)
    * Include a request for resources (starting with `#SBATCH` pragma; more on job options later)
    * Include commands to run (e.g., run Python script)
    
`job.sh`:
```bash 
#!/bin/bash

#SBATCH -c 2                              # Number of cpus (default: 1)
#SBATCH --time=2:00:00                    # Wall-clock time hours:minutes:seconds (default: 4h)
#SBATCH --mem-per-cpu=2G                  # 2 GB
#SBATCH --tmp=4000                        # temporary disk space per node (default units MB [M])
#SBATCH --job-name=analysis1
#SBATCH --output=analysis1.out
#SBATCH --error=analysis1.err

hostname
python ~/data_analysis.py
```

Use `sbatch` to submit a job to the batch system
```bash
$ module load gcc/8.2.0 python/3.10.4
$ sbatch job.sh
Submitted batch job <jobid>
```


* SLURM analyzes the job requirements and dispatches it to the job **queue**. 
* Once there are available resources, the job will start running.
* We recommend to load all required modules before submitting a job because the batch system uses module information to better schedule a job.

#### Batch jobs without job script

via the `--wrap` flag:

`sbatch --wrap "python ~/data_analysis.py"`

### Interactive jobs
* Request an interactive job with specific resources
* SLURM will schedule your job and execute it on a compute node once resources 
are available
* You receive a prompt on a compute node to run your commands interactively
* Output is printed directly in the terminal
* Interactive jobs offer
* Exit interactive job

Use `srun` to submit an interactive job with the default options
```bash
user@eu-login$ srun --pty bash  # submit interactive job
user@eu-compute$ pwd            # run command on compute node
/cluster/home/user
user@eu-compute$ exit           # exit interactive job
exit
user@eu-login$
```

**Example**:

`srun --time 1:00:00 -c 2 --mem-per-cpu 2G --pty bash`

Requests an interactive job with a max time of 1h, 2 CPUs, and 2x2GB memory.

### Resource requirements
The batch system of Euler works like a black box:
- We do not need to know anything about queues, hosts, user groups, priorities, etc. to use it.
- We only need to specify the resources needed by our job.

Resources are requested:
- via the `#SBATCH` pragma (directive) in job scripts, or
- as options when calling SLURM submission commands (`sbatch` and `srun`)

The most important resources are:
- **maximal run-time** `--time <HH:MM:SS>` (default 4 hours) 
- **memory** `--mem-per-cpu <memory>` (default 1024 MB per CPU)
- Multiprocessing and parallel jobs
  - **number  of CPUs per task** 
    - `-c <number_of_cpus>`, (or `--cpus-per-task`; default 1)
    - Use this option for multi-threaded jobs
  - **number of tasks** 
    - `-n <number_of_tasks>`, (or `--ntasks`; default 1) for parallel jobs
    - Use this option for parallel tasks (e.g., MPI jobs)

Note that:
* Options passed to the command will override the options in the job script.
* If you request more resources than are available, the job will queue forever (you can kill it).
* Docs
    * Further information on resource options can be found in the 
    [scicomp wiki - LSF to Slurm quick reference](https://scicomp.ethz.ch/wiki/LSF_to_Slurm_quick_reference)
    * You can assemble a jobscript with the desired options using the [LSF/Slurm submission line advisor](https://scicomp.ethz.ch/public/lsla/index2.html) .

#### GPUs

To request GPUs:
* `--gpus=1`
* `--gpus=gtx_1080:1` (for a specific GPU model)

# Exercise 2: submission advisor [5 min]
  * Open the [LSF/Slurm submission line advisor](https://scicomp.ethz.ch/public/lsla/index2.html)
  * Select "Slurm" as "Batch System"
  * Tweak the options and click "display command/script"
  * Also switch between "command" and "jobscript"
  
  <p>
<img src="./images/submission-advisor.png" width="640">
</p>
  

### Job queue
```
$ squeue
$ squeue -i 5   # (iterate) refresh all 5 seconds
$ squeue -l     # (--long) slightly more verbose output
```

```
$ squeue -l
JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
3266  normal.4h analysis     user  RUNNING       1:10   2:00:00      1 eu-a2p-524
```

### Resource usage information of running jobs

```bash
$ sstat -a --format JobID,AveCPU,AveRSS,MaxRSS -j <jobid>.batch
        JobID     AveCPU     AveRSS     MaxRSS 
-------------- ---------- ---------- ---------- 
<jobid>.batch   00:14.000   8745040K   8745040K 
```

* `AveCPU`: Average (system + user) CPU time of all tasks in the job. Example: 2 CPUs running 100% for one minute, will result in 2 minutes of `AveCPU` time
* `AveRSS`: Average resident set size, i.e., the average memory usage of the job.
* `MaxRSS`: Maximum resident set size, i.e., the maximum memory usage of the job.

### Details on past jobs
```bash
$ sacct -X --format JobID,Elapsed,State,ExitCode -j <jobid>.batch
```

**Example successful job**
```
$ sacct -X --format JobID,Elapsed,State,ExitCode -j 3284
JobID           Elapsed      State ExitCode 
------------ ---------- ---------- -------- 
3284           00:01:43  COMPLETED      0:0 
```

**Example failed job (exit code 1)**
```
$ sacct -X --format JobID,State,Elapsed,ExitCode -j 3286
JobID           Elapsed      State ExitCode 
------------ ---------- ---------- -------- 
3286           00:00:01     FAILED      1:0 
```

**Example failed job (out of memory)**
```
$ sacct -X --format JobID,Elapsed,State,ExitCode -j 11349
JobID           Elapsed      State ExitCode 
------------ ---------- ---------- -------- 
11349          00:00:05 OUT_OF_ME+    0:125  
```

For information on the job state codes, see the [official SLURM docs](https://slurm.schedmd.com/sacct.html#SECTION_JOB-STATE-CODES)

**Resource usage information of past jobs**
```bash
$ sacct --format JobID,AveCPU,AveRSS,MaxRSS -j <jobid>.batch
```

### Output of jobs
* SLURM creates an output file for each job. 
* By default the name of the file is `slurm-<jobid>.out`.
* If specified, `stdout` will be written to the file specified in `--output`, 
`sterr` to the `--error` file 

### Cancelling a job
Cancel pending or running jobs:
```bash
$ scancel <jobid>
$ scancel -u $USER  # cancel all your jobs
```

## Exercise 3: job submission & management [20 min]


### Account
* In the next exercises you should replace `<username>` with the training account and use the corresponding password.
* After the workshop you could use your nethz account (the user that you use to login to [mail.ethz.ch](https://mail.ethz.ch/)). The `password` is also the same as the one used for your email address.

### Goals
* login to Euler
* inspect the files required to run the job
* submit the job
* observe its status & output

1. Login to the cluster.

```bash
$ ssh <username>@euler.ethz.ch

# Example:
$ ssh msmith@euler.ethz.ch
```

2. Check the content of the Python script `~/euler_scripts/job_summary.py` and the job scipt .

```bash
$ cat ~/euler_scripts/job_summary.py 
$ cat ~/euler_scripts/job.sh
```

3. Load modules, submit the script, and identify its `<jobid>`

```bash
$ module load gcc/8.2.0 python/3.10.4
$ sbatch ~/euler_scripts/job.sh
```


4. Check the job status and the queue using `squeue` and/or `sstat`.

```bash
$ squeue -j <jobid>
$ sstat -a --format JobID,AveCPU,AveRSS,MaxRSS -j <jobid>.batch
```

5. Once the job is done list the files that start with `slurm-`, and check that you have one that includes the `<jobid>` of the previous job.

```bash
$ ls -l slurm-*
```

6.  Display the content of the output file corresponding to your job

```bash
$ cat slurm-<jobid>*
```

Note the differences between `SLURM_CPUS_PER_TASK` and `os.cpu_count()`. What's the implication for multi-core python code?

7. Check the job status and exit code. 

```bash
$ sacct -X --format JobID,State,Elapsed,ExitCode -j <jobid>
```

## Dos and Don'ts

**Dos**

- Understand what we are doing
- Ask for help if we don't understand what we are doing, e.g. write to cluster-support@id.ethz.ch
- Use the wiki https://scicomp.ethz.ch
- Optimize our workflow to make it as efficient as possible
    - Jobs with shorter expected running time (`--time HH:MM:SS`) are likely to get the compute resources faster.
- Keep in mind that our clusters are shared by many users
- Carefully choose the file system you want to use
- Try to have jobs of at least 5 minutes
- Before requesting multiple cores, check whether the program supports parallelism and how to use it.

**Don'ts**

- Don't waste CPU time or disk space
- Don't run applications on the login nodes
    - The login nodes are shared between many users and we can negatively impact the experience.
    - You should run them on compute nodes by using an interactive batch job, e.g.:
    ```bash
    $ srun --pty bash
    ```
- Don't write large amounts of data to standard output
    - The size of standard output (`/cluster/shadow`) is finite, so you might lose it. Write or redirect to a file (see [here](https://scicomp.ethz.ch/wiki/Too_much_space_is_used_by_your_output_files)).
- Don't run module commands within a job or job script
    - We strongly recommend to load all needed modules before submitting a job because the batch system uses module information to better schedule a job.
- Don't create millions of small files
    - The file system used by **Global scratch** and **Work** is not design for a lot of small files. The entire storage can be negatively impacted.
- Don't use conda (because it creates a lot of small files)
- Don't run hundreds of small jobs if the same work can be done in a single job
    - The batch system has to do some work to put the job in the queue. By creating hundreds of small jobs the batch system is negatively impacted.
    - Use job arrays if possible (see  [Job Arrays](#JobArrays))

## Check questions [10min]

1. How many users can use an HPC cluster at the same time?
2. What is the difference between a compute and a login node?
3. Who is scheduling the jobs?
4. What is a module?
5. How can you get Python 3.10 on Euler?
6. How do you test a code / script on Euler?
7. Imagine that you are using two different HPC clusters, each of them with a different batch system. What type of problems do you expect to have when you try to move your job from one cluster to another: think about the nodes, data, environment and batch system.

**Solution**
1. A lot since the HPC cluster is shared between the users.
2. The login nodes allow users to login to the cluster and submit jobs, and the compute nodes are running the jobs (doing the computation).
3. The batch system (SLURM on Euler).
4. The way to configure our computing environment for specific tools.
5. We activate the new software stack if desired via `env2lmod` and load the modules `module load gcc/8.2.0 python/3.10.4` .
6. Via an interactive job. More precisely, in case we want to experiment on a compute node, we can start a new terminal with `srun --pty bash` .
7. A lot of problems:
    - different nodes so the time to solution might be different, big memory nodes might not be available;
    - the storage layouts may differ and be optimized for different kinds of data
    - for the environment we have to test whether all applications are available. The environment might work differently, e.g. we have to reload all modules when the jobs are running on the compute node;
    - for the batch system we have to check how to do the daily business: submit jobs, monitor, kill, interactive jobs, ask for resources, ...

## Further Reading
- https://scicomp.ethz.ch
- https://slurm.schedmd.com/

# Optional Topics
## Batch System: Advanced Topic - Parallel/Dependent Jobs 
<a name="BatchParallel"></a>

### Embarrassingly Parallel Jobs

![](./images/job_array.svg)

Each job submitted to the batch system is introducing a small amount of work to the batch system.  
For the case of multiple similar jobs we can avoid this overhead by submitting all jobs at once using the so-called "Job Arrays".

In the following example we submit an job array of 4 jobs:

```
$ sbatch --array=1-4 --wrap 'echo "Hello, I am task $SLURM_ARRAY_TASK_ID of $SLURM_ARRAY_TASK_COUNT"'
```
* `--wrap` allows to submit a batch job without a job script
* The `SLURM_ARRAY_TASK_ID` environmental variable can be used in scripts to select (input files, patient IDs, list entry....)
* In Python you can read environment variables using `os.environ` which is a dictionary mapping names of variables to their values as strings.
* You can limit the number of simultaneously running tasks using the `%` sign (e.g., `--array=1-4%2` will only run 2 tasks at a time)
* You can define the `--array` via the command line or inside a job script

So in case you want to process 1000 files in parallel you could:

1. create a text file with the names of the 1000 files
2. submit a job array using `sbatch --array=1-1000%20 process.py` to start 1000 jobs, 20 running at the same time
3. `process.py` 
    - reads the list of file names and picks the file name at line `int(os.environ["SLURM_ARRAY_TASK_ID"]) - 1` (Python start index is `0`)
    - processes this file.

*Key aspects*

- All jobs in an array share the same `<jobid>`.  
- To refer to a single task (e.g., `scancel`), use `<jobid>_<taskid>`
- Each task can have its own standard output (default: `slurm-<jobid>_<taskid>.out`).

*For more information on SLURM job arrays, see*
* [LSF to Slurm quick reference](https://scicomp.ethz.ch/wiki/LSF_to_Slurm_quick_reference)
* [SLURM docs](https://slurm.schedmd.com/job_array.html)

## Dependent Jobs - Job Chaining<a name="JobChaining"></a>


![](./images/job_chaining.svg)


In case the output of a job is needed as input for another job, one can use `--dependency` to wait until the required job has completed successfully.

For example we can create a chain `Job_1+Job_2->Job_3->Job_4` and use the job names to specify the dependency:

```bash
$ sbatch -J Job_1 --wrap 'date; echo "I am job $SLURM_JOB_ID"; sleep 20'
Submitted batch job 9551
$ sbatch -J Job_2 --wrap 'date; echo "I am job $SLURM_JOB_ID"; sleep 20'
Submitted batch job 9552
$ sbatch -J Job_3 --dependency=afterok:9551,afterok:9552 --wrap 'date; echo "I am job $SLURM_JOB_ID"; sleep 10'
Submitted batch job 9553
$ sbatch -J Job_4 --dependency=afterok:9553 --wrap 'date; echo "I am job $SLURM_JOB_ID"; sleep 10'
Submitted batch job 9554
```

or via the job name <jobid> (simplified example)

```bash
$ sbatch -J Job_1 --wrap 'date; echo "I am job $SLURM_JOB_ID"; sleep 20'
$ jobid=$(squeue --noheader --format %i --name Job_1)
$ sbatch -J Job_2 --dependency=afterok:$jobid --wrap 'date; echo "I am job $SLURM_JOB_ID"; sleep 20'
````

<div class="alert alert-warning">
  <strong>Warning!</strong> 
  Using sophisticated job chaining can introduce a dependency on the batch system. Therefore running our workflow on a cluster with a different batch system might require to rewrite the job chaining part.</br>
  Workflow management systems that provide integration with various HPC system can overcome this.
</div>

## Workflow Management System + HPC integration <a name="WMS"></a>

As we mentioned in the [Job Chaining](#JobChaining) subsection, the workflow management systems allow us to specify the dependency between the jobs independently of the batch system used.   
The jobs (also called tasks or rules) determine a **Directed Acyclic DAG (DAG)**.  
In practice one usually defines for each rule the input and output files, and the DAG is determined directly by the workflow management system.  
Nodes of this graph are tasks and edges define the dependencies between those tasks.


![](./images/DAG.svg)

A simple workflow may consist of multiple scripts that need to be run in a dependent manner, but there are also parts that can profit from parallelization, e.g. `Job_1` and `Job_2` can run concurrently.  

Since the input and output files are available, they can be used as checkpoints, i.e. in case `Job_4` fails there is no need to recompute the previous jobs, since their output is available in the corresponding output files.
       
Examples: [Airflow](https://airflow.apache.org/),[Luigi](https://luigi.readthedocs.io), [Nextflow](https://www.nextflow.io/), [Snakemake](https://snakemake.readthedocs.io/), [Nipype](https://nipype.readthedocs.io/), ...

These tools allow for easy scaling up and out on cloud or HPC clusters, e.g. for HPC clusters:
- [Snakemake](https://snakemake.readthedocs.io/): SGE, LSF, SLURM, PBS, HTCondor, DRMAA, ...
- [Nextflow](https://www.nextflow.io/): SGE, LSF, SLURM, PBS and HTCondor,
- [Luigi](https://luigi.readthedocs.io): SGE and LSF.
- [Nipype](https://nipype.readthedocs.io/): SGE, PBS, HTCondor, LSF and SLURM


<div class="alert alert-warning">
  <strong>Warning!</strong> 
  Wokflow management systems generally do not provide Job Array integration.
</div>