Tutorial 1: First Steps on the Supercomputer
==========

**Content creators**: Stefan Kesselheim, Jan Ebert

**Content reviewers / testers**: Alexandre Strube

In this first tutorial, you will be doing first steps on **Juwels**, including **Juwels Booster**, the powerful upgrade that features 936 nodes with four NVidia A100 GPUs each. This tutorial assumes that you are familiar with the command prompt to a minimum degree.

## Exercise 0: Install an SSH client

Before you can actually start, it is required that an SSH client is installed on your machine. On both, Mac and Linux, an SSH client should be installed by default. On Windows, it is recommended to install the Windows Subsystem for Linux (WSL). On older Windows versions without WSL, you have to install a terminal emulator like [PuTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/).

## Exercise 1: SSH connection to Juwels
As a first step, you will create a key pair for public/private key authentification. Then, you will register the public keys for access to Juwels using the JuDoor web page. To do so, it is required to add a meaningful restriction of the range of IPs or hostnames that are allowed to connect to Juwels. Finally, you will be able to connect to Juwels. This exercise guides you through the process that is explained in more detail in the [Juwels access documentation pages](https://apps.fz-juelich.de/jsc/hps/juwels/access.html).

Execute the following command in the command line to create an ED25519 key pair directly into your `.ssh` directory.
```bash
ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519
```
On Windows, you must define a different storage location for the key pair, but otherwise the command works. In WSL you can execute the command right away.

The command generated two keys, a public one and a private one. The public key (ending in `.pub`) is similar to your hand-written signature: you may give it to others who can then use it to confirm your identity. The private key (we called it `id_ed25519`) **must not** be shared. Continuing with the hand-written signature analogy, the private key is the way _you_ write your signature. Just as you would not give others the ability to perfectly copy your hand-written signature, you should under no circumstance publicize your private key.

Before you can add your public SSH key to the list of authorized SSH keys for Juwels, you must create a valid *from-clause* that meaningfully restricts the range of IPs. You have several options to that, e.g. check the IP range of your internet service provider (ISP). If you know the IP of your ISP, or if you can connect to a VPN giving you a fixed IP range (FZ Jülich's VPN is an example, but other institutions work as well), this is very easy. You can directly use the IP range as a *from-clause*. For FZ Jülich your *from-clause* would be:
```
from="134.94.0.0/16"
```
Note that the `/16` indicates the subnet, hence all adresses of the form 134.94.\*.\* will be allowed. If you use this option, you can directly jump to the point *Register your public key*.

We also show here the slightly more difficult steps to create a *from-clause* based on reverse DNS lookup.

1. Visit the [JuDoor page](https://judoor.fz-juelich.de). Prior to this course, you should have visited this page to register and get access to the compute resources. Under the header **Systems**, find **juwels -> Manage SSH-keys** and navigate to it. On this page, your IP should be visible.\
   Example: *Your current IP address is 37.201.214.241*.
2. Perform a reverse DNS search of your IP and extract the DNS name (the field *Name*) associated with your IP. Type into your command line:

```bash
nslookup <your-ip>
```

Example results:\
*Name:    aftr-37-201-214-241.unity-media.net* or *\[...\] name = aftr-37-201-214-241.unity-media.net*

3. Guess a wildcard pattern that will likely apply for all future connections. For example `*.unity-media.net`.

### Register your public key.
Now, you can register your key pair in JuDoor: Create a *from-clause* from your wildcard expression and enter it into the field *Your public key and options string*, but do not confirm yet. Then, open your public key file `~/.ssh/id_ed25519.pub` and copy your public key into the same field (making sure there is a single space between the *from-clause* and the contents of the file) and select *Start upload of SSH-Keys*. Note the file ending `.pub`!\
Example line:
```
from="*.unity-media.net" ssh-ed25519 AAAAasdbmnsowrmnsdigninmnmnasdta username@HOSTNAME
```

After a few minutes, your newly added SSH key should be available. Note that JuDoor writes the file `~/.ssh/authorized_keys` in your Juwels home directory, thus manually added SSH keys will automatically be overwritten.

Finally, you can log into Juwels, using
```bash
ssh -i <path>/<to>/id_ed25519 username@juwels-booster.fz-juelich.de
```
If you have created the key pair in `~/.ssh/` it is possible to omit the `-i` option as `ssh` will try all keys in your `.ssh` directory by default. Your username is identical to the username in the JuDoor website, typically *lastname1*.


### Tasks
Once SSH is up and running, you are ready to perform a few tasks.
1. Create a personal directory named like your user in the project folder located in `/p/project/training2306/`.
```bash
mkdir /p/project/training2306/${USER}
```

1. Navigate to this folder.
```bash
cd /p/project/training2306/${USER}
```   

1. Clone the [course material Git reposity](https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_deep_learning/course-material.git) to that folder.
```bash
git clone https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_deep_learning/course-material.git
```


## Exercise 2: Your first Slurm job

After logging into Juwels Booster, you may realize you are not actually connected to a node named `juwels-booster.fz-juelich.de`. The hostname might by `jwlogin23.juwels` as you have been redirected onto one of the many login nodes. Login nodes are not intended for computational workloads, but only serve as entrypoints to the supercomputer. In order to start a job on the compute node, please type in the following command:
```bash
srun --pty --nodes=1 -A training2306 --partition=dc-gpu --gres gpu --time=00:15:00 /bin/bash
```
Notice how the command prompt changes. For example, when writing this tutorial, it changed to `kesselheim1@jwb0012`. 
![prompt](./images/prompt.png)

Now you have started an interactive job on a compute node. Execute `nvidia-smi` to check the status of the GPUs installed on the machine you have been assigned to.  

Open a second terminal, `ssh` to Juwels and use the command `squeue` to check the status of your job. Use 
```bash
squeue
```
to inspect the current status of the queues. Enter
```
squeue -u <username>
```
to filter out only the lines of `squeue` that contain entries belonging to you user.

### Tasks
1. What is the meaning of the string `training2306` in the upper command?
1. What is the partition and queue that your job was assigned to?
1. What is your job's Slurm job ID?
1. Can you download a file from the internet from the compute node?
Try for example
```
wget https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_deep_learning/course-material/-/raw/main/README.md
```
1. Cancel your job by using `scancel <job_id>`.

## Exercise 3: Batch jobs
In the previous tutorial, you have learned how to run an interactive Slurm job. In practice, you will often run a longer running job as a *batch job* that will wait in the queue until compute nodes are allocated. Batch jobs are typically written as scripts, often in the user's favorite shell language, such as Bash. Here is an example `hello_world.sbatch`:

```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH -A training2306
#SBATCH --partition=dc-gpu
#SBATCH --gres gpu
#SBATCH --time=00:02:00

# Create directory if needed and navigate to it
mkdir -p /p/project/training2306/${USER}
cd /p/project/training2306/${USER}

echo "This message indicates that the job is running."
echo "Hello world!" > greeting.txt
```

Note the `#SBATCH` comments that Slurm will interpret as flags for where and how to run the job. You can build complex resource requirements, build multi-stage jobs etc. with these. For this exercise, it will be enough to know that you can use the lines above to run a two-minute job on one of the booster's nodes. It is straightforward to adjust the maximum runtime and the number of compute nodes.

To start the batch job, run
```bash
sbatch hello_world.sbatch
```

Since you will want to read the output to check for errors or be politely greeted, Slurm automatically creates a file based on the job ID called `slurm-<job-id>.out`. You can give this file your own name with the `--output` flag.

### Tasks

1. Modify the script such that it will print the host name of the compute node into a file `hostname.txt`.
1. Add a line `sleep 60` at the end of the script. Run the job again and use `squeue` to determine if and when the job is running. Hint: `squeue -u $USER` will only show your jobs.
1. Cancel the job. (Take a look at the previous exercise if you cannot remember how to do this.)
1. Use `sacct -S 2020-01-01` to retrieve information about all jobs you have run since Jan 1st, 2020.

## Exercise 4: Compute environment

A supercomputer is a shared resource and therefore, it is challenging to build a compute environment that suffices scientific criteria like reproducibility. At JSC, [environment modules](https://modules.readthedocs.io/en/latest/index.html) are used to provide a modularized but consistent compute environment. Software is not installed system-wide but encapsulated in modules. Loading a module corresponds to setting a set of environment variables such that certain software is found. This allows also for concurrent versions of the same software being installed without mutual interference. Providing curated sets of environment modules is a challenging task in the administration of a supercomputer.

Modules can be loaded and unloaded with the `ml command`. Documentation can be found [here](https://modules.readthedocs.io/en/latest/ml.html). 

On top of the environment modules, it is possible to use Python virtual environments. For this course, we have selected a set of modules and installed a set of Python packages that can be used for all tutorials.

### Tasks

1. Activate the tutorial compute environment by typing
```bash
source /p/project/training2306/software_environment/activate.sh
```
   You will have to activate this environment each time you log in. Get used to writing it or learn to use <kbd>CTRL</kbd>+<kbd>R</kbd> to search through your command history.
2. Open an iPython interpreter and check if you can import TensorFlow.
![ipython](./images/ipython.png)
