## LUMI
LUMI (or, ‘Large Unified Modern Infrastructure’) is a large, high performance computing (‘HPC’) facility situated in Kajaani, Finland. It opened in 2021, and is one of the most powerful computing facilities in the world (in 2022 it was ranked as the 3rd fastest supercomputer in the world). 
Denmark is one of the members of the LUMI consortium, meaning that DeIC (Danish e-Infrastruture Consortium), which is the organization that handles national computing infrastructure in Denmark, has access to a certain amount of computing resources on LUMI. They have been to so kind as to share some of those resources with the students of this course. 

Hence this guide 😉

If you are curious about LUMI in general, I recommend browsing this website: https://www.lumi-supercomputer.eu/

## First things first:

The very first thing to do, if you're interested in using LUMI, is to put your email on the list of interested students. Then I send that list to DeIC, who create toy projects for everyone interested.

## Getting ready 
After the toy projects have been created, it's time to sign up. You should have received an email from 'lumi-noreply@lumi.deic.dk' titled 'Invitation to XXXXXXX project'. Inside there is a link for signing up and accepting your invitation. Please click that and follow the instructions. 

__Please be aware that there is a time limit for how long the invitation is active__ - usually it's several days, but don't wait a week or more. The link will become inactive, and it's a hassle to have to go through the process again.

After you have done this, you will receive a username to LUMI (this may take up to an hour). That will be in an email from 'lumi-noreply@csc.fi' with the words 'Your CSC username is '. Whatever follows immediately after that is your username. Please keep that handy.

As hinted at above, using LUMI is not free, and to do anything, we need to tell LUMI which project will be paying for our activities. This means going into an another email that you will receive from 'info-noreply@csc.fi', which has a project ID of the form 'project_465XXXXXX'. We need this project ID whenever we want to do anything substantial with LUMI. For now, simply find the email, and keep it handy for later. 

(when I signed up, the project ID arrived before the username, I don't know if that is how it always goes)

## Accesssing LUMI

After receiving the above emails, we have to actually access the machine. There are, fundamentally, two approaches:

### Use the web interface

Go to https://www.lumi.csc.fi/. 
Click 'Go to login'. 
Click on 'MyAccessID' -> go through that
Open VS code (gives you both an editor and a terminal)


### Use a public/private keypair and an SSH client

This is the 'advanced' method which will probably be preferable if you are to work with LUMI for a longer time. First set up an SSH key pair (explained here: https://docs.lumi-supercomputer.eu/firststeps/SSH-keys/). Then, open your favourite SSH client and connect to the lumi.csc.fi server. If you have no favourite, I suggest either installing the SSH plugin for Visual Studio Code, or using 'mobaxterm'. If you get in, you will be greeted by this cool picture of a wolf:





In [None]:


 *  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒  *   *      *
                                                       *      *  *
   * ████       ████   ████   █████▄    ▄█████   ████     *     *
 *   ████       ████   ████   ████ █▄  ▄█ ████   ████         ,    *,
     ████       ████   ████   ████  ████  ████   ████  *   *  |\_ _/|
     ████       ████   ████   ████   ▀▀   ████   ████   *    .| ." ,|
  *  ████       ████   ████   ████        ████   ████        /(  \_\)
     ████       ████   ████   ████        ████   ████       /    ,-,|
 *   ████▄▄▄▄▄  ▀███   ███▀   ████        ████   ████ *    * /      \
     █████████    ▀▀███▀▀     ████        ████   ████  * ,/  (      *
 *                                                     ,/       |  /
  * ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒/    \  * || |
                 *              *               ,_   (       )| || |
*   *    *    The Supercomputer of the North  * | `\_|   __ /_| || |
        **               *            * *       \_____\______)\__)__)

## Getting started
On LUMI, we work with 'containers'. If you're not familiar with containers, they are a handy way to run code without installing a lot of other libraries and dependencies (which is something the people behind LUMI do not want us to do). 

If you are curious, the specific type of container which we will be using is the 'Singularity' container: https://github.com/sylabs/singularity (if you're not curious, just skip that link)

For convenience, I have already created a container which we will be using in this guide. It contains a python environment with pytorch, numpy, and SKlearn, and you are free to use it for the project too. If you have more elaborate needs for your python environment, look at the bottom of this guide for instructions on how to make your own container. You can download it directly to your LUMI folder using:

In [None]:
wget https://anon.erda.au.dk/share_redirect/FYaG5IWTPR -O cotainrImage.sif

(takes about 50 seconds)

In case you were wondering: we build singularity containers using a tool called 'cotainr'. This is the reason for the name of the .sif-file above. 

# Hello world
It's now, finally, time to run a python script on LUMI. We start with the classic 'hello world'. Put the following in a .py file called hello.py:

In [None]:
import os

#describe the environment that has been loaded:
print("Hello World!")

print('The name of my current conda environment is:')
print(os.environ['CONDA_DEFAULT_ENV'])

And run this command in a terminal on LUMI:

In [None]:
singularity exec cotainrImage.sif python hello.py

It should give you:

In [None]:
Hello World!
The name of my current conda environment is:
conda_container_env

So far so good. There are two important take aways from this:

1: We can run regular python scripts on LUMI without bothering with actually installing python or anaconda or anything else

2: This is __the wrong way to do this__.

The reason for #2 is that right now, we are actually running code on the 'front-end', which is a big no-no on a cluster such as LUMI. To explain what is going on, I need to show you this overall diagram of how clusters are *supposed* to be used:

![cluster diagram](clusterDiagram.png)


Here, we log onto the front-end computer (blue box on top), where we send a jobscript, which is a detailed description of the calculation we would like to perform, to a jobmanager (which on LUMI is called 'Slurm'), which then makes sure the calculation is executed on one or more of its many *nodes*. In our case, LUMI has 2928 GPU-enabled nodes, where each node is a quite powerful computer with 512 GB RAM, a processor with 64 cores and 8 GPUs with each 64 GB of dedicated memory. In your projects in this course, you will be using at most one full node. As you can see, the storage is separate from both front-end and nodes, and so our jobscript usually also needs to specify where the data that we need for the computation is located (we'll do that later in this guide). 

What we did above with our 'hello world' script was to run our job directly on the front end, without bothering the jobmanager. This is fine if you're just downloading data or checking that your code works (like calling hello.py), but you shouldn't do it for anything heavy. 

Let's run a more elaborate, but also more realistic, example:

## Hello mnist

First, download mnist data and move it to your scratch folder:

In [None]:
wget https://anon.erda.au.dk/share_redirect/AIYv1rmrtI -O mnist.h5
mv mnist.h5 /scratch/project_465XXXXXX/

Notice that I have here saved the mnist data (in the form of numpy-arrays) to and hdf5-file, which we then store in the scratch folder. 

This serves two purposes:
 - minimizes the bookkeeping necessary for data handling on the cluster (that is nice for us as users)
 - minimizes the number of individual files stored on LUMI. This is an advantage because many small files (such as a large image dataset, for instance) risks burdening the LUMI file system with overhead (the file system used on LUMI prefers fewer larger files instead of many small).

Please use a similar convention in your own projects (and remember to place the data on scratch, you have more space there).

Next put this code in a python script called hello_mnist.py:

(you're welcome to read it at make sure you understand it)

In [None]:

#%% import packages:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import os
import h5py

#%% check if cuda is available:
if torch.cuda.is_available():
    print('cuda is available')
    device = torch.device("cuda:0")
else:
    print('cuda is not available')
    device = torch.device("cpu")

#%% set up data sets:
data=h5py.File('/data/mnist.h5')
Xtrain=np.array(data['Xtrain'])
Xtest=np.array(data['Xtest'])
ytrain=np.array(data['ytrain'])
ytest=np.array(data['ytest'])

#convert numpy arrays to torch tensors:
Xtrain=torch.from_numpy(Xtrain).float()
Xtest=torch.from_numpy(Xtest).float()
ytrain=torch.from_numpy(ytrain)
ytest=torch.from_numpy(ytest)

#set up data sets:
train_set = torch.utils.data.TensorDataset(Xtrain, ytrain)
test_set = torch.utils.data.TensorDataset(Xtest, ytest)

#set up data loaders:
train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=64, shuffle=False)

#%% set up sequential model:

net=nn.Sequential(
        nn.Conv2d(1, 10, kernel_size=5),
        nn.MaxPool2d(2),
        nn.ReLU(),
        nn.Conv2d(10, 20, kernel_size=5),
        nn.Dropout2d(),
        nn.MaxPool2d(2),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(320, 50),
        nn.ReLU(),
        nn.Linear(50, 10)
                )

#%% set up optimizer:

import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)


#%% set up loss function:

loss_fn = nn.CrossEntropyLoss()


#%% train the model:

net.to(device)
for epoch in range(100):
    for i, data in enumerate(train_loader, 0):
        X, labels = data


        X=X.to(device)
        labels=labels.long().to(device)

        optimizer.zero_grad()

        outputs = net(X.view(-1, 1, 28, 28))
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()

    if epoch % 10 == 0:
        print(epoch+1, loss.item())

print('Finished Training')

#%% test the model:
net.to('cpu')
correct = 0
total = 0

with torch.no_grad():
    for data in test_loader:
        inputs, labels = data
        outputs = net(inputs.view(-1, 1, 28, 28))
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted==labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (100*correct/total))

Next, put the following in another file, jobscript1.sh:

In [None]:
#!/bin/bash
#SBATCH --job-name=helloMnist
#SBATCH --account=project_465XXXXXX
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=5G
#SBATCH --partition=small-g
#SBATCH --gpus-per-task=1

srun singularity exec -B /scratch/project_465XXXXXX/:/data cotainrImage.sif python hello_mnist.py

In the above, we give our job a name ('helloMnist'), define a project to bill (notice that I put in a placeholder, you need to put your own project number), tell slurm how long the job is allowed to run at most (0 hrs, 10 minutes and 0 seconds), tell it to only use 1 node, that our job consists of 1 task, that it should dedicate 8 cpus to this task, 5 GB RAM, 1 GPU and to put it in the 'small-g' queue. 

For ease of use, I create a variable called 'SCRATCH_FOLDER' indicating where our data is going to be found.

In the bottom line, we tell slurm to run the singularity command from before, using the command 'srun'. Note that we tell singularity to 'mount' the scratch folder to the 'data' folder inside the container. This means that anything in the scratch folder will appear (from inside the container), as if it was inside a folder called 'data', in our working directory.

We can use this file to make slurm run our python job on a node:

In [None]:
sbatch jobscript1.sh

This creates an outputfile, which (because we haven't changed the name) will get a pretty unimaginative name on the form 'slurm-1234567.out'. If we print the contents, we see that it's just the standard-out from our script:


In [None]:
MIOpen(HIP): Warning [SQLiteBase] Missing system database file: gfx90a6e.kdb Performance may degrade. Please follow instructions to install: https://github.com/ROCmSoftwarePlatform/MIOpen#installing-miopen-kernels-package
cuda is available
1 1.3244482278823853
11 0.7749866843223572
21 0.4031255841255188
31 0.6772407293319702
41 0.5549220442771912
51 0.7431323528289795
61 0.7124130129814148
71 0.6285948157310486
81 0.5698408484458923
91 0.9437584280967712
Finished Training
Accuracy of the network on the 10000 test images: 74 %
mikkelse@uan02:~/LUMI_guide>

(ignore the warning in the beginning)

We are really getting somewhere now !

In your own jobscripts, you will probably want to increase the max duration (the 'walltime') from 10 minutes to something sensible, like a couple of hours. However, be careful with changing 'mem' and 'gpus'. As you can read at https://docs.lumi-supercomputer.eu/runjobs/lumi_env/billing/, your jobs will be billed based on how many resources you reserve, and not how much you use. And if, for instance, you reserve 512 GB of RAM, which is an entire node, then you will also be billed for using all 8 GPUs on that node. No matter if you only used 7 GB of RAM and a single GPU. This makes it possible to blow through the whole compute allocation for your project really quickly, if you accidentally reserve much more resources than you need. 
The same goes for increasing the number of nodes, or starting multiple jobs. 

Note:
If you decide to use LUMI for your project, your group will be awarded 200 GPU hours and 20 TB hours (meaning that you can have 20 TB stored for one hour, or about 40 GB for three weeks). It's up to you to make those resources last the full duration of your project. 

If you, during the project, are curious how much of your resources you have spent already, the command 'lumi-allocations' will give you a handy overview.
Another thing to note is that the 8 GPUs on each node are actually counted as 8 half-GPUs. This means that allocating a single GPU for 1 hour actually only costs half an hour... which is a little weird. More details about billing can be found here: https://docs.lumi-supercomputer.eu/runjobs/lumi_env/billing/

## student_sbatch
Sometimes, particularly when learning to use slurm, you may accidentally create a job which allocates much more compute that you intended to (for instance, accidentally allocating all the memory on a node, even if you're just using a single GPU). This is annoying because you pay for what you reserve, not what you use. To help catch such errors, we have created *student_sbatch*, which is a wrapper for sbatch. It works almost exactly the same as sbatch:

In [None]:
./student_sbatch.sh jobscript1.sh
GPU hours billed by this job: .08333333333333333333
Total GPU hours allocated: 500

Here, 'student_sbatch' reads jobscript1, calculates what the total number of GPU hours billed would be if the job ran to the very end, and compares that with the allocation of the whole project. If the job requires more than 10% of the full amount (in this case that would be 50 GPU hours), the job is not allowed to run. Otherwise, the jobscript is passed to sbatch which runs it like normal. 

You obtain student_sbatch by downloading it and making it executable:

In [None]:
wget https://anon.erda.au.dk/share_redirect/CyBLhyxsNa -O student_sbatch.sh
chmod +x student_sbatch.sh

### Slurmlearning
So far, this has been a pretty patchy crash course in the slurm jobmanager. Fortunately, DeIC has made a really excellent slurm tutorial which I recommend anyone who want to use LUMI go through: http://slurmlearning.deic.dk/

### You have now finished the compulsory part of the tutorial :)

# Other useful details

## Job arrays
A nice thing about working on a cluster is that you can do a lot of things simultaneously. For instance, run a bunch of similar calculations with slightly different hyper parameter values (like when we're doing hyper parameter optimization). 

One way to organize that is by using 'job arrays', which essentially just tells slurm to start a lot of small jobs independently of each other. We do this by adding one more line to the jobscript:

In [None]:
#!/bin/bash
#SBATCH --job-name=helloMnist
#SBATCH --account=project_465000376
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=5G
#SBATCH --partition=small-g
#SBATCH --gpus=1
#SBATCH --array=1-5


srun singularity exec -B /scratch/project_465000376/data cotainrImage.sif \
    python array_mnist.py

saying that this will be an array of 5 (indexed by 1-5) independent jobs. This then creates an environment variable called 'SLURM_ARRAY_TASK_ID' which distinguishes between the 5 jobs. There are many ways of using this, but I think the simplest way is to just access it directly from within our python script:  

In [None]:
os.getenv('SLURM_ARRAY_TASK_ID')

(note that this is a string)

In the example below, I have changed the 'hello_mnist' script to a crude hyper parameter line scan, trying different sizes of the last hidden layer of the network:

In [None]:

#%% import packages:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

import os
#%% check if cuda is available:
if torch.cuda.is_available():
    print('cuda is available')
    device = torch.device("cuda:0")
else:
    print('cuda is not available')
    device = torch.device("cpu")


#%% set up data sets:

#load numpy arrays from ./data:
Xtrain=np.load('./data/Xtrain.npy')
Xtest=np.load('./data/Xtest.npy')
ytrain=np.load('./data/ytrain.npy')
ytest=np.load('./data/ytest.npy')

#convert numpy arrays to torch tensors:
Xtrain=torch.from_numpy(Xtrain).float()
Xtest=torch.from_numpy(Xtest).float()
ytrain=torch.from_numpy(ytrain)
ytest=torch.from_numpy(ytest)

#set up data sets:
train_set = torch.utils.data.TensorDataset(Xtrain, ytrain)
test_set = torch.utils.data.TensorDataset(Xtest, ytest)

#set up data loaders:
train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=64, shuffle=False)

#%% set up sequential model:

sizeArray=[10,20,30,40,50]
lastSize=sizeArray[int(os.getenv('SLURM_ARRAY_TASK_ID'))-1]

print('lastSize: ',lastSize)

net=nn.Sequential(
        nn.Conv2d(1, 10, kernel_size=5),
        nn.MaxPool2d(2),
        nn.ReLU(),
        nn.Conv2d(10, 20, kernel_size=5),
        nn.Dropout2d(),
        nn.MaxPool2d(2),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(320, lastSize),
        nn.ReLU(),
        nn.Linear(lastSize, 10)
                )

#%% set up optimizer:

import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)


#%% set up loss function:

loss_fn = nn.CrossEntropyLoss()


#%% train the model:

net.to(device)
for epoch in range(100):
    for i, data in enumerate(train_loader, 0):
        X, labels = data


        X=X.to(device)
        labels=labels.long().to(device)

        optimizer.zero_grad()

        outputs = net(X.view(-1, 1, 28, 28))
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()


print('Finished Training')

#%% test the model:
net.to('cpu')

correct = 0
total = 0

with torch.no_grad():
    for data in test_loader:
        inputs, labels = data
        outputs = net(inputs.view(-1, 1, 28, 28))
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted==labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (100*correct/total))

This is a pretty crude way of doing things, but I hope you get the idea. Some slightly more elaborate examples are given here:

https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/throughput/


### How to make a new container
If you need other packages installed instead of what is in the container I have supplied, this is the workflow:

First you have to describe the conda environment which you intend to build. This has to include both the 'normal' packages that you are interested in, and GPU libraries that work with the AMD GPUs on LUMI (so not CUDA). Below is the 'yaml'-file that I used to create my container:



In [None]:
name: py311_062024
channels:
- conda-forge
dependencies:
- python=3.11
- mne
- mne-bids
- neptune
- lightning
- matplotlib
- numpy
- scikit-learn
- ray-tune
- optuna
- tabulate
- pandas
- pip
- h5py
- wandb
- pyarrow
- pip:
    - --extra-index-url https://download.pytorch.org/whl/rocm6.0
    - torch==2.4+rocm6.0
    - torchaudio==2.4+rocm6.0
    - torchvision==0.19+rocm6.0

This was saved in a file called 'py311_092024.yml'. You should probably not touch anything below 'pip'. 

If you make your own yml file, you then need to load first the crayEnv module and then the cotainr module on the LUMI frontend, after which you can build the container:

In [None]:
module load CrayEnv
module load cotainr
cotainr build py311_092024.sif --system=lumi-g --conda-env=py311_092024.yml --accept-licenses

This creates a container called py311_092024.sif. Please note that these containers take up a lot of space, and take a while to create (~30 minutes). Your user folder is limited to 20 GB of storage, meaning that two or three containers is the most you can have in your personal storage at a time. If you start creating containers without enough space, the building process will not succeed. So, it's a good idea to just have one container which works for everything. 

If you need a newer version of of python or pytorch, you are welcome to try changing version requirements in the yml file above. Though, it is possible that may get some really long error messags from cotainr if you pick unlucky combinationws.

### Using 'Ray' on LUMI
Possibly the most straight-forward use of a machine like LUMI is for hyper parameter optimization, which you will probably want to do towards the end of the project (most projects, really). A really good tool for that is the 'ray' library, which has a subcomponent called 'tune'. In the example below, we return to the optimization problem from before, but we use ray.tune to adjust the size of the last layer and the learning rate:


In [None]:
#%%
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


import ray
from ray import tune, air
from ray.air import session
from ray.tune.search.optuna import OptunaSearch


import numpy as np


#%%

#%% import packages:
import matplotlib.pyplot as plt
import os
import h5py

#%% check resources
#
# if cuda is available:
if torch.cuda.is_available():
    print('cuda is available')
    device = torch.device("cuda:0")
    numGPUs=torch.cuda.device_count()
else:
    print('cuda is not available')
    device = torch.device("cpu")
    numGPUs=0

#number of cpus:
print('SLURM_GPUS_PER_TASK: ',os.getenv('SLURM_GPUS_PER_TASK'))
numCPUs=int(os.getenv('SLURM_CPUS_PER_TASK'))
print('numCPUs: ',numCPUs)

print(os.listdir('/data'))


#%% set up data sets:
data=h5py.File('/data/mnist.h5')
Xtrain=np.array(data['Xtrain'])
Xtest=np.array(data['Xtest'])
ytrain=np.array(data['ytrain'])
ytest=np.array(data['ytest'])

#convert numpy arrays to torch tensors:
Xtrain=torch.from_numpy(Xtrain).float()
Xtest=torch.from_numpy(Xtest).float()
ytrain=torch.from_numpy(ytrain)
ytest=torch.from_numpy(ytest)

#set up data sets:
train_set = torch.utils.data.TensorDataset(Xtrain, ytrain)
test_set = torch.utils.data.TensorDataset(Xtest, ytest)



loss_fn = nn.CrossEntropyLoss()

def trainTestNet(config,train_set,test_set):
    if torch.cuda.is_available():
        device = torch.device("cuda:0")
    else:
        device = torch.device("cpu")

    #set up data loaders:
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
    test_loader = torch.utils.data.DataLoader(test_set, batch_size=64, shuffle=False)

    net=nn.Sequential(
            nn.Conv2d(1, 10, kernel_size=5),
            nn.MaxPool2d(2),
            nn.ReLU(),
            nn.Conv2d(10, 20, kernel_size=5),
            nn.Dropout2d(),
            nn.MaxPool2d(2),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(320, config['lastSize']),
            nn.ReLU(),
            nn.Linear(config['lastSize'], 10)
                    )

    optimizer = optim.SGD(net.parameters(), lr=config['lr'], momentum=0.9)

    # train the model:
    net.to(device)
    for epoch in range(100):
        for i, data in enumerate(train_loader, 0):
            X, labels = data


            X=X.to(device)
            labels=labels.long().to(device)

            optimizer.zero_grad()

            outputs = net(X.view(-1, 1, 28, 28))
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

        #test the model:
        net.to('cpu')
        correct = 0
        total = 0

        with torch.no_grad():
            for data in test_loader:
                inputs, labels = data
                outputs = net(inputs.view(-1, 1, 28, 28))
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted==labels).sum().item()

        # return 100*correct/total
        session.report({"mean_accuracy": 100*correct/total})  # Report to Tune



search_space = {"lr": tune.uniform(1e-4, 1e-2), "lastSize": tune.randint(10,100)}
algo = OptunaSearch()

trainable_with_resources = tune.with_resources(
    tune.with_parameters(trainTestNet,train_set=train_set,test_set=test_set),
     {"cpu": 1, "gpu":1})


tuner = tune.Tuner(
    trainable_with_resources,
    tune_config=tune.TuneConfig(
        metric="mean_accuracy",
        mode="max",
        search_alg=algo,
        num_samples=1,
    ),
    run_config=air.RunConfig(
        stop={"training_iteration": 1},
    ),
    param_space=search_space,
)

#if I don't limit num_cpus, ray tries to use the whole node and crashes:
ray.init( num_cpus=numCPUs,num_gpus=numGPUs, log_to_driver = False)

result_grid = tuner.fit()
print("Best config is:", result_grid.get_best_result().config,
 ' with accuracy: ', result_grid.get_best_result().metrics['mean_accuracy'])

When we run this, quite a lot of things get written to the slurm-*.out file, but the bottom part is: 


In [None]:
== Status ==
Current time: 2023-09-18 23:15:22 (running for 00:00:13.02)
Memory usage on this node: 35.3/503.2 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/7 CPUs, 0/1 GPUs, 0.0/328.01 GiB heap, 0.0/144.57 GiB objects
Current best trial: 110a0a66 with mean_accuracy=19.0 and parameters={'lr': 0.008263221162693079, 'lastSize': 73}
Result logdir: /users/mikkelse/ray_results/trainTestNet_2023-09-18_23-15-07
Number of trials: 1/1 (1 TERMINATED)
+-----------------------+------------+---------------------+------------+------------+-------+--------+------------------+
| Trial name            | status     | loc                 |   lastSize |         lr |   acc |   iter |   total time (s) |
|-----------------------+------------+---------------------+------------+------------+-------+--------+------------------|
| trainTestNet_110a0a66 | TERMINATED | 10.253.41.144:96970 |         73 | 0.00826322 |    19 |      1 |           10.755 |
+-----------------------+------------+---------------------+------------+------------+-------+--------+------------------+


Best config is: {'lr': 0.008263221162693079, 'lastSize': 73}  with accuracy:  19.0
End:
------------ Mon 18 Sep 2023 11:15:26 PM EEST ----------------

Things to note in the ray.tune-script:
- Ray does not understand the concept of SLURM. Particularly the fact that the job may not have the entire node allocated. This means that we have to go in and restrict ray 'manually'. 
- To be able to share the training data between instances, I wrap 'trainTestNet' in a 'with_parameters' call. This passes these parameters to all instances, but using a shared copy, so we don't have to make a separate version of the data set for each run. In more elaborate setups, for instance where you are reading from disk, this may backfire. If so, I suggest reading up on threadsafe data loaders, for instance here: https://docs.ray.io/en/latest/tune/examples/tune-pytorch-cifar.html

### Good habits when working on a cluster
Finally, a few pieces of advice for being efficient with your own time on a cluster:
 - Use a version control system (git) for syncronizing your work computer (where you presumably do most of the coding) with your folder on the cluster (LUMI). This means you will only have to push & pull to make sure that all files are the correct version, and only the minimal of changes need to be transferred. This reduces errors (it's really annoying if you're copying files over manually, but forgot to move one file that was also changed), and makes it much easier for you to keep track of which version of your pile of files was actually used for creating something. 
 - Use a logger. Jobs running on the cluster are not that easy to check on, and output will just be text written to some output file. If you use a logger (such as 'neptune' or 'wandb'), checking in on your code is as easy as opening a browser. An added bonus is that a good logger will also help you keep track of how well your code utilizes the compute resources, what errors and warnings you're getting, and which version of your repository (assuming you followed advice #1) was responsible for creating a certain output. 
 - If you want to inspect a running job, you can use 'srun' to log into the compute node: https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/interactive/#using-srun-to-check-running-jobs. In this regard, the command 'rocm_smi' may be useful for you to check what the GPUs are doing. 

## Getting to use LUMI in your project

If you have made it this far in the guide, you should have a decent idea whether you want to use LUMI in your deep learning project. If you do, send me an email before the start of the project period, and I will ask DeiC to set up a project allocation for your group.