# 5. Computing at CESGA

The users of the infrastructures and services of the CESGA include researchers, developers, technicians, and innovators in public and private institutions.

$\bullet$ Galician Universities

$\bullet$ Regional Research Centres

$\bullet$ The National Scientific Research Council (CSIC)

$\bullet$ Other public or private organisations all over the world, including:

- R&D departments of of industries and companies,
- technological and research centres oriented to industry,
- other Universities all over the world, and
- non-profit R&D organisations.

CESGA has different computing platforms of different architectures to allow the researcher to always choose the architecture that best suits their calculation needs.

For operations that require calculation of high performance and supercomputing, the **FinisTerrae-III supercomputer** offers higher performance and a high performance interconnection network for parallel work or that require the use of GPUs. It also allows operations that require handling large volumes of data. Here are the general characteristics of the FT-III:

![alternative text](images/PANEL-FINISTERRAE-ALMACENAMENTO-FINAL_2-600x1364.png)

For an individual user, the basic services available are the following:

**Queues:**
**$HOME:**
**$STORE:**
**$LUSTRE**



## 5.1. Creating an account

[See the full process here](images/proceso_usuarios.pdf)

It is required to fill the form in this link: https://www.cesga.es/en/community/service-request/

and, also, attach a **certified document from your institution supporting your access right.**


## 5.2. Connecting to CESGA

Once your account is created, you must follow all the suggestions given by email, such as reseting your default password.

The first step is just connecting to the remote machine, and this can be done by typing the following on a terminal (it also works in Windows):

```
ssh -XY YourUser@ft3.cesga.es
```

In my case the user is **uviirlcc**, just in case of doubts in further commands.

After sending that instruction it will ask for your password. Once logged you should see a screen like the folowing:

![alternative text](images/cesga_login.png)


In order to copy files from local to FT-III (and the other way around), you must type:

```scp /path/from/local/to/specific/file YourUser@ft3.cesga.es:/path/in/FTIII``` To copy a regular file from local to remote

```scp YourUser@ft3.cesga.es:/path/from/FTIII/to/specific/file path/to/local``` To copy a regular file from remote to local

**Important:** ```scp``` is the instruction for copying files, so for nested copies, i.e., to copy **folders** use the ```-r``` flag.

## 5.3. Basic commands


```sbatch``` Send a script to a SLURM partition. The only mandatory parameters are the estimated time and the estimated memory per node/CPU. 

For example, to send a script called ```script.sh``` with a duration of 24 hours: ```sbatch -t 24:00:00 --mem=4GB script.sh```

If the command is executed successfully, it returns the number of the job (<jobid>).

```srun``` Commonly used to run a parallel task on a script controlled by SLURM.

```sinfo``` Displays information about SLURM nodes and partitions. It also provides information about:

- Existing partitions (PARTITION)

- Whether or not they are available (AVAIL)

- The maximum time of each partition (TIMELIMIT. If it is infinite then it is regulated externally)

- The nodes belonging to each partition (NODES)

- Node state, the most common are:
```
            idle: means available

            alloc: means in use

            mix: means part of your CPUs are available

            resv: means reserved for an specific use

            drain: means temporarily removed for technical reasons
```

- Information about a specific partition: ```sinfo -p <partitionname>```

- Information every 60 seconds: ```sinfo -i60```

- List reasons nodes are in the down, drained, fail or failing state: ```sinfo -R```

```squeue``` Displays information about (onyly your) jobs and their status in the Slurm scheduling queue:

- State of a job with the jobid: ```squeue -j <jobid>```

- Report the expected start time and resources to be allocated for pending jobs in order of increasing start time: ```squeue --start```

- List all the running jobs: ```squeue -t RUNNING```

- List all the pending jobs: ```squeue -t PENDING```

- List the jobs demanding a specific partition: ```squeue -p <partition name>```

You can also see full list of job states here: https://cesga-docs.gitlab.io/ft3-user-guide/batch_jobs_states.html

```scancel``` It is used to cancel jobs, job arrays or job steps

 - Cancel a job: ```scancel <jobid>```

 - Cancel all pending jobs: ```scancel -t PENDING```

 - Cancel one or more jobs with name “jobname”: ```scancel --name <jobname>```

 - Cancel all jobs: ```scancel -u <YourUser>```

```scontrol``` Returns detailed information about the nodes, partitions, job steps, and configuration. It is used for monitoring and modifing queued jobs.

 - Show detailed information about a job: ```scontrol show jobid -dd <jobid>```

 - Write the batch script for a given job_id to a file or to stdout: ```scontrol write batch_script <jobid> -```

 - Prevent a pending job from being started (without cancel it): ```scontrol hold <jobid>```

 - Release a previously held job to begin execution: ```scontrol release <jobid>```

 - Requeue a running, suspended or finished Slurm batch job into pending state (equivalent to scancel + sbatch): ```scontrol requeue <jobid>```

```sqstat``` Detailed information about the queue system, resources consumption, status of all partitions and jobs

## 5.4. GPU Nodes

https://cesga-docs.gitlab.io/ft3-user-guide/gpu_nodes.html

## 5.5. Sending a job to a queue

To send a job to a queue it is required a shell script and, as stated before, this shell script must be submitted via ```sbatch``` command. An simple (but efficient) example of how to send a job to **GPU nodes** can be the following:

```
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run CUDA applications
# on CESGA's FT-III system.
#----------------------------------------------------
#SBATCH -J test_lunes       # Job name
#SBATCH -o test_lunes.o%j   # Name of stdout output file(%j expands to jobId)
#SBATCH -e test_lunes.o%j   # Name of stderr output file(%j expands to jobId)
#SBATCH -c 32               # Cores per task requested (1 task job). Needed 32 cores per A100 demanded!!!
#SBATCH --mem-per-cpu=3G    # memory per core demanded
#SBATCH --gres=gpu          # Options for requesting 1GPU
#SBATCH -t 01:30:00         # Run time

# Run the CUDA application
python my_script_that_uses_GPU.py
```

Remember that the time stamp has the following format: ```days-hours:minutes:seconds```

Here is a helpful table with different threshold values for each queue:

```
Name   Priority       GrpTRES       MaxTRES     MaxWall MaxJobsPU     MaxTRESPU MaxSubmit

------------ ---------- ------------- ------------- ----------- --------- ------------- ---

short        50                    cpu=2048                    50      cpu=2048       100

medium       40                    cpu=2048                    30      cpu=2048        50

long         30      cpu=8576      cpu=2048                     5      cpu=2048        10

requeue      20                    cpu=2048                     5      cpu=2048        10

ondemand     10      cpu=4288      cpu=1024                     2      cpu=1024        10

...

clk_short    50                      node=1    06:00:00       200       cpu=960       400

clk_medium   40                      node=1  3-00:00:00       200       cpu=960       250

clk_long     30      cpu=1440        node=1  7-00:00:00        60       cpu=360        60

clk_ondemand 10       cpu=720        node=1 42-00:00:00        20       cpu=240        20
```

## 5.6. Dealing with docker containers in FT-III

```

#----------------------------------------------------------------------------

# INSTALACIÓN Y EJECUCIÓN DE CONTENEDORES QUE REQUIEREN GPU EN EL FT-III

# Los pasos son casi los mismos, pero por temas de las gráficas hay que hacer unas cosas distintas.

cd /Directorio/donde/alojar/udocker

wget https://github.com/indigo-dc/udocker/releases/download/v1.3.1/udocker-1.3.1.tar.gz

tar zxvf udocker-1.3.1.tar.gz

export PATH=`pwd`/udocker:$PATH

udocker install

vi .bashrc

*Añadir línea: export PATH=`pwd`/udocker:$PATH

[Esc] + wq


# Ahora hacemos un pull de la imagen que queremos usar del DockerHub y creamos un contenedor asociado a la misma:

udocker pull tensorflow/tensorflow:1.2.0-gpu

udocker create --name=contenedor_pointnet [ID de la imagen que queremos vincular]

# IMPORTANTE! Ahora debemos pedirle al CESGA que nos dé servicio de GPU, si no crasheará más adelante:

copmute --gpu

# Esperamos un rato y hacemos:

udocker setup --nvidia [ID del contenedor]

# Y ahora ya está listo para usarse

udocker run --volume=$(pwd) --volume=$LUSTRE(en mi caso) --workdir=$(pwd) [ID del contenedor] /bin/bash

...



#----------------------------------------------------------------------------
```


