-
Working in a node (link to the scheduler)
Our mini-cluster is located at the Higher Technical School of Computer Engineering facilities, next to our Department of Computer Science and Artificial Intelligence. It is currently composed of 4 servers, named after the highest peaks of Spain: Teide (3715m, Canary Islands), Mulhacen (3479m, Granada), Aneto (3404m, Huesca), Veleta (3396m, Granada). The specifications are detailed at http://www.gcn.us.es/gpu_computing_servers. This is a picture of the servers taken in May 2022:
The configuration is the following:
-
Veleta server can be openly accessed through a public IP of the university of Seville (we can give you the IP address if you collaborate with us, and you should have received it in the email with the credentials). It has three GPUs, two of them no longer supported in CUDA 12, so it will remain with CUDA 11.
-
Aneto server is hidden behind Veleta server in a private network. Unfortunately, this server has hardware issues and it is unavailable. Hard drives and GPUs were moved to Veleta.
-
Mulhacen server is hidden behind Veleta server in a private network. It has two GPUs.
-
Teide server is hidden behind Veleta server in a private network. It has two GPUs.
If you want to access each node by its name from your computer, do the following.
You should have Openssh Client in your operating system:
-
in Linux/Ubuntu:
sudo apt install openssh-client
(or search openssh in your distribution) -
in Windows: install Openssh Client feature,
You should have a folder $HOME/.ssh
in your
home directory (in your computer). If not, please create it:
mkdir $HOME/.ssh
Next, backup your $HOME/.ssh/config
file if it exists already,
by executing:
cp $HOME/.ssh/config $HOME/.ssh/config.backup
If this config file doesn’t exist, just continue with the next steps.
Next, download the config file from these links (depending on your
operating
system): linux or windows.
Copy the file to your $HOME/.ssh
folder. If
you had a config file previously, simply copy the contents to your
existing file.
Please, edit the $HOME/.ssh/config
file with
your favourite editor, and replace GPU-RGNC by the public IP of our
server, and USER by the username given to you in the mini-cluster
(you should have received by email). Moreover, delete the comment (text
after #) in those two lines (third and fourth line in the example
below). At the end, you should have something like this (for example):
### First jumphost. Directly reachable
Host veleta
HostName **someservername.cs.us.es**
User **gheorghepaun**
ProxyCommand none
ForwardAgent yes
GSSAPIAuthentication no
### Host to jump to via veleta
Host aneto
HostName aneto
ProxyCommand ssh veleta -W %h:%p
ForwardAgent yes
GSSAPIAuthentication no
### Host to jump to via veleta
Host mulhacen
HostName mulhacen
ProxyCommand ssh veleta -W %h:%p
ForwardAgent yes
GSSAPIAuthentication no
### Host to jump to via veleta
Host teide
HostName teide
ProxyCommand ssh veleta -W %h:%p
ForwardAgent yes
GSSAPIAuthentication no
If you don’t want to enter your password twice when accessing Teide, Mulhacen or Aneto through Veleta, execute the following lines in your machine by simply replacing replace USER with your username given in the mini-cluster (if you are a Windows user, please see the note below):
-
ssh-keygen -t rsa -b 2048
(press enter for all the prompted questions) -
ssh-copy-id USER@veleta
(insert your password all the required times) -
ssh-copy-id USER@aneto
(insert your password all the required times) -
ssh-copy-id USER@mulhacen
(insert your password all the required times) -
ssh-copy-id USER@teide
(insert your password all the required times)
Try for example ssh **USER**@mulhacen
, and check that you are
requested only once for your password.
Note: For Windows users, ssh-copy-id does not exist, so you have to use an alternative such as this or this.
If you have configured your ssh client as in previous section, you can access to each node independently. So, depending on the server (replace USER by your username in the mini-cluster):
-
Access to Veleta:
ssh USER@veleta
-
Access to Aneto:ssh USER@aneto
-
Access to Mulhacen:
ssh USER@mulhacen
-
Access to Teide:
ssh USER@teide
Once you have accessed to the desired server, you can run your programs remotely. If you need to copy files to and from the servers, do the following (replace USER by your username and NODE by your desired node, either veleta, aneto, mulhacen or teide):
-
Copy a file to a NODE:
scp YOUR\_FILE USER@NODE:
-
Copy a file from a NODE:
scp [USER@](mailto:username@teide)NODE:YOUR\_FILE .
-
Copy a folder to a NODE:
scp -r YOUR\_FOLDER USER@NODE:
-
Copy a folder from a NODE:
scp -r [USER@](mailto:username@teide)NODE:YOUR\_FOLDER .
Please, keep using the password that you were given, or change it by
another secured password. DO NOT use as password same your username,
or this kind of bad practices, since we are receiving lot of external
attacks. In order to change the password, please execute
passwd
in each server (we don’t have a
centralized user directory for the moment).
Given the high demand we have for our GPUs, please select a time slot in our scheduler and book a GPU (or GPUs) for you. It is super-easy, just go to a GPU that you have access to, go to the day (row) and time (column) of your desire, and write your username to the hours you plan to use it. Please, be responsible and do not reserve the GPUs more than you expect to need (e.g. no more than two days in a row). Moreover, do not delete a reservation made by other user. We trust on the good behaviour of the users. In case you are having problems with other users, please notify the administrator with the subject: [GPU at RGNC] USER REPORT, and explain the situation.
This is a temporary solution while we have a high demand, it can be replaced in the future with SLURM, tensorhive or something similar. Moreover, bear in mind that the RGNC can lock GPUs for research purposes. Required management time will be shown in the spreadsheet well in advance.
You do not need to reserve a GPU when your work is CPU only, e.g.: compilation, coding development, scripting, python and DL with CPU only, etc. Also note that other users might consume all RAM memory when using one GPU only. If this happens and you have booked the other GPU in the system, please report to mdelamor.
Once you are logged in a node, you can check which GPUs are available and their status typing:
nvidia-smi
Please, double check that your booked GPU is idle. The outcome of the instruction should be something as follows:
In the example above, the first GPU is busy and 14GB is being used, while the GPU 1 is idle. You can see at the bottom of the output the processes using the GPUs. In the example, only one process is using GPU 0. In theory you can launch more than one process to the same GPU, as long as they fit in the memory. In any case, please try to avoid it, since the GPUs can be used for benchmarking, and this can affect performance.
By default you are given GPU number 0. In order to select a different GPU, or you want to use GPU number 1, type:
export CUDA\_VISIBLE\_DEVICES=1
If you want to use both GPUs, type:
export CUDA\_VISIBLE\_DEVICES=0,1
If you want to use GPU number 0 again, type:
export CUDA\_VISIBLE\_DEVICES=0
All servers should have enough storage for your needs, but please check
that home or / is not full by executing df -h
.
If you are using a large dataset, or you have lot of data, please
consider moving it to the /data partition
(Veleta, Mulhacen and Teide only). This partition has lot of space (1TB
in Mulhacen and 2TB in Teide, Veleta has no such partition yet).
We use CVMFS in all nodes for accessing different compilers and libraries versions. This is a CERN virtual file system with lot of development tools compiled for Red Hat based systems (CentOS, RockyLinux, AlmaLinux…). All our servers run Rocky Linux 9. In order to use cvmfs, do the following:
-
ls /cvmfs/sft.cern.ch/lcg/releases
(this will mount the remote file system, and you will see all tools available, it is a long list) -
For instance, if you want to use another version of gcc than the default one, select your version:
-
First, execute
ls /cvmfs/sft.cern.ch/lcg/releases/gcc
to see all available versions. -
Select a version with a subfolder with centos9 (although centos7 usually works good as well).
-
For example, assume you want to use GCC 11.3.1, then just type:
source /cvmfs/sft.cern.ch/lcg/releases/gcc/11.3.1/x86\_64-centos9/setup.sh
-
-
Some other libraries instead of requiring a source file, just need to adjust your $PATH or your $LD_LIBRARY_PATH to that folder (for example for cmake)
You can check which are the available CUDA versions installed by running ls -l /usr/local.
We recommend using CVMFS for development, and python pip or conda for Machine Learning. If you need other software to get installed, and admin credentials are need, please email to mdelamor ‘at’ us.es with the subject [GPU at RGNC] SOFTWARE INSTALL.
Both Jupyter Notebook and Lab are installed in all nodes. If you need to execute a remote Jupyter notebook environment, then do the following (replace USER by your username and NODE by your desired node, either veleta, aneto, mulhacen or teide):
-
Type in your machine:
ssh -L 8888:localhost:8888 USER@NODE
-
In the node you chose, type:
jupyter notebook --port=8888 --no-browser
Copy the URL with the token shown at the end. -
Now in your browser, paste the URL you copied before.
If more than one user is using Jupyter notebook in the same server, please change
your port in the lines above (instead of 9999, use e.g. 8889, 8890, 8891, etc.). Finally,
anaconda is installed in our severs (though the
conda
command), in case you need to install a
custom configuration. Please keep it at minimum since the disk space is
limited, so consider using miniconda instead.
It is possible to use Remote Explorer plugin of NVIDIA Nsight Visual Studio in our servers. You will need to configure .ssh/config file from your active terminal in Visual Studio Code, as explained above. In this way, Visual Studio Code will find our servers. If you use WSL, please do the configuration in Windows, so that VS code will take the ssh configuration from here.
Have fun, and wish you an efficient code!
v9.0, 06-09-2025
Miguel Ángel Martínez del Amor,
Research Group on Natural Computing
University of Seville
This README was generated with pandoc