# Talapas knowledge base  
https://hpcrcf.atlassian.net/wiki/spaces/TCP/overview?homepageId=6761503  
Service Desk  
https://hpcrcf.atlassian.net/servicedesk/customer/portal/1  
Before you read the rest of this notebook, go read the Talapas Quick Start Guide:  
https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7312376/Quick+Start+Guide  
There's a lot of information there, including how to request an account. This notebook is a companion to that guide, which I'll be refering to frequently.

# Logging in
You log into talapas on a login node, but you shouldn't do any serious work there! As the Quick Start guide states: "The login nodes are for light tasks needed to set up and submit your work.  They're not for running significant applications, simulations, etc.  Processes that use a lot of memory or CPU will be killed." There are two login nodes, ln1 and ln2. You can ssh into talapas with either:  
  
ssh username@talapas-ln1.uoregon.edu  
ssh username@talapas-ln2.uoregon.edu  
ssh username@talapas-login.uoregon.edu  
  
You will need to be connected to the on campus network (directly or with VPN) for that to work. If you want to run matlab, fsleyes, or some other graphical interface, detailed instructions are here:  
  
https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7312376/Quick+Start+Guide#QuickStartGuide-RunningGraphicalInteractiveJobs  
  
The easiest way to do this is to use OnDemand. In my experience, it also has the best performance when logging in from home.

# OnDemand
Assuming you already have a talapas account, the easiest way to log in is to use OnDemand. No VPN required.    
Open a private browser window and go to:  
https://talapas-ln1.uoregon.edu/  
OR  
https://talapas-ln2.uoregon.edu/  
Log in using your Duck ID  

![](1.images/ondemand.png)

Menu options include:  
* Files: opens a file browser. You can download and upload files here.
* Jobs: view your active jobs, or use the Job Composer to create a job script. We'll come back to this later.
* Clusters: This will let you work on a login node without starting a job or interactive app, similar to ssh. You won't have access to any graphical interface, and you shouldn't use this for anything but light tasks.
* Interactive apps: Start an interactive app. This is where you'll do your "real work."

# Directories on talapas
* your home directory. This is where you start when you log in. There's very little space for files here! (10 GB) Keep small text files nad scripts here but data elsewhere.  
* /projects/[pirg]/USERNAME  This is your project directory and where you should store data that you don't share with other pirg members. It shares a quota with your pirg. pirg stands for primary investigator research group, and is usually your lab's name (for me, it's lcni). If you aren't sure what yours is, type `groups` at a command prompt. You should be in at least two: the talapas group and your pirg.
* /projects/[pirg]/shared  This also shares a quota with your pirg, and is intended for data that you share with other pirg members.
* /tmp  This directory is for temporary files. It should be faster than using a directory inside the /projects paths. If you use this directory, be a good citizen and delete any files you create when you are through with them. This directory is not shared across hosts, so if you use it for one job another job will not be able to access it.

# Permissions and access control lists
You can use access control lists to give or restrict access to your files. To see who has access, use the command getfacl. For example, if you run `getfacl /project/[pirg]/shared`, you should see something like this:  
```
[jolinda@talapas-ln1]$ getfacl /project/lcni/shared
# file: projects/lcni/shared  
# owner: fws
# group: lcni 
# flags: -s-  
user::rwx  
group::rwx  
other::---
```

There are three levels of permissions (user, group, other), and three types of permissions (read, write, and execute). "Owner" shows the default user, and is usually the person who created the file or folder. "Group" shows the default group, usually your pirg. "Other" covers everyone else. In this example, anyone in the pirg can read, write, or execute any file in the shared folder, while those outside the pirg have no access. What happens if you create a folder inside this directory?

```
[jolinda@talapas-ln1]$ cd /projects/lcni/shared
[jolinda@talapas-ln1 shared]$ mkdir test
[jolinda@talapas-ln1 shared]$ getfacl test
# file: test
# owner: jolinda
# group: lcni
# flags: -s-
user::rwx
group::r-x
other::r-x
```

You have read/write/execute permissions in the folder. Everyone else has read & execute permissions. Because they don't have 'x' permissions on the parent folder, users outside of your pirg still won't be able to access this folder. You can change permissions using the `setfacl` command (as long as you are the owner). For instance, suppose you don't want most members of your pirg to be able to read the folder, but you do want user janedoe to have full permissions:

```
[jolinda@talapas-ln1 shared]$ setfacl -m group::--- test
[jolinda@talapas-ln1 shared]$ setfacl -m user:janedoe:rwx test
[jolinda@talapas-ln1 shared]$ getfacl test
# file: test
# owner: jolinda
# group: lcni
# flags: -s-
user::rwx
user:janedoe:rwx
group::---
mask::rwx
other::r-x
```
Again, if janedoe is not a member of your pirg, these permissions won't do her much good because she has no permissions for /projects/[pirg]/shared. Yes, this makes sharing data with members of other pirgs a bit of a pain unless you are the owner of /projects/[pirg], in which case you have all the permissions and can do what you like.

# Partitions
When starting a job on Talapas, you'll need to specify what partition you want it to run on. If your pirg has it's own condo partition, you'll probably want to use it. Otherwise, you'll probably be choosing between interactive, short, long, and gpu. The biggest difference between these is how long you have before being kicked off:
* interactive: 4 hours
* short: 24 hours
* long: 14 days
* gpu: 24 hours but you can use gpus
There are more partitions, including some with access to more memory. See the full list here: https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7285967/Partition+List

# Interactive Apps 
There are currently two: Talapas Desktop and Jupyter Notebook. For either, you'll need the following information: your pirg, the partition you are requesting, and how long you want your job to run. You may also want to include 'idx: [project/index]' in the comment field to help with billing. 

Talapas desktop gives you a nice desktop environment where you can run your guis. Some applications including matlab and rstudio can be launched from the menu. For others you'll need to load a module, which we'll cover next.  
  
![](1.images/desktop.png)  
Jupyter Notebook will start a JupyterLab session with an interactive python kernel. There are several options for what sort of notebook to start, most of which are for specific UO courses. I will be using the Python3/TensorFlow server for my examples. We'll be coming back to Jupyter Notebook later (it's where this document was created).

# Modules  
Most software on Talapas is installed as LMOD modules. A nice guide on how to use them is here:  
  
https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/7198035/How-to+Use+LMOD  

To see all available modules, use module avail, or module avail [name]. For example, to see all available modules starting with fsl:
```
[jolinda@talapas-ln1 ~]$ module avail fsl

----------------------------------- /packages/modulefiles/Core ----------------------------------
   fsl/5.0.9    fsl/5.0.10 (D)    fsl/6.0.1    fsleyes/0.23.0

  Where:
   D:  Default Module

Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
```
We see that there are four packages matching "fsl", and the default module is version 5.0.10. That's the version that will load if you simply type `module load fsl`. To control which version of fsl you load, type the whole name, eg `module load fsl/6.0.1`. Once the module is loaded you'll be able to run any of the usual commands from the command line.  
  
'module spider' returns similar information to 'module avail', except that it doesn't indicate the default module and will also show modules that can't be loaded with 'module load'.

# Slurm
Slurm Workload Manager is an open source job scheduler, formerly known as the Simple Linux Utility for Resource Management. When you submit a job to the talapas cluster, Slurm is the software that decides where and when to send it and keeps track of whether it completes successfully or not. You can see all the active jobs for just yourself or for all users by choosing the "active jobs" menu option, or by using the command `sacct`. `sacct` will also show you completed or canceled jobs.

In [4]:
!sacct

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
11691589     sys/dashb+      short       lcni          1    TIMEOUT      0:0 
11691589.ba+      batch                  lcni          1  CANCELLED     0:15 
11691589.ex+     extern                  lcni          1  COMPLETED      0:0 
11745141     sys/dashb+      short       lcni          1    RUNNING      0:0 
11745141.ba+      batch                  lcni          1    RUNNING      0:0 
11745141.ex+     extern                  lcni          1    RUNNING      0:0 
11745178     sys/dashb+      short       lcni          1    RUNNING      0:0 
11745178.ba+      batch                  lcni          1    RUNNING      0:0 
11745178.ex+     extern                  lcni          1    RUNNING      0:0 
11745191        convert      short       lcni          1  COMPLETED      0:0 
11745191.ba+      batch                  lcni          1  COMPLE

'sacct' shows recent job information for the current user. Each job has a JobID and a name. Jobs may have separate steps (indicated by jobId.something), with their own names. The '+' signs mean the names are too long for the default column widths. We can use the -j parameter to show the result of a specific jobid (it can be another user's job).

In [5]:
!sacct -j 11617266 

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
11617266     snakejob.+      short    kernlab          1  COMPLETED      0:0 
11617266.ba+      batch               kernlab          1  COMPLETED      0:0 
11617266.ex+     extern               kernlab          1  COMPLETED      0:0 


You can use the format command to change which columns are shown. There are a ton of columns you can include in the format command, and there's a special command to list them all.

In [6]:
!sacct --helpformat

Account             AdminComment        AllocCPUS           AllocGRES          
AllocNodes          AllocTRES           AssocID             AveCPU             
AveCPUFreq          AveDiskRead         AveDiskWrite        AvePages           
AveRSS              AveVMSize           BlockID             Cluster            
Comment             Constraints         ConsumedEnergy      ConsumedEnergyRaw  
CPUTime             CPUTimeRAW          DerivedExitCode     Elapsed            
ElapsedRaw          Eligible            End                 ExitCode           
Flags               GID                 Group               JobID              
JobIDRaw            JobName             Layout              MaxDiskRead        
MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite        MaxDiskWriteNode   
MaxDiskWriteTask    MaxPages            MaxPagesNode        MaxPagesTask       
MaxRSS              MaxRSSNode          MaxRSSTask          MaxVMSize          
MaxVMSizeNode       MaxVMSizeTask       

For completed jobs, "Elapsed"and "MaxRSS" are particularly helpful. They tell us how long the job ran, and how much memory it used. This can be useful for planning how much time & memory to request in the future.

In [7]:
!sacct -j 11617266 --format="Elapsed, MaxRSS, ReqMem, user, account, partition"

   Elapsed     MaxRSS     ReqMem      User    Account  Partition 
---------- ---------- ---------- --------- ---------- ---------- 
  00:11:44                 220Gc   jadrion    kernlab      short 
  00:11:44 107742656K      220Gc              kernlab            
  00:11:44          0      220Gc              kernlab            


This job used 107G of ram, and requested 220G, and took 11 minutes 44 seconds to run.

# Running a job: sbatch
You launch jobs from the command line with the sbatch command. Very simple jobs can be launched using the --wrap keyword. More complex jobs can be launched from a saved file. In these examples I'll be using my pirg (lcni) but you should use yours. Since I'm in a notebook, I'm starting all my bash commands with and exclamation point. Don't use these if you are in a terminal.

In [14]:
!sbatch --account=lcni --wrap "echo hello"

Submitted batch job 11749761


In [16]:
!sacct -j 11749761

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
11749761           wrap      short       lcni          1  COMPLETED      0:0 
11749761.ba+      batch                  lcni          1  COMPLETED      0:0 
11749761.ex+     extern                  lcni          1  COMPLETED      0:0 


In [17]:
!cat slurm-11749761.out

hello


If we want a more descriptive name than "wrap", we can set that. There are lots of options you can set, there's a convenient list of them here: https://slurm.schedmd.com/pdfs/summary.pdf

In [18]:
!sbatch --account=lcni --job-name='echo' --wrap "echo hello" 

Submitted batch job 11749773


In [19]:
!sacct -j 11749773

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
11749773           echo      short       lcni          1  COMPLETED      0:0 
11749773.ba+      batch                  lcni          1  COMPLETED      0:0 
11749773.ex+     extern                  lcni          1  COMPLETED      0:0 


In many cases it's easier to write a script file and submit that. Here's a simple one:
```
#!/bin/bash
#SBATCH --job-name=fslinfo
#SBATCH --account=lcni

module load fsl/6.0.1
fslinfo bold.nii.gz
```  
If I save that text to a file called fslinfo.srun, I can submit it to slurm with `sbatch`

In [21]:
!sbatch fslinfo.srun

Submitted batch job 11749811


In [22]:
!sacct -j 11749811

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
11749811        fslinfo      short       lcni          1  COMPLETED      0:0 
11749811.ba+      batch                  lcni          1  COMPLETED      0:0 
11749811.ex+     extern                  lcni          1  COMPLETED      0:0 


In [23]:
!cat slurm-11749811.out # cat is a bash command that prints a file to the console

data_type	INT16
dim1		64
dim2		64
dim3		30
dim4		184
datatype	4
pixdim1		4.000000
pixdim2		4.000000
pixdim3		3.999975
pixdim4		2.500000
cal_max		0.000000
cal_min		0.000000
file_type	NIFTI-1+


# Parts of a slurm script  
  
Let's take apart that script line by line. First we define the interpreter. You'll probably usually want bash, but you can use other scripting languages such as python. Just make sure you have the right path  
`#!/bin/bash`   
  
Next up are our SLURM options, one per line. Account is the only required one (and that's only if you don't have environment variables set)  
`#SBATCH --job-name=fslinfo`   
`#SBATCH --account=lcni` 
   
Finally, we have our bash commands:  
`module load fsl/6.0.1`  
`fslinfo bold.nii.gz`

# Array jobs  
We often need to run the same command on a long list of subjects. This is when array jobs come in handy. Instead of writing 10 different scripts with different subject names, we write one script with an array of names. Just separate out the part of the command that changes and replace it with ${x}, set a bash array variable named x, and include the array parameter in your file.

In [5]:
!cat hello_everyone.srun

#!/bin/bash
#SBATCH --job-name=hello_array
#SBATCH --account=lcni
#SBATCH --array=0-3

data=(Eugene Oregon USA World)

x=${data[$SLURM_ARRAY_TASK_ID]}

echo hello ${x}


In [6]:
!sbatch hello_everyone.srun

Submitted batch job 20940806


In [7]:
!sacct -j 20940806

               JobID                   JobName  Partition      State    Elapsed     MaxRSS 
-------------------- ------------------------- ---------- ---------- ---------- ---------- 
          20940806_0               hello_array      short  COMPLETED   00:00:02            
    20940806_0.batch                     batch             COMPLETED   00:00:02          0 
   20940806_0.extern                    extern             COMPLETED   00:00:02          0 
          20940806_1               hello_array      short  COMPLETED   00:00:02            
    20940806_1.batch                     batch             COMPLETED   00:00:02          0 
   20940806_1.extern                    extern             COMPLETED   00:00:02          0 
          20940806_2               hello_array      short  COMPLETED   00:00:02            
    20940806_2.batch                     batch             COMPLETED   00:00:02          0 
   20940806_2.extern                    extern             COMPLETED   00:00:02 

This created a single job with four tasks. For each task, the variable ${x} was replaced with the corresponding value in the array "data". There's a slurm output file for each value of x.

In [11]:
!cat slurm-20940806_*.out

hello Eugene
hello Oregon
hello USA
hello World


# Slurm parameters to know (with examples)
`--partition=long`  
Partition you are submitting to. The default is short.  
`--output=slurmjobname.out`  
`--error=slurmjobname-%j.err`  
Alternative to the slurm-{jobnumber}.out filename for standard out and standard error. %j will be replaced with the jobnumber  
`--time=5`  
`--time=5:30:0`  
`--time=2-0`  
Time limit for your job. First example is 5 minutes, second is 5 hours 30 minutes (h:m:s), third is 2 days (d-h). If unspecified it will be the limit for the partition (24 hours for short, 14 days for long, etc). Asking for less time means your job should launch more quickly.   
`--mem=16G`  
Memory requested for your job. The default amount varies, but it's about 4 GB for a standard node. This will not be enough for some jobs. You'll know you need more memory because your job will fail. Don't just ask for all the memory on the node (128 GB). It will take longer to launch your job and you will be taking resources away from other users. If you are doing something that requires more then 128 GB, use fat or longfat.  
`--cpus-per-task=1`  
Number of threads per task. If you are running something that can take advantage of multithreading, increase this number.  
`--dependency=afterok:123456`  
Run this job after job 123456 finishes with no errors  
`--mail-user=email_address -mail-type=END `  
Send an email when the job ends. I like to combine this with email-to-text from my carrier to get a text messsage notification.  
`--comment=idx:index`  
Write a comment. Talapas may use idx:index to indicate which project should be billed for your time.

# Canceling a job
You can cancel a job with the scancel command: `scancel jobnumber`.