# NAIST (Nara institute of science and technology) High Performance Computational Server Guide

This is a note recording hands-on process as a supplement to the guideness provided by ITC center as following:

NAIST Server Guidebook:  
https://itcw3.naist.jp/ITC-local/manual/cc21/index.html

NAIST GPU Guidebook:  
https://itcw3.naist.jp/ITC-local/manual/cc21/Microsoft_Azure_CloudGPU_20230802.pdf

## SSH log in

**1.Normal login from your PC to server**

```bash
# Private PC terminal

# method 1
ssh [mandara account]@[server_node].naist.jp
# method 2
ssh -l [mandara_account] [server_node]

```
For instance, mandara_account is 'lyt', accessing dev node 'cc21dev0' or 'cc21dev1' are as following:
```bash
# Private PC terminal

# method 1
ssh lyt@cc21dev0.naist.jp
# method 2
ssh -l lyt cc21dev0

```

And then enter your mandara password.

### 2.ssh login through ssh-keygen
(1). create ssh key pairs in your private PC.
```bash
# Private PC terminal
ssh-keygen -t rsa -b 4096 -C "[your email]" -f ~/.ssh/[name]
# For instance
# ssh-keygen -t rsa -b 4096 -C "lyt@gmail.com" -f ~/.ssh/rsa-server
```
**Notice:**   
In '~/.ssh/' folder, 'rsa-server' file is your private key, keep it safe and don't copy to anybody or copy to internet. 'rsa-server.pub' file is your public key, you can copy it to server you want to login.

(2). check 'authorized_keys' file in server.   
```bash
# Remote server terminal
ls ~/.ssh/
```
If no 'authorized_keys' file there, create one.
```bash
# Remote server terminal
touch ~/.ssh/authorized_keys
```
(3) copy your public key to remote server '~/.ssh/authorized_keys'

```bash
# Private PC terminal
ssh-copy-id -i ~/.ssh/rsa-server.pub lyt@cc21dev0
```
(4) login by indicating identified file
``` bash
# Private PC terminal
ssh -l lyt cc21dev0 -i ~/.ssh/rsa-server
# or
ssh lyt@cc21dev0 -i ~/.ssh/rsa-server
```



### 3.ssh log in off-campus
Ref:　https://itcw3.naist.jp/ITC-local/remotelogin/ssh.en.html

Access server from off-campus need to login sh.naist.jp first to avoid fire wall.
So repeat section2 to for ssh-keygen login on sh.naist.jp

Step1. create ssh-keygen pairs as described in section '2.ssh login through ssh-keygen-(1). create ssh key pairs in your private PC'

Step2. copy public key and [submitted](https://mandara-request.naist.jp/sh.naist.jp/sh-registration.ja.cgi) to itc center, they will help to copy your public key into sh.naist.jp `~/.ssh/authorized_keys`

Step3. login method
```bash
% slogin -l [mandara name] [server name(sh.naist.jp)]
# For instance, mandara_name is 'lyt'
% slogin -l lyt sh.naist.jp
鍵 '/home/itc/lyt/.ssh/id_rsa' のパスフレーズを入力してください:

```



Azure HPC: NVIDIA A100, 24 core, 80GB GPU memory, 220 GB memory,880GB local storage.

MAC Studio M4 Max: 16 core CPU, 40 core GPU, 128 GB unit memory, 1TB storage -> 528300 yen

## change default shell

### 1. check your default shell  
Ref: https://unix.stackexchange.com/questions/136423/making-zsh-default-shell-without-root-access

```bash
echo $SHELL
>>>/bin/csh  # C Shell, an old Unix shell. I want to change to zsh or bash shell.

# Change to zsh shell.
## check zsh existance
which zsh
>>> /usr/bin/zsh. # nice, have zsh

chsh -s /usr/bin/zsh
>>>chsh: user 'lyt' does not exist in /etc/passwd
# ok, failed because lacking admin privileges.

getent passwd $(whoami)  # check if account is managed by Centralized identity services
>>> XXX:/work/lyt:/bin/csh


```



### 2. Change to zsh shell


```bash
# in server terminal
vim .login # for csh shell, it will init by .login file under home directory

```


```bash
# add following content into .login file
if ( $?ZSH_VERSION == 0 ) then
    # echo "testpoint1:.login"
    if ( -x /usr/bin/zsh ) then
        # echo "testpoint2:.login"
        exec /usr/bin/zsh
    else if ( -x /bin/zsh ) then
        # echo "testpoint3:.login"
        exec /bin/zsh
    endif
endif

# enter `:`
# enter `wq` for saving and close file
```

Now re-login through ssh, zsh shell will be processed. If you want to change zsh theme by oh-my-zsh, change it in .zshrc file.


```bash
# in .zshrc file
ZSH_THEME="agnoster"
```



Summary:  

| File      | load condition | object |
| ----------- | ----------- |-----|
| .bash_profile | Bash first choice for login config | Bash|
| .profile | Universal choice for login config       | sh, dash, ksh, zsh(partial).bash_profile(if no .bash_profile)|
|.cshrc | every time shell was exec, like .bshrc for csh | csh, tcsh |
|.login | login, similar to .bash_profile | csh, tcsh |
|.logout| logout | csh, tcsh |
|.bash_logout | bash logout read | bash |
|.zlogout| zsh logout read | zsh |


## Assign job through slurm
srun for interactive job, and sbatch for batch job.

### 1. Check partition, node list and its state

(1) `sinfo`: list all node, partition and its state.

| abbreviate | state      | meaning                               |
| :--------- | :--------- | ------------------------------------- |
| alloc      | allocated  | running task                          |
| block      | blocked    |                                       |
| comp       | completing |                                       |
| down       | down       | node can not be used                  |
| drain      | drained    |                                       |
| drng       | draining   |                                       |
| fail       | fail       |                                       |
| failg      | failing    |                                       |
| futr       | future     |                                       |
| idle       | idle       | node is free and can be used          |
| maint      | maint      |                                       |
| mix        | mixed      | partial node is free, partial is busy |
| npc        | perfctrs   |                                       |
| plnd       | planned    |                                       |
| pow_dn     | power_down |                                       |
| pow_up     | power_up   |                                       |
| resv       | reserved   |                                       |
| unk        | unknown    |                                       |



```bash
lyt@cc21dev0  ~  sinfo
PARTITION        AVAIL  TIMELIMIT  NODES  STATE NODELIST
cluster_short*      up    4:00:00      2   comp cc21cluster[00,13]
cluster_short*      up    4:00:00      1    mix cc21cluster05
cluster_short*      up    4:00:00     35  alloc cc21cluster[01-04,06-12,14-37]
cluster_long        up 4-04:00:00     18  alloc cc21cluster[20-37]
cluster_low         up 41-16:00:0      2   comp cc21cluster[00,13]
cluster_low         up 41-16:00:0      1    mix cc21cluster05
cluster_low         up 41-16:00:0     17  alloc cc21cluster[01-04,06-12,14-19]
cluster_intr        up   10:00:00      2   comp cc21cluster[00,13]
cluster_intr        up   10:00:00      1    mix cc21cluster05
cluster_intr        up   10:00:00     17  alloc cc21cluster[01-04,06-12,14-19]
gpu_short           up    4:00:00      5    mix cc21gpu[01-02,04-06]
gpu_short           up    4:00:00      1  alloc cc21gpu00
gpu_short           up    4:00:00      2   idle cc21gpu[03,07]
gpu_long            up 4-04:00:00      3    mix cc21gpu[04-06]
gpu_long            up 4-04:00:00      1   idle cc21gpu07
gpu_intr            up   10:00:00      5    mix cc21gpu[01-02,04-06]
gpu_intr            up   10:00:00      1  alloc cc21gpu00
gpu_intr            up   10:00:00      2   idle cc21gpu[03,07]
hmem_short          up    4:00:00      1  alloc cc21hmem00
hmem_short          up    4:00:00      1   idle cc21hmem01
hmem_long           up 4-04:00:00      1  alloc cc21hmem00
hmem_long           up 4-04:00:00      1   idle cc21hmem01
hmem_intr           up   10:00:00      1  alloc cc21hmem00
hmem_intr           up   10:00:00      1   idle cc21hmem01
msas_short          up    4:00:00      1    mix cc21msas
msas_long           up 4-04:00:00      1    mix cc21msas
msas_intr           up   10:00:00      1    mix cc21msas
azuregpu1_long      up 4-04:00:00      5  idle~ cc19azuregpu[100-104]
azuregpu1_intr      up   10:00:00      5  idle~ cc19azuregpu[100-104]
ocigpu8a100_long    up 4-04:00:00      1  idle~ cc21ocigpu8a110
ocigpu8a100_long    up 4-04:00:00      1 drain~ cc21ocigpu8a100
ocigpu8a100_intr    up   10:00:00      1  idle~ cc21ocigpu8a110
ocigpu8a100_intr    up   10:00:00      1 drain~ cc21ocigpu8a100
ocigpu1a10_long     up 4-04:00:00      7  idle~ cc21ocigpu1a[001-007]
ocigpu1a10_long     up 4-04:00:00      1    mix cc21ocigpu1a000
ocigpu1a10_intr     up   10:00:00      7  idle~ cc21ocigpu1a[001-007]
ocigpu1a10_intr     up   10:00:00      1    mix cc21ocigpu1a000
```



### 2. Assign partition or node for your work      

(1). Assign partition or node in command line

- `srun` case
  
```bash
# interactive pattern: srun -p cluster_intr --pty bash -l
# -p -> partition 
lyt@cc21dev0 ~ srun -p cluster_intr --pty bash -l

srun: job 3468405 queued and waiting for resources
srun: job 3468405 has been allocated resources
testpoin1: .profile  # This line comes because I add a testpoint in .profile file, it is clear that srun read '.profile' config file.
lyt@cc21cluster05:~$  # in this case, cc21cluster05 node was allocated to me, the following command will be excuted in this node.
```

Let's run test.sh file for testing, the file content as following:
```
#!/bin/bash -l
success test.
```

Under interactive mode, we can send command like this 👇
```bash
lyt@cc21cluster05: bash test.sh
success test.

lyt@cc21cluster05: exit  # logout node
logout

```


- `sbatch` case     
  
you can assign batch job by `sbatch`, the parameter can be assigned in bash script file through '#SBATCH' or in command line.
The priority: command > script, so if same parameter is provided in command, it will replace that in script.

Now example, we have a file called 'test_cpu_run.sh' under 'GPU_test' folder.
The content is like:
```
#!/bin/bash -l
#SBATCH -p azuregpu1_intr  # Notice, here I assigned 'azuregpu1_intr' partition, later I will reassign in command line. 
#SBATCH --chdir=/work/lyt  # working path
#SBATCH --job-name=test_cpu
#SBATCH --output=test_cpu_output.txt  # output will recorded in this file and save in your work space.

# Load modules if necessary, e.g., Anaconda or Python environment
# module load anaconda
# source activate your_env

echo "Starting CPU job...list miniconda if installed"
apt list --installed | grep miniconda
echo "Job finished."
```

```bash
# batch job mode: sbatch -p cluster_intr GPU_test/test_cpu_run.sh
lyt@cc21dev0 ~ sbatch -p cluster_low GPU_test/test_cpu_run.sh  # Notice here I covered the -p parameter in command line.

Submitted batch job 3468410
```


Check 'work/lyt/' folder, 'test_cpu_output.txt' file was saved, output as following:

```text
testpoin1: .profile
Starting CPU job...list miniconda if installed

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Job finished.

```


sbatch parameter:
```bash
#SBATCH --nodelist=cc21ocigpu1a003
```

### 3. Cancle job

1. verify job ID
```bash
squeue -u $USER

```
2. cancel single job
```bash
scancel [job id]
```

3. cancel all job
```bash
scancel -u $USER
```

## GPU node - super parallel calculation nodes/ Cloud HPC Node (OCI)

I only test super parallel calculation nodes(SPC), OCI is not checked yet.
Here is the test flow for SPC nodes.

(1). `sbatch`

In cc21dev0 node, 'test_gpu_run.sh' and 'test_gpu.py' is under GPU_test folder. 

```bash
lyt@cc21dev0  ~ sbatch -p gpu_short GPU_test/test_gpu_run.sh
Submitted batch job 3468418
```

Output file 'test_gpu_output.txt':
``` text
testpoin1: .profile
Starting GPU test job...
Running test_gpu.py
Traceback (most recent call last):
  File "/work/lyt/GPU_test/test_gpu.py", line 2, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'  # Not setting environment yet, so cannot run torch package.
Job finished.

```

## GPU node - Cloud HPC Node (Azure)

### 1. Copy data and script to azure specific folder

Ref:
https://itcw3.naist.jp/ITC-local/manual/cc21/Microsoft_Azure_CloudGPU_20230802.pdf
(page 17-19)

Because Azure HPC center is in American, so need to copy data and script into specific Azure folder 'azure-cc1' through `sftp` interactive command.
Azure folder is a temperal place for your work, and has a limitation of 500GB for each user.

**copy file/folder from dev node to Azure folder:**


```bash
# upload folder
sftp azure-cc1 <<< 'put -r [dev_node/folder] [remote/folder]'
# upload file
sftp azure-cc1 <<< 'put [dev_node/file] [remote/file]'

# For instance, upload test folder
# sftp azure-cc1 <<< 'put -r test test'

```

**get file/folder from Azure folder to dev node:**
```bash
# get folder
sftp azure-cc1 <<< 'get -r [remote/folder] [dev_node/folder]'
# get file
sftp azure-cc1 <<< 'put [remote/file] [dev_node/file]'

# For instance, upload test folder
# sftp azure-cc1 <<< 'get test test'

```

**It is also possible using interactive style:**
```bash
# open interactive line
sftp azure-cc1

# list file/folders
ls

# remove file/folders
rm [file]
rm -r [folder]

```

### 2. Interactive command

open a GPU node in interactive environment

```bash
srun –p azuregpu1_intr --gres=gpu:1 --chdir=/work/(account) --pty bash
# For instance
# srun –p azuregpu1_intr --gres=gpu:1 --chdir=/work/lyt --pty bash
```


Memo: Azure HPC has really heavy delay, maybe that's why it is idle. 🤷


## Others
- Usage limitation
- Conda environment should install in which node?
- What's the diffrence between `module load` and conda virtual environment
- How to keep ssh connection alive all the time? (client_loop: send disconnect: Broken pipe)
  

### Usage limitation

1. Space limitation
   
| dir            | Total storage | Max storage/person |
| -------------- | ------------- | ------------------ |
| /work          | 200 TB        | -                  |
| azurecc-1/work | 2 TB          | 512 GB             |
| /home          |               |                    |
| /project       |               |                    |
|                |               |                    |

2. System-wide restrictions
Number of concurrent jobs submitted per user: 4160
Number of concurrent job executions per user: 4160
Arrays per array job: 4160

### Conda environment should install in which node?
Conda environment should install in development node cc21dev0 or cc21dev1.
Because every calculation node could access cc21dev0 node [mandara account]/work/, however, the calculation node assigned to you is dynamic.
So setting your virtual environment like conda under cc21dev0 node, and write following lauch command in your running bash script.

```bash
source /work/lyt/miniconda3/bin/activate
conda activate [myenv]  # myenv is your environment name, don't need typing [].

```


### What does these complier do?
There many modules in system, you could check the list by `module avail`.
Load module by `module load [name]`.
Check your loaded module by `module list`.
```bash
2025-04-11 module avail

-----------------------------------------  /etc/environment-modules/modules ------------------------------------------
compiler/intel/2024.1.0            compiler/nvhpc-openmpi3/23.11  java/17/          MATLAB/R2023a      mpi/openmpi/4.1.5      sys/slurm
compiler/nvhpc-byo-compiler/22.2   compiler/nvhpc/22.2            java/17/0/11      MATLAB/R2024a      profiler/nvidia/22.2
compiler/nvhpc-byo-compiler/23.11  compiler/nvhpc/23.11           java/21/0/3       mkl/2024.1         profiler/vtune/2024.1
compiler/nvhpc-nompi/22.2          cuda/11.6u1                    mathematica/13.0  mpi/intel/2021.12  R/4.1.3
compiler/nvhpc-nompi/23.11         cuda/12.2u2                    MATLAB/R2022a     mpi/openmpi/4.1.3  singularity/3.9.6
```

Here is a short summary generated by chatgpt4o about what these modules could do (welcome to supplement or point out mistakes):
```markdown
⸻

🧠 Compiler Modules

compiler/intel/2024.1.0		
	•	Intel compiler suite (C/C++/Fortran)			
	•	Optimized for high-performance numerical computing		
	•	Works well with Intel MKL		
	•	Great for simulation, computational chemistry, climate models		

compiler/nvhpc/* (NVIDIA HPC SDK)
	•	Designed for GPU-accelerated computing		
	•	Supports CUDA Fortran, OpenACC, OpenMP Offload		
	•	Versions:		
	•	nvhpc/22.2, nvhpc/23.11: Standard GPU-enabled versions		
	•	nvhpc-nompi/*: Versions without MPI support		
	•	nvhpc-openmpi3/23.11: Comes with OpenMPI3 pre-integrated		
	•	nvhpc-byo-compiler/*: “Bring Your Own Compiler” for advanced configuration		

⸻

🧮 Math Libraries & Numerical Computing

mkl/2024.1		
	•	Intel Math Kernel Library		
	•	Optimized implementations of:		
	•	BLAS, LAPACK		
	•	FFTs		
	•	Sparse matrix operations		
	•	Accelerates NumPy, SciPy, and machine learning libraries		

⸻

🔢 CUDA Modules

cuda/11.6u1, cuda/12.2u2
	•	NVIDIA CUDA Toolkit		
	•	Includes:		
	•	nvcc CUDA compiler		
	•	cuBLAS, cuDNN, and other GPU libraries		
	•	Required for training or deploying deep learning models on GPUs		

⸻

🔬 Profiler Tools

profiler/nvidia/22.2
	•	NVIDIA Nsight tools for GPU performance analysis

profiler/vtune/2024.1		
	•	Intel VTune Profiler		
	•	Used to analyze:			
	•	CPU and thread performance		
	•	Memory and cache usage		
	•	Parallel execution bottlenecks		

⸻

📡 MPI Modules (Parallel Computing)

mpi/openmpi/*		
	•	Open-source MPI (Message Passing Interface)		
	•	Standard for distributed memory parallelism		

mpi/intel/2021.12		
	•	Intel MPI		
	•	Offers improved performance with Intel compilers and hardware		

⸻

🧪 Applications / Programming Languages

MATLAB
	•	MATLAB/R2022a, R2023a, R2024a		
	•	Used for:		
	•	Data analysis		
	•	Control systems		
	•	Machine learning and signal processing		

R
	•	R/4.1.3		
	•	Widely used in:		
	•	Statistical analysis		
	•	Data visualization		
	•	Bioinformatics pipelines (DESeq2, edgeR, Seurat, etc.)		

Java		
	•	java/17, java/21			
	•	For Java-based applications and tools		

⸻

📦 System & Utility Tools

sys/slurm		
	•	Job scheduler used to submit and manage jobs on the cluster

singularity/3.9.6			
	•	Container platform (like Docker for HPC)		
	•	Enables reproducible workflows		
	•	Useful for packaging entire bioinformatics pipelines		

⸻
```

### What's the diffrence between `module load` and conda virtual environment

`module load` is system level complier.

### How to keep ssh connection alive all the time? (client_loop: send disconnect: Broken pipe)

In this situation, try the solution: https://unix.stackexchange.com/questions/602518/ssh-connection-client-loop-send-disconnect-broken-pipe-or-connection-reset

Error:

```bash
Connection to cc21dev0 closed by remote host.
Connection to cc21dev0 closed.
client_loop: send disconnect: Broken pipe

```

Solution:

```bash
vim ~/.ssh/config

# add this content in ~/.ssh/config
Host *
    ServerAliveInterval 20
    TCPKeepAlive no
```

hhhhhh
hhh
hhh      
hhh  
hhhh hhh    