# **Data Transfer Nodes for Large Data Transfer at Utah CHPC**
For science workflows that transfer very large datasets between institutions, we need ***advanced parallel transfer tools*** running on tuned devices such as ***Data Transfer Nodes (DTNs)***. The University of Utah CHPC supports various parallel transfer tools that support these heavyweight tasks.

**DTNs** are dedicated CHPC nodes desgined to handle high-speed large-data transfers. **DTNs** are optimized for moving heavy data volumes within the Utah CHPC network and between CHPC and external soruces (such as our local machine).

Network traffic from most CHPC systems (on campus) pass through the campus firewall when communicating  with resources off campus.
* Large research computing workflows require more bandwidth and connections/sessions requirements than the campus firewall can handle: it overwhelm the campus firewall capacity, impacting the usage for the rest of campos.
    * For adress these needs, Utah campus has created a Science DMZ (a network segment with different security approaches) that allows for specific transfers (high performance and low latency) of data.
## **General DTN environments**
There are (all CHPC users are able to utilize the following):
* intdtn01.chpc.utah.edu (connected at 10gbs, no dmz, use for internal campus transfers)
* intdtn02.chpc.utah.edu (connected at 10gbs, no dmz, use for internal campus transfers)
* intdtn03.chpc.utah.edu (connected at 10gbs, no dmz, use for internal campus transfers)
* intdtn04.chpc.utah.edu (connected at 10gbs, no dmz, use for internal campus transfers)
* dtn05.chpc.utah.edu (connected via dmz at 100gbs)
* dtn06.chpc.utah.edu (connected via dmz at 100gbs)
* dtn07.chpc.utah.edu (connected via dmz at 100gbs)
* dtn08.chpc.utah.edu (connected via dmz at 100gbs)

Where (for moving large datasets):
* dtn05-08 operate individually, as well as together.
* intdtn01-03 operate both individually as well as together.
Furthermore:
* CHPC supports specialized tools for moving data to/from cloud storage.
    * `s3cmd` for Amazon cloud services
    * `rclone` for different cloud storage providers.
* **dtns** via slurm is enabled at `notchpeak`.
# Data Transfer Node Access via SLURM
It is good know that each **dtn** node has:
* 24 cores, 128 GB RAM
    * Only 12 cores and 96 GB RAM are avialable to run Slurm jobs.
        * For `notchpeak` cluster:
            * Slurm partition: `notchpeak-dtn`.
            * Slurm Account: `dtn`.
            * Nodes: `dtn05`,`dtn06`,`dtn07`,`dtn08`.
            * `notchpeak-dtn` has 100 Gbps connections to the **Utah's Science DMZ** (segment of the Utah network with streamlined data-flow across the campus firewall to and from off-campus).
    * All CHPC users have been set up to use the dtns.

`notchpeak-dtn` Slurm partition is similar to other shared SLURM partitions at CHPC, with multiple transfer jobs sharing a node.
* Each Slurm job running on a **dtn** is allocated a 1 core and 2 GB RAM.
* `notchpeak-dtn` has 72 hours per job as a maximum limit time.
* For parallel transfers, users can request the number of cores and memory using `$SBATCH` directives.
## **Download a dataset using dtn Slurm script:**
```bash
#!/bin/tcsh 

#SBATCH --partition=notchpeak-dtn

#SBATCH --account=dtn

#SBATCH --time=1:00:00

#SBATCH -o slurm-%j.out-%N

#SBATCH -e slurm-%j.err-%N s

setenv SCR /scratch/general/lustre/$USER/$SLURM_JOB_ ID

mkdir -p $SCR

cd $SCR

wget https://www1.ncdc.noaa.gov/pub/data/uscrn/products/daily01/2020/CRND0103-2020-AK_Aleknagik_1_NNE.txt
```
Note that:
* The appropiate account and partition were used (`notchpeak-dtn` and `dtn`).
* `setenv SCR /scratch/general/lustre/$USER/$SLURM_JOB_ID` set an environment variable `SCR` to a path in the scratch file system (specific to the user and job ID). `setenv` (`tcsh` command) is equivalent to `export` in bash. 
    * `SCR` is the name of the environment variable being set.
    * `/scratch/general/lustre` is a directory path on the file system intended for temporary or intermediate data storage. The **scratch** space is a HP temporary storage area.
    * `$USER`is an environment variable that ensures each user's data is kept separate.
    * `$SLURM_JOB_ID` is an Slurm environment variable, containing the unique job ID assigned to the current job. It ensures that data from different jobs run by the same user doesn't collide and is stored in separate directories.
        * `/scratch/general/lustre/$USER/$SLURM_JOB_ID` is the value being assigned to the `SCR` environment variable. It constructs a path where temporary files can be stored for the job.
* `mkdir -p $SCR` creates the directory if it doesn't already exist, ensuring that the path `scratch/general/lustre/$USER/$SLURM_JOB_ID` exists. 
* `cd $SCR` changes the current directory to the one just created.
* `wget [...]` downloads a specific file into the directory defined by `SCR`.
## **There is a problem with `/scratch/general/lustre/`!**
Note that you have to join to the `notchpeak` cluster to use the `notchpeak-dtn` partition and `dtn` account. Now, let's see this:
```bash
(base) [u6059911@notchpeak2:~]$ cd /scratch/general/lustre/ && mkdir $USER
mkdir: cannot create directory ‘u6059911’: Read-only file system
```
It means that `/scratch/general/lustre/` is a Read-only file system, for this reason the following commands (that were used in the Slurm script above):
```bash
setenv SCR /scratch/general/lustre/$USER/$SLURM_JOB_ID

mkdir -p $SCR

cd $SCR
```
Don't have sense, because we are not able to create directories inside `lustre`. If we try run the Slurm bath script using the `scratch` path, despite what was mentioned above, we will find:
```bash
(base) [u6059911@notchpeak2:~]$ cat slurm-1458728.out-dtn05 
(base) [u6059911@notchpeak2:~]$ cat slurm-1458728.err-dtn05 
mkdir: cannot create directory ‘/scratch/general/lustre/u6059911’: Permission denied
/scratch/general/lustre/u6059911/1458728: No such file or directory.
```
And the Slurm batch job for data transfer will not have been executed. To overcome this problem, we need to change the job directory path. 
    * Use `\tmp` could be good, because it often is in fast storage, has broad write permissions, and the files are automatically deleted after a while.
```bash
#!/bin/tcsh

#SBATCH --partition=notchpeak-dtn         
#SBATCH --account=dtn                      
#SBATCH --time=1:00:00                    
#SBATCH --ntasks=1                       
#SBATCH --cpus-per-task=1                
#SBATCH --mem=4GB                        
#SBATCH -o slurm-%j.out-%N               
#SBATCH -e slurm-%j.err-%N             

setenv SCR /tmp/$USER/$SLURM_JOB_ID

mkdir -p $SCR

cd $SCR

echo "Working directory: $SCR"

wget https://www1.ncdc.noaa.gov/pub/data/uscrn/products/daily01/2020/CRND0103-2020-AK_Aleknagik_1_NNE.txt

ls -lh CRND0103-2020-AK_Aleknagik_1_NNE.txt

head CRND0103-2020-AK_Aleknagik_1_NNE.txt
```
Then, you can submit the Slurm batch job using `sbatch` and the `.slurm` file name that was executed.
```bash
(base) [u6059911@notchpeak2:~]$ sbatch dtn_test.slurm 
Submitted batch job 1458720
```
Note that how the job is beingf processed:
```bash
(base) [u6059911@notchpeak2:~]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1458720 notchpeak dtn_test u6059911 CG       0:05      1 dtn05
(base) [u6059911@notchpeak2:~]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
```
Then you can see the `.err` & `.out` outputs of the Slurm job:
```bash
(base) [u6059911@notchpeak2:~]$ ls
dtn_test.slurm  Miniconda3-latest-Linux-x86_64.sh  scripts                  slurm-1458720.out-dtn05
environments    MyModules                          slurm-1458720.err-dtn05  software
(base) [u6059911@notchpeak2:~]$ cat sl
slurm-1458720.err-dtn05  slurm-1458720.out-dtn05
```
Thus, you can see what the `.out` file contains using `cat`.   
```bash
(base) [u6059911@notchpeak2:~]$ cat slurm-1458720.out-dtn05 
Working directory: /tmp/u6059911/1458720
-rw-r--r-- 1 u6059911 nineil 78K Nov 20  2022 CRND0103-2020-AK_Aleknagik_1_NNE.txt
23583 20200101  2.514 -158.61   59.28   -19.0   -27.8   -23.4   -21.0     0.0     0.31 C   -20.0   -36.2   -25.2    79.3    49.9    62.6 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0
23583 20200102  2.514 -158.61   59.28   -21.8   -28.0   -24.9   -24.1     0.0     0.90 C   -25.7   -36.4   -29.7    86.6    64.2    73.5 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0
23583 20200103  2.514 -158.61   59.28   -17.7   -22.8   -20.2   -20.2     0.0     1.22 C   -21.9   -27.1   -24.6    71.5    59.9    64.6 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0
23583 20200104  2.514 -158.61   59.28   -20.1   -26.0   -23.0   -22.3     0.0     1.19 C   -18.1   -32.2   -26.0    89.5    69.3    78.1 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0
```
What verifies that the dataset has been downloaded correctly. Note that it will be deleted after a while (because the `\tmp` nature). If you want that the dataset to stay in a directory, you will have to change the path that follows `setenv`.