# Run some HDFS commands

First, we need to launch a Hadoop cluster on the VSC-4 cluster.

After starting the cluster, we can run our commands.

Finally, we can shutdown the cluster.

### Edit SLURM script

The only part that needs to be edited is from `EDIT BEGIN` to `EDIT END`.

In [1]:
%%writefile hdfs_commands.slrm
#!/bin/bash
#SBATCH --job-name=hdfs_commands      # name of the job
#SBATCH --nodes=2                     # number of nodes reserved
#SBATCH --time=00:04:00               # time needed for job
#SBATCH --partition=skylake_0096  
#SBATCH --qos=skylake_0096
#SBATCH --reservation=training        # available only during the course


###################################################
# Use SBATCH --reservation=training during training

##########################################
# clear all modules & load Java and Hadoop
module purge
module load openjdk
module load hadoop

################################
# set some environment variables
export PDSH_RCMD_TYPE=ssh

#######################
# launch Hadoop cluster
prolog_create_key.sh
source vsc_start_hadoop.sh


#########################################################
# EDIT BEGIN
# here the cluster is running and we can run our commands

DATADIR='/home/fs70824/training/hadoop_training_data'   # this is where data is located 

echo "Creating a new directory myDir"
hdfs dfs -mkdir myDir

echo "List new HDFS directory myDir"
hdfs dfs -ls -h myDir 

echo "Upload file to new directory myDir"
hdfs dfs -put ${DATADIR}/wiki_sample_2400lines myDir

echo "List new directory myDir"
hdfs dfs -ls -h myDir

echo "Show disk usage of directory myDir"
hdfs dfs -du -h myDir 

# change replication factor to 2
hdfs dfs -setrep -w 2 myDir

echo "Show disk usage of directory myDir after changing replication factor"
hdfs dfs -du -h myDir 

echo "Remove directory myDir (this will remove also all files contained in it"
hdfs dfs -rm -r myDir

# EDIT END
#########################################################



#####################
# stop Hadoop cluster
source vsc_stop_hadoop.sh
epilog_discard_key.sh

# check output/errors in slurm-<jobID>


Writing hdfs_commands.slrm


### Launch batch script

Launch `hdfs_commands.slrm` script to start a Hadoop cluster and run some HDFS commands.

### Launch batch script keeping track of job ID

In [2]:
bash_output = !sbatch hdfs_commands.slrm

In [3]:
import os
JOBNR=bash_output[0].split()[-1]
print("Job number = {}".format(JOBNR))

Job number = 38946


#### Check the queue

You can re-run the following cell until job is running or completed.

In [4]:
os.system("squeue --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R' --me -j {}".format(JOBNR))

   JOBID PARTITION            NAME      USER    STATE     TIME TIME_LIMIT NODES NODELIST(REASON)
   38946 skylake_0   hdfs_commands trainee20  PENDING     0:00       4:00     2 (Resources)


0

In [5]:
!squeue --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R' --me -u $USER

   JOBID PARTITION            NAME      USER    STATE     TIME TIME_LIMIT NODES NODELIST(REASON)
   38946 skylake_0   hdfs_commands trainee20  RUNNING  INVALID       4:00     2 n4901-[015-016]
   38529 skylake_0 vsc4_jh_conda_t trainee20  RUNNING  1:47:53   12:00:00     1 n4901-003


### Check other participants' jobs

Check jobs from other participants in the queue

In [6]:
os.system("squeue -R'training' --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R'|grep trainee|sort -k3")

   38941 skylake_0   hdfs_commands trainee01  RUNNING     0:31       4:00     2 n4901-[022-023]
   38947 skylake_0   hdfs_commands trainee08  RUNNING  INVALID       4:00     2 n4901-[020-021]
   38945 skylake_0   hdfs_commands trainee15  RUNNING  INVALID       4:00     2 n4901-[009-010]
   38946 skylake_0   hdfs_commands trainee20  RUNNING  INVALID       4:00     2 n4901-[015-016]
   38940 skylake_0   hdfs_commands trainee32  RUNNING     0:38       4:00     2 n4901-[018-019]
   38942 skylake_0   hdfs_commands trainee35  RUNNING     0:25       4:00     2 n4901-[011-012]
   38944 skylake_0   hdfs_commands trainee35  RUNNING  INVALID       4:00     2 n4901-[013-014]
   38950 skylake_0   hdfs_commands trainee97  PENDING     0:00       4:00     2 (Resources)


0

In [7]:
os.system("squeue -R'training' --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R'|grep trainee|sort -k3")

   38947 skylake_0   hdfs_commands trainee08  RUNNING  INVALID       4:00     2 n4901-[020-021]
   38945 skylake_0   hdfs_commands trainee15  RUNNING     0:17       4:00     2 n4901-[009-010]
   38946 skylake_0   hdfs_commands trainee20  RUNNING  INVALID       4:00     2 n4901-[015-016]
   38944 skylake_0   hdfs_commands trainee35  RUNNING     0:08       4:00     2 n4901-[013-014]
   38950 skylake_0   hdfs_commands trainee97  RUNNING  INVALID       4:00     2 n4901-[018-019]


0

#### Check results
When the job does not appear in the queue anymore, you can check the output in a file called `slurm-<JOBNR>.out`.

In [8]:
os.system("ls -l slurm-{}.out".format(JOBNR))

-rw-r--r-- 1 trainee20 p70824 54996 Oct 19 15:07 slurm-38946.out


0

In [9]:
!scontrol show job {JOBNR}

JobId=38946 JobName=hdfs_commands
   UserId=trainee20(73846) GroupId=p70824(70824) MCS_label=N/A
   Priority=961 Nice=0 Account=p70824 QOS=skylake_0096
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:39 TimeLimit=00:04:00 TimeMin=N/A
   SubmitTime=2022-10-19T15:06:59 EligibleTime=2022-10-19T15:06:59
   AccrueTime=2022-10-19T15:06:59
   StartTime=2022-10-19T15:07:23 EndTime=2022-10-19T15:11:23 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-19T15:07:23 Scheduler=Main
   Partition=skylake_0096 AllocNode:Sid=n4901-003:2624425
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n4901-[015-016]
   BatchHost=n4901-015
   NumNodes=2 NumCPUs=192 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=192,mem=192592M,node=2,billing=192
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=training

In [10]:
!sacct -j {JOBNR} --format=JobID,JobName,MaxRSS,Elapsed

JobID           JobName     MaxRSS    Elapsed 
------------ ---------- ---------- ---------- 
38946        hdfs_comm+              00:01:27 
38946.batch       batch    817476K   00:01:27 
38946.extern     extern          0   00:01:28 
