# Launch a Hadoop cluster on the VSC-4 cluster

## Create a batch script

This is a SLURM script that is going to run on one or more VSC-4 nodes.

The only command that is going to run on the VSC-4 cluster once Hadoop has started is: `echo 'Hello, World!'`.

In [1]:
%%writefile launch_Hadoop_cluster.slrm
#!/bin/bash
#SBATCH --job-name=launch_Hadoop_cluster  # name of the job
#SBATCH --nodes=1                         # number of nodes reserved
#SBATCH --time=00:04:00                   # time needed for job
#SBATCH --partition=skylake_0096  
#SBATCH --qos=skylake_0096
#SBATCH --reservation=training            # available only during the course


###################################################
# Use SBATCH --reservation=training during training


##########################################
# clear all modules & load Java and Hadoop
module purge
module load openjdk
module load hadoop

################################
# set some environment variables
export PDSH_RCMD_TYPE=ssh

#######################
# launch Hadoop cluster
prolog_create_key.sh
source vsc_start_hadoop.sh


#########################################################
# EDIT BEGIN
# here the cluster is running and we can run our commands
# echo 'Hello, World!'
echo 'Hello, World!'
# EDIT END
#########################################################


#####################
# stop Hadoop cluster
source vsc_stop_hadoop.sh
epilog_discard_key.sh

# check output/errors in slurm-<jobID>


Writing launch_Hadoop_cluster.slrm


## Launch batch script

Launch `launch_Hadoop_cluster.slrm` script to start a Hadoop cluster.

In [2]:
!sbatch launch_Hadoop_cluster.slrm

Submitted batch job 38921


### Check all your SLURM jobs.

With `slurm -u $USER` you can see all your SLURM jobs (queued or running).

**Note:** The job running in partition `jupyter` is the job responsible for delivering the current Jupyter Hub.

In [3]:
!squeue --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R' --me -u $USER

   JOBID PARTITION            NAME      USER    STATE     TIME TIME_LIMIT NODES NODELIST(REASON)
   38529 skylake_0 vsc4_jh_conda_t trainee20  RUNNING  1:39:38   12:00:00     1 n4901-003
   38921 skylake_0 launch_Hadoop_c trainee20  RUNNING  INVALID       4:00     1 n4901-018


Check queue again

In [4]:
!squeue --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R' --me -u $USER

   JOBID PARTITION            NAME      USER    STATE     TIME TIME_LIMIT NODES NODELIST(REASON)
   38529 skylake_0 vsc4_jh_conda_t trainee20  RUNNING  1:39:51   12:00:00     1 n4901-003
   38921 skylake_0 launch_Hadoop_c trainee20  RUNNING  INVALID       4:00     1 n4901-018


### Launch job keeping track of job ID

We're going to launch the same script again but this time we keep track of the job number so we don't need to type it.

In [5]:
bash_output = !sbatch launch_Hadoop_cluster.slrm

In [6]:
import os
JOBNR=bash_output[0].split()[-1]
print("Job number = {}".format(JOBNR))

Job number = 38923


#### Check the queue

You can re-run the following cell until job is running or completed.

**Note:** Use `ctrl-enter` to run a cell keeping the cursor in the cell

In [7]:
os.system("squeue --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R' --me -j {}".format(JOBNR))

   JOBID PARTITION            NAME      USER    STATE     TIME TIME_LIMIT NODES NODELIST(REASON)
   38923 skylake_0 launch_Hadoop_c trainee20  RUNNING  INVALID       4:00     1 n4901-020


0

### Check other participants' jobs

Check jobs from other participants in the queue by filtering for user named `trainee*` in the SLURM queue

In [8]:
os.system("squeue -R'training' --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R'|grep trainee|sort -k3")

   38921 skylake_0 launch_Hadoop_c trainee20  RUNNING     0:05       4:00     1 n4901-018
   38923 skylake_0 launch_Hadoop_c trainee20  RUNNING  INVALID       4:00     1 n4901-020
   38922 skylake_0 launch_Hadoop_c trainee32  RUNNING  INVALID       4:00     1 n4901-019


0

Check queue again to see progress 

In [9]:
os.system("squeue -R'training' --format='%.8i %.9P %.15j %.9u %.8T %.8M %.10l %.5D %R'|grep trainee|sort -k3")

   38921 skylake_0 launch_Hadoop_c trainee20  RUNNING     0:07       4:00     1 n4901-018
   38923 skylake_0 launch_Hadoop_c trainee20  RUNNING  INVALID       4:00     1 n4901-020
   38922 skylake_0 launch_Hadoop_c trainee32  RUNNING  INVALID       4:00     1 n4901-019


0

### View results
When the job does not appear in the queue anymore, you can find the output in a file called `slurm-<JOBNR>.out`.

In [10]:
os.system("ls -l slurm-{}.out".format(JOBNR))

-rw-r--r-- 1 trainee20 p70824 54077 Oct 19 14:59 slurm-38923.out


0

## Check SLURM job run time

### With `scontrol`

While the job is running or shortly after its completion, one can get information about it with `scontrol`.

In [11]:
os.system("scontrol show job {}".format(JOBNR))

JobId=38923 JobName=launch_Hadoop_cluster
   UserId=trainee20(73846) GroupId=p70824(70824) MCS_label=N/A
   Priority=958 Nice=0 Account=p70824 QOS=skylake_0096
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=INVALID TimeLimit=00:04:00 TimeMin=N/A
   SubmitTime=2022-10-19T14:59:46 EligibleTime=2022-10-19T14:59:46
   AccrueTime=2022-10-19T14:59:46
   StartTime=2022-10-19T14:59:48 EndTime=2022-10-19T15:03:48 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-19T14:59:48 Scheduler=Main
   Partition=skylake_0096 AllocNode:Sid=n4901-003:2624075
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n4901-020
   BatchHost=n4901-020
   NumNodes=1 NumCPUs=96 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=96,mem=96296M,node=1,billing=96
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=training
  

0

Or equivalently, using a bash command:

In [12]:
!scontrol show job {JOBNR}

JobId=38923 JobName=launch_Hadoop_cluster
   UserId=trainee20(73846) GroupId=p70824(70824) MCS_label=N/A
   Priority=958 Nice=0 Account=p70824 QOS=skylake_0096
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=INVALID TimeLimit=00:04:00 TimeMin=N/A
   SubmitTime=2022-10-19T14:59:46 EligibleTime=2022-10-19T14:59:46
   AccrueTime=2022-10-19T14:59:46
   StartTime=2022-10-19T14:59:48 EndTime=2022-10-19T15:03:48 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-19T14:59:48 Scheduler=Main
   Partition=skylake_0096 AllocNode:Sid=n4901-003:2624075
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n4901-020
   BatchHost=n4901-020
   NumNodes=1 NumCPUs=96 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=96,mem=96296M,node=1,billing=96
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Reservation=training
  

### With `sacct`

After completion, information on the job can be retrieved with `sacct`.

In [14]:
!sacct -j {JOBNR} --format=JobID,JobName,MaxRSS,Elapsed

JobID           JobName     MaxRSS    Elapsed 
------------ ---------- ---------- ---------- 
38923        launch_Ha+              00:00:18 
38923.batch       batch              00:00:18 
38923.extern     extern              00:00:18 
