mostly initial stuff, other cosmetics

BYUHPC · Oct 11, 2013 · c4a45cb · c4a45cb
1 parent 1e78593
commit c4a45cb
Show file tree

Hide file tree

Showing 43 changed files with 1,064 additions and 10 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,19 @@
+Unless otherwise noted or not copyrightable, each file is copyrighted under the MIT/Expat License (http://opensource.org/licenses/MIT) as follows:
+
+Copyright (C) 2013, Brigham Young University
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this
+software and associated documentation files (the "Software"), to deal in the Software
+without restriction, including without limitation the rights to use, copy, modify, merge,
+publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons
+to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or
+substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
+INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
+PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE
+FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
+OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
diff --git a/README b/README
@@ -0,0 +1,22 @@
+This collection of scripts and programs is designed to improve the stability of shared nodes, whether login or compute, in an HPC setting.  It may work in other scenarios but is not tested for anything but HPC.  The tools were developed by Ryan Cox at BYU's Fulton Supercomputing Lab in order to limit the ability of users to negatively affect each other's work.  These tools work to control memory and CPU usage, keep /tmp and /dev/shm clean through cgroups, namespaces, process limits, and a polling mechanism if cgroups aren't available.
+
+These tools are grouped into different categories and can generally be used separately.
+
+cgroups_compute/
+	Ideally your scheduler will have support for cgroups.  There are scripts that help catch ssh-launched tasks (i.e. tasks not launched through the scheduler) for SLURM (available in 13.12 or newer) and Torque.  Some incomplete code is included that may help in the development of a prologue-based mechanism for cgroups in Torque or other schedulers.
+	cgroups on compute nodes are probably only necessary if you allow node sharing (multiple jobs can run on the same node).
+
+cgroups_login/
+	This contains files intended to control login node usage, specifically memory usage and CPU sharing.  It uses the memory and cpu (not cpuset) cgroups to provide hard memory limits and soft core limits on a per-user basis. Also includes is oom_notifierd, an out-of-memory notification tool for users.
+
+cputime_controls/
+	Make sure that login nodes are only used as login nodes.  Prohibit long processing tasks but optionally allow long-running data movement processes such as scp and rsync.
+
+loginlimits/
+	Contains a script that reads certain cgroup information from the current cgroup as well as cputime limits.  It very verbosely explains some of the cgroup information.
+
+loginmemlimitenforcer/
+	Enforce a memory usage threshold on a login node by killing a process if a user exceeds the limit.  Uses a polling mechanism.  Use cgroups if possible but this if nothing else.
+
+namespaces/
+	Create a separate /tmp and /dev/shm per user on login and compute nodes.  Each user will have no idea anything is different but it greatly assists in cleanup of user data after the user exits the node.
diff --git a/cgroups_compute/slurm/README b/cgroups_compute/slurm/README
@@ -0,0 +1,3 @@
+SLURM already has support for cgroups but an extra feature is provided.  See: http://slurm.schedmd.com/cgroups.html, slurm.conf manpage, and cgroup.conf manpage.
+
+The helper script /etc/ssh/sshrc help "catch" users who use ssh to distribute tasks and assign them to the appropriate cgroup.  Requires SLURM 13.12.  The user can still get around it by creating a ~/.sshrc, unfortunately, though most users are unaware of such a file.
diff --git a/cgroups_compute/slurm/ssh_config b/cgroups_compute/slurm/ssh_config
@@ -0,0 +1,3 @@
+# Add this to /etc/ssh/ssh_config
+# You may also want to add other SLURM_* variables such as SLURM_STEP_ID and SLURM_TASK_ID.
+SendEnv SLURM_JOB_ID
diff --git a/cgroups_compute/slurm/sshd_config b/cgroups_compute/slurm/sshd_config
@@ -0,0 +1,3 @@
+# Add this to /etc/ssh/sshd_config
+# You may also want to add other SLURM_* variables such as SLURM_STEP_ID and SLURM_TASK_ID.
+AcceptEnv SLURM_JOB_ID
diff --git a/cgroups_compute/slurm/sshrc b/cgroups_compute/slurm/sshrc
@@ -0,0 +1,19 @@
+# /etc/ssh/sshrc
+# must do this for X forwarding purposes
+# reads from stdin so be careful what you place before it
+# see sshd manpage for details under SSHRC
+if read proto cookie && [ -n "$DISPLAY" ]; then
+	if [ `echo $DISPLAY | cut -c1-10` = 'localhost:' ]; then
+		# X11UseLocalhost=yes
+		echo add unix:`echo $DISPLAY | cut -c11-` $proto $cookie
+	else
+		# X11UseLocalhost=no
+		echo add $DISPLAY $proto $cookie
+	fi | xauth -q -
+fi
+
+#For this to work you must set up sshd_config to AcceptEnv SLURM_JOB_ID and ssh_config to SendEnv SLURM_JOB_ID
+#for SLURM >= 13.12:
+#  use new scontrol call to adopt $PPID to $SLURM_JOB_ID
+#
+#  TODO:  Insert scontrol command here once 13.12 is finalized
diff --git a/cgroups_compute/torque/README b/cgroups_compute/torque/README
@@ -0,0 +1,3 @@
+Torque does not have support for cgroups as of the time of this writing.  Ryan Cox at BYU almost completed a prologue-based mechanism for using cgroups but it was never finished due to the anticipated and eventual switch to SLURM.  The test code is provided in an incomplete state in the directory incomplete_cgroups_support/.  Read the README in that directory.
+
+The helper script sshrc helps "catch" users who use ssh to distribute tasks.  It uses tm_adopt() to begin tracking any processes launched through ssh under the proper job ID.  It requires additional ssh configuration as demonstrated in the ssh* files.  If you add cgroups support or Torque adds it, you may need to add something to sshrc so the process will be in the proper cgroup.
diff --git a/cgroups_compute/torque/incomplete_cgroups_support/EXPERIMENTAL b/cgroups_compute/torque/incomplete_cgroups_support/EXPERIMENTAL
@@ -0,0 +1 @@
+README
diff --git a/cgroups_compute/torque/incomplete_cgroups_support/INCOMPLETE b/cgroups_compute/torque/incomplete_cgroups_support/INCOMPLETE
@@ -0,0 +1 @@
+README
diff --git a/cgroups_compute/torque/incomplete_cgroups_support/README b/cgroups_compute/torque/incomplete_cgroups_support/README
@@ -0,0 +1,9 @@
+The files provided in here are INCOMPLETE.  That means they do NOT work properly.  They were fairly close to completion but are NOT COMPLETE and may burn down your servers and get you fired.  These files still have hard-coded paths to Ryan's testing directories.  That said, here is some information about the files in case you want to complete it yourself.  Stream of consciousness, engage!
+
+Here are some thoughts from over a year since the code was last tested.  Unfortunately I can't remember what everything was supposed to do or where I left on things.  I also can't remember why I mounted multiple control group types at a single mount point.  It probably seemed easier at the time and I'm not sure if it was the right decision.
+
+There is a release_agent in the top level cgroup for each controller type, for example: /cgroup/memory/release_agent.  You need a release agent that will cause the cgroup to be removed and the aggregate cgroups usage to be recalculated.  You can have a different file per cgroup type but a common way of handling it is to use a common file then parse $1 (the cgroup path under the cgroup mount point).  I chose to only have one mount point so it behaves a little differently.  There is only one release_agent to work with.  Echo the path of that script to the top level release_agent file.  See create_cgroups.sh.  Another file I was playing with is release_cgroup.sh.  I think that I created a much better, functional verson on a node and forgot to copy the working one back...  Right now it just uses wall to inform you of events.
+
+I can't remember if assign_pid_to_cgroup.c was intended to be setuid or not.  It's been almost a year since I used Torque and I can't recall how the prologue worked even though I used to do a lot of work with it.  Basically, something needs to run as root to create and modify the cgroups.  If you set the cgroup's tasks file to be owned by and writable by the user, the user can assign new pids (enforced by the kernel to be ones that he owns) to his cgroup.  Running as setuid is the only reason I can think of to have assign_pid_to_cgroup.c.  Ideally the task would be broken into two pieces:  1) create/modify/delete cgroups as root and 2) assign new pids to the correct cgroup as the user.
+
+Feel free to ask me (Ryan Cox) questions, though I may not remember how to answer them.
diff --git a/cgroups_compute/torque/incomplete_cgroups_support/assign_pid_to_cgroup.c b/cgroups_compute/torque/incomplete_cgroups_support/assign_pid_to_cgroup.c
@@ -0,0 +1,34 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#define PATH_TO_CGROUP_MGT_SCRIPT "/fslhome/ryancox/cgroups/manage_cgroup.pl"
+
+#define UID_STR_BUF 10
+
+int errno = 0;
+
+int main(int argc, char *argv[]) {
+	struct stat sb;
+	char uid[UID_STR_BUF];
+
+	if (argc != 3) {
+		fprintf(stderr, "Usage: %s <PID> <JOBID>\n", argv[0]);
+		exit(1);
+	}
+
+	execl(
+		PATH_TO_CGROUP_MGT_SCRIPT,
+		PATH_TO_CGROUP_MGT_SCRIPT,
+		"-p",
+		argv[1],
+		"-j",
+		argv[2],
+		NULL
+	);
+	fprintf(stderr, "%0: exec errno=%d\n", argv[0], errno);
+
+	return 0;
+}
diff --git a/cgroups_compute/torque/incomplete_cgroups_support/create_cgroups.sh b/cgroups_compute/torque/incomplete_cgroups_support/create_cgroups.sh
@@ -0,0 +1,33 @@
+#!/bin/sh -e
+
+#add the ones you need
+mount -tcgroup -o cpu,memory cgroup /cgroup
+echo 1 > "/cgroup/memory.use_hierarchy"
+echo 0 > "/cgroup/notify_on_release"
+echo 1 > "/cgroup/cgroup.clone_children"
+echo /fslhome/ryancox/cgroups/release_cgroup.sh > "/cgroup/release_agent"
+
+exit
+
+
+
+
+
+CGROUPPATH=/cgroup
+
+mkdir -p "$CGROUPPATH"
+awk 'NR>1 {print $1}' /proc/cgroups |
+while read -r type
+do
+	path="$CGROUPPATH/$type"
+	mkdir -p "$path"
+	mount -tcgroup -o"$type" "cgroup:$type" "$path"
+	echo /fslhome/ryancox/cgroups/release_cgroup.sh > "$path/release_agent"
+	echo 0 > "$path/notify_on_release"
+	if [ -e "$path/cgroup.clone_children" ]
+	then
+		echo 1 > "$path/cgroup.clone_children"
+	fi
+done 
+
+echo 1 > "$CGROUPPATH/memory/memory.use_hierarchy"