New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PP-810: Update cgroup hook to work with systemd #626
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey mike, I know it's still a WIP, but pylint picked up some issues.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
if not result: | ||
continue | ||
key, val, extra = result.groups() | ||
if not key or not value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value
should probably be val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next commit.
@@ -1770,10 +1969,13 @@ class CgroupUtils(object): | |||
repr(self.subsystems), | |||
repr(self.paths), | |||
repr(self.vntype), | |||
repr(self.assigned_resources))) | |||
repr(self.assigned_resources), | |||
repr(self.systemd_version))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's too many arguments to the string, you need to add another %s
to the format string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next commit.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -1177,11 +1249,48 @@ class NodeConfig(object): | |||
convert_size(entries[3] + entries[4], 'kb') | |||
elif entries[2] == "HugePages_Total:": | |||
numa_nodes[num]['HugePages_Total'] = entries[3] | |||
# Spread memory resources across NUMA nodes | |||
if self.cfg['vnode_per_numa_node'] and len(numa_nodes) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since numa_nodes is a list, you can just check if numa_nodes
instead of its length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use len(numa_nodes) as a denominator we must ensure it is not zero. There was no sense recalculating this value in the block that followed, so assigned len(num_nodes) to a variable.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -347,6 +356,8 @@ def expand_list(old): | |||
2,5-7,10 | |||
""" | |||
new = [] | |||
if isinstance(old, list): | |||
old = ','.join(str(x) for x in old) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map(str, old)
does the same thing but is more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be restored to map() in the next commit.
@@ -360,20 +371,37 @@ def expand_list(old): | |||
return new | |||
|
|||
|
|||
def find_files(path, pattern='*', kind=''): | |||
def find_files(path, pattern='*', kind='', | |||
follow_links=False, follow_mounts=True): | |||
""" | |||
Return a list of files similar to the find command | |||
""" | |||
if isinstance(pattern, str): | |||
pattern = [pattern] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern
can be ['']
. Is this okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is safe. Just tried it out to make sure.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
info = {} | ||
jobfile = os.path.join(PBS_MOM_JOBS, '%s.JB' % jobid) | ||
if not os.path.isfile(jobfile): | ||
pbs.logmsg(pbs.EVENT_DEBUG4, "File not found: %s" % (jobfile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why parentheses around jobfile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly necessary in this case, but if we were doing multiple substitutions it would be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And PEP8 doesn't complain, so it must be "acceptable" Python syntax.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
return substate in ['0x2b', '0x2d', 'unknown'] | ||
except KeyboardInterrupt: | ||
raise | ||
except Exception, exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comma here. Do you mean
except Exception as exc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All instances have been updated to "except Exception as exc:" which is compatible with Python 2.6 through 3.x.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
numa_nodes[num]['cpus'] = expand_list(desc.readline()) | ||
numa_nodes[num]['cpus'] = \ | ||
[x for x in expand_list(desc.readline()) if x not in | ||
expand_list(self.cfg['cgroup']['cpuset']['exclude_cpus'])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way of writing this is
a = expand_list(self.cfg['cgroup']['cpuset']['exclude_cpus'])
b = expand_list(desc.readline())
numa_nodes[num]['cpus'] = filter(lambda x : x not in a, b)
a
and b
are just placeholder variable names that you can replace with more meaningful ones.
I personally this is more readable but you are free to choose which one you like better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with you, so I updated it.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
def _systemd_unescape(self, buf): | ||
""" | ||
Unescape strings encoded for usage in system unit names | ||
Some distros don't provide the systemd-escape commad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Missing a 'n'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
while i < length: | ||
if (length - i) > 3 and buf[i] == '\\' and buf[i + 1] == 'x' and \ | ||
buf[i + 2] in '0123456789abcdef' and \ | ||
buf[i + 3] in '0123456789abcdef': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about capital 'A' to 'F'? Do they need to be handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the actual systemd-escape command doesn't handles lower case.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
if os.path.isdir(os.path.join(os.sep, 'proc', 'self', 'task')): | ||
check_tasks = 1 | ||
else: | ||
check_tasks = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a boolean type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted to boolean type.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
|
||
def remove_jobid_from_cgroup_jobs(self, jobid): | ||
""" | ||
Remove a job idea from the file where local jobs are maintained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: job id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -3773,66 +4383,82 @@ class CgroupUtils(object): | |||
return self.assigned_resources is not None | |||
|
|||
def add_jobid_to_cgroup_jobs(self, jobid): | |||
""" | |||
Add a job idea to the file where local jobs are maintained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: job id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
result = re.match(r'([-+]?\d+)\s*([a-zA-Z]+)', | ||
str(value).lower()) | ||
if not result: | ||
raise ValueError('Unrecognized value.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop the period from this error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be removed in the next commit.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -347,6 +356,8 @@ def expand_list(old): | |||
2,5-7,10 | |||
""" | |||
new = [] | |||
if isinstance(old, list): | |||
old = ','.join(str(x) for x in old) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC you previously used map() to call str() on each element in a list. This is fine though. Up to you on how you want to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched it back to map().
src/hooks/cgroups/pbs_cgroups.PY
Outdated
return substate in ['0x2b', '0x2d', 'unknown'] | ||
except KeyboardInterrupt: | ||
raise | ||
except Exception, exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the newer except Exception as esc style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work with some older versions of PBS Pro. There are still users with Python 2.5 that want to make use of the cgroup hook. We'll have to use the "as" syntax when we migrate to Python 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the thing. You are checking this hook into 18.1 OSS. It uses the system python which will be at least 2.6. You really should be using the as syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have to defer to @scc138 for a decision. There was an issue related to this recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @subhasisb should weigh in on the decision as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it would be convenient if this hook would work with PBS Pro 13.x where Python 2.5 is present it is not a requirement. Doing the right thing for the targeted Python versions (2.6 and up, less than 3.0) and being future proof are more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update all of the "Exception, exc:" to "Exception as exc:" in the next push.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
pbs.logmsg(pbs.EVENT_DEBUG2, "Error message: %s" % err) | ||
pbs.logmsg(pbs.EVENT_DEBUG2, "Exception: %s" % exc) | ||
return info | ||
pattern = re.compile(r'^(\w.*):\s*(\S+)\s*(\S*).*$') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need the .*$ since you're just throwing it away. The regex should apply to a substring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next commit.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
result = re.match(pattern, line) | ||
if not result: | ||
continue | ||
key, val, extra = result.groups() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why group extra if you don't use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next commit.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
if int(entries[0]) not in pids: | ||
pids.append(int(entries[0])) | ||
except: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try except block doesn't do anything. You have a blanket except where all you do is raise the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment and changed the raise to a pass. We're inspecting proc, which could change while we're reading files. That should not cause a fatal error since it is expected (occasionally). When I pasted the method I had unit tested into the hook I didn't update that piece.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
elif vtype is pbs.int: | ||
dest[key] = pbs.int(0) | ||
elif vtype is pbs.float: | ||
dest[key] = pbs.float(0.0) | ||
else: | ||
raise ValueError('Unrecognized resource type.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove the period from the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pbs.logmsg(pbs.EVENT_DEBUG4, "%s: Method called" % caller_name()) | ||
if not os.path.isdir(path): | ||
pbs.logmsg(pbs.EVENT_DEBUG4, "%s: No such directory: %s" % | ||
(caller_name(), path)) | ||
return False | ||
tasks_file = os.path.join(path, 'tasks') | ||
if not jobid: | ||
parent = path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can pass None in for jobid, make it a default value for jobid so it doesn't have to be passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also corrected some log messages associated with the jobid.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
except KeyboardInterrupt: | ||
raise | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use blanket except statements like this. Narrow the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another case where we don't care what the error was, or even that it occurred. We just keep trying to identify and cleanup orphans. I added a log message in the case where os.rename() fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using blanket except pass statements is a bad habit to get into. I understand you don't care about the exceptions, but I'd still suggest you narrow the scope. You know what os.rename() will throw.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
try: | ||
with open(self.cgroup_jobs_file, 'w') as f: | ||
with open(self.cgroup_jobs_file, 'w+') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why w+? You're not reading the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 'w'. Thanks for all of your feedback!
04356e8
to
f0e29da
Compare
71310be
to
56eec1e
Compare
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -2738,14 +2739,16 @@ class CgroupUtils(object): | |||
if int(entries[0]) not in pids: | |||
pids.append(int(entries[0])) | |||
except: | |||
raise | |||
# PIDs may come and go as we read /proc so the glob data can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this except be an except OSError? The comment makes it sound like we're trying to be accepting of pids that go away. That means the open will fail and raise an OSError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added OSError.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -3166,8 +3170,8 @@ class CgroupUtils(object): | |||
avail_mem = available['memory'] | |||
if req_mem > avail_mem: | |||
pbs.logmsg(pbs.EVENT_DEBUG4, | |||
"Insufficient memory on socket(s) " + | |||
"%s: requested:%s, assigned:%s" % | |||
'Insufficient memory on socket(s) ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd replace this + with a '\'. This way the string doesn't need to be regenerated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placed both strings within parens and removed the + operator.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
msg += "has the offline file." | ||
pbs.logmsg(pbs.EVENT_DEBUG2, msg) | ||
pbs.logmsg(pbs.EVENT_DEBUG2, | ||
'Cgroup(s) not cleaning up but the node already ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this +. It isn't needed and will cause the string to be regenerated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed several instances, including those you flagged.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
msg += "already offline." | ||
pbs.logmsg(pbs.EVENT_DEBUG2, msg) | ||
pbs.logmsg(pbs.EVENT_DEBUG2, | ||
'Cgroup(s) not cleaning up but the node is ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, remove the +
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
msg += "cleaning up" | ||
pbs.logmsg(pbs.EVENT_DEBUG2, msg) | ||
pbs.logmsg(pbs.EVENT_DEBUG2, | ||
'Offlining node since cgroup(s) are not ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, remove the +
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
3363b3e
to
5641d4b
Compare
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -1252,20 +1252,23 @@ class NodeConfig(object): | |||
# Huge page memory | |||
host_hpmem = self.get_hpmem_on_node(ignore_reserved=True) | |||
host_resv_hpmem = host_hpmem - self.get_hpmem_on_node() | |||
host_resv_hpmem = 0 if (host_resv_hpmem < 0) else host_resv_hpmem | |||
if (host_resv_hpmem < 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python style doesn't use () in ifs unless necessary. Please remove them (and for the other cases in this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
'Description=%s\n' | ||
'[Slice]\n' | ||
'Delegate=yes\n' | ||
'TasksMax=infinity\n') % description) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the extra set of (). You're in a set of () by being a function call. (and for other similar strings in function calls)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -4156,8 +4150,6 @@ class CgroupUtils(object): | |||
raise CgroupLimitError('Invalid limit value: %s' % (value)) | |||
else: | |||
raise | |||
except KeyboardInterrupt: | |||
raise | |||
except: | |||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real purpose to an except: raise. If you don't have this statement, all uncaught exceptions will be just that, uncaught. This just makes you catch them and reraise them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
2ac5837
to
5367b96
Compare
src/hooks/cgroups/pbs_cgroups.PY
Outdated
if os.path.isfile(tasks_file): | ||
self._kill_tasks(tasks_file) | ||
pbs.logmsg(pbs.EVENT_DEBUG2, '%s: Removing directory %s' % | ||
(caller_name(), parent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parent is undefined here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Thanks!
ce29cd8
to
35aeceb
Compare
9ccf0c1
to
3300370
Compare
Conflict resolution did not occur smoothly. Git tried to replay each commit in order, and it was a waste of time re-applying changes one by one to get the branch back on track. I ended up copying the latest hook and tests over the instances with conflicts until everything was resolved. Git then noticed that several patches had already been applied and without my knowledge discarded some previous commits. |
oh no and I see Vincent's commit is gone .. |
But his changes are all still there. Nothing was lost in the process, aside from a bit of history which is eventually going to be squashed out. |
oh okay .. phew!! |
if not os.path.isdir(fdir): | ||
fdir = self.get_cgroup_job_dir('freezer', '123.foo', self.momA) | ||
fdir = os.path.dirname(fdir) | ||
fdir = os.path.dirname(fdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note why you are assigning fdir twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you turn these two comments into one that says "Get the parent's parent"
'python - 100 4 <<EOF\n' + \ | ||
self.eatmem_script + 'EOF\n' + \ | ||
'let i=0; while [ $i -lt 400000 ]; do let i+=1 ; done\n' + \ | ||
'sleep 5\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this job script was updated to allocate less memory as sometime VMs were running out of memory or causing a hang. Why do we need this job script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allocates less memory than it used to, and is needed to accumulate CPU time so that the measurements get larger as the job runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mike. Please update the script to run explicitly with bash. As these syntax are failing with csh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -1166,7 +1167,7 @@ def test_cgroup_enforce_memory(self): | |||
max_attempts=20) | |||
self.wait_and_remove_file(filename=o.split(':')[1], mom=self.momA) | |||
|
|||
def Xtest_cgroup_enforce_memsw(self): | |||
def test_cgroup_enforce_memsw(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please skip this test if resources_available.vmem is not present as jobs will not run in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done completely clean installs and immediately after ran the cgroup tests, but have not seen this fail. Do you have an example where it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the example,
pbsroot@testdev-09-s12:~/cgroup_dev/tests/functional> qsub -l select=1:ncpus=1:mem=300mb:vmem=320mb:host=testdev-09-s12 -- /bin/sleep 10
63.testdev-09-s12
comment = Can Never Run: Insufficient amount of resource: vmem (R: 320mb A:
0kb T: 0kb)
pbsroot@testdev-09-s12:~/cgroup_dev/tests/functional> pbsnodes -a | grep vmem
resources_available.vmem = 0kb
resources_assigned.vmem = 0kb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code with following and added a skip
if have_swap() == 0 or not (self.server.status(NODE,'resources_available.vmem',id=self.momA)):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must have manually added "vmem" to the sched_config "resources:" line. The test harness doesn't do that, so the scheduler doesn't consider vmem when scheduling jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now I understand why. I'll update it.
@@ -84,7 +107,6 @@ def setUp(self): | |||
self.vntypenameA = "no_cgroups" | |||
self.vntypenameB = self.vntypenameA | |||
self.iscray = 0 | |||
self.tempfile = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this piece of code back with related changes at other places too. This was added to cleanup the temporary files created by create_temp_file function.
@@ -321,7 +373,7 @@ def setUp(self): | |||
} | |||
""" | |||
self.cfg2 = """{ | |||
"cgroup_prefix" : "pbs", | |||
"cgroup_prefix" : "propbs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other palces it is pbspro. is "propbs" intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is intentional. It is needed by test_cgroup_prefix_and_devices.
self.logger.info("vntype set for %s is %s" % | ||
(self.momA, self.vntypenameA)) | ||
self.logger.info("vntype set for %s is %s" % | ||
(self.momB, self.vntypenameB)) | ||
(self.momB, self.vntypenameA)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be self.vntypenameB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -571,7 +637,6 @@ def load_config(self, cfg): | |||
Create a hook configuration file with the provided contents. | |||
""" | |||
fn = self.du.create_temp_file(hostname=self.serverA, body=cfg) | |||
self.tempfile.append(fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add all self.tempfile related code back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -580,9 +645,11 @@ def load_config(self, cfg): | |||
'file request received', | |||
starttime=self.server.ctime) | |||
self.logger.info("Current config: %s" % cfg) | |||
self.du.rm(hostname=self.serverA, path=fn, force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with self.tempfile we do not need to remove files here but in tearDown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, removed the line.
"%s;update_job_usage: CPU usage: [0-9.]+ secs" % jid, | ||
regexp=True) | ||
"%s;update_job_usage: CPU usage: [0-9].[0-9]+ secs" % jid, | ||
regexp=True, max_attempts=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove max_attempts or keep something more than 15-20s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed max_attempts.
self.wait_and_remove_file(filename=o.split(':')[1], mom=self.momA) | ||
lines = self.hostA.log_match( | ||
"%s;update_job_usage: Memory usage: mem=" % jid, | ||
allmatch=True, max_attempts=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove max_attempts or keep something more than 15-20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed max_attempts.
if not os.path.isdir(fdir): | ||
fdir = self.get_cgroup_job_dir('freezer', '123.foo', self.momA) | ||
fdir = os.path.dirname(fdir) | ||
fdir = os.path.dirname(fdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note why you are assigning fdir twice?
@@ -1598,24 +1676,24 @@ def tearDown(self): | |||
dest=freezer_file, sudo=True, | |||
uid='root', gid='root', | |||
mode=0644) | |||
self.du.rm(hostname=self.momA, path=fn) | |||
cmd = ["rmdir", jpath] | |||
self.du.rm(hostname=self.momA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe some of the changes were not merged from master. for example, above rm do not have file specified which was fixed in master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed one of your commits, but I thought I got them all. Fixed this as well.
I've already added the comments to the fdir lines for my next commit, so not adding them here. I realize this line doesn't have anything to do with fdir, but GitHub wouldn't let me respond to your earlier comment.
93ee5ef
to
003b82b
Compare
if entries[2] != 'cgroup': | ||
continue | ||
flags = entries[3].split(',') | ||
if 'systemd' in flags or 'name=systemd' in flags: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If name=systemd is in flags, then the first part would have caught it first (systemd is in flags).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe that is correct because 'systemd' will not match 'name=systemd' and vice-versa. We need to explicitly check for both. On my system the line we start with is...
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's a list. I thought it was a string. If it was a string, in would work. Since it is a list of strings, you do need the exact string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Will respond to more comments tomorrow.
if curval == 0: | ||
self.write_value(path, '1') | ||
else: | ||
if curval == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this else and if be turned into an elif?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not. We're testing a boolean, so there are only two possible values.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -4297,6 +4367,8 @@ class CgroupUtils(object): | |||
# Check to see if the job id is found in dmesg output | |||
for line in out: | |||
start = line.find('Killed process ') | |||
if start < 0: | |||
start = line.find('Task in /%s' % self.cfg['cgroup_prefix']) | |||
if start < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be inside the above if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is correct. The dmesg output differs on newer kernels. First, we look for the old line, and only look for the new line if we didn't find the old one.
raise ValueError('Not a basetype string') | ||
ret = '' | ||
for i, char in enumerate(buf): | ||
if i < 1 and char == '.': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumerate() will start at 0. The only value this can be is 0. Consider checking for only that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, but it's effectively the same test whether we check for (i < 1) or (i == 0). It's a shame this function even has to be present, IMHO. I'll change it if you insist.
@@ -136,92 +160,150 @@ def setUp(self): | |||
totalSizeMb = chunkSizeMb * iterations | |||
print('Allocating %d chunk(s) of size %dMB. (%dMB total)' % | |||
(iterations, chunkSizeMb, totalSizeMb)) | |||
buf = "" | |||
buf = '' | |||
for i in range(0, iterations): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ranges start at 0 by default. You can just say range(iterations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's been there for a long time. Fixed.
result = self.du.cat(hostname=self.momA, filename=fn, sudo=True) | ||
self.assertTrue(result['rc'] == 0) | ||
self.logger.info('output: %s' % result['out'][0]) | ||
self.assertTrue(result['out'][0] == '1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertEqual()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
fn = os.path.join(fn, 'cpuset.memory_spread_page') | ||
result = self.du.cat(hostname=self.momA, filename=fn, sudo=True) | ||
self.assertTrue(result['rc'] == 0) | ||
self.assertTrue(result['out'][0] == '0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertEqual()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
fn = os.path.join(fn, 'cpuset.memory_spread_page') | ||
result = self.du.cat(hostname=self.momA, filename=fn, sudo=True) | ||
self.assertTrue(result['rc'] == 0) | ||
self.assertTrue(result['out'][0] == '1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertEqual()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cmd = ["rmdir", jpath] | ||
self.logger.info("deleting jobdir %s" % cmd) | ||
cmd = 'rmdir ' + jpath | ||
self.logger.info('deleting jobdir %s' % cmd) | ||
self.du.run_cmd(cmd=cmd, sudo=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use du.run_cmd() instead of du.rm()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's more of a question for Anamika. At first glance, we're running the command "rmdir" and not "rm".
self.du.run_cmd(cmd=cmd2, sudo=True) | ||
for jdir in glob.glob(os.path.join(cpath, '*', '')): | ||
if not os.path.isdir(cpath): | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know cpath is a dir, you're in a if os.path.isdir(cpath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should have been jdir and not cpath. Fixed.
val *= 2 ** factor | ||
slop = val - int(val) | ||
if (val - int(val)) > 0.0: | ||
val += 1.0 | ||
val = int(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to do a ceiling() operation here? If so I recommend using the library call math.ceil
under math
module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid point, but I didn't want to pull in the math module for a single operation that can be done without it. Leaving it as-is unless you insist.
@@ -237,7 +236,7 @@ def size_as_int(value): | |||
# | |||
# FUNCTION convert_time | |||
# | |||
def convert_time(value, return_unit='s'): | |||
def convert_time(value, units='s'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use plural instead of unit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because one would normally say "express the time in units" as opposed to "express the time in unit".
undermount = True | ||
break | ||
if undermount: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain what this part is doing? What is undermount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The os.walk() method takes a parameter called followlinks to determine whether symbolic links are to be avoided. It does not offer a similar parameter (like followmounts) to avoid mount points it encounters. We maintain a list of mount points encountered in the mounts variable. When we encounter a mount, it gets add it to the list. As the search progresses we check to see if the current item is under a mount point. If it is, we skip it when follow_mounts is false.
return size_as_int(total) | ||
pbs.logmsg(pbs.EVENT_DEBUG4, | ||
'%s: Failed to obtain memory using NUMA node method' % | ||
caller_name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the three methods get_mem_on_node
, get_vmem_on_node
and get_hpmem_on_node
be combined into one? They have a lot of common code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, but I'm apprehensive about refactoring so close to release. Filed a ticket so we can address it later. https://pbspro.atlassian.net/browse/PP-1276
src/hooks/cgroups/pbs_cgroups.PY
Outdated
elif key == 'hpmem': | ||
# Used for vnodes per NUMA socket | ||
if vnodes: | ||
vnode_resc_avail['hpmem'] = pbs.size(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three cases above can be combined into
elif key in ['mem', 'vmem', 'hpmem'] and vnodes:
vnode_resc_avail[key] = pbs.size(val)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
else: | ||
self.assigned_resources = self._get_assigned_cgroup_resources() | ||
self.offline_msg = 'Hook %s: ' % pbs.event().hook_name | ||
self.offline_msg += 'Unable to clean up one or more cgroups' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a single string assignment and save a string concat operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/hooks/cgroups/pbs_cgroups.PY
Outdated
@@ -1902,13 +2138,21 @@ class CgroupUtils(object): | |||
# but use a different prefix | |||
paths['memsw'] = \ | |||
self._assemble_path('memsw', entries[1], flags) | |||
if 'pids' in flags: | |||
paths['pids'] = \ | |||
self._assemble_path('pids', entries[1], flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of cases in this if statement do the same thing so why not
for subsys in ['blkio', 'cpu', 'cpuacct', 'cpuset', 'devices', 'freezer', 'hugetlb', 'pids']:
if subsys in flags:
paths[subsys] = self._assemble_path(subsys, entries[1], flags)
And then handle memory
, memsw
, systemd
in seperate cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
7296a54
to
265da8a
Compare
match = re.match(r'(\d+)kb', mem) | ||
self.assertFalse(match is None) | ||
usage = int(match.groups()[0]) | ||
self.assertTrue(usage < 20000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertLesser()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI... there is no assertLesser. I used assertGreater and reversed the arguments.
self.assertFalse(match is None) | ||
usage = int(match.groups()[0]) | ||
self.assertTrue(usage < 400000) | ||
self.assertTrue(usage < 20000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use assertLesser()
# Allow some time to pass for values to be updated | ||
self.logger.info('Waiting for periodic hook to update usage data.') | ||
time.sleep(5) | ||
begin = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to use this to match logs, you need to int() this. time.time() returns a float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already addressed with latest commit.
usage = float(match.groups()[0]) | ||
if usage > 1.0: | ||
break | ||
self.assertTrue(usage > 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertGreater()
@@ -1010,14 +1012,14 @@ def test_cgroup_periodic_update(self): | |||
usage = int(match.groups()[0]) | |||
if usage > 400000: | |||
break | |||
self.assertTrue(usage > 400000) | |||
self.assertTrue(usage > 400000, 'Max memory usage: %dkb' % usage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assertGreater()
Pull request is too large and becoming unresponsive. I will open a new one. |
New PR available here: #678 |
Bug/feature Description
Affected Platform(s)
Cause / Analysis / Design
Solution Description
Checklist:
For further information please visit the Developer Guide Home.