New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user reliability workflow - creating reservation out of a job #1145
Conversation
@PBSPro/pbspro-maintainers - please do not close this PR. I understand it is inactive. |
@prakashcv13 why don't you close this and re-open when you are ready? it has 0 reviews so there's really not much value in keeping this open right? |
this PR is ready for review - PTL tests are pending for some scenarios - I will push them in a day or two. |
src/include/batch_request.h
Outdated
@@ -175,7 +175,7 @@ struct rq_py_spawn { | |||
|
|||
struct rq_move { | |||
char rq_jid[PBS_MAXSVRJOBID+1]; | |||
char rq_destin[(PBS_MAXSVRJOBID > PBS_MAXDEST ? PBS_MAXSVRJOBID:PBS_MAXDEST)+1]; | |||
char rq_destin[(PBS_MAXSVRJOBID > PBS_MAXDEST ? PBS_MAXSVRJOBID:PBS_MAXDEST)+2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With +1 people can guess it is for a null byte. I think +2 deserves a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need +2 there, so reverting the change.
src/include/pbs_error.h
Outdated
@@ -260,6 +260,8 @@ extern "C" { | |||
|
|||
#define PBSE_SOFTWT_STF 15180 /* soft_walltime is incompatible with STF jobs */ | |||
|
|||
#define PBSE_RESVFROMRESVJOB 15181 /* A reservation used to create a reservation */ | |||
#define PBSE_RESVFROMARRJOB 15182 /* A reservation used to create a reservation */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean a job used to create a reservation? Also, I don't understand how these two error codes differ. Also, some underscores might make them more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15181 is for creating a reservation out of a job within a reservation
15182 is for creating a reservation out of an array job.
I have inserted underscores and updated the comments as well
<member_at_comp>comp_str</member_at_comp> | ||
<member_at_free>free_str</member_at_free> | ||
<member_at_action>NULL_FUNC</member_at_action> | ||
<member_at_flags><both>READ_WRITE | ATR_DFLAG_ALTRUN | ATR_DFLAG_SELEQ | ATR_DFLAG_MOM</both></member_at_flags> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the mom care about this attribute? Actually does ATR_DFLAG_MOM mean anything to a reservation? It usually means to send a job attribute to the mom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - removed the flag.
src/server/hook_func.c
Outdated
snprintf(log_buffer, sizeof(log_buffer), | ||
"Found job '%s' attribute flagged to be set", | ||
ATTR_create_resv_from); | ||
log_event(PBSEVENT_DEBUG2, PBS_EVENTCLASS_HOOK, LOG_ERR, phook->hook_name, log_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using log_eventf(). Also, I don't think you need to use %s. ATTR_create_from_resv is going to be replaced with a quoted string. You should be able to do "found job '" ATTR_create_from_resv "' attribute flagged to be set"
src/server/queue_func.c
Outdated
*/ | ||
|
||
resc_resv * | ||
find_resvbyquename(char *quename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our standard suggests using underscores between words
jid = self.server.submit(job) | ||
self.server.expect(JOB, {ATTR_state: 'R'}, jid) | ||
|
||
rid = 'R' + str(int(jid.split(".")[0]) + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't assume this. When we get multi-server, ids will be all over the place. Do a self.server.status(RESV)[0]['id'] (here and other tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, but here we do not have multi-server, this test will run on a single server system only, here, we can safely assume this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Don't assume ids. Query them. No other test I've read has assumed IDs to be like this, they query them.
delete_hook = [qmgr_cmd, '-c', 'delete hook %s' % hook_name] | ||
ret = self.du.run_cmd(self.server.hostname, | ||
cmd=delete_hook, sudo=True) | ||
self.assertEqual(ret['rc'], 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooks will be deleted in setUp(). No need to delete it.
resv = Reservation(job=jid) | ||
self.server.submit(resv) | ||
|
||
rid = 'R' + str(int(jid.split(".")[0]) + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.server.submit() returns the rid. No need to assume the id.
try: | ||
self.server.submit(resv) | ||
except PbsSubmitError as e: | ||
self.assertTrue(msg in e.msg[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an else: self.fail() to both of these try/except blocks (here and below)
test/fw/ptl/lib/pbs_testlib.py
Outdated
@@ -14262,7 +14266,7 @@ class Reservation(ResourceResv): | |||
|
|||
dflt_attributes = {} | |||
|
|||
def __init__(self, username=TEST_USER, attrs=None, hosts=None): | |||
def __init__(self, username=TEST_USER, attrs=None, hosts=None, job=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, maybe key off of the reservation attribute instead of adding an argument to Reservation()? That's how it will work in API mode.
src/server/hook_func.c
Outdated
} | ||
if (new_conv_str != NULL) | ||
log_eventf(PBSEVENT_DEBUG2, PBS_EVENTCLASS_HOOK, LOG_ERR, phook->hook_name, | ||
"Found job "ATTR_create_resv_from_job" attribute flagged to be set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the %s was taken out, you can just use log_event() now. Also, put spaces after the " and before the next opening " around the macro. Right now it looks like it is quoted and not the other strings.
src/server/req_runjob.c
Outdated
pjob->ji_qs.ji_jobid, log_buffer); | ||
pjob->ji_qs.ji_jobid, "batch request allocation failed"); | ||
log_event(PBSEVENT_SYSTEM, PBS_EVENTCLASS_JOB, LOG_ERR, | ||
pjob->ji_qs.ji_jobid, "Could not create reservation from the job"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combine these log messages into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to keep 2 separate log messages, but let us combine them into one.
src/server/req_quejob.c
Outdated
if (resc_def != NULL) { | ||
job_resc_entry = find_resc_entry(&pjob->ji_wattr[(int)JOB_ATR_resource], resc_def); | ||
job_resc_entry = (resource *)GET_NEXT(pjob->ji_wattr[(int)JOB_ATR_resource].at_val.at_list); | ||
while (job_resc_entry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are iterating, use a for() loop.
resc_def = find_resc_def(svr_resc_def, WALLTIME, svr_resc_size); | ||
if (resc_def != NULL) { | ||
resv_resc_entry = find_resc_entry(&presv->ri_wattr[(int)RESV_ATR_resource], resc_def); | ||
if (resv_resc_entry == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If walltime is copied, how will this ever be NULL? We only get in here if we copied walltime, so it should be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if JOB_ATR_resource does not have the entry for walltime, we set the walltime of 5 years.
src/server/queue_func.c
Outdated
|
||
(void)strncpy(qname, quename, PBS_MAXDEST); | ||
qname[PBS_MAXDEST] ='\0'; | ||
pc = strchr(qname, (int)'@'); /* strip off server (fragment) */ | ||
if (pc) | ||
*pc = '\0'; | ||
pque = (pbs_queue *)GET_NEXT(svr_queues); | ||
while (pque != NULL) { | ||
for (; pque != NULL; pque = (pbs_queue *)GET_NEXT(pque->qu_link)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add the initialization into the for() header? (here and below)
@@ -80,15 +80,17 @@ def test_create_resv_from_job_using_runjob_hook(self): | |||
jid = self.server.submit(job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI (above), you don't need setUp() if the only thing you do is call the super's setUp()
|
||
a = {ATTR_job: jid, 'reserve_state': "RESV_RUNNING", | ||
'Resource_List.select': '1:ncpus=3', | ||
'Resource_List.walltime': 9999} | ||
self.server.expect(RESV, a, id=rid) | ||
|
||
a = {ATTR_queue: rid} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When checking for the state of a reservation do (MATCH_RE, 'RESV_RUNNING|5)' (here and other places)
hey @bhroam - could you please let me know if there are further comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -1089,6 +1089,9 @@ find_job(char *jobid) | |||
char buf[PBS_MAXSVRJOBID+1]; | |||
|
|||
/* Make a copy of the job ID string before we modify it. */ | |||
if (jobid == NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think above comment is for the sprintf statement, can you move your changes above the comment?
src/server/queue_func.c
Outdated
|
||
/** | ||
* @brief | ||
* find_resvbyquename() - find a reservation by the name of its queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find_resv_by_quename
src/server/req_quejob.c
Outdated
char tbuf1[256] = {0}; | ||
char tbuf2[256] = {0}; | ||
int is_maintenance = 0; | ||
char buf[PBS_MAXUSER + PBS_MAXHOSTNAME + 64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This buf is left uninitialized.
src/server/req_quejob.c
Outdated
/*for resources that are not specified in the request and | ||
*for which default values can be determined, set these values | ||
*as the values for those resources | ||
/* for resources that are not specified in the request and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it doxygen standard.
src/server/req_runjob.c
Outdated
plhed = &newreq->rq_ind.rq_queuejob.rq_attr; | ||
CLEAR_HEAD(newreq->rq_ind.rq_queuejob.rq_attr); | ||
if ((psatl = attrlist_create(ATTR_resv_job, NULL, len)) != NULL) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this extra line.
@@ -1966,3 +1969,47 @@ req_defschedreply(struct batch_request *preq) | |||
|
|||
reply_send(preq); | |||
} | |||
|
|||
void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doxygen comment for the function is missing.
src/server/req_runjob.c
Outdated
/** | ||
* @brief | ||
* convert_job_to_resv - creates a reservation out of the job | ||
* and moved the job to the newly created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want the present tense here (move). Also, I you might want to make creates not plural.
hey @bhroam, shall we go ahead and merge this PR. If there are any more changes, please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thank you @bhroam and @riyazhakki for the review |
Describe Bug or Feature
When a job encounters a problem with application/data/script/etc, the job exits and the resources allocated to the job are released back to the server. If the user wants to run the job again (after correcting the issues that caused the problem), they need to re-submit the job which could take a while to run (depending on various factors like number of jobs in the queue, priority, scheduling policy etc.). If the user needs resources to be allocated for a time period so that the job can be re-submitted and can run without delays, they can use reservations. As of now reservations have a start and end time, so it is not yet possible to have a reservation that is schedule-able.
Describe Your Change
This PR focuses on creating a reservation out of a job.
Link to Design Doc
Design
Attach Test and Valgrind Logs/Output
hook demo.txt
qsub demo.txt
pbs_rsub demo.txt
ptl.txt