Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user reliability workflow - creating reservation out of a job #1145

Merged
merged 10 commits into from Mar 3, 2020
Merged

user reliability workflow - creating reservation out of a job #1145

merged 10 commits into from Mar 3, 2020

Conversation

prakashcv13
Copy link
Contributor

@prakashcv13 prakashcv13 commented May 28, 2019

Describe Bug or Feature

When a job encounters a problem with application/data/script/etc, the job exits and the resources allocated to the job are released back to the server. If the user wants to run the job again (after correcting the issues that caused the problem), they need to re-submit the job which could take a while to run (depending on various factors like number of jobs in the queue, priority, scheduling policy etc.). If the user needs resources to be allocated for a time period so that the job can be re-submitted and can run without delays, they can use reservations. As of now reservations have a start and end time, so it is not yet possible to have a reservation that is schedule-able.

Describe Your Change

This PR focuses on creating a reservation out of a job.

Link to Design Doc

Design

Attach Test and Valgrind Logs/Output

hook demo.txt
qsub demo.txt
pbs_rsub demo.txt

ptl.txt

@prakashcv13 prakashcv13 changed the title [WIP] NASA user reliability [WIP] user reliability May 28, 2019
@prakashcv13
Copy link
Contributor Author

@PBSPro/pbspro-maintainers - please do not close this PR. I understand it is inactive.

@agrawalravi90
Copy link

@prakashcv13 why don't you close this and re-open when you are ready? it has 0 reviews so there's really not much value in keeping this open right?

@agrawalravi90 agrawalravi90 added the inactive This pull request has been inactive for > 2 weeks label Dec 12, 2019
@prakashcv13 prakashcv13 removed the inactive This pull request has been inactive for > 2 weeks label Dec 20, 2019
@prakashcv13 prakashcv13 changed the title [WIP] user reliability user reliability workflow - creating reservation out of a job Jan 10, 2020
@prakashcv13
Copy link
Contributor Author

this PR is ready for review - PTL tests are pending for some scenarios - I will push them in a day or two.

@@ -175,7 +175,7 @@ struct rq_py_spawn {

struct rq_move {
char rq_jid[PBS_MAXSVRJOBID+1];
char rq_destin[(PBS_MAXSVRJOBID > PBS_MAXDEST ? PBS_MAXSVRJOBID:PBS_MAXDEST)+1];
char rq_destin[(PBS_MAXSVRJOBID > PBS_MAXDEST ? PBS_MAXSVRJOBID:PBS_MAXDEST)+2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With +1 people can guess it is for a null byte. I think +2 deserves a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need +2 there, so reverting the change.

@@ -260,6 +260,8 @@ extern "C" {

#define PBSE_SOFTWT_STF 15180 /* soft_walltime is incompatible with STF jobs */

#define PBSE_RESVFROMRESVJOB 15181 /* A reservation used to create a reservation */
#define PBSE_RESVFROMARRJOB 15182 /* A reservation used to create a reservation */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a job used to create a reservation? Also, I don't understand how these two error codes differ. Also, some underscores might make them more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15181 is for creating a reservation out of a job within a reservation
15182 is for creating a reservation out of an array job.
I have inserted underscores and updated the comments as well

<member_at_comp>comp_str</member_at_comp>
<member_at_free>free_str</member_at_free>
<member_at_action>NULL_FUNC</member_at_action>
<member_at_flags><both>READ_WRITE | ATR_DFLAG_ALTRUN | ATR_DFLAG_SELEQ | ATR_DFLAG_MOM</both></member_at_flags>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the mom care about this attribute? Actually does ATR_DFLAG_MOM mean anything to a reservation? It usually means to send a job attribute to the mom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - removed the flag.

snprintf(log_buffer, sizeof(log_buffer),
"Found job '%s' attribute flagged to be set",
ATTR_create_resv_from);
log_event(PBSEVENT_DEBUG2, PBS_EVENTCLASS_HOOK, LOG_ERR, phook->hook_name, log_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using log_eventf(). Also, I don't think you need to use %s. ATTR_create_from_resv is going to be replaced with a quoted string. You should be able to do "found job '" ATTR_create_from_resv "' attribute flagged to be set"

*/

resc_resv *
find_resvbyquename(char *quename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our standard suggests using underscores between words

jid = self.server.submit(job)
self.server.expect(JOB, {ATTR_state: 'R'}, jid)

rid = 'R' + str(int(jid.split(".")[0]) + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume this. When we get multi-server, ids will be all over the place. Do a self.server.status(RESV)[0]['id'] (here and other tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, but here we do not have multi-server, this test will run on a single server system only, here, we can safely assume this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Don't assume ids. Query them. No other test I've read has assumed IDs to be like this, they query them.

delete_hook = [qmgr_cmd, '-c', 'delete hook %s' % hook_name]
ret = self.du.run_cmd(self.server.hostname,
cmd=delete_hook, sudo=True)
self.assertEqual(ret['rc'], 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooks will be deleted in setUp(). No need to delete it.

resv = Reservation(job=jid)
self.server.submit(resv)

rid = 'R' + str(int(jid.split(".")[0]) + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.server.submit() returns the rid. No need to assume the id.

try:
self.server.submit(resv)
except PbsSubmitError as e:
self.assertTrue(msg in e.msg[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an else: self.fail() to both of these try/except blocks (here and below)

@@ -14262,7 +14266,7 @@ class Reservation(ResourceResv):

dflt_attributes = {}

def __init__(self, username=TEST_USER, attrs=None, hosts=None):
def __init__(self, username=TEST_USER, attrs=None, hosts=None, job=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, maybe key off of the reservation attribute instead of adding an argument to Reservation()? That's how it will work in API mode.

}
if (new_conv_str != NULL)
log_eventf(PBSEVENT_DEBUG2, PBS_EVENTCLASS_HOOK, LOG_ERR, phook->hook_name,
"Found job "ATTR_create_resv_from_job" attribute flagged to be set");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the %s was taken out, you can just use log_event() now. Also, put spaces after the " and before the next opening " around the macro. Right now it looks like it is quoted and not the other strings.

pjob->ji_qs.ji_jobid, log_buffer);
pjob->ji_qs.ji_jobid, "batch request allocation failed");
log_event(PBSEVENT_SYSTEM, PBS_EVENTCLASS_JOB, LOG_ERR,
pjob->ji_qs.ji_jobid, "Could not create reservation from the job");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine these log messages into one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to keep 2 separate log messages, but let us combine them into one.

if (resc_def != NULL) {
job_resc_entry = find_resc_entry(&pjob->ji_wattr[(int)JOB_ATR_resource], resc_def);
job_resc_entry = (resource *)GET_NEXT(pjob->ji_wattr[(int)JOB_ATR_resource].at_val.at_list);
while (job_resc_entry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are iterating, use a for() loop.

resc_def = find_resc_def(svr_resc_def, WALLTIME, svr_resc_size);
if (resc_def != NULL) {
resv_resc_entry = find_resc_entry(&presv->ri_wattr[(int)RESV_ATR_resource], resc_def);
if (resv_resc_entry == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If walltime is copied, how will this ever be NULL? We only get in here if we copied walltime, so it should be set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if JOB_ATR_resource does not have the entry for walltime, we set the walltime of 5 years.


(void)strncpy(qname, quename, PBS_MAXDEST);
qname[PBS_MAXDEST] ='\0';
pc = strchr(qname, (int)'@'); /* strip off server (fragment) */
if (pc)
*pc = '\0';
pque = (pbs_queue *)GET_NEXT(svr_queues);
while (pque != NULL) {
for (; pque != NULL; pque = (pbs_queue *)GET_NEXT(pque->qu_link)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add the initialization into the for() header? (here and below)

@@ -80,15 +80,17 @@ def test_create_resv_from_job_using_runjob_hook(self):
jid = self.server.submit(job)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI (above), you don't need setUp() if the only thing you do is call the super's setUp()


a = {ATTR_job: jid, 'reserve_state': "RESV_RUNNING",
'Resource_List.select': '1:ncpus=3',
'Resource_List.walltime': 9999}
self.server.expect(RESV, a, id=rid)

a = {ATTR_queue: rid}
Copy link
Contributor

@bhroam bhroam Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When checking for the state of a reservation do (MATCH_RE, 'RESV_RUNNING|5)' (here and other places)

@prakashcv13
Copy link
Contributor Author

hey @bhroam - could you please let me know if there are further comments.

Copy link
Contributor

@bhroam bhroam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1089,6 +1089,9 @@ find_job(char *jobid)
char buf[PBS_MAXSVRJOBID+1];

/* Make a copy of the job ID string before we modify it. */
if (jobid == NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think above comment is for the sprintf statement, can you move your changes above the comment?


/**
* @brief
* find_resvbyquename() - find a reservation by the name of its queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_resv_by_quename

char tbuf1[256] = {0};
char tbuf2[256] = {0};
int is_maintenance = 0;
char buf[PBS_MAXUSER + PBS_MAXHOSTNAME + 64];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This buf is left uninitialized.

/*for resources that are not specified in the request and
*for which default values can be determined, set these values
*as the values for those resources
/* for resources that are not specified in the request and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it doxygen standard.

plhed = &newreq->rq_ind.rq_queuejob.rq_attr;
CLEAR_HEAD(newreq->rq_ind.rq_queuejob.rq_attr);
if ((psatl = attrlist_create(ATTR_resv_job, NULL, len)) != NULL) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this extra line.

@@ -1966,3 +1969,47 @@ req_defschedreply(struct batch_request *preq)

reply_send(preq);
}

void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen comment for the function is missing.

/**
* @brief
* convert_job_to_resv - creates a reservation out of the job
* and moved the job to the newly created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want the present tense here (move). Also, I you might want to make creates not plural.

@prakashcv13
Copy link
Contributor Author

hey @bhroam, shall we go ahead and merge this PR. If there are any more changes, please let me know.

Copy link
Contributor

@bhroam bhroam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bhroam bhroam merged commit c047c42 into openpbs:master Mar 3, 2020
@prakashcv13
Copy link
Contributor Author

thank you @bhroam and @riyazhakki for the review

@prakashcv13 prakashcv13 deleted the nasa_reliable branch June 2, 2020 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants