Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBS doesn't finish already running jobs after overlay upgrade #1049

Merged
merged 5 commits into from Apr 2, 2019

Conversation

Projects
None yet
3 participants
@Bhagat-Rajput
Copy link
Contributor

commented Mar 24, 2019

Bug/feature Description

  • PBS doesn't finish already running jobs after overlay upgrade if starts the mom with '-p' option.
  • Steps to reproduce:
  1. Install any older PBS version rpm.
  2. Submit few jobs and should be in R state.
  3. kill -INT mom_process_id
  4. Stop the PBS complex(/etc/init.d/pbs stop)
  5. Upgrade PBS with the 19 version rpm(rpm -U newer_version.rpm)
  6. In pbs.conf file set PBS_START_MOM=0
  7. Start the PBS(/etc/init.d/pbs start)
  8. Start the mom(execution host)(pbs_mom -p)
  9. Make sure all jobs should be finished after the given time and run_count should be 1.(In failed scenario, Jobs never ends).

Affected Platform(s)

  • ALL

Cause / Analysis / Design

  • Updated jobfix and taskfix structure was missing in the pbs_upgrade_job file thus Jobs did not fit into the new version's job structure.

Solution Description

  • Added new jobfix and taskfix structure to the file.

Testing logs/output

Checklist:

For further information please visit the Developer Guide Home.

@subhasisb

This comment has been minimized.

Copy link
Collaborator

commented Mar 24, 2019

Does this change support upgrading from 13 as well?

@Bhagat-Rajput

This comment has been minimized.

Copy link
Contributor Author

commented Mar 25, 2019

Yes @subhasisb, it supports from 13 to 19 as well.

#define ji_taskid ji_extended.ji_ext.ji_taskidx
#define ji_nodeid ji_extended.ji_ext.ji_nodeidx
#define JSVERSION_18 800
#define JSVERSION 900

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

It is better to clean-up the macros JSVERSION_514, JSVERSION_80 and its corresponding usage in the code as they are used for older version/s of PBS. Also do not use the values like 800 for PBS version 18. It looks like 800 meant for an older release. We need to change this number acoordingly.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 26, 2019

Author Contributor

Thanks for the review @suresh-thelkar, i think we can't change the value 800 because our
older versions are using the same value and could be directly affect the overlay upgrade because we are parsing the jobs on the basis of JSVERSION_18(800).

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 27, 2019

Contributor

@Bhagat-Rajput, the value 800 meant for older release of PBS, since the code clean-up is happening now, we should use a number for the macro that goes into its respective structure which represents the release number for example we can use a value 1800 for JSVERSION_18. This is more readable than using 800.

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 27, 2019

Contributor

As discussed offline we now write 1900 as a job structure version for the current release or 19.x.

@suresh-thelkar
Copy link
Contributor

left a comment

I have provided my initial set of comments. Please go through it.

* altered the size of the jobfix and taskfix structures.
* This tool is required due to the change in PBS macros which are defined in the pbs_ifl.h/server_limits.h
or in any other file and the same macros PBS uses in the job structure(see job.h) as well that alters the
size of the jobfix and taskfix structures.

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

Why the description of this function is changed ?

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 26, 2019

Author Contributor

because previous description was claiming that tool is required only for change in single macro (PBS_MAXSERVERNAME), but instead of that PBS uses other macros as well in job structure.

char ji_queue[PBS_MAXQUEUENAME_V12+1];
char ji_destin[PBS_MAXROUTEDEST_V12+1];
int ji_un_type;
typedef struct jobfix_v12

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

Better to clean-up this structure and its corresponding usage in the code if we are not going to support it.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 26, 2019

Author Contributor

cleaned up!

int ti_status;
pid_t ti_sid;
int ti_exitstat;
typedef struct taskfix_v12

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

Better to clean-up this structure and its corresponding usage in the code if we are not going to support it.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

* Define macros that controlled the size of the jobfix and taskfix structure in
* pre 19 and post 13 versions. Append the _PRE19 suffix to each.
*/

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

Better to modify the comment as we are going to have only one set of structures for both pre-19 and post-13.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

/* The string "Job_Name" should appear at address 01160 (octal) for v12 */
pos_new = lseek(fd, 01160, SEEK_SET);
if (pos_new != 01160) {
fprintf(stderr, "File format not recognized.\n");
goto check_job_file_exit;
}
errno = 0;
len = read(fd, &buf, 8);
if (len < 0) {
length = read(fd, &buf, 8);

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

All this code which belongs to older version can be removed if we are no longer going to support it.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

length = -1;
pos_new = lseek(fd, pos_saved, SEEK_SET);
memset(&old_jobfix_pre19, 0, sizeof(old_jobfix_pre19));
length = read(fd, (char *)&old_jobfix_pre19, sizeof(old_jobfix_pre19));

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

memset and read for the whole structure is not needed to just find out the version of the structure. We can only read the version and then decide accordingly.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

@suresh-thelkar
Copy link
Contributor

left a comment

I have provided my initial review comments. Please look into it.

*/
int
upgrade_job_file(int fd)
upgrade_job_file(int fd, int jobfix_version)

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

If we clean-up the code for older versions of PBS then this function gets called only for pre-19 case. So we actually no need to pass jobfix_version as an argument.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

if (len < 0) {
fprintf(stderr, "Failed to read temporary file [%s]\n",
errno ? strerror(errno) : "No error");
fprintf(stderr, "Failed to read input file [%s]\n",

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

Is there any reason to change the comment to use input file instead of temporary file ?

fprintf(stderr, "Format not recognized, not enough fixed data.\n");
return 1;
}
if (old_jobfix_pre19.ji_jsversion != JSVERSION_18) {

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

I don't think we need to check the version here, it is already checked in the function upgrade_job_file().

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 26, 2019

Author Contributor

done!

* @return int
* @retval -1 : failure
* @retval 0 : success
*/
int
upgrade_task_file(char *taskfile)
upgrade_task_file(char *taskfile, int taskfile_version)

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

Here also no need to pass taskfile_version if we clean-up the code for older versions of PBS and then this function gets called only for pre-19 case then.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

ret = upgrade_task_file(namebuf);
ret = upgrade_task_file(namebuf, job_version);
if (ret != 0) {
fprintf(stderr, "Upgrade task failed\n");

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

As we are now throwing an error message in case of failure of an upgrade of task file, please mention the task file name for which the upgrade is failed.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

@suresh-thelkar

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

@Bhagat-Rajput, can you please provide the logs for other versions for which the upgrade is supported. Also fix the Codacy issues if they are valid.

len = read(fd, &buf, 8);
if (len < 0) {
/*---------- For PBSPro >=13 or <=18 versions job structure ---------- */
length = -1;

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 26, 2019

Contributor

No need to initialise length to -1.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!

@Bhagat-Rajput

This comment has been minimized.

Copy link
Contributor Author

commented Mar 26, 2019

Thanks for the review @suresh-thelkar, i've addressed your comments so please review it again.

@suresh-thelkar
Copy link
Contributor

left a comment

Added the comment.

pos_new = lseek(fd, pos_saved, SEEK_SET);
if (pos_new != 0) {
fprintf(stderr, "Couldn't set the file position to zero.\n");

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 27, 2019

Contributor

Please print errno also as part of the message which tells us why lseek failed.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

fixed!

}
} else {
/* If current version, JSVERSION_80, read into place */
if (pj->ji_qs.ji_jsversion >= JSVERSION_18) {

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 27, 2019

Contributor

change errno initialisation to zero.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!.

errno = 0;
len = read(fd, &buf, 8);
if (len < 0) {
memset(&old_jobfix_pre19.ji_jsversion, 0, sizeof(old_jobfix_pre19.ji_jsversion));

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 27, 2019

Contributor

Instead of memset just initialise ji_jsversion to zero.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

done!.

goto check_job_file_exit;
}

check_job_file_exit:
pos_new = lseek(fd, pos_saved, SEEK_SET);
return ret;
if (pos_new != 0) {
fprintf(stderr, "Couldn't set the file position back to zero.\n");

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Mar 27, 2019

Contributor

Use errno in the fprintf.

This comment has been minimized.

Copy link
@Bhagat-Rajput

Bhagat-Rajput Mar 27, 2019

Author Contributor

added!.

@suresh-thelkar
Copy link
Contributor

left a comment

Most of the code changes look good. I have provided few more comments.

@Bhagat-Rajput

This comment has been minimized.

Copy link
Contributor Author

commented Mar 27, 2019

@suresh-thelkar, uploaded the fresh overlay upgrade test logs.Please look at it.

@suresh-thelkar
Copy link
Contributor

left a comment

Thanks for making all the code changes. Please make sure that you test all the relavent test cases after the review comments also.

@subhasisb

This comment has been minimized.

Copy link
Collaborator

commented Mar 28, 2019

@Bhagat-Rajput please check codacy warnings - looks like some or all of them can be addressed?
Also, please run test under valgrind

@Bhagat-Rajput

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2019

Fixed codacy warnings and ran overlay upgrade scenario in valgrind, log file attached below.
valgrind_overlay_upgrade.txt

@Bhagat-Rajput Bhagat-Rajput force-pushed the Bhagat-Rajput:overlay_upgrade branch from 5217944 to dd7b462 Apr 1, 2019

@subhasisb

This comment has been minimized.

Copy link
Collaborator

commented Apr 1, 2019

Will ignore the latest codacy warnings as the structure is required while we do not need to use all the structure members (of the job structure). The usage is perfectly legitimate.

@Bhagat-Rajput Bhagat-Rajput force-pushed the Bhagat-Rajput:overlay_upgrade branch from dd7b462 to db7dd4d Apr 2, 2019

@Bhagat-Rajput Bhagat-Rajput force-pushed the Bhagat-Rajput:overlay_upgrade branch from db7dd4d to 9c29977 Apr 2, 2019

@Bhagat-Rajput

This comment has been minimized.

Copy link
Contributor Author

commented Apr 2, 2019

Hello @subhasisb, I've tested the latest code changes (overlay upgrade from 18->19) and it's working fine, means all the old running jobs gets finished successfully after upgraded to newer version.
Added fresh testing logs and valgrind logs(pbs_mom and pbs_upgrade_job).
new_18_19_overlay_upgrade_logs.txt
valgrind_pbs_upgrade_job.txt
valgrind_pbs_mom.txt

@subhasisb subhasisb merged commit ec748e7 into PBSPro:master Apr 2, 2019

3 of 4 checks passed

Codacy/PR Quality Review Not up to standards. This pull request quality could be better.
Details
Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.