Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database shutdown / server abort relating to duplicate queue creation #1076

Merged
merged 3 commits into from Apr 8, 2019

Conversation

Projects
None yet
4 participants
@nithinj
Copy link
Contributor

commented Apr 3, 2019

Describe Bug or Feature

Three issues are being addressed here related to queues

  1. database shutdown / server abort when trying to create the same queue second time which is having max queue name length .
    que_alloc() will copy queue name to a string which can hold only 14 characters. During the next attempt, 15 char long queue name is compared with this string in find_queuebyname() and proceeds to insert in database as it couldn't find a match. Database will see this as a duplicate entry and will throw a primary key violation resulting in a panic_stop_db

  2. If you create a queue, by default it wont be having any attributes with it. So it inserts empty string to the attribute field of pbs.queue table. However our database API's are not able to handle this empty string value and hence will not be able to load these values back to the memory after a restart. Hence qstat -Qf will not display this queue and re-creating this queue will cause server abort as it is unable to handle primary key violation exception in the db.

  3. After an overlay upgrade all the queues are found missing if one of the queue does not have any attribute associated with it:
    Overlay upgrade sql commands are trying to insert a NULL (as no attribute is present) and hence resulting in an error due to the NOT NULL constraint in the table defenition. Due to this error attribute fields of other entries are also filled with empty string rather than the actual value. After upgrade a fresh incarnation of the server wont be able to read these values due to bug mentioned above.

Describe Your Change

  1. Add one more byte to hold null character.
  2. Update the database API's to read empty string
  3. Following changes are made to the sql command to fix overlay upgrade issue:
    Create attribute field without NOT NULL constraint -> Insert values including NULL -> convert all the NULL to empty string -> Introduce NOT NULL constraint for attribute field.

Attach Test Logs or Output

logs.txt

CC: @subhasisb

@suresh-thelkar
Copy link
Contributor

left a comment

I have added a comment. Please look into it.

@@ -121,7 +121,7 @@ que_alloc(char *name)
CLEAR_HEAD(pq->qu_jobs);
CLEAR_LINK(pq->qu_link);

snprintf(pq->qu_qs.qu_name, PBS_MAXQUEUENAME, "%s", name);
snprintf(pq->qu_qs.qu_name, PBS_MAXQUEUENAME + 1, "%s", name);

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Apr 4, 2019

Contributor

Instead of using PBS_MAXQUEUENAME + 1 in snprintf, please use sizeof(pq->qu_qs.qu_name). This way even if the macro/its value changes we no need to change this piece of code.

nithinj added some commits Apr 5, 2019

@nithinj

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

@nithinj

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

Attaching valgrind logs for both devbranch and mainline for the restart scenario. Could not find anything new related to the changes.
server_1554568698_mainline.log
server_1554577126_devbranch.log

@nithinj nithinj changed the title database shutdown / server abort relating to duplicate queue name creation database shutdown / server abort relating to duplicate queue creation Apr 6, 2019

@suresh-thelkar
Copy link
Contributor

left a comment

@nithinj, Most of the changes look good to me. As discussed I have couple of comments. Please look into it.

@@ -101,7 +101,13 @@ convert_array_to_db_attr_list(char *raw_array, pbs_db_attr_list_t *attr_list)
struct pg_array *array = (struct pg_array *) raw_array;
struct str_data *val = (struct str_data *)(raw_array + sizeof(struct pg_array));

if (ntohl(array->ndim) != 1 || ntohl(array->elemtype) != TEXTOID) {
if (ntohl(array->ndim) == 0) {

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Apr 8, 2019

Contributor

As discussed raw_array contains an empty string in case if either NULL or empty string itself is present in the attributes column of the corresponding table.
So we just need to check this condition and decide accordingly.
So no need to convert raw_array to pg_array if this is the case and we can exit early from the function.

This comment has been minimized.

Copy link
@nithinj

nithinj Apr 8, 2019

Author Contributor

I''m seeing empty in raw_array even for NON NULL values. As we discussed we need to make use of APIs like PQgetisnull which might perfectly seem to apply for this case, but for which we need to study a little bit and decide. In the interest of time we will proceed with the current fix and since we are anyway planning to change db layer to handle null during which time we can take care of these comments.

@@ -163,47 +163,59 @@ upgrade_pbs_schema_from_v1_3_0() {
CREATE EXTENSION hstore;
ALTER TABLE pbs.job ADD attributes public.hstore DEFAULT ''::public.hstore NOT NULL;
ALTER TABLE pbs.job ADD attributes public.hstore DEFAULT ''::public.hstore;

This comment has been minimized.

Copy link
@suresh-thelkar

suresh-thelkar Apr 8, 2019

Contributor

As discussed if we implement my next comment our database layer is now capable of handling NULL or empty string. So no need to modify these sql queries a lot. We only need to make sure that we remove the NOT NULL constraint for the attributes column and have empty string as default.
Same comment applies for other sql queries in this file.

This comment has been minimized.

Copy link
@nithinj

nithinj Apr 8, 2019

Author Contributor

This I can handle as part of the upcoming db layer change.

@suresh-thelkar

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

Also one more optional comment. If attr_list->attr_count = 0 in decode_attr_db() function can we exit early from this function ? If things are fine even if we do not exit early then I am ok even if we do not have this change in place.

@suresh-thelkar
Copy link
Contributor

left a comment

As discussed the API PQgetisnull looks like perfectly solves our issue but needs a little bit of study which we are doing as part of the another db layer change to handle null I am fine with the existing code changes. I sign off.

@agrawalravi90 agrawalravi90 merged commit 34babc8 into PBSPro:master Apr 8, 2019

4 checks passed

Codacy/PR Quality Review Up to standards. A positive pull request.
Details
Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.