Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Diagnostic invoking csm_allocation_create/delete #112

Closed
sanomiya opened this issue Jun 13, 2018 · 8 comments
Closed

Diagnostic invoking csm_allocation_create/delete #112

sanomiya opened this issue Jun 13, 2018 · 8 comments

Comments

@sanomiya
Copy link
Contributor

Diagnostic issues the command csm_allocation_create passing the status "running",
then at the end of the run it issue csm_allocation_delete. The csm_allocation_delete is failing with rc=25.
Then I Invoked the command manually, it says:

c699mgt00: > /opt/ibm/csm/bin/csm_allocation_delete -a 14222
[csmapi][warning]       /home/ppsbld/workspace/PUBLIC_CAST_V1.1.1_ppc64LE_RH7.5_ProdBuild/csmi/src/common/src/csmi_common_utils.c-147: the Error Flag Set
[csmapi][error] csmi_sendrecv_cmd failed: 25 - csm_allocation_delete[838226367]; Database Error Message: ERROR:  Detected a multicast operation in progress for allocation, rejecting delete.

This message comes from the db function fn_csm_allocation_delete_start, when checking for valid transition.

The database says:

c699mgt00:/home/diagadmin/log > psql -U postgres -d csmdb -c "select  state  from csm_allocation  where allocation_id=14222"
   state
------------
 to-running
(1 row)

Why the csm_allocation_create is not setting the state to "running"?

@mew2057
Copy link
Contributor

mew2057 commented Jun 14, 2018

I have a fix pending in pull request #114, we need some test cases in the regression for this.

@pdlun92
Copy link
Contributor

pdlun92 commented Jun 19, 2018

@sanomiya @mew2057 feel free to close this issue if it is resolved.

I've created a separate issue for tracking test case development

@sanomiya
Copy link
Contributor Author

Works now. Closing this issue.

Issue the create allocation command:

c650mnp05:/u/aldas/CAST/work/csm/bin > sudo ./csm_allocation_create -n 'c650f99p18'  -s 'running' -t 'diagnostics' 
[csmapi][warning]	Invalid 'primary_job_id supplied (<= 0), setting to 1.
---
allocation_id: 1
num_nodes: 1
- compute_nodes: c650f99p18
user_name: root
user_id: 0
state: running
type: diagnostics
job_submit_time: 2018-06-19 10:49:31

Database:

c650mnp05:/u/aldas/CAST/work/csm/bin > psql -U postgres -d csmdb -c "select * from csm_allocation where allocation_id=1"
 allocation_id | primary_job_id | secondary_job_id | ssd_file_system_name | launch_node_name | isolated_cores | user_flags | 
system_flags | ssd_min | ssd_max | num_nodes | num_processors | num_gpus | projected_memory |  state  |    type     | job_typ
e | user_name | user_id | user_group_id | user_group_name | user_script |         begin_time         | account | comment | jo
b_name |   job_submit_time   | queue | requeue | time_limit | wc_key 
---------------+----------------+------------------+----------------------+------------------+----------------+------------+-
-------------+---------+---------+-----------+----------------+----------+------------------+---------+-------------+--------
--+-----------+---------+---------------+-----------------+-------------+----------------------------+---------+---------+---
-------+---------------------+-------+---------+------------+--------
             1 |              1 |                0 |                      |                  |              0 |            | 
             |       0 |       0 |         1 |              0 |        0 |                0 | running | diagnostics | batch  
  | root      |       0 |             0 |                 |             | 2018-06-19 10:49:31.412651 |         |         |   
       | 2018-06-19 10:49:31 |       |         |          0 | 
(1 row)
c650mnp05:/u/aldas/CAST/work/csm/bin > ./csm_allocation_query -a 1|grep state
state:                          running

@sanomiya sanomiya reopened this Jul 20, 2018
@sanomiya
Copy link
Contributor Author

We still see this problem on Austin system, with 1.1.2

# rpm -qa|grep ibm-csm
ibm-csm-hcdiag-1.1.2-585.noarch
ibm-csm-db-1.1.2-176.noarch
ibm-csm-api-1.1.2-176.ppc64le
ibm-csm-restd-1.1.2-176.ppc64le
ibm-csm-core-1.1.2-176.ppc64le
ibm-csm-bds-1.1.2-176.noarch

We did delete the previous database create it again:

# /opt/ibm/csm/db/csm_db_script.sh -n csmdb  
-----------------------------------------------------------------------------------------------------------------
[Start   ] Welcome to CSM database automation script.
[Info    ] PostgreSQL is installed
[Info    ] csmdb database user: csmdb already exists
[Complete] csmdb database created.
[Complete] csmdb database tables created.
[Complete] csmdb database functions and triggers created.
[Complete] csmdb table data loaded successfully into csm_db_schema_version
[Complete] csmdb table data loaded successfully into csm_ras_type
[Info    ] csmdb DB schema version (15.0)
---------------------------------------------------------------------------------------------------------------
# su - postgres                    
Last login: Fri Jul 20 10:40:19 CDT 2018 on pts/54
-bash-4.2$ psql csmdb                             
psql (9.2.23)                                     
Type "help" for help.                             

csmdb=# \x
Expanded display is on.
csmdb=# \df+ fn_csm_allocation_node_sharing_status 
List of functions                                  
-[ RECORD 1 ]-------+--------------------------------------------------------------------------------------------------------------------                                                                                                                   
Schema              | public                                                                                                  
Name                | fn_csm_allocation_node_sharing_status                                                                   
Result data type    | void                                                                                                    
Argument data types | i_allocation_id bigint, i_type text, i_state text, i_shared boolean, i_nodenames text[]                 
Type                | normal                                                                                                  
Volatility          | volatile                                                                                                
Owner               | csmdb                                                                                                   
Language            | plpgsql                                                                                                 
Source code         |                                                                                                         
                    | DECLARE                                                                                                 
                    |     bad_nodes text[];                                                                                   
                    |     missing_nodes text[];                                                                               
                    |     running_nodes text[];                                                                               
                    | BEGIN                                                                                                   
                    |     --LOCK TABLE csm_allocation_node IN EXCLUSIVE MODE;                                                 
                    |     PERFORM 1 FROM csm_allocation_node WHERE allocation_id=i_allocation_id FOR UPDATE;                  
                    |                                                                                                         
                    |     -- TODO Should this be consolidated into one Query with missing_nodes?                              
                    |     -- Determine if any of the supplied nodes were not ready, or not computes.                          
                    |     bad_nodes := ARRAY(                                                                                 
                    |         SELECT node_name                                                                                
                    |         FROM csm_node                                                                                   
                    |         WHERE node_name = ANY(i_nodenames) AND (state != 'IN_SERVICE' OR type != 'compute')             
                    |     );                                                                                                  
                    |                                                                                                         
                    |     -- Check for any missing nodes.                                                                     
                    |     missing_nodes := ARRAY (                                                                            
                    |         SELECT p.node_name                                                                              
                    |         FROM (SELECT unnest(i_nodenames) as node_name) p                                                
                    |         LEFT JOIN csm_node n on n.node_name = p.node_name                                               
                    |         WHERE n.node_name IS NULL                                                                       
                    |     );                                                                                                  
                    |                                                                                                         
                    |     -- If this is not a diagnostic and any bad nodes were found                                         
                    |     -- OR there were nodes that couldn't be found, raise an exception.                                  
                    |     IF (i_type != 'diagnostics' AND array_length(bad_nodes, 1) > 0 )                                    
                    |         OR  array_length(missing_nodes,1) > 0 THEN                                                      
                    |         RAISE EXCEPTION 'The following nodes were not available: %                                      
                    | The following nodes were not found: %',                                                                 
                    |                 array_to_string(bad_nodes, ', ', '*' ),                                                 
                    |                 array_to_string(missing_nodes, ', ', '*');                                              
                    |     END IF;                                                                                             
                    |                                                                                                         
                    |     -- If the allocation is being created in the running state.                                         
                    |     IF (i_state='running') THEN                                                                         
                    |         IF (NOT(i_shared)) THEN                                                                         
                    |                                                                                                         
                    |             IF EXISTS (                                                                                 
                    |                 SELECT state                                                                            
                    |                 FROM csm_allocation_node                                                                
                    |                 WHERE node_name = ANY(i_nodenames) AND state!='staging-in' AND state!='staging-out'  )  
                    |             THEN                                                                                        
                    |                 running_nodes := ARRAY(                                                                 
                    |                     SELECT node_name                                                                    
                    |                     FROM csm_allocation_node                                                            
                    |                     WHERE node_name = ANY(i_nodenames) AND state!='staging-in' AND state!='staging-out');                                                                                                                             
                    |                                                                                                         
                    |                 RAISE EXCEPTION 'Node(s) are currently busy, unable to request exclusive job. Active Nodes: %',                                                                                                                       
                    |                         array_to_string(running_nodes, ', ', '*');                                      
                    |             END IF;                                                                                     
                    |                                                                                                         
                    |         ELSIF EXISTS (                                                                                  
                    |             SELECT state                                                                                
                    |             FROM csm_allocation_node                                                                    
                    |             WHERE node_name = ANY(i_nodenames) AND state!='staging-in' AND state!='staging-out' AND NOT shared )                                                                                                                      
                    |         THEN                                                                                            
                    |             running_nodes := ARRAY(                                                                     
                    |                 SELECT node_name                                                                        
                    |                 FROM csm_allocation_node                                                                
                    |                 WHERE node_name = ANY(i_nodenames) AND state!='staging-in' AND state!='staging-out');   
                    |                                                                                                         
                    |             RAISE EXCEPTION 'Node(s) can not be shared because an exclusive job currently active. Active Nodes: %',                                                                                                                   
                    |                 array_to_string(running_nodes, ', ', '*');                                              
                    |         END IF;                                                                                         
                    |     ELSIF i_state!='staging-in' THEN                                                                    
                    |         RAISE EXCEPTION using message = 'Inserting into invalid state';                                 
                    |     --ELSIF i_state='stage-out' THEN                                                                    
                    |         --RAISE EXCEPTION using message = 'Inserting into the stage-out state';                         
                    |     END IF;                                                                                             
                    |                                                                                                         
                    |     -- If no execption was raised insert into the allocation_node table.                                
                    |     INSERT INTO csm_allocation_node(                                                                    
                    |         allocation_id,                                                                                  
                    |         shared,                                                                                         
                    |         state,                                                                                          
                    |         node_name)                                                                                      
                    |     SELECT                                                                                              
                    |         i_allocation_id, i_shared, i_state, node                                                        
                    |     FROM                                                                                                
                    |         unnest(i_nodenames) as n(node);                                                                 
                    |     EXCEPTION                                                                                           
                    |         WHEN others THEN                                                                                
                    |             RAISE EXCEPTION                                                                             
                    |             USING ERRCODE = sqlstate,                                                                   
                    |                 MESSAGE = 'error_handling_test: ' || sqlstate || '/' || sqlerrm;                        
                    | END; -- releases all locks                                                                              
                    |                                                                                                         
Description         | csm_allocation_sharing_status function to handle exclusive usage of shared nodes on INSERT.             

csmdb=# 
csmdb=# \q

@mew2057
Copy link
Contributor

mew2057 commented Jul 23, 2018

     -- If this is not a diagnostic and any bad nodes were found                                         
     -- OR there were nodes that couldn't be found, raise an exception.                                  
     IF (i_type != 'diagnostics' AND array_length(bad_nodes, 1) > 0 )                                    
         OR  array_length(missing_nodes,1) > 0 THEN                                                      
         RAISE EXCEPTION 'The following nodes were not available: %                                      
 The following nodes were not found: %',                                                                 
                 array_to_string(bad_nodes, ', ', '*' ),                                                 
                 array_to_string(missing_nodes, ', ', '*');                                              
     END IF;                              

This code is not correct, The code in mainline is as follows:

    -- If this is not a diagnostic and any bad nodes were found
    -- OR there were nodes that couldn't be found, raise an exception.
    ELSIF (array_length(bad_nodes, 1) > 0 )
        OR  array_length(missing_nodes,1) > 0 THEN
        RAISE EXCEPTION 'The following nodes were not available: % 
The following nodes were not found: %',
                array_to_string(bad_nodes, ', ', '*' ),
                array_to_string(missing_nodes, ', ', '*');
END IF;

If this is not present in csm_create_triggers.sql, then it's likely that the RPM is an old version of the RPM before the last fix was submitted.

@sanomiya
Copy link
Contributor Author

Chasing the source of the rpm 1.1.2-176. I assumed that all 1.1.2-xxx has the change, but that might not be true.

@pdlun92
Copy link
Contributor

pdlun92 commented Jul 23, 2018

Looking around a bit, it seems that the fix was merged in to master on Jun 18th. Looking at regression logs, seems like they would need at least version 1.1.2-432 (Built with latest git commit being 559ee9b)

@sanomiya
Copy link
Contributor Author

It is fixed on 1.2.0. Closing this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants