Skip to content

bug: ClusteredApplicationManager state store operations can fail silently #251

@sfloess

Description

@sfloess

Bug Description

The deploy(), start(), stop(), and undeploy() methods call state store operations (putApplicationDescriptor, putApplicationState) but don't check for success or handle exceptions, leading to silent failures and inconsistent cluster state.

Location

jplatform-cluster/src/main/java/org/flossware/jplatform/cluster/ClusteredApplicationManager.java:137-138,192,221,249

Problematic Code

@Override
public synchronized void deploy(ApplicationDescriptor descriptor) throws Exception {
    String appId = descriptor.getApplicationId();

    if (clusterManager != null && clusterManager.isJoined()) {
        logger.info("[{}] Deploying application in cluster mode", appId);

        // Write descriptor to cluster state
        stateStore.putApplicationDescriptor(appId, descriptor);  // Line 137 - can fail silently
        stateStore.putApplicationState(appId, ApplicationState.DEPLOYED);  // Line 138 - can fail silently
        
        // ... continues even if state store writes failed ...
    }
}

@Override
public synchronized void start(String applicationId) throws Exception {
    if (clusterManager != null && clusterManager.isJoined() && scheduler != null) {
        // ...
        super.start(applicationId);

        // Update cluster state
        stateStore.putApplicationState(applicationId, ApplicationState.RUNNING);  // Line 192 - can fail
    }
}

@Override
public synchronized void stop(String applicationId) throws Exception {
    if (clusterManager != null && clusterManager.isJoined() && scheduler != null) {
        // ...
        super.stop(applicationId);

        // Update cluster state
        stateStore.putApplicationState(applicationId, ApplicationState.STOPPED);  // Line 221 - can fail
    }
}

Impact

  • Application deployed locally but descriptor not in cluster state
  • Other nodes don't see the application
  • Application running but cluster state shows DEPLOYED or STOPPED
  • Monitoring dashboards show incorrect state
  • Leader makes decisions based on stale/incorrect state
  • No indication to caller that operation partially failed

Example

// Hazelcast network partition occurs
ClusteredApplicationManager manager = new ClusteredApplicationManager(...);

manager.deploy(descriptor);  
// putApplicationDescriptor fails due to partition
// putApplicationState fails due to partition
// Method continues, calls super.deploy()
// Application deployed locally
// Cluster state not updated
// Other nodes don't know about application
// No exception thrown

manager.start(appId);
// super.start() succeeds
// putApplicationState fails
// Cluster still shows DEPLOYED but app is RUNNING
// Leader might try to start it on another node

Proposed Fix

@Override
public synchronized void deploy(ApplicationDescriptor descriptor) throws Exception {
    String appId = descriptor.getApplicationId();

    if (clusterManager != null && clusterManager.isJoined()) {
        logger.info("[{}] Deploying application in cluster mode", appId);

        // Write descriptor to cluster state - must succeed before local deployment
        try {
            stateStore.putApplicationDescriptor(appId, descriptor);
            stateStore.putApplicationState(appId, ApplicationState.DEPLOYED);
        } catch (Exception e) {
            logger.error("[{}] Failed to update cluster state during deploy", appId, e);
            throw new Exception("Failed to update cluster state: " + e.getMessage(), e);
        }

        // If leader, try to assign to a node
        if (scheduler != null) {
            try {
                if (clusterManager.isLeader()) {
                    String assignedNode = scheduler.assignApplication(appId);
                    logger.info("[{}] Leader assigned application to node: {}", appId, assignedNode);
                }
            } catch (IllegalStateException e) {
                logger.debug("[{}] Lost leadership during assignment: {}", appId, e.getMessage());
            } catch (Exception e) {
                logger.error("[{}] Failed to assign application", appId, e);
                // Clean up cluster state
                try {
                    stateStore.putApplicationState(appId, ApplicationState.FAILED);
                } catch (Exception se) {
                    logger.error("[{}] Failed to update state to FAILED", appId, se);
                }
                throw new Exception("Failed to assign application: " + e.getMessage(), e);
            }

            // Check if assigned to local node
            if (scheduler.isAssignedToLocalNode(appId)) {
                logger.info("[{}] Application assigned to local node, deploying locally", appId);
                try {
                    super.deploy(descriptor);
                } catch (Exception e) {
                    // Update cluster state to reflect failure
                    try {
                        stateStore.putApplicationState(appId, ApplicationState.FAILED);
                    } catch (Exception se) {
                        logger.error("[{}] Failed to update state to FAILED", appId, se);
                    }
                    throw e;
                }
            }
        }
    } else {
        // Standalone mode
        logger.info("[{}] Deploying application in standalone mode", appId);
        super.deploy(descriptor);
    }
}

Similar fixes needed for start(), stop(), and undeploy() methods.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions