Skip to content

[IMPORTANT] Introduction of the "Freeze" concept and manual lifecycle management for individual services #1861

@Dangles

Description

@Dangles

Describe the new feature

I suggest adding two concepts for managing specific service instances within a task:

  1. Commands service <name> freeze / unfreeze:
    • freeze: Completely stops the instance and marks it with a internal flag that prohibits any automatic startup by the TickLoop.
    • Core Logic: During the "frozen" state, the internal minServiceCount check (located in DefaultTickLoop#startService) must count this server as "occupied/present". This ensures the system does not attempt to create or start a new server to replace the frozen one.
  2. Command service <name> autoStart <true/false>:
    • Allows overriding the auto-start behavior for a specific static instance without modifying the global settings of the ServiceTask.
    • When set to false, the service should be ignored by the automatic service selection logic in CloudServiceManager, making it startable only through manual command or API call.

Why do you need this feature?

As a DevOps engineer managing large-scale networks (200–1500+ players) with dozens of static servers, "targeted" maintenance is a frequent necessity. Tasks like wiping a specific SMP instance, performing seasonal map cleanups, or manually updating playerdata require the server to be completely offline to prevent file corruption.
Currently, CloudNet's minServiceCount mechanism makes this impossible to automate safely. As soon as a static instance is stopped for maintenance, the TickLoop detects a "shortage" and immediately tries to restart the same instance or spawn a replacement.

This leads to:

  • Persistent file access conflicts.
  • Data corruption due to concurrent writes.
  • Unnecessary resource load on the nodes.

Changing the minServiceCount for the whole task or putting the entire task into maintenance mode is not viable, as it affects all other healthy servers in that group.

Default Task Config Examples

This implementation would allow DevOps engineers to fully automate maintenance (via Ansible/API) for specific servers without "fighting" the CloudNet automation.

Technical reference:
The issue resides in the current implementation of eu.cloudnetservice.node.impl.tick.DefaultTickLoop#startService, where only services with ServiceLifeCycle.RUNNING are counted towards the minimum service requirement.

Configuration Example (ServiceTask properties):
To support this per-instance, these flags could be stored in the properties of the ServiceTask or directly in the service's own configuration/snapshot. For a task managing multiple static instances, it might look like this:

{
  "name": "FFA",
  "minServiceCount": 3,
  "staticServices": true,
  "properties": {
    "instanceOverrides": {
      "FFA-1": {
        "autoStart": false,
        "frozen": true,
        "maintenance": true
      },
      "FFA-2": {
        "autoStart": true,
        "frozen": false,
        "maintenance": false
      },
      "FFA-3": {
        "autoStart": false,
        "frozen": false,
        "maintenance": true
      }
    }
  }
}
  • FFA-1: Completely ignored by minServiceCount (counted as occupied) and cannot be started manually without unfreezing.
  • FFA-2: Operates as a standard automated service.
  • FFA-3: Exists and is ready, but will never be started by the TickLoop automation; it requires a manual service FFA-3 start command.

Why do you need this feature?

As a DevOps engineer managing large-scale networks (200–1500+ players) with dozens of static servers, "targeted" maintenance is a frequent necessity. Tasks like wiping a specific SMP instance, performing seasonal map cleanups, or manually updating playerdata require the server to be completely offline to prevent file corruption.
Currently, CloudNet's minServiceCount mechanism makes this impossible to automate safely. As soon as a static instance is stopped for maintenance, the TickLoop detects a "shortage" and immediately tries to restart the same instance or spawn a replacement.

This leads to:

  • Persistent file access conflicts.
  • Data corruption due to concurrent writes.
  • Unnecessary resource load on the nodes.

Changing the minServiceCount for the whole task or putting the entire task into maintenance mode is not viable, as it affects all other healthy servers in that group.

Alternatives

I have considered solving this via a custom module, but it is technically ineffective. While a module can cancel the CloudServicePreLifecycleEvent, it cannot intervene in the DefaultTickLoop counting logic. The core will still see "insufficient services" and attempt a restart every second, resulting in an endless loop of cancelled events and console spam. This functionality needs to be native to the core's lifecycle management.

Other

No response

Issue uniqueness

  • Yes, this issue is unique. There are no similar issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    s: needs triageIssue waiting for triaget: feature requestA request of a feature someone wants to see in a future release.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions