Skip to content

Feature/replica support#1

Merged
SecretSettler merged 16 commits intomainfrom
feature/replica-support
Feb 4, 2026
Merged

Feature/replica support#1
SecretSettler merged 16 commits intomainfrom
feature/replica-support

Conversation

@anyin233
Copy link
Contributor

@anyin233 anyin233 commented Jan 6, 2026

  1. Allow the submit api can start multiple replica (for SwarmX)
  2. Add deletion for instance and offline workers
  3. Add graceful shutdown for workers (when receive SIGINT, notify head node set its status to OFFLINE, then exit)

anyin233 and others added 12 commits January 6, 2026 17:12
This commit adds comprehensive deletion support for PyLet instances
across all interfaces (HTTP API, Python API, and CLI).

Database Layer (db.py):
- Added delete_instance(instance_id) - delete by ID
- Added delete_instance_by_name(name) - delete by name
- Added delete_all_instances(status_filter) - bulk deletion
- Foreign key CASCADE automatically deletes allocations

Controller Layer (controller.py):
- Added delete_instance(instance_id)
- Added delete_instance_by_name(name)
- Added delete_all_instances(status_filter)
- Pokes scheduler after deletion to handle freed resources

HTTP API (server.py):
- DELETE /instances/{instance_id} - delete by ID (returns 204)
- DELETE /instances/by-name/{instance_name} - delete by name (returns 204)
- DELETE /instances?status=X - delete all with optional status filter

Python Sync API (_sync_api.py):
- pylet.delete(name) or pylet.delete(id=...)
- pylet.delete_all(status="COMPLETED")

Python Async API (aio/__init__.py):
- await pylet.aio.delete(name) or await pylet.aio.delete(id=...)
- await pylet.aio.delete_all(status="COMPLETED")

HTTP Client (client.py):
- client.delete_instance(instance_id)
- client.delete_instance_by_name(name)
- client.delete_all_instances(status)

CLI (cli.py):
- pylet delete --instance-id <id>
- pylet delete --name <name>
- pylet delete --all [--status COMPLETED]
- Includes confirmation prompts (--yes to skip)

Documentation (docs/instance-state-machine.md):
- Added comprehensive state machine visualization
- Documented all state transitions and lifecycle

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated documentation to include the new instance deletion feature:

API Reference (docs/api_reference.md):
- Added pylet.delete(name, *, id) function documentation
- Added pylet.delete_all(*, status) function documentation
- Updated async API section with delete methods
- Updated API summary table with delete functions

CLI Reference (docs/cli_reference.md):
- Added comprehensive pylet delete command documentation
- Documented all deletion options (--instance-id, --name, --all, --status, --yes)
- Added safety features section (confirmation prompts)
- Added examples for single and bulk deletion
- Added best practices for safe deletion workflows
- Updated command summary table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit adds comprehensive deletion support for workers with
safety constraints ensuring only OFFLINE workers can be deleted.

Database Layer (db.py):
- Added delete_worker(worker_id) - delete by ID
- Added delete_all_offline_workers() - bulk deletion
- Foreign key CASCADE automatically deletes GPU inventory

Controller Layer (controller.py):
- Added delete_worker(worker_id) with OFFLINE status check
  Returns (success, error) tuple to differentiate "not_found" vs "online"
- Added delete_all_offline_workers() with in-memory state cleanup
- Cleans up desired_gen and gen_events for deleted workers

HTTP API (server.py):
- DELETE /workers/{worker_id} - Returns 204/404/400
  400 error if worker is not OFFLINE
- DELETE /workers - Delete all OFFLINE workers

Python Sync API (_sync_api.py):
- pylet.delete_worker(worker_id)
  Raises ValueError if worker is not OFFLINE
- pylet.delete_all_offline_workers()

Python Async API (aio/__init__.py):
- await pylet.aio.delete_worker(worker_id)
- await pylet.aio.delete_all_offline_workers()

HTTP Client (client.py):
- client.delete_worker(worker_id)
  Returns False if not found, raises ValueError if not OFFLINE
- client.delete_all_offline_workers()

CLI (cli.py):
- pylet delete-worker --worker-id <id>
- pylet delete-worker --all-offline
- Confirmation prompts (--yes to skip)
- Clear error messages for non-OFFLINE workers

Documentation:
- Updated API reference with worker deletion functions
- Updated CLI reference with delete-worker command
- Added examples and safety notes

Safety Features:
- ONLY OFFLINE workers can be deleted
- ONLINE and SUSPECT workers are protected (400 error)
- Confirmation prompts in CLI
- Automatic cleanup of in-memory controller state

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add optional delete parameter (default False) to cancel() method.
When True, deletes the instance after cancellation is requested.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replica support:
- Add `replicas` parameter to submit instances (default 1)
- When replicas > 1, instances are named `{base_name}-{index}`
- API returns `instance_id` for single replica, `instance_ids` for multiple
- Update sync/async APIs, CLI, client, controller, and server

Port configuration:
- Add `--port` option to CLI for customizing head node API port and worker HTTP port
- Worker accepts `http_port` parameter for log retrieval server

Bug fix:
- Handle 503/404 responses gracefully when fetching logs from unavailable workers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When workers receive SIGINT or SIGTERM (Ctrl+C), they now:
1. Catch the signal and trigger graceful shutdown
2. Notify the head node via POST /workers/{id}/unregister endpoint
3. Head node immediately marks worker as OFFLINE

This enables faster failover compared to waiting for heartbeat timeout.

Changes:
- controller.py: Add unregister_worker() method
- server.py: Add POST /workers/{worker_id}/unregister endpoint
- worker.py: Add signal handlers and graceful shutdown logic
- test_controller.py: Fix test for auto-generated instance names

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When no name is provided, use instance_id[:8] as the auto-generated name
(original behavior) instead of a separate UUID. This ensures the name
is derived from the actual instance ID for single-replica submissions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update documentation to align with recent API changes:

- api_reference.md: Add full submit() signature with target_worker,
  gpu_indices, exclusive, labels, env, venv, and replicas parameters
- cli_reference.md: Add --port option to start command for both head
  and worker nodes
- cli_reference.md: Add --replicas, --target-worker, --gpu-indices,
  --exclusive, --label, --env, --venv options to submit command

Also includes:
- db.py: Fix foreign key constraint when deleting offline workers
- worker.py: Raise CancelledError after graceful shutdown notification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change heartbeat and monitor loops to check shutdown_event instead of while True
- Re-raise CancelledError in heartbeat loop when shutdown is triggered
- Add graceful instance termination before notifying head node
- Keep finally block to raise CancelledError for proper cleanup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Workers now report their HTTP port during registration, which is stored
in the database. When proxying log requests, the server uses the worker's
registered port instead of hardcoding config.WORKER_HTTP_PORT. This fixes
503 errors when fetching logs from workers started with custom ports.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add missing documentation for:
- pylet.instances() labels parameter for filtering
- Instance.cancel() delete parameter
- Instance properties: display_status, gpu_indices, exclusive, labels, env, target_worker
- WorkerInfo.gpu_indices_available property
- Async API differences (instances lacks labels, cancel has delete param)
- Updated API Summary table with Instance properties

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of using static WORKER_PORT_MIN/MAX config, the instance port
range is now calculated from the worker's HTTP port. When a worker
starts with --port 16000, instances get ports 16001-16100.

This allows multiple workers on the same host without port conflicts.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@SecretSettler SecretSettler self-requested a review January 8, 2026 13:12
anyin233 and others added 3 commits January 16, 2026 12:21
BREAKING CHANGE: The `replicas` parameter has been removed from all
submit functions. To create multiple instances, use a loop.

- Remove replicas parameter from controller, server, client, CLI
- Simplify return types (no more Union[str, List[str]])
- Update documentation with deprecation notice
- Add .ticktick-project to .gitignore

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@SecretSettler SecretSettler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove the replica feature and resolve the conflict? Maybe place replica as an example?

@anyin233
Copy link
Contributor Author

anyin233 commented Feb 2, 2026

Replica support removed

Resolve conflicts in 10 files keeping feature branch functionality
(deletion APIs, graceful shutdown, http_port, port range derivation)
while accepting main's type-hint modernization, formatting, tooling
(ruff/mypy config), and documentation additions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@anyin233
Copy link
Contributor Author

anyin233 commented Feb 2, 2026

Conflict resolved

@SecretSettler SecretSettler self-requested a review February 2, 2026 20:44
@SecretSettler SecretSettler merged commit 0e500f4 into main Feb 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants