Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion cerebrium/container-images/custom-web-servers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@ The configuration requires three key parameters:
<Info>
For ASGI applications like FastAPI, include the appropriate server package
(like `uvicorn`) in your dependencies. After deployment, your endpoints become
available at `https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`.
available at
`https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`.
</Info>

Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation.
Expand Down
12 changes: 9 additions & 3 deletions cerebrium/getting-started/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,11 @@ We can then run this function in the cloud and pass it a prompt.
cerebrium run main.py::run --prompt "Hello World!"
```

Your should see logs that output the prompt you sent in - this is running in the cloud! Let us now turn this into a scalable REST endpoint.
Your should see logs that output the prompt you sent in - this is running in the cloud!

Use the `run` functionality for quick code iteration and testing snippets or once-off scripts that require large CPU/GPU in the cloud.

Let us now turn this into a scalable REST endpoint - something we could put in production!

### 4. Deploy your app

Expand All @@ -60,11 +64,13 @@ Run the following command:
cerebrium deploy
```

This will turn the function into a callable endpoint that accepts json parameters (prompt) and can scale to 1000s of requests automatically!
This will turn the function into a callable persistent [endpoint](/cerebrium/endpoints/inference-api). that accepts json parameters (prompt) and can scale to 1000s of requests automatically!

Once deployed, an app becomes callable through a POST endpoint `https://api.aws.us-east-1.cerebrium.ai/v4/{project-id}/{app-name}/{function-name}` and takes a json parameter, prompt

Great! You made it! Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates.
Great! You made it!

Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates.

## How It Works

Expand Down
31 changes: 17 additions & 14 deletions cerebrium/scaling/graceful-termination.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ When Cerebrium needs to terminate an contanier, we do the following:

1. Stop routing new requests to the container.
2. Send a SIGTERM signal to your container.
3. Waits for `response_grace_period` seconds to elaspse.
3. Waits for `response_grace_period` seconds to elaspse.
4. Sends SIGKILL if the container hasn't stopped

Below is a chart that shows it more eloquently:
Expand All @@ -24,30 +24,29 @@ Below is a chart that shows it more eloquently:
flowchart TD
A[SIGTERM sent] --> B[Cortex]
A --> C[Custom Runtime]

B --> D[automatically captured]
C --> E[User needs to capture]

D --> F[request finishes]
D --> G[response_grace_period reached]

E --> H[User logic]

F --> I[Graceful termination]
G --> J[SIGKILL]

H --> O[Graceful termination]
H --> G[response_grace_period reached]

J --> K[Gateway Timeout Error]
```

If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.


## Example: FastAPI Implementation

For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM.
For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM.

The code below tracks active requests using a counter and prevents new requests during shutdown. When SIGTERM is received, it sets a shutdown flag and waits for all active requests to complete before the application terminates.

Expand All @@ -63,11 +62,11 @@ lock = asyncio.Lock()
@asynccontextmanager
async def lifespan(app: FastAPI):
yield # Application startup complete

# Shutdown: runs when Cerebrium sends SIGTERM
global shutting_down
shutting_down = True

# Wait for active requests to complete
while active_requests > 0:
await asyncio.sleep(1)
Expand All @@ -79,7 +78,7 @@ async def track_requests(request, call_next):
global active_requests
if shutting_down:
raise HTTPException(503, "Shutting down")

async with lock:
active_requests += 1
try:
Expand All @@ -96,12 +95,15 @@ In your Dockerfile:
```dockerfile
ENTRYPOINT ["exec", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```

Or in cerebrium.toml:

```toml
[cerebrium.runtime.custom]
entrypoint = ["fastapi", "run", "app.py", "--port", "8000"]

```

In bash scripts:

```bash
Expand All @@ -111,5 +113,6 @@ exec fastapi run app.py --port ${PORT:-8000}
Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so your shutdown code never runs and Cerebrium force-kills the container after the grace period.

<Tip>
Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs.
</Tip>
Test SIGTERM handling locally before deploying: start your app, send SIGTERM
with `Ctrl+C`, and verify you see graceful shutdown logs.
</Tip>
8 changes: 7 additions & 1 deletion cerebrium/scaling/scaling-apps.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,13 @@ During normal replica operation, this simply corresponds to a request timeout va
waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.

<Note>
When using the Cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you'll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our [Graceful Termination guide](/cerebrium/scaling/graceful-termination) for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown.
When using the Cortex runtime (default), SIGTERM signals are automatically
handled to allow graceful termination of requests. For custom runtimes, you'll
need to implement SIGTERM handling yourself to ensure requests complete
gracefully before termination. See our [Graceful Termination
guide](/cerebrium/scaling/graceful-termination) for detailed implementation
examples, including FastAPI patterns for tracking and completing in-flight
requests during shutdown.
</Note>

Performance metrics available through the dashboard help monitor scaling behavior:
Expand Down
2 changes: 1 addition & 1 deletion docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@
"pages": [
"v4/examples/gpt-oss",
"v4/examples/openai-compatible-endpoint-vllm",
"v4/examples/streaming-falcon-7B"
"v4/examples/sglang"
]
},
{
Expand Down
Binary file added images/sglang-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sglang_advertisement.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading