CerebriumAI · milo157 · Oct 16, 2025 · Oct 14, 2025 · Oct 14, 2025 · Oct 15, 2025
@@ -52,7 +52,8 @@ The configuration requires three key parameters:
 <Info>
   For ASGI applications like FastAPI, include the appropriate server package
   (like `uvicorn`) in your dependencies. After deployment, your endpoints become
-  available at `https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`.
+  available at
+  `https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`.
 </Info>
 
 Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation.

@@ -50,7 +50,11 @@ We can then run this function in the cloud and pass it a prompt.
 cerebrium run main.py::run --prompt "Hello World!"
 ```
 
-Your should see logs that output the prompt you sent in - this is running in the cloud! Let us now turn this into a scalable REST endpoint.
+Your should see logs that output the prompt you sent in - this is running in the cloud!
+
+Use the `run` functionality for quick code iteration and testing snippets or once-off scripts that require large CPU/GPU in the cloud.
+
+Let us now turn this into a scalable REST endpoint - something we could put in production!
 
 ### 4. Deploy your app
 
@@ -60,11 +64,13 @@ Run the following command:
 cerebrium deploy
 ```
 
-This will turn the function into a callable endpoint that accepts json parameters (prompt) and can scale to 1000s of requests automatically!
+This will turn the function into a callable persistent [endpoint](/cerebrium/endpoints/inference-api). that accepts json parameters (prompt) and can scale to 1000s of requests automatically!
 
 Once deployed, an app becomes callable through a POST endpoint `https://api.aws.us-east-1.cerebrium.ai/v4/{project-id}/{app-name}/{function-name}` and takes a json parameter, prompt
 
-Great! You made it! Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates.
+Great! You made it!
+
+Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates.
 
 ## How It Works
 

@@ -15,7 +15,7 @@ When Cerebrium needs to terminate an contanier, we do the following:
 
 1. Stop routing new requests to the container.
 2. Send a SIGTERM signal to your container.
-3. Waits for `response_grace_period` seconds to elaspse. 
+3. Waits for `response_grace_period` seconds to elaspse.
 4. Sends SIGKILL if the container hasn't stopped
 
 Below is a chart that shows it more eloquently:
@@ -24,30 +24,29 @@ Below is a chart that shows it more eloquently:
 flowchart TD
     A[SIGTERM sent] --> B[Cortex]
     A --> C[Custom Runtime]
-    
+
     B --> D[automatically captured]
     C --> E[User needs to capture]
-    
+
     D --> F[request finishes]
     D --> G[response_grace_period reached]
-    
+
     E --> H[User logic]
-    
+
     F --> I[Graceful termination]
     G --> J[SIGKILL]
-    
+
     H --> O[Graceful termination]
     H --> G[response_grace_period reached]
-    
+
     J --> K[Gateway Timeout Error]
 ```
 
 If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**.
 
-
 ## Example: FastAPI Implementation
 
-For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM. 
+For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM.
 
 The code below tracks active requests using a counter and prevents new requests during shutdown. When SIGTERM is received, it sets a shutdown flag and waits for all active requests to complete before the application terminates.
 
@@ -63,11 +62,11 @@ lock = asyncio.Lock()
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     yield  # Application startup complete
-    
+
     # Shutdown: runs when Cerebrium sends SIGTERM
     global shutting_down
     shutting_down = True
-    
+
     # Wait for active requests to complete
     while active_requests > 0:
         await asyncio.sleep(1)
@@ -79,7 +78,7 @@ async def track_requests(request, call_next):
     global active_requests
     if shutting_down:
         raise HTTPException(503, "Shutting down")
-    
+
     async with lock:
         active_requests += 1
     try:
@@ -96,12 +95,15 @@ In your Dockerfile:
 ```dockerfile
 ENTRYPOINT ["exec", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
 ```
+
 Or in cerebrium.toml:
+
 ```toml
 [cerebrium.runtime.custom]
 entrypoint = ["fastapi", "run", "app.py", "--port", "8000"]
 
 ```
+
 In bash scripts:
 
 ```bash
@@ -111,5 +113,6 @@ exec fastapi run app.py --port ${PORT:-8000}
 Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so your shutdown code never runs and Cerebrium force-kills the container after the grace period.
 
 <Tip>
-Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs.
-</Tip>
+  Test SIGTERM handling locally before deploying: start your app, send SIGTERM
+  with `Ctrl+C`, and verify you see graceful shutdown logs.
+</Tip>
@@ -79,7 +79,13 @@ During normal replica operation, this simply corresponds to a request timeout va
 waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error.
 
 <Note>
-  When using the Cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you'll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our [Graceful Termination guide](/cerebrium/scaling/graceful-termination) for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown.
+  When using the Cortex runtime (default), SIGTERM signals are automatically
+  handled to allow graceful termination of requests. For custom runtimes, you'll
+  need to implement SIGTERM handling yourself to ensure requests complete
+  gracefully before termination. See our [Graceful Termination
+  guide](/cerebrium/scaling/graceful-termination) for detailed implementation
+  examples, including FastAPI patterns for tracking and completing in-flight
+  requests during shutdown.
 </Note>
 
 Performance metrics available through the dashboard help monitor scaling behavior:

@@ -111,7 +111,7 @@
                     "pages": [
                       "v4/examples/gpt-oss",
                       "v4/examples/openai-compatible-endpoint-vllm",
-                      "v4/examples/streaming-falcon-7B"
+                      "v4/examples/sglang"
                     ]
                   },
                   {