diff --git a/cerebrium/container-images/custom-web-servers.mdx b/cerebrium/container-images/custom-web-servers.mdx index f2c531f..50d0da6 100644 --- a/cerebrium/container-images/custom-web-servers.mdx +++ b/cerebrium/container-images/custom-web-servers.mdx @@ -52,7 +52,8 @@ The configuration requires three key parameters: For ASGI applications like FastAPI, include the appropriate server package (like `uvicorn`) in your dependencies. After deployment, your endpoints become - available at `https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`. + available at + `https://api.aws.us-east-1.cerebrium.ai/v4/[project-id]/[app-name]/your/endpoint`. Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation. diff --git a/cerebrium/getting-started/introduction.mdx b/cerebrium/getting-started/introduction.mdx index 8ce204f..778da21 100644 --- a/cerebrium/getting-started/introduction.mdx +++ b/cerebrium/getting-started/introduction.mdx @@ -50,7 +50,11 @@ We can then run this function in the cloud and pass it a prompt. cerebrium run main.py::run --prompt "Hello World!" ``` -Your should see logs that output the prompt you sent in - this is running in the cloud! Let us now turn this into a scalable REST endpoint. +Your should see logs that output the prompt you sent in - this is running in the cloud! + +Use the `run` functionality for quick code iteration and testing snippets or once-off scripts that require large CPU/GPU in the cloud. + +Let us now turn this into a scalable REST endpoint - something we could put in production! ### 4. Deploy your app @@ -60,11 +64,13 @@ Run the following command: cerebrium deploy ``` -This will turn the function into a callable endpoint that accepts json parameters (prompt) and can scale to 1000s of requests automatically! +This will turn the function into a callable persistent [endpoint](/cerebrium/endpoints/inference-api). that accepts json parameters (prompt) and can scale to 1000s of requests automatically! Once deployed, an app becomes callable through a POST endpoint `https://api.aws.us-east-1.cerebrium.ai/v4/{project-id}/{app-name}/{function-name}` and takes a json parameter, prompt -Great! You made it! Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates. +Great! You made it! + +Join our Community [Discord](https://discord.gg/ATj6USmeE2) for support and updates. ## How It Works diff --git a/cerebrium/scaling/graceful-termination.mdx b/cerebrium/scaling/graceful-termination.mdx index c547502..b452fa9 100644 --- a/cerebrium/scaling/graceful-termination.mdx +++ b/cerebrium/scaling/graceful-termination.mdx @@ -15,7 +15,7 @@ When Cerebrium needs to terminate an contanier, we do the following: 1. Stop routing new requests to the container. 2. Send a SIGTERM signal to your container. -3. Waits for `response_grace_period` seconds to elaspse. +3. Waits for `response_grace_period` seconds to elaspse. 4. Sends SIGKILL if the container hasn't stopped Below is a chart that shows it more eloquently: @@ -24,30 +24,29 @@ Below is a chart that shows it more eloquently: flowchart TD A[SIGTERM sent] --> B[Cortex] A --> C[Custom Runtime] - + B --> D[automatically captured] C --> E[User needs to capture] - + D --> F[request finishes] D --> G[response_grace_period reached] - + E --> H[User logic] - + F --> I[Graceful termination] G --> J[SIGKILL] - + H --> O[Graceful termination] H --> G[response_grace_period reached] - + J --> K[Gateway Timeout Error] ``` If you do not handle SIGTERM in the custom runtime, Cerebrium terminates containers immediately after sending `SIGTERM`, which can interrupt in-flight requests and cause **502 errors**. - ## Example: FastAPI Implementation -For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM. +For custom runtimes using FastAPI, implement the [`lifespan` pattern](https://fastapi.tiangolo.com/advanced/events/) to respond to SIGTERM. The code below tracks active requests using a counter and prevents new requests during shutdown. When SIGTERM is received, it sets a shutdown flag and waits for all active requests to complete before the application terminates. @@ -63,11 +62,11 @@ lock = asyncio.Lock() @asynccontextmanager async def lifespan(app: FastAPI): yield # Application startup complete - + # Shutdown: runs when Cerebrium sends SIGTERM global shutting_down shutting_down = True - + # Wait for active requests to complete while active_requests > 0: await asyncio.sleep(1) @@ -79,7 +78,7 @@ async def track_requests(request, call_next): global active_requests if shutting_down: raise HTTPException(503, "Shutting down") - + async with lock: active_requests += 1 try: @@ -96,12 +95,15 @@ In your Dockerfile: ```dockerfile ENTRYPOINT ["exec", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] ``` + Or in cerebrium.toml: + ```toml [cerebrium.runtime.custom] entrypoint = ["fastapi", "run", "app.py", "--port", "8000"] ``` + In bash scripts: ```bash @@ -111,5 +113,6 @@ exec fastapi run app.py --port ${PORT:-8000} Without exec, SIGTERM is sent to the bash script (PID 1) instead of FastAPI, so your shutdown code never runs and Cerebrium force-kills the container after the grace period. -Test SIGTERM handling locally before deploying: start your app, send SIGTERM with `Ctrl+C`, and verify you see graceful shutdown logs. - \ No newline at end of file + Test SIGTERM handling locally before deploying: start your app, send SIGTERM + with `Ctrl+C`, and verify you see graceful shutdown logs. + diff --git a/cerebrium/scaling/scaling-apps.mdx b/cerebrium/scaling/scaling-apps.mdx index f446ab6..1df45ce 100644 --- a/cerebrium/scaling/scaling-apps.mdx +++ b/cerebrium/scaling/scaling-apps.mdx @@ -79,7 +79,13 @@ During normal replica operation, this simply corresponds to a request timeout va waits for the specified grace period, issues a SIGKILL command if the instance has not stopped, and kills any active requests with a GatewayTimeout error. - When using the Cortex runtime (default), SIGTERM signals are automatically handled to allow graceful termination of requests. For custom runtimes, you'll need to implement SIGTERM handling yourself to ensure requests complete gracefully before termination. See our [Graceful Termination guide](/cerebrium/scaling/graceful-termination) for detailed implementation examples, including FastAPI patterns for tracking and completing in-flight requests during shutdown. + When using the Cortex runtime (default), SIGTERM signals are automatically + handled to allow graceful termination of requests. For custom runtimes, you'll + need to implement SIGTERM handling yourself to ensure requests complete + gracefully before termination. See our [Graceful Termination + guide](/cerebrium/scaling/graceful-termination) for detailed implementation + examples, including FastAPI patterns for tracking and completing in-flight + requests during shutdown. Performance metrics available through the dashboard help monitor scaling behavior: diff --git a/docs.json b/docs.json index e836c23..3a9d7b3 100644 --- a/docs.json +++ b/docs.json @@ -111,7 +111,7 @@ "pages": [ "v4/examples/gpt-oss", "v4/examples/openai-compatible-endpoint-vllm", - "v4/examples/streaming-falcon-7B" + "v4/examples/sglang" ] }, { diff --git a/images/sglang-arch.png b/images/sglang-arch.png new file mode 100644 index 0000000..b6ff418 Binary files /dev/null and b/images/sglang-arch.png differ diff --git a/images/sglang_advertisement.jpg b/images/sglang_advertisement.jpg new file mode 100644 index 0000000..48c3152 Binary files /dev/null and b/images/sglang_advertisement.jpg differ diff --git a/v4/examples/sglang.mdx b/v4/examples/sglang.mdx new file mode 100644 index 0000000..c5181db --- /dev/null +++ b/v4/examples/sglang.mdx @@ -0,0 +1,276 @@ +--- +title: "Deploy a Vision Language Model with SGLang" +description: "Build an intelligent ad analysis system that evaluates advertisements across multiple dimensions" +--- + +In this tutorial, we'll explore how to deploy a Vision Language Model (VLM) using SGLang on Cerebrium. A VLM is an AI model that combines a large language model (LLM) with a vision encoder, allowing it to understand and process both images and text. + +We'll build an intelligent ad analysis system that evaluates advertisements across multiple dimensions, giving us a score on how the advertisement relates to the business in quesion and how it scores on the given criteria. + +SGLang (Structured Generation Language) differs from other inference frameworks such as vLLM and TensorRT by focusing no structed generation and complex workflows multi-step LLM workflows. SGLang is being used in production by teams at xAI and Deepseek to power their core language model capabilities making it a trusted choice. + +### SGLang Architecture + +SGLang isn't just a domain-specific language (DSL). It's a complete, integrated execution system, designed with a clear seperation of functionality: + +| Layer | What it does | Why it matters | +| -------- | ------------------------------------------------------------------------------ | --------------------------------------------------------------------------------- | +| Frontend | Where you define your LLM logic (with gen, fork, join, etc.) | This keeps your code clean, readable, and your workflows easily reusable. | +| Backend | Where SGLang intelligently figures out how to run your logic most efficiently. | This is where the speed, scalability, and optimized inference truly come to life. | + +To give you quick example, here are some primitives on the frontend you can use to create multi-step workflows: + +| Primitive | What it does | Example | +| ---------- | --------------------------------------- | ------------------------------------------ | +| `gen()` | Generates a text span | `gen("title", stop="\n")` | +| `fork()` | Splits execution into multiple branches | For parallel sub-tasks | +| `join()` | Merges branches back together | For combining outputs | +| `select()` | Chooses one option from many | For controlled logic, like multiple choice | + +![SGLang Architecture](/images/sglang-arch.png) + +Here is a summary of key advantages over traditional inference engines + +| Feature | Traditional Engines (vLLM, TGI) | SGLang | +| ----------------------- | ------------------------------------------------ | -------------------------------------------------------------------------- | +| **Programming Model** | Sequential API calls with manual prompt chaining | Native structured logic with `gen()`, `fork()`, `join()`, `select()` | +| **Memory Management** | Basic KV caching, often discarded between calls | **RadixAttention**: Intelligent prefix-aware cache reuse (up to 6x faster) | +| **Output Control** | Hope and pray for correct formatting | **Compressed FSMs**: Guaranteed structured output (JSON, XML, etc.) | +| **Parallel Processing** | Manual batching and coordination | Built-in `fork()` and `join()` for parallel execution | +| **Performance** | Standard inference optimization | PyTorch-native with `torch.compile()`, quantization, sparse inference | + +If you would like to read more, checkout this [article](https://huggingface.co/blog/paresh2806/sglang-efficient-llm-workflows). Let us show this in practice with our tutorial. +You can see the final code sample [here](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/7-vision-language-sglang) + +## Tutorial + +### Step 1: Project Setup + +First, let's create our project structure: + +```bash +cerebrium init 7-vision-language-sglang +cd 7-vision-language-sglang +``` + +### Step 2: Configure Dependencies + +The VLM we will be using is [Qwen3-VL-30B-A3B-Instruct-FP8](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct-FP8) model, which need a lot of GPU memory - we configure this in our cerebrium.toml. + +Cerebrium runs containers in the cloud and this file defines our environment, hardware, and scaling settings. We'll use an ADA_L40 GPU to accommodate our model's memory requirements. The configuration includes: + +- Hardware settings for GPU, CPU and memory allocation +- Scaling parameters to control instance counts +- Required pip packages like SGLang, flashinfer (our chosen backend), and PyTorch +- APT system dependencies +- FastAPI server configuration for hosting our API + +For a complete reference of all available TOML settings, see our [TOML Reference](/toml-reference/toml-reference). + +While we use flashinfer as our backend here, other options like flash attention are also available depending on your needs. + +Update your cerebrium.toml with: + +```toml +[cerebrium.deployment] +name = "7-vision-language-sglang" +python_version = "3.11" +docker_base_image_url = "nvidia/cuda:12.8.0-devel-ubuntu22.04" +deployment_initialization_timeout = 860 + +[cerebrium.hardware] +cpu = 6.0 +memory = 60.0 +compute = "ADA_L40" + +[cerebrium.scaling] +min_replicas = 0 +max_replicas = 2 + +[cerebrium.build] +use_uv = true + +[cerebrium.dependencies.pip] +transformers = "latest" +huggingface_hub = "latest" +pydantic = "latest" +pillow = "latest" +requests = "latest" +torch = "latest" +"sglang[all]" = "latest" +"sgl-kernel" = "latest" +"flashinfer-python" = "latest" + +[cerebrium.dependencies.apt] +libnuma-dev = "latest" + +[cerebrium.runtime.custom] +port = 8000 +entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] +``` + +### Step 3: Implement the Ad Analysis Logic + +One of the many great features of Cerebrium is we don't enforce any special class design or way of architecing your applications - Just write your python code as if you +were running it locally (and if you had a GPU ;). Below, we setup our SGLang Runtime Engine (Backend) with our FastAPI and load the model on startup of the container. This means we will incur +a model load on the first request but subsequent requests will execute instantaneously. + +In your `main.py` file: + +```python +import sglang as sgl +from sglang import function +from fastapi import FastAPI, HTTPException +from transformers import AutoProcessor + +app = FastAPI(title="Vision Language SGLang API") +model_path = "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" +processor = AutoProcessor.from_pretrained(model_path) + +@app.on_event("startup") +def _startup_warmup(): + # Initialize engine on main thread during app startup + runtime = sgl.Runtime( + model_path=model_path, + enable_multimodal=True, + mem_fraction_static=0.8, + tp_size=1, + attention_backend="flashinfer", + ) + runtime.endpoint.chat_template = sgl.lang.chat_template.get_chat_template( + "qwen2-vl" + ) + sgl.set_default_backend(runtime) + + +@app.get("/health") +def health(): + return { + "status": "healthy", + } +``` + +In order to score the advertisement, we will be using one of the core differentiators of SGLang, `fork`, which allows us to run many prompts in parallel and bring the results together in the end. This +allows us to execute alot of simulaneous requests with no increase in total latency. Lastly, we bring these results together and structure them in a specific format to return to the user. + +````python +@function +def analyze_ad(s, image, ad_description, dimensions): + s += sgl.system("Evaluate an advertisement about an company's description.") + s += sgl.user(sgl.image(image) + "Company Description: " + ad_description) + s += sgl.assistant("Sure!") + + s += sgl.user("Is the company description related to the image?") + s += sgl.assistant(sgl.select("related", choices=["yes", "no"])) + if s["related"] == "no": + return + + forks = s.fork(len(dimensions)) + for i, (f, dim) in enumerate(zip(forks, dimensions)): + f += sgl.user("Evaluate based on the following dimension: " + + dim + ". End your judgment with the word 'END'") + # Use unique slot names per dimension to avoid collisions + f += sgl.assistant("Judgment: " + sgl.gen(f"judgment_{i}", stop="END")) + + s += sgl.user("Provide a one-sentence synthesis of the overall evaluation, then we will output JSON.") + s += sgl.assistant(sgl.gen("summary_one_liner", stop=".")) + + schema = r'^\{"summary": ".{1,400}", "grade": "[ABCD][+\-]?"\}$' + s += sgl.user("Return only a 3 line parapgrah JSON object with keys summary and grade (A, B, C, D, +, -), where summary briefly synthesizes the above judgments.") + s += sgl.assistant(sgl.gen("output", regex=schema)) +``` + +To end, let us bring it all together in an endpoint and + +```python +from pydantic import BaseModel +import base64 +import io +import json +from PIL import Image + +class AnalyzeRequest(BaseModel): + image_base64: str + ad_description: str + dimensions: list + +def process_image(image_base64: str) -> Image.Image: + image_data = base64.b64decode(image_base64) + return Image.open(io.BytesIO(image_data)) + +@app.post("/analyze") +def analyze_advertisement(req: AnalyzeRequest): + try: + image = process_image(req.image_base64) + state = analyze_ad.run(image, req.ad_description, req.dimensions) + try: + print(state) + output = state["output"] + except KeyError: + output = None + if isinstance(output, str): + start = output.find("{") + end = output.rfind("}") + 1 + if start != -1 and end > start: + return { + "success": True, + "analysis": json.loads(output[start:end]), + "dimensions_evaluated": req.dimensions + } + return { + "success": True, + "analysis": output, + "dimensions_evaluated": req.dimensions + } + except Exception as e: + raise HTTPException(status_code=500, detail=str(e)) + + +``` + +Thats it! Let's deploy your application so it becomes a scalable inference endpoint + +### Step 4: Deploy Your Application + +Run: + +```bash +cerebrium deploy +``` + +Once deployed, test your application with a sample request: + +```bash +curl -X POST "https://api.aws.us-east-1.cerebrium.ai/v4/p-/7-vision-language-sglang/analyze" \ + -H "Content-Type: application/json" \ + -d '{ + "company_description": "Nike is a global leader in athletic footwear, apparel, and sports equipment known for its innovative designs and the iconic “swoosh” logo. The brand embodies performance, style, and inspiration, empowering athletes worldwide to Just Do It."", + "image_base64": "", + "dimensions": ["Effectiveness","Clarity", "Appeal","Credibility"] + }' +``` + +Nike AD + +### Example Response + +```json +{ + "success": true, + "analysis": { + "summary": "The company description is relevant to the image because it accurately reflects Nike's branding, which is showcased through the advertised sneaker and logo. The ad promotes Nike's core products—athletic footwear—and its values of performance, style, and inspiration, aligning with the brand's identity. The collaboration with a superhero theme further emphasizes innovation and empowerment, core ", + "grade": "A" + }, + "dimensions_evaluated": [ + "Effectiveness", + "Clarity", + "Appeal", + "Credibility" + ] +} +``` + +We've demonstrated a simple application how to leverage SGLang's powerful structured generation capabilities to build a naive ad analysis system. By using features like `fork()` for parallel processing and SGLang's built-in output control. + +You can find the complete code for this tutorial in our [examples repository](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/7-vision-language-sglang). +````