# Ray Serve

* Ray serve is a scalable model serving library for building online inference APIs.
    * Framework-agnostic
    * Well-suited for model composition

# Key Concepts

### Deployment

* Deployments are the central concept in Ray Serve.
    * A deployment contains business logic or an ML model to handle incoming requests and can be scaled up to  run across Ray Cluster.
    * At runtime, a deployment consists of a number of replicas, which are individual copies of the class of function that are started in separate Ray Actors (processes). The number of replicas can be scaled up or down (or even autoscaled) to match the incoming request load.
    * To define a deployment:
        * Use the `@serve.deployment` decorator on a class (or function for simple use cases).
        * Then, `bind` the deployment with optional arguments to the constructor to define the application.
        * Finally, deploy the resulting application using `serve.run`

In [2]:
from ray import serve
from ray.serve.handle import DeploymentHandle

@serve.deployment
class MyFirstDeployment:
    # Take the message to return as an argument to the constructor.
    def __init__(self, msg):
        self.msg = msg
    
    def __call__(self):
        return self.msg
    
my_first_deployment = MyFirstDeployment.bind("Hello world!")
handle: DeploymentHandle = serve.run(my_first_deployment)

INFO 2024-11-14 18:27:15,753 serve 17528 api.py:259 - Connecting to existing Serve app in namespace "serve". New http options will not be applied.
INFO 2024-11-14 18:27:19,798 serve 17528 client.py:492 - Deployment 'MyFirstDeployment:42vdi3zg' is ready at `http://127.0.0.1:8000/`. component=serve deployment=MyFirstDeployment
INFO 2024-11-14 18:27:19,802 serve 17528 api.py:549 - Deployed app 'default' successfully.


### Application

* An application is the unit of upgrade in a Ray Serve cluster. An application consists of one or more deployments. 
* One of these deployments is considered the “ingress” deployment, which handles all inbound traffic.

* Applications can be called via HTTP at the specified `route_prefix` or in Python using a `DeploymentHandle`.

### `DeploymentHandle` (composing deployments)

* Ray Serve enables flexible model composition and scaling by allowing multiple independent deployments to call into each other.
* When binding a deployment, you can include references to other bound deployments. Then, at runtime each of these arguments is converted to a `DeploymentHandle` that can be used to query the deployment using a Python-native API. 

In [3]:
from ray import serve
from ray.serve.handle import DeploymentHandle

@serve.deployment
class Hello:
    def __call__(self) -> str:
        return "Hello"

@serve.deployment
class World:
    def __call__(self) -> str:
        return " world!"
    
@serve.deployment
class Ingress:
    def __init__(self, hello_handle: DeploymentHandle, world_handle: DeploymentHandle):
        self._hello_handle = hello_handle
        self._world_handle = world_handle

    async def __call__(self) -> str:
        hello_response = self._hello_handle.remote()
        world_response = self._world_handle.remote()
        return (await hello_response) + (await world_response)
    
hello = Hello.bind()
world = World.bind()

# The deployments passed to the Ingress constructor are replaced with handles.
app = Ingress.bind(hello, world)

# Deploys Hello, World, and Ingress.
handle: DeploymentHandle = serve.run(app)

INFO 2024-11-14 18:47:02,469 serve 17528 api.py:259 - Connecting to existing Serve app in namespace "serve". New http options will not be applied.
INFO 2024-11-14 18:47:05,498 serve 17528 client.py:492 - Deployment 'Hello:lp6t9mbo' is ready. component=serve deployment=Hello
INFO 2024-11-14 18:47:05,499 serve 17528 client.py:492 - Deployment 'World:lptc9rl5' is ready. component=serve deployment=World
INFO 2024-11-14 18:47:05,500 serve 17528 client.py:492 - Deployment 'Ingress:56vy2kd0' is ready at `http://127.0.0.1:8000/`. component=serve deployment=Ingress
INFO 2024-11-14 18:47:05,504 serve 17528 api.py:549 - Deployed app 'default' successfully.


### Ingress Deployment (HTTP Handling)

* A serve application can consist of multiple deployments that can be combined to perform model composition or complex business logic. 
    * However, one deployment is always the "top-level" one that is passed to  `serve.run` to deploy the application. This deployment is called the "ingress deployment" because it serves as the entrypoint for all traffic to the application. * Often, it then routes to other deployments or calls into them usin the `DeploymentHandle` API, and composes the results before returning to the user.

* The ingress deployment defines the HTTP handling logic for the application.
    * By default, the `__call__` method of the class is called and passed in a Starlette `request` object. 
    * The response will be serialized as JSON, but other Starlette response objects can also be returned directly.

In [4]:
import requests
from starlette.requests import Request

from ray import serve


@serve.deployment
class MostBasicIngress:
    async def __call__(self, request: Request) -> str:
        name = (await request.json())["name"]
        return f"Hello {name}!"
    
app = MostBasicIngress.bind()
serve.run(app)

INFO 2024-11-14 19:02:15,287 serve 17528 api.py:259 - Connecting to existing Serve app in namespace "serve". New http options will not be applied.
INFO 2024-11-14 19:02:17,308 serve 17528 client.py:492 - Deployment 'MostBasicIngress:0w54a9l9' is ready at `http://127.0.0.1:8000/`. component=serve deployment=MostBasicIngress
INFO 2024-11-14 19:02:17,312 serve 17528 api.py:549 - Deployed app 'default' successfully.


DeploymentHandle(deployment='MostBasicIngress')

In [None]:
requests.get("http://127.0.0.1:8000/", json={"name": "Corey"}).text

'Hello Corey!'

: 

# Quickstart

In [1]:
import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

In [2]:
# 1: Define a ray serve application
@serve.deployment
class MyModelDeployment:
    def __init__(self, msg:str):
        # Initialize model state: could be very large neural net weights.
        self._msg = msg

    def __call__(self, request: Request) -> Dict:
        return {'result':self._msg}
    
app = MyModelDeployment.bind(msg='Hello world!')

# 2: Deploy the application locally.
serve.run(
    app, route_prefix="/"
)

# 3: Query the application and print the result.
print(
    requests.get("http://localhost:8000/").json()
)


2024-11-12 21:40:29,329	INFO worker.py:1807 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
[36m(ProxyActor pid=31828)[0m INFO 2024-11-12 21:40:38,827 proxy 127.0.0.1 proxy.py:1191 - Proxy starting on node fccb8ab9447526fdf3320db14157ba80322942fa7eea38b6abadf950 (HTTP port: 8000).
INFO 2024-11-12 21:40:40,441 serve 9360 api.py:277 - Started Serve in namespace "serve".
[36m(ServeController pid=29596)[0m INFO 2024-11-12 21:40:40,549 controller 29596 deployment_state.py:1604 - Deploying new version of Deployment(name='MyModelDeployment', app='default') (initial target replicas: 1).
[36m(ServeController pid=29596)[0m INFO 2024-11-12 21:40:40,673 controller 29596 deployment_state.py:1850 - Adding 1 replica to Deployment(name='MyModelDeployment', app='default').
[36m(ServeReplica:default:MyModelDeployment pid=33100)[0m INFO 2024-11-12 21:40:42,724 default_MyModelDeployment s41nlx0n 857f49ff-a984-4420-aeea-489d9481430b / replica.py:378 

{'result': 'Hello world!'}


[36m(ServeReplica:default:MyModelDeployment pid=33100)[0m INFO 2024-11-12 21:40:45,550 default_MyModelDeployment s41nlx0n d35de062-0052-4eeb-8eae-2e35c0f75349 / replica.py:378 - __CALL__ OK 3.0ms
[36m(ServeController pid=29596)[0m INFO 2024-11-12 22:09:07,569 controller 29596 deployment_state.py:1604 - Deploying new version of Deployment(name='Adder', app='default') (initial target replicas: 1).
[36m(ServeController pid=29596)[0m INFO 2024-11-12 22:09:07,571 controller 29596 deployment_state.py:1604 - Deploying new version of Deployment(name='Adder_1', app='default') (initial target replicas: 1).
[36m(ServeController pid=29596)[0m INFO 2024-11-12 22:09:07,572 controller 29596 deployment_state.py:1604 - Deploying new version of Deployment(name='Combiner', app='default') (initial target replicas: 1).
[36m(ServeController pid=29596)[0m INFO 2024-11-12 22:09:07,574 controller 29596 deployment_state.py:1604 - Deploying new version of Deployment(name='Ingress', app='default') (init

# Model Composition

* Use Serve’s model composition API to combine multiple deployments into a single application.

In [1]:
import requests
import starlette
from typing import Dict
from ray import serve
from ray.serve.handle import DeploymentHandle


In [3]:
# 1. Define the models in our composition graph and an ingress that calls them.
@serve.deployment
class Adder:
    def __init__(self, increment: int):
        self.increment = increment

    def add(self, inp: int):
        return self.increment + inp


@serve.deployment
class Combiner:
    def average(self, *inputs) -> float:
        return sum(inputs) / len(inputs)


@serve.deployment
class Ingress:
    def __init__(
        self,
        adder1: DeploymentHandle,
        adder2: DeploymentHandle,
        combiner: DeploymentHandle,
    ):
        self._adder1 = adder1
        self._adder2 = adder2
        self._combiner = combiner

    async def __call__(self, request: starlette.requests.Request) -> Dict[str, float]:
        input_json = await request.json()
        final_result = await self._combiner.average.remote(
            self._adder1.add.remote(input_json["val"]),
            self._adder2.add.remote(input_json["val"]),
        )
        return {"result": final_result}


# 2. Build the application consisting of the models and ingress.
app = Ingress.bind(Adder.bind(increment=1), Adder.bind(increment=2), Combiner.bind())
serve.run(app, route_prefix="/")

# 3: Query the application and print the result.
print(requests.post("http://localhost:8000/", json={"val": 100.0}).json())
# {"result": 101.5}

INFO 2024-11-12 22:17:44,902 serve 19084 api.py:259 - Connecting to existing Serve app in namespace "serve". New http options will not be applied.
INFO 2024-11-12 22:17:47,932 serve 19084 client.py:492 - Deployment 'Adder:bfezuj6x' is ready. component=serve deployment=Adder
INFO 2024-11-12 22:17:47,933 serve 19084 client.py:492 - Deployment 'Adder_1:3w8ofwz9' is ready. component=serve deployment=Adder_1
INFO 2024-11-12 22:17:47,934 serve 19084 client.py:492 - Deployment 'Combiner:77lexfl4' is ready. component=serve deployment=Combiner
INFO 2024-11-12 22:17:47,934 serve 19084 client.py:492 - Deployment 'Ingress:9h62rn08' is ready at `http://127.0.0.1:8000/`. component=serve deployment=Ingress
INFO 2024-11-12 22:17:47,938 serve 19084 api.py:549 - Deployed app 'default' successfully.


{'result': 101.5}


# FastAPI Integration

In [6]:
import requests
from fastapi import FastAPI
from ray import serve

# 1: Define a FastAPI app and wrap it in a deployment with a route handler.
app = FastAPI()


@serve.deployment
@serve.ingress(app)
class FastAPIDeployment:
    # FastAPI will automatically parse the HTTP request for us.
    @app.get("/hello")
    def say_hello(self, name: str) -> str:
        return f"Hello {name}!"


# 2: Deploy the deployment.
serve.run(FastAPIDeployment.bind(), route_prefix="/")

# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/hello", params={"name": "Theodore"}).json())
# "Hello Theodore!"

INFO 2024-11-12 22:20:19,871 serve 19084 api.py:259 - Connecting to existing Serve app in namespace "serve". New http options will not be applied.


INFO 2024-11-12 22:20:23,903 serve 19084 client.py:492 - Deployment 'FastAPIDeployment:0oiif6jq' is ready at `http://127.0.0.1:8000/`. component=serve deployment=FastAPIDeployment
INFO 2024-11-12 22:20:23,909 serve 19084 api.py:549 - Deployed app 'default' successfully.


Hello Theodore!
