# Ray Serve - Model Serving Challenges

© 2019-2022, Anyscale. All Rights Reserved

## The Challenges of Model Serving

Model development happens in a data science research environment. There are many challenges, yet there are tools at the data scientists' disposal.

By contrast, model deployment to production faces an entirely different set of challenges and requires different tools, and it is desirable to bridge the divide as much as possible.

Here is a partial lists of the challenges of model serving:

### It Should Be Framework Agnostic

First, model serving frameworks must be able to serve models from popular frameworks and libraries like TensorFlow, PyTorch, scikit-learn, or even arbitrary Python functions. Even within the same organization, it is common to use several machine learning frameworks, in order to get the best model. 

Second, machine learning models are typically surrounded by lots of application or business logic. For example, some model serving is implemented as a RESTful service to which scoring requests are made. Often this is too restrictive, as some additional processing, such as fetching additional data from a online feature store, to augment the request data, may be desired as part of the scoring process, and the performance overhead of remote calls may be suboptimal.

### Pure Python or Pythonic

A common trend for model serving is to use JVM-based systems, since many production enterprises are JVM-based. This is a disadvantage when model training and other data processing are done using only Python tools. 

In general, model serving should be intuitive for developers and simple to configure and run. Hence, it is desirable to use pure Python and to avoid verbose configurations using YAML files or other means. 

Data scientists and engineers use Python to develop their machine learning models, so they should also be able to use Python to deploy their machine learning applications. This need is growing more critical as online learning applications combine training and serving in the same applications.

### Simple and Scalable

Model serving must be simple to scale on demand across many machines. It must also be easy to upgrade models dynamically, over time. Achieving production uptime and performance requirements are essential for success.

### DevOps/MLOps Integrations

Model serving deployments need to integrate with existing "DevOps" CI/CD practices for controlled, audited, and predicatble releases. Patterns like [Canary Deployment](https://martinfowler.com/bliki/CanaryRelease.html) are particularly useful for testing the efficacy of a new model before replacing existing models, just as this pattern is useful for other software deployments.

### Flexible Deployment Patterns

There are unique deployment patterns, too. For example, it should be easy to deploy a forest of models, to split traffic to different instances, and to score data in batches for greater efficiency.

See also this [Ray blog post](https://medium.com/distributed-computing-with-ray/the-simplest-way-to-serve-your-nlp-model-in-production-with-pure-python-d42b6a97ad55) on the challenges of model serving and the way Ray Serve addresses them. It also provides an example of starting with a simple model, then deploying a more sophisticated model into the running application. Along the same lines, this blog post, [Serving ML Models in Production Common Patterns](https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns) discusses how deployment patterns for model serving and how you can use Ray Serve. Additionally, listen to this webinar [Building a scalable ML model serving API with Ray Serve](https://www.anyscale.com/events/2021/09/09/building-a-scalable-ml-model-serving-api-with-ray-serve). This introductory webinar highlights how Ray Serve makes it easy to deploy, operate and scale a machine learning API.

## Why Ray Serve?

[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) is a scalable, framework-agnostic and Python-first model serving library built on [Ray](https://ray.io).

For users, Ray Serve offers these benefits:

* **Framework Agnostic**: You can use the same toolkit to serve everything from deep learning models built with [PyTorch](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html#serve-pytorch-tutorial), [Tensorflow](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), or [Keras](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), to [scikit-Learn](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html#serve-sklearn-tutorial) models, to arbitrary business logic.
* **Python First:** Configure your model serving with pure Python code. No YAML or JSON configurations required.

As a library, Ray Serve enables the following:

* [Splitting traffic between backends dynamically](https://docs.ray.io/en/latest/serve/advanced.html#serve-split-traffic) with zero downtime. This is accomplished by decoupling routing logic from response handling logic.
* [Support for batching](https://docs.ray.io/en/latest/serve/advanced.html#serve-batching) to improve performance helps you meet your performance objectives. You can also use a model for batch and online processing.
* Because Serve is a library, it's esay to integrate it with other tools in your environment, such as CI/CD.

Since Serve is built on Ray, it also allows you to scale to many machines, in your datacenter or in cloud environments, and it allows you to leverage all of the other Ray frameworks.

## Ray Serve Archictecture and components

<img src="images/architecture.png" height="50%" width="60%">

There are three kinds of actors that are created to make up a Serve instance:

**Controller**: A global actor unique to each Serve instance that manages the control plane. The Controller is responsible for creating, updating, and destroying other actors. Serve API calls like creating or getting a deployment make remote calls to the Controller.

**Router**: There is one router per node. Each router is a Uvicorn HTTP server that accepts incoming requests, forwards them to replicas, and responds once they are completed.

**Worker Replica**: Worker replicas actually execute the code in response to a request. For example, they may contain an instantiation of an ML model. Each replica processes individual requests from the routers (they may be batched by the replica using `@serve.batch`, see the [batching docs](https://docs.ray.io/en/latest/serve/ml-models.html#serve-batching)).

For more details, see this [key concepts](https://docs.ray.io/en/latest/serve/index.html) and [architecture](https://docs.ray.io/en/latest/serve/architecture.html) documentation.

### Lifetime of a Request

When an HTTP request is sent to the router, the follow things happen:

 * The HTTP request is received and parsed.

 * The correct deployment associated with the HTTP url path is looked up. The request is placed on a queue.

 * For each request in a deployment queue, an available replica is looked up and the request is sent to it. If there are no available replicas (there are more than max_concurrent_queries requests outstanding), the request is left in the queue until an outstanding request is finished.

Each replica maintains a queue of requests and executes one at a time, possibly using asyncio to process them concurrently. If the handler (the function for the deployment or __call__) is async, the replica will not wait for the handler to run; otherwise, the replica will block until the handler returns.



## Two Simple Ray Serve Examples

We'll explore a more detailed example in the next lesson, where we actually serve ML models. Here we explore how two simple deployments are simple with Ray Serve! We will first use a function that does "scoring," sufficient for _stateless_ scenarios, then a use class, which enables _stateful_ scenarios.

But first, initialize Ray as before:

In [23]:
import ray
from ray import serve

import requests  # for making web requests

Now we initialize Ray Serve itself. Note that we did not have to start a Ray cluster explicity. If one is not running `serve.start()` will automatically launch a Ray cluster, otherwise it'll connect to an exisisting instance.

In [24]:
serve.start()

[2m[36m(ServeController pid=99442)[0m 2022-03-31 13:38:07,073	INFO checkpoint_path.py:16 -- Using RayInternalKVStore for controller checkpoint and recovery.
[2m[36m(ServeController pid=99442)[0m 2022-03-31 13:38:07,179	INFO http_state.py:98 -- Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:XLakOE:SERVE_PROXY_ACTOR-node:127.0.0.1-0' on node 'node:127.0.0.1-0' listening on '127.0.0.1:8000'
2022-03-31 13:38:07,482	INFO api.py:521 -- Started Serve instance in namespace 'serve'.


<ray.serve.api.Client at 0x7f9088ab6fa0>

[2m[36m(HTTPProxyActor pid=99435)[0m INFO:     Started server process [99435]


Next, define our stateless function for processing requests.


Let's define a simple function that will be served by Ray. As with Ray Tasks, we can decoarte this function with `@serve.deployment`, meaning this is going to be
deployed on Ray Serve as function to which we can send Flask requests.

It takes in a `request`, extracts the request parameter with key "name,"
and returns an echoed string. 

Simple to illustrate that Ray Serve can also serve Python functions.

### Create a Python function deployment 

In [25]:
@serve.deployment
def hello(request):
    name = request.query_params["name"]
    return f"Hello {name}!"

Use the `<func_name>.deploy()` method to deploy in on Ray Serve

### Deploy a Python function for serving

In [26]:
hello.deploy()

2022-03-31 13:38:18,486	INFO api.py:262 -- Updating deployment 'hello'. component=serve deployment=hello
[2m[36m(ServeController pid=99442)[0m 2022-03-31 13:38:18,537	INFO deployment_state.py:920 -- Adding 1 replicas to deployment 'hello'. component=serve deployment=hello
2022-03-31 13:38:18,903	INFO api.py:274 -- Deployment 'hello' is ready at `http://127.0.0.1:8000/hello`. component=serve deployment=hello


### Send some requests to our Python function

In [27]:
for i in range(10):
    response = requests.get(f"http://127.0.0.1:8000/hello?name=request_{i}").text
    print(f'{i:2d}: {response}')

 0: Hello request_0!
 1: Hello request_1!
 2: Hello request_2!
 3: Hello request_3!
 4: Hello request_4!
 5: Hello request_5!
 6: Hello request_6!
 7: Hello request_7!
 8: Hello request_8!
 9: Hello request_9!


You should see `hello request_N` in the output. 

Now let's serve another "model" in the same service:

In [44]:
from random import random
import starlette
from starlette.requests import Request

@serve.deployment
class SimpleModel:
    def __init__(self):
        self.weight = 0.5
        self.bias = 1
        self.prediction = 0.0

    def __call__(self, starlette_request):
        if isinstance(starlette_request, starlette.requests.Request):
            data = starlette_request.query_params['data']
        else:
            # Request came via a ServerHandle API method call.
            data = starlette_request
        self.prediction = float(data) * self.weight * random() + self.bias
        return {"prediction": self.prediction}

In [45]:
SimpleModel.deploy()

2022-03-31 13:57:46,991	INFO api.py:262 -- Updating deployment 'SimpleModel'. component=serve deployment=SimpleModel
[2m[36m(ServeController pid=99442)[0m 2022-03-31 13:57:47,010	INFO deployment_state.py:882 -- Stopping 1 replicas of deployment 'SimpleModel' with outdated versions. component=serve deployment=SimpleModel
[2m[36m(ServeController pid=99442)[0m 2022-03-31 13:57:49,174	INFO deployment_state.py:920 -- Adding 1 replicas to deployment 'SimpleModel'. component=serve deployment=SimpleModel
2022-03-31 13:57:49,504	INFO api.py:274 -- Deployment 'SimpleModel' is ready at `http://127.0.0.1:8000/SimpleModel`. component=serve deployment=SimpleModel


### List Deployments

In [8]:
for i in range(10):
    response = requests.get(f"http://127.0.0.1:8000/Counter?i={i}").json()
    print(f'{i:2d}: {response}')

 0: {'count': 1}
 1: {'count': 2}
 2: {'count': 3}
 3: {'count': 4}
 4: {'count': 5}
 5: {'count': 6}
 6: {'count': 7}
 7: {'count': 8}
 8: {'count': 9}
 9: {'count': 10}


### Send some requests to our Model

In [47]:
url = f"http://127.0.0.1:8000/SimpleModel"
for i in range(5):
    print(f"prediction  : {requests.get(url, params={'data': random()}).text}")

prediction  : {
  "prediction": 1.345999158500957
}
prediction  : {
  "prediction": 1.4281992530664707
}
prediction  : {
  "prediction": 1.2471813082870054
}
prediction  : {
  "prediction": 1.1081882081506653
}
prediction  : {
  "prediction": 1.3262035646085715
}


In [46]:
serve.list_deployments()

{'hello': Deployment(name=hello,version=None,route_prefix=/hello),
 'SimpleModel': Deployment(name=SimpleModel,version=None,route_prefix=/SimpleModel)}

In [48]:
serve.shutdown()

[2m[36m(ServeController pid=99442)[0m 2022-03-31 13:58:02,296	INFO deployment_state.py:940 -- Removing 1 replicas from deployment 'hello'. component=serve deployment=hello
[2m[36m(ServeController pid=99442)[0m 2022-03-31 13:58:02,299	INFO deployment_state.py:940 -- Removing 1 replicas from deployment 'SimpleModel'. component=serve deployment=SimpleModel


## Exercise - Try Adding more examples

Here are some things you can try:

1. Add couple of functions, deploy, and send requests.
2. Add couple of stateful classes, deploy, and send requests
3. Send a dictionary of different data points to the SimpleModel and modify the predition result
to use these data points.