# Ray Serve - Model Serving Challenges

© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademyLogo.png)

## The Challenges of Model Serving

Model development happens in a data science research environment. There are many challenges, but also tools at the data scientists disposal.

Model deployment to production faces an entirely different set of challenges and requires different tools, although it is desirable to bridge the divide as much as possible.

Here is a partial lists of the challenges of model serving:

### It Should Be Framework Agnostic

Model serving frameworks must be able to serve models from popular systems like TensorFlow, PyTorch, scikit-learn, or even arbitrary Python functions. Even within the same organization, it is common to use several machine learning frameworks. 

Also, machine learning models are typically surrounded by lots of application or business logic. For example, some model serving is implemented as a RESTful service to which scoring requests are made. Often this is too restrictive, as some additional processing, such as fetching additional data from a online feature store, may be desired as part of the scoring process, and the performance overhead of remote calls may be suboptimal.

### Pure Python

It has been common recently for model serving to be done using JVM-based systems, since many production enterprises are JVM-based. This is a disadvantage when model training and other data processing are done using Python tools, only. 

In general, model serving should be intuitive for developers and simple to configure and run. Hence, it is desirable to use pure Python and to avoid verbose configurations using YAML files or other means. 

Data scientists and engineers use Python to develop their machine learning models, so they should also be able to use Python to deploy their machine learning applications. This need is growing more critical as online learning applications combine training and serving in the same applications.

### Simple and Scalable

Model serving must be simple to scale on demand across many machines. It must also be easy to upgrade models dynamically, over time. Achieving production uptime and performance requirements are essential for success.

### DevOps Integrations

Model serving deployments need to integrate with existing "DevOps" CI/CD practices for controlled, audited, and predicatble releases. Patterns like [Canary Deployment](https://martinfowler.com/bliki/CanaryRelease.html) are particularly useful for testing the efficacy of a new model before replacing existing models, just as this pattern is useful for other software deployments.

### Flexible Deployment Patterns

There are unique deployment patterns, too. For example, it should be easy to deploy a forest of models, to split traffic to different instances, and to score data in batches for greater efficiency.

See also this [Ray blog post](https://medium.com/distributed-computing-with-ray/the-simplest-way-to-serve-your-nlp-model-in-production-with-pure-python-d42b6a97ad55) on the challenges of model serving and the way Ray Serve addresses them. It also provides an example of starting with a simple model, then deploying a more sophisticated model into the running application. Along the same lines, this blog post, [Serving ML Models in Production Common Patterns](https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns) discusses how deployment patterns for model serving and how you can use Ray Serve. Additionally, [Building a scalable ML model serving API with Ray Serve](https://www.anyscale.com/events/2021/09/09/building-a-scalable-ml-model-serving-api-with-ray-serve) webinar In this introductory webinar highlights how Ray Serve makes it easy to deploy, operate and scale a machine learning API.

## Why Ray Serve?

[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) is a scalable, framework-agnostic and Python-first model serving library built on [Ray](https://ray.io).

For users, Ray Serve offers these benefits:

* **Framework Agnostic**: You can use the same toolkit to serve everything from deep learning models built with [PyTorch](https://docs.ray.io/en/latest/serve/tutorials/pytorch.html#serve-pytorch-tutorial), [Tensorflow](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), or [Keras](https://docs.ray.io/en/latest/serve/tutorials/tensorflow.html#serve-tensorflow-tutorial), to [scikit-Learn](https://docs.ray.io/en/latest/serve/tutorials/sklearn.html#serve-sklearn-tutorial) models, to arbitrary business logic.
* **Python First:** Configure your model serving with pure Python code. No YAML or JSON configurations required.

As a library, Ray Serve enables the following:

* [Splitting traffic between backends dynamically](https://docs.ray.io/en/latest/serve/advanced.html#serve-split-traffic) with zero downtime. This is accomplished by decoupling routing logic from response handling logic.
* [Support for batching](https://docs.ray.io/en/latest/serve/advanced.html#serve-batching) to improve performance helps you meet your performance objectives. You can also use a model for batch and online processing.
* Because Serve is a library, it's esay to integrate it with other tools in your environment, such as CI/CD.

Since Serve is built on Ray, it also allows you to scale to many machines, in your datacenter or in cloud environments, and it allows you to leverage all of the other Ray frameworks.

## Two Simple Ray Serve Examples

We'll explore a more detailed example in the next lesson, where we actually serve ML models. Here we explore how simple deployments are simple with Ray Serve! We will first use a function that does "scoring," sufficient for _stateless_ scenarios, then a use class, which enables _stateful_ scenarios.

But first, initialize Ray as before:

In [None]:
import ray
from ray import serve

import requests  # for making web requests

Now we initialize Serve itself:

In [None]:
serve.start()

Next, define our stateless function for processing requests.


Let's define a simple function that will be served by Ray. As with Ray Tasks, we can decoarte this function with `@serve.deployment`, meaning this is going to be
deployed on Ray Serve as function to which we can send Flask requests.

It takes in a `request`, extracts the request parameter with key "name", and returns an echoed string. 
Simple to illustrate that Ray Serve can also serve Python functions

In [None]:
@serve.deployment
def hello(request):
    name = request.query_params["name"]
    return f"Hello {name}!"

Use the `<func_name>.deploy()` method to deploy in on Ray Serve

In [None]:
hello.deploy()

Now send some requests to our Python function

In [None]:
for i in range(10):
    response = requests.get(f"http://127.0.0.1:8000/hello?name=request_{i}").text
    print(f'{i:2d}: {response}')

You should see `hello request_N` in the output. 

Now let's serve another "model" in the same service:

In [None]:
@serve.deployment
class Counter:
    def __init__(self):
        self.count = 0

    def __call__(self, *args):
        self.count += 1
        return {"count": self.count}

In [None]:
Counter.deploy()

In [None]:
for i in range(10):
    response = requests.get(f"http://127.0.0.1:8000/Counter?i={i}").json()
    print(f'{i:2d}: {response}')

## Ray Serve Concepts

For more details, see this [key concepts](https://docs.ray.io/en/latest/serve/index.html) documentation.

In [None]:
serve.list_deployments()

In [None]:
serve.shutdown()