# Deploying your first model

[Ray Serve](https://docs.ray.io/en/master/serve/) is a library for scalable and programmable model serving.
It aims to address some of the major challenges found in model serving:

- **Framework-agnostic:** Model serving frameworks must be able to serve models from popular systems like TensorFlow, PyTorch, scikit-learn, or even arbitrary Python functions. Even within the same organization, it is common to use several machine learning frameworks.
- **Supports application logic:** Machine learning models are typically surrounded by lots of application logic. In our application, this will come up when we decide which types of movie recommendations to serve to a particular user based on information about what that user has selected before.
- **Python-first:** Configure your model serving with pure Python code - no more YAML or JSON configs.
- **Simple and scalable:** Model serving must be simple to scale on demand across many machines. It must also be easy to upgrade models dynamically, over time. Achieving production uptime and performance requirements are essential for success.
- **Flexible deployment patterns:** Ray Serve makes it easy to deploy a forest of models and to split traffic to different instances.

See this [blog post](https://medium.com/distributed-computing-with-ray/the-simplest-way-to-serve-your-nlp-model-in-production-with-pure-python-d42b6a97ad55) and the [docs](https://docs.ray.io/en/master/serve/) for more background on Ray Serve!

In this notebook, we'll deploy our first models using Ray Serve.
We'll deploy one model that serves movie recommendations based on the movie cover's color palette and another that serves movie recommendations based on the movie's plot.

## If you didn't finish notebook 1:

Run the next cell to finish filling out the missing movie palettes in the database.

In [None]:
%%bash
# Fill out the missing movie palettes in the database, in case you haven't finished notebook 1.
bash run_1.sh

Now, we can connect to the cluster with `ray.init`.
This is the same as in the previous notebook: we'll pass in the argument `address=auto` to indicate that we should connect to an existing cluster that is running on the local machine.

In [None]:
import ray

ray.init(address="auto", ignore_reinit_error=True)

Next we'll start Serve. This will set up an empty Flask server that can serve HTTP requests.
After this cell, we'll have something that looks like the diagram above, except without the "Endpoint" and "Backend" boxes.

In [None]:
from ray import serve

try:
    client = serve.start(detached=True)
except:
    # Skip if we already started Serve.
    client = serve.connect()

## Backends and endpoints

Ray Serve has two key concepts, *backends* and *endpoints*.
Backends define the implementation of your business logic or models that will handle requests, and *endpoints* define how user requests should be routed to the various backends.

Each backend can have many replicas, which are individual processes running in the Ray cluster to handle requests.
To define a backend, first you must define the “handler” or the business logic you’d like to respond with.
The handler should take as input a Flask Request object and return any JSON-serializable object as output.
The implementation can be defined as either a function or a class. Use a function when your response is stateless and a class when you might need to maintain some state (like a model).

An endpoint is used to expose a backend to HTTP.
Each endpoint can have one or multiple backends that serve requests; in our case, we'll use one backend per endpoint.

In this notebook, we'll create one backend and endpoint each for the color-based recommender and the plot-based recommender.
By the end of this notebook, we'll have a system that looks something like this:

![](serve-notebook-2.jpg "Ray Serve diagram")


## Creating an endpoint

First, we'll define a *backend* for the server.
Our first backend will be a *stateful* class that serves movie recommendations based on a movie cover's color palette.
Each request to the backend will include the ID of a movie that the user liked (`liked_id`).
The backend will use k-nearest neighbors to determine the movies closest to the user's selected movie.
Since it would be expensive to have to reload the index on each request, we'll use a *stateful* backend to keep the index in memory between requests.

We'll specify the logic that should get run on the server in the `__call__` method.
Serve will pass each Flask request that gets routed to this backend as an argument to this method.
That way, we can access any user arguments that are passed in the request, such as parameters in an HTTP `GET` request.

In [None]:
from util import get_db_connection, KNearestNeighborIndex


class ColorRecommender:
    def __init__(self):
        self.db = get_db_connection()

        # Create index of cover image colors.
        colors = self.db.execute("SELECT id, palette_json FROM movies")
        self.color_index = KNearestNeighborIndex(colors)

    def __call__(self, request):
        liked_id = request.args["liked_id"]
        num_returns = int(request.args.get("count", 6))

        # Perform KNN search for similar images.
        recommended_ids = self.color_index.search(liked_id, num_returns)

        # Let's perform some post processing.
        titles_and_ids = self.db.execute(
            f"SELECT title, id FROM movies WHERE id in ({','.join(recommended_ids)})"
        ).fetchall()

        # Wrangle the data for JSON
        return [{
            "id": movie_id,
            "title": title
        } for title, movie_id in titles_and_ids]

Next, we'll create an instance of the backend and define an *endpoint* that exposes it to HTTP.
This will tell Serve which traffic should go to the `ColorRecommender` instance.

The `create_backend` call gives Serve a name for the backend (`"color:v1"`) and the class or function that contains the logic that we want to run.
The `create_endpoint` call gives Serve a name for the endpoint, the backend that we want to use to serve requests, and the HTTP route.

In [None]:
# Instantiate the backend. This will create an instance of ColorRecommender.
client.create_backend(backend_tag="color:v1", func_or_class=ColorRecommender)
# Create an endpoint. This will route GET requests to /rec/color to the ColorRecommender backend.
client.create_endpoint(endpoint_name="color", backend="color:v1", route="/rec/color")

## Sending requests

Let's try sending a request to the new endpoint.
You can also try this by visiting the URL directly.

In [None]:
from util import MOVIE_IDS
import requests
import json


def send_color_request(movie_id):
    r = requests.get("http://localhost:8000/rec/color", params={"liked_id": movie_id})

    if r.status_code == 200:
        return r.json()
    print(r.text)

send_color_request(MOVIE_IDS[0])

We can also send a request directly to the endpoint using the Ray core API.
First we use our Serve client to get a handle to the endpoint, using the endpoint name that we passed to the `create_endpoint` call.
This will allow us to call the backend directly with a Ray *task*, which will return an `ObjectRef`, just like we did with Ray tasks in the first tutorial.
One advantage of exposing methods this way is that we can now call an endpoint directly from Python, instead of having to go through HTTP first.

Instead of defining the arguments in the HTTP request body, we'll pass them directly as keyword arguments to the remote function call.

In [None]:
color_handle = client.get_handle(endpoint_name="color")
ray.get(color_handle.remote(liked_id=MOVIE_IDS[0]))

## Exercise: Create another endpoint

There are lots of ways to provide movie recommendations!
Let's create a second endpoint that provides recommendations based on the movie plot.
We'll deploy a BERT NLP model that has been fine-tuned to determine similarity between movie plot descriptions.

Here's a stateless version of the endpoint to get you started.
The code loads plot vectors for each movie in our database that have been computed offline.
Then, similar to the `ColorRecommender`, it finds the k-nearest neighbors of a movie liked by the user.

In [None]:
import pickle
from util import KNearestNeighborIndex

def recommend_by_plot(request):
    db = get_db_connection()

    bert_vectors = db.execute(
        "SELECT id, plot_vector_json FROM movies")
    index = KNearestNeighborIndex(bert_vectors)

    # Find k nearest movies with similar plots.
    liked_id = request.args["liked_id"]
    num_returns = int(request.args.get("count", 6))
    recommended_movie_ids = index.search(liked_id, num_returns)

    # Let's perform some post processing.
    titles_and_ids = db.execute(
        f"SELECT title, id FROM movies WHERE id in ({','.join(recommended_movie_ids)})"
    ).fetchall()

    # Wrangle the data for JSON
    return [{
        "id": movie_id,
        "title": title
    } for title, movie_id in titles_and_ids]

We could just deploy this model as a stateless function, which Serve will invoke on each request.
The problem with that is that we'll waste a lot of time loading the movie plot vectors on every request.
To see that, let's try deploying a stateless backend.

In [None]:
def send_plot_request(movie_id):
    r = requests.get("http://localhost:8000/rec/plot", params={"liked_id": movie_id})

    if r.status_code == 200:
        return r.json()
    print(r.text)


# Instantiate the backend. This is the same as the ColorRecommender,
# except that we're deploying a stateless function.
client.create_backend(backend_tag="plot:v1", func_or_class=recommend_by_plot)
# Create an endpoint. This will route GET requests to /rec/plot to the recommend_by_plot function.
client.create_endpoint(endpoint_name="plot", backend="plot:v1", route="/rec/plot")

%timeit send_plot_request(MOVIE_IDS[0])

Let's try that again, but this time with a stateful backend!

**Task:** Converting `PlotRecommender` to a stateful backend.
1. Copy the code from the `recommend_by_plot` function to fill out the `PlotRecommender` class skeleton below. Make sure to load any state that should be reused between requests in the `__init__` method. 
2. Test it out by evaluating the following cell. You should see the same movie results as the `recommend_by_plot` backend, but the time per request should be much faster.

**Tip:** Use the `ColorRecommender` structure as a reference.

**If you haven't finished but want to move on:** We've included a reference implementation of `PlotRecommender` in the next cell. Show the code by clicking the "..." and evaluate it.

In [None]:
class PlotRecommender:
    def __init__(self):
        pass

    def __call__(self, request):
        return []

In [None]:
class PlotRecommender:
    def __init__(self):
        self.db = get_db_connection()

        bert_vectors = self.db.execute(
            "SELECT id, plot_vector_json FROM movies")
        self.index = KNearestNeighborIndex(bert_vectors)

    def __call__(self, request):
        # Find k nearest movies with similar plots.
        liked_id = request.args["liked_id"]
        num_returns = int(request.args.get("count", 6))
        recommended_movie_ids = self.index.search(liked_id, num_returns)

        # Let's perform some post processing.
        titles_and_ids = self.db.execute(
            f"SELECT title, id FROM movies WHERE id in ({','.join(recommended_movie_ids)})"
        ).fetchall()

        # Wrangle the data for JSON
        return [{
            "id": movie_id,
            "title": title
        } for title, movie_id in titles_and_ids]

In [None]:
# Delete the stateless backend.
client.delete_endpoint("plot")
client.delete_backend("plot:v1")

# Instantiate the stateful backend.
# Tip! You can run this cell again if you need to debug the PlotRecommender code.
client.create_backend(backend_tag="plot:v1", func_or_class=PlotRecommender)
# Create an endpoint. This will route GET requests to /rec/plot to the recommend_by_plot function.
client.create_endpoint(endpoint_name="plot", backend="plot:v1", route="/rec/plot")

%timeit send_plot_request(MOVIE_IDS[0])

**Task:** Use the Serve `client` to get a handle to the "plot" endpoint and compare the recommendations to the "color" endpoint.
1. Get a handle to the "plot" endpoint. You can use the code to get the `color_handle` as a reference.
2. Submit requests to the "plot" and "color" endpoints with the same "liked_id", and compare the returned recommendations. They should be completely different except for the movie that was passed as the "liked_id".
> **Tip:** You can even do this in parallel! Try it by submitting all of the `.remote` functions first, then calling `ray.get` on a list of the results, like we did in notebook 1.

Here's the code again for getting the results from the color endpoint, to get you started:

In [None]:
color_handle = client.get_handle(endpoint_name="color")
ray.get(color_handle.remote(liked_id=MOVIE_IDS[0]))

## Once you've finished this notebook: head over to to [step 3](3.%20Deploying%20a%20custom%20ensemble%20model.ipynb), where we'll deploy a composed model!