# Text Translation Model (before Ray Serve)

In [6]:
from transformers import pipeline
import os

os.environ['CUDA_LAUNCH_BLOCKING']='1'


class Translator:
    def __init__(self):
        # Load model
        self.model = pipeline("translation_en_to_fr", model="t5-small", device='cuda:0')

    def translate(self, text: str) -> str:
        # Run inference
        model_output = self.model(text)

        # Post-process output to return only the translation text
        translation = model_output[0]["translation_text"]

        return translation


translator = Translator()

translation = translator.translate("Hello world!")
print(translation)

Bonjour monde!


# Converting to a Ray Serve Application

* Run [serve_trans.py](scaling/ray/serve/serve_trans.py)
    ```bash
    serve run serve_deployment:translator_app
    ```

In [16]:
# Send a post request w/ json data
import requests

english_text = "Hello world!"

response = requests.post("http://127.0.0.1:8000/", json=english_text)
french_text = response.text

In [None]:
print(french_text)

Unexpected error, traceback: [36mray::ServeReplica:default:MostBasicIngress.handle_request_with_rejection()[39m (pid=2472, ip=127.0.0.1)
  File "d:\Miniconda\envs\mle_proj\Lib\site-packages\ray\serve\_private\utils.py", line 168, in wrap_to_ray_error
    raise exception
  File "d:\Miniconda\envs\mle_proj\Lib\site-packages\ray\serve\_private\replica.py", line 1151, in call_user_method
    await self._call_func_or_gen(
  File "d:\Miniconda\envs\mle_proj\Lib\site-packages\ray\serve\_private\replica.py", line 875, in _call_func_or_gen
    result = await result
             ^^^^^^^^^^^^
  File "D:\TEMP\Temp\ipykernel_17528\3284836380.py", line 10, in __call__
TypeError: string indices must be integers, not 'str'.


: 

# Composing Multiple Models

* Ray Serve allows you to compose multiple deployments into a single Ray Serve application. 
    * This makes it easy to combine multiple machine learning models along with business logic to serve a single request.
    * We can use parameters like `autoscaling_config`, `num_replicas`, `num_cpus` and `num_gpus` to independently configure and scale each deployment in the application.

* Example, in the CLI:
    * Run [APP](summary_serve_composed.py):
        * `serve run summary_serve_composed:app`
    * Then run [CLIENT](summary_client.py):
        * `python summary_client.py`
    

# Develop and Deploy an ML Application

* The flow for developing a Ray Serve application locally and deploying it in production covers the following steps:
    * Converting a Machine Learning model into a Ray serve application
    * Testing the application locally
    * Building Serve config files for production deployment
    * Deploying application using a config file

### Convert a model into a Ray Serve Application

* See [model_serve.py](<deploy ml app/model_serve.py>)
* Run in CLI: `serve run model_serve:translator_app`
* To send a POST request w/ JSON data, run `python model_client.py`
* To check the status of the application and deployment , run `serve status`

### Build Serve config files for production deployment
* To deploy Serve applications in production, you need to generate a Serve config YAML file. 
* A Serve config file is the single source of truth for the cluster, allowing you to specify system-level configurations and your applications in one place.
    * It also allows you to declaratively update your applications. 
* The `serve build` CLI command takes as input the import path and saves to an output file using the -o flag.
    * `serve build model_serve:translator_app -o config.yaml`
        * The `runtime_env` field will always be empty when using `serve build` and must be set manually.
    * The `serve build` command adds a default application name that can be modified.
    * You can also use the Serve config file with `serve run` for local testing.

### Dynamically change parameters without restarting replicas (`user_config`)

* You can use the `user_config` field in the YAML to supply a structured configuration for your deployment.
    * You can pass arbitrary JSON serializable objects to the YAML configuration.
    * Serve then applies it to all running and future deployment replicas.
    * The application of user configuration doesn't restart the replica.

* This deployment continuity means that you can use this field to dynamically:
    * Adjust model weights and versions without restarting the cluster.
    * Adjust traffic splitting percentage for your model composition graph.
    * Configure any feature flag, A/B tests, and hyper-parameters for your deployments.

* To enable the `user_config` feature, implement a `reconfigure` method that takes a JSON-serializable object (e.g., a Dictionary, List, or String) as its only argument

* SEE [Updating Applications In-Place](https://docs.ray.io/en/latest/serve/advanced-guides/inplace-updates.html#serve-inplace-updates)


* Example:
```python
    # add to model_serve.py type files
    @serve.deployment
    class Model:
        def reconfigure(self, config: Dict[str, Any]):
            self.threshold = config["threshold"]
```
* The corresponding YAML snippet:
```YAML
    ...
    deployments:
        -name: Model
        user_config:
            threshold: 1.5
```

# Deploy Compositions of Models

### Compose deployments using `DeploymentHandles`

* When building an application, you can `.bind()` multiple deployments and pass them to each other's constructors. At runtime, inside the deployment code Ray Serve substitutes the bound deployments w/ `DeploymentHandles` that you can use to call methods of other deployments.
* This capability lets you divide your application’s steps, such as preprocessing, model inference, and post-processing, into independent deployments that you can independently scale and configure.
* Use `handle.remote` to send requests to a deployment.
    * These requests can contain ordinary Python args and kwargs, which DeploymentHandles can pass directly to the method. 
    * The method call returns a `DeploymentResponse` that represents a future to the output. 
    * You can `await` the response to retrieve its result or pass it to another downstream `DeploymentHandle` call.

### Chaining `DeploymentHandle` calls

* Ray Serve can directly pass the `DeploymentResponse` object that a `DeploymentHandle` returns, to another `DeploymentHandle` call to chain together multiple stages of a pipeline. 
    * You don’t need to `await` the first response, Ray Serve manages the `await` behavior under the hood. 
    * When the first call finishes, Ray Serve passes the output of the first call, instead of the `DeploymentResponse` object, directly to the second call.

### Streaming `DeploymentHandle` Calls

* You can also use `DeploymentHandles` to make streaming method calls that return multiple outputs.
* To make a streaming call :
    * The method must be a generator and you must set `handle.options(stream=True)`
    * Then, the handle call returns a `DeploymentResponseGenerator` instead of a unary `DeploymentResponse`
    * You can use `DeploymentResponseGenerators` as a sync or async generator, like in an `async for` code block.
    * You can't pass `DeploymentResponseGenerators` to other handle calls.

# Deploy Multiple Applications

### When to use multiple applications
* Suppose you have multiple models and/or business logic that all need to be executed for a single request.

* Model Compostion
    * If they are living in one repository, then you most likely upgrade them as a unit, so have all those deployments in one application.

* Multiple Applications
    *  If these models or business logic have logical groups, for example, groups of models that communicate with each other but live in different repositories, separate the models into applications.
    * Another common use-case for multiple applications is separate groups of models that may not communicate with each other, but you want to co-host them to increase hardware utilization.
    * Because one application is a unit of upgrade, having multiple applications allows you to deploy many independent models (or groups of models) each behind different endpoints. You can then easily add or delete applications from the cluster as well as upgrade applications independently of each other.

* See [Example](<deploy multiple applications>):
    * Run `serve build imgC:app txtT:app -o config.yaml`

# FastAPI and HTTP