# A Guided Tour of Ray Core: Ray Actor Tree Pattern

© 2019-2022, Anyscale. All Rights Reserved

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_05_multiprocess_pool.ipynb) <br>
⬅️ [Previous notebook](./ex_03_remote_classes.ipynb) <br>

### Overview

Ray has a myriad of design patterns for [tasks](https://docs.ray.io/en/latest/ray-core/tasks/patterns/index.html#task-patterns) and [actors](https://docs.ray.io/en/latest/ray-core/actors/patterns/index.html). These patterns
allows you to write distributed applications. 

#### Tree of Actors Pattern

A common pattern used in Ray libraries [Ray Tune](https://docs.ray.io/en/latest/tune/index.html), [Ray Train](https://docs.ray.io/en/latest/train/train.html), and [RLlib](https://docs.ray.io/en/latest/rllib/index.html) to train models in a parallel or conduct distributed HPO.

In this common pattern, tree of actors, a collection of workers as actors, are managed by a supervisor actor. For example, you want to train multiple models, each of a different type, at the same time, while being able to inspect its state during its training.

<img src="https://docs.ray.io/en/latest/_images/tree-of-actors.svg" width="25%" height="25%">

This pattern is no different from Same Data Different Function/Model (SDDF). Popular in AutoML scenarios, where you may want to train different models at the same time using the same dataset.

<img src="images/same_data_different_model.png" width="35%" height="25%">


### Learning objectives
In this this tutorial, we revisit Ray Actors and learn more about:
 * Common Ray Actors patterns used in Ray native libraries for writing distributed Actors
   * Tree of Actors 
   * Same Data Different Functions (SDDF)
 * How to use Actors for Batch Inference
 * How to pass Ray Actors to remote tasks for distributed computing

Let's implement a simple example to illustrate this pattern.

---

First, let's start Ray…

In [1]:
import logging
import time
import ray
import random
from random import randint
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

import ray
from ray.util.actor_pool import ActorPool

In [2]:
if ray.is_initialized:
    ray.shutdown()
ray.init(logging_level=logging.ERROR)

0,1
Python version:,3.8.13
Ray version:,2.0.1
Dashboard:,http://127.0.0.1:8265


## Example 1: Supervisor and worker actor pattern

### Generic model factory utility  

This factory generates a few specify type of models (they are fake 😏): regression, classification, or neural network, and will have its respective training function. Each model will be in a particular state  during training. The final state is `DONE`.

In [3]:
# Factory function to return an instance of a model type
def model_factory(m: str, func: object):
    return Model(m, func)

# states to inspect or checkpoint
STATES = ["RUNNING", "PENDING", "DONE"]

class Model:

    def __init__(self, m:str, func: object):
        self._model = m
        self._func = func

    def train(self):
        # do some training work here for the respective model type
        self._func()

### Create a Worker Actor
This worker actor will train each model. When the model's state reaches `DONE`, we stop training

In [4]:
@ray.remote
class Worker(object):
    def __init__(self, m:str, func: object):
        # type of a model: lr, cl, or nn
        self._model = m  
        self._func = func
       
    # inspect its current state and return it. For now
    # it could be in one of the states
    def state(self) -> str:
        return random.choice(STATES)
    
    # Create the model from the factory for this worker and 
    # do the training by invoking its respective objective function 
    # for that model
    def work(self) -> None:
        model_factory(self._model, self._func).train()

### Create Supervisor Actor 
The supervisor creates three actors, each with its own respective training model type and its training function.

In [5]:
# Define respective model training functions

def lf_func():
    # Do some training work for linear regression
    time.sleep(1)
    return 0

def cl_func():
    # Do some training work for classification
    time.sleep(1)
    return 0

def nn_func():
    # Do some training work for neural networks
    time.sleep(1)
    return 0

@ray.remote
class Supervisor:
    def __init__(self):
        # Create three Actor Workers, each by its unique model type and 
        # their respective training function
        self.workers = [Worker.remote(name, func) for (name, func) 
                        in [("lr", lf_func), ("cl",cl_func), ("nn", nn_func)]]
                        
    def work(self):
        # do the work 
        [worker.work.remote() for worker in self.workers]
        
    def terminate(self):
        [ray.kill(worker) for worker in self.workers]
        
    def state(self):
        return ray.get([worker.state.remote() for worker in self.workers])

### Create a Actor instance for supervisor and launch its workers

In [6]:
sup = Supervisor.remote()

# Launch remote actors as workers
sup.work.remote()

ObjectRef(16310a0f0a45af5c7b8bc69eb2e162a6abd773fb0100000001000000)

### Look at the Ray Dashboard

You should see Actors running as process on the workders nodes
 * Supervisor
 * Workers
 
Also, click on the `Logical View` to view more metrics and data on individual Ray Actors

In [7]:
# check their status
while True:
    # Fetch the states of all its workers
    states = ray.get(sup.state.remote())
    print(states)
    # check if all are DONE
    result = all('DONE' == e for e in states)
    if result:
        # Note: Actor processes will be terminated automatically when the initial actor handle goes out of scope in Python. 
        # If we create an actor with actor_handle = ActorClass.remote(), then when actor_handle goes out of scope and is destroyed, 
        # the actor process will be terminated. Note that this only applies to the original actor handle created for the actor 
        # and not to subsequent actor handles created by passing the actor handle to other tasks.
        
        # kill supervisors' all workers manually, only for illustrtation and demo
        sup.terminate.remote()

        # kill the supervisor manually, only for illustration and demo
        ray.kill(sup)
        break

['PENDING', 'PENDING', 'PENDING']
['DONE', 'DONE', 'RUNNING']
['RUNNING', 'DONE', 'RUNNING']
['DONE', 'PENDING', 'DONE']
['DONE', 'RUNNING', 'RUNNING']
['RUNNING', 'RUNNING', 'DONE']
['DONE', 'DONE', 'PENDING']
['DONE', 'PENDING', 'PENDING']
['DONE', 'DONE', 'RUNNING']
['DONE', 'DONE', 'DONE']


### Any Questions?

### Example 2: Actor-based batch inference

In our first tutorial on [Ray Tasks](./ex_01_remote_funcs.ipynb) we covered a use case to parallelize batch inference. In short, we
used the Different Data Same Function(DDSF) pattern. Here we will use the same pattern but with Ray Actors, which are state-based and preferred method to do batch inference. Primarily, the elements remain the same except for few modificaitons.

**Input dataset**: This is a large collection of observations to generate predictions for. The data is usually stored in an external storage system like S3, HDFS or database, across
many, files.

**ML model**: This is a trained ML model that is usually also stored in an external storage system or in a model store.

**Predictions**: These are the outputs when applying the ML model on observations. Normally, predictions are usually written back to the storage system. Unlike tasks
doing the predictions, we employ a pool of Actors.

For purpose of this tutorial, we make the following provisions:
 * create a dummy model that returns some fake prediction
 * use real-world NYC taxi data to provide large data set for batch inference
 * create a pool of actors and submit each shard to the pool.
 * return the predictions instead of writing it back to the disk

As an example of scaling pattern called Different Data Same Function (DDSF), also known as Distributed Data Parallel (DDP) paradigm, our function in this digaram is the 
pretrained **model** and the data is split and disributed as **shards**.

<img src="images/batch-inference.png" width="25%" height="25%">



In [8]:
NUM_ACTORS = 6             # You can always increase the number of actors to scale
NUM_SHARD_FILES = 12       # number of shard files you want each each actor to work on

In [9]:
# Our load model closure remains the same
def load_trained_model():
    # A fake model that predicts whether tips were given based on number of passengers in the taxi cab.
    def model(batch: pd.DataFrame) -> pd.DataFrame:
        # Some model payload so Ray copies the model in the shared plasma store to tasks scheduled across nodes.
        model.payload = np.arange(100, 100_000_000, dtype=float)
        model.cls = "regression"
        
        # give a tip if 2 or more passengers
        predict = batch["passenger_count"] >= 2 
        return pd.DataFrame({"score": predict})
    
    return model    

Create a Ray Actor that stores a model reference and does the prediction

In [10]:
@ray.remote
class NYCBatchPredictor:
    def __init__(self, model):
        self.model = model

    def predict(self, shard_path):
        # read each shard and convert to pandas
        df = pq.read_table(shard_path).to_pandas()
        
        # do the inference with our model and return the result
        result = self.model(df)
        return result

1. Get our trained model instance
2. Store it into the plasma object store

In [11]:
model = load_trained_model()
model_ref = ray.put(model)

Fetch our NYC taxi shard files

In [12]:
# Iterate thorough our NYC files ~ 2GB
input_shard_files = [
    f"s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet"
    f"/fe41422b01c04169af2a65a83b753e0f_{i:06d}.parquet"
    for i in range(NUM_SHARD_FILES) ]

In [13]:
input_shard_files

['s3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet/fe41422b01c04169af2a65a83b753e0f_000000.parquet',
 's3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet/fe41422b01c04169af2a65a83b753e0f_000001.parquet',
 's3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet/fe41422b01c04169af2a65a83b753e0f_000002.parquet',
 's3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet/fe41422b01c04169af2a65a83b753e0f_000003.parquet',
 's3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet/fe41422b01c04169af2a65a83b753e0f_000004.parquet',
 's3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet/fe41422b01c04169af2a65a83b753e0f_000005.parquet',
 's3://anonymous@air-example-data/ursa-labs-taxi-data/downsampled_2009_full_year_data.parquet/fe41422b01c04169af2a65a83b753e0f_000006.pa

1. Create five Actor instances, each initialized with the same model reference
2. Create a pool of five actors

We use the Ray actor pool utility [ActorPool](https://docs.ray.io/en/latest/ray-core/actors/actor-utils.html?highlight=ActorPool#actor-pool).

[Actool Pool API](https://docs.ray.io/en/latest/ray-core/package-ref.html?highlight=ActorPool#ray-util-actorpool) reference package.

In [14]:
actors = [NYCBatchPredictor.remote(model_ref) for _ in range(NUM_ACTORS)]
actors_pool = ActorPool(actors)

In [15]:
# Submit each shard to the pool of actors for batch reference
# The API syntax is not dissimilar to Python or Ray Multiprocessor pool APIs

for shard_path in input_shard_files:
    # Submit file shard for prediction to the pool
    actors_pool.submit(lambda actor, shard: actor.predict.remote(shard), shard_path)

In [16]:
# Iterate over finised actor's predictions
while actors_pool.has_next():
    r =  actors_pool.get_next()
    print(f"Predictions dataframe size: {len(r)} | Total score for tips: {r['score'].sum()}")

Predictions dataframe size: 141062 | Total score for tips: 46360
Predictions dataframe size: 133932 | Total score for tips: 42175
Predictions dataframe size: 144014 | Total score for tips: 45175
Predictions dataframe size: 143087 | Total score for tips: 45510
Predictions dataframe size: 148108 | Total score for tips: 47713
Predictions dataframe size: 141981 | Total score for tips: 45188
Predictions dataframe size: 136394 | Total score for tips: 43234
Predictions dataframe size: 136999 | Total score for tips: 45142
Predictions dataframe size: 139985 | Total score for tips: 44138
Predictions dataframe size: 156198 | Total score for tips: 49909
Predictions dataframe size: 142893 | Total score for tips: 46112
Predictions dataframe size: 145976 | Total score for tips: 48036


### Recap

What we have demonstrated above is an Actor tree design pattern, commonly used in Ray for writing distributed applications. In particular, Ray's native libraries such as Train, Tune, Serve, and RLib and [Ray AIR's](https://docs.ray.io/en/latest/ray-air/getting-started.html) components use it for distributed training and tuning trials. 

Additionally, we implemented a DDSF scaling design pattern using an Actor-based predictor model function, using an `ActorPool` utility class instead of task.  
Task-based batch inferene has an overhead cost that can be significant if the model size is large, since it has to fetch the model from the driver's plasma store. We can optimize it by using Ray actors, 
which will fetch the model just once and reuse it for all predictions assigned to the same actor in the pool.

### Any questions?

Let's look at another example in a similar tree of actors pattern

## Example 3: Passing Actor handles to Ray Tasks

Actors are versatile: they can be instantiated and passed to remote Ray tasks or even other actors. 

Let's consider writing a distributed messaging service, where workers may post messages to update the state of the messaging service. This could be a logging or monitoring service. For example, [WhyLabs](https://www.anyscale.com/blog/running-and-monitoring-distributed-ml-with-ray-and-whylogs) implemented a variation of this usage pattern to monitor Ray Serve deployments. Since tasks and actors are accessiible as Python
objects, they can be passed around to other Python classes or functions.

You can pass actor handle instances to remote Ray tasks, which can change the actor's 
state. The `MessageActor` keeps or clears messages, depending on the method
invoked.

In [17]:
@ray.remote
class MessageActor(object):
    def __init__(self):
        # Keep the state of all the messages received
        self.messages = []
    
    def add_message(self, message):
        self.messages.append(message)
    
    # reset and clear all messages
    def get_and_clear_messages(self):
        messages = self.messages
        self.messages = []
        return messages

Create a worker doing some specific work, such as updating a DB or posting a tweet or
checking a status of a process and then sends a message to the actor.

**NOTE**: _Question: What does this remind of you from the previous lessons?_ 

(PS)

In [18]:
@ray.remote
def worker(message_actor, w):
    for i in range(10):
        time.sleep(1)
        message_actor.add_message.remote(
            f"Message {i} from worker {w}.")

In [19]:
message_actor = MessageActor.remote()

Start three worker tasks that update the `MessageActor` service since each Ray task gets the handle to the `MessageActor`

In [20]:
[worker.remote(message_actor, w) for w in range(3)]

[ObjectRef(c1464dc5b2308f10ffffffffffffffffffffffff0100000001000000),
 ObjectRef(c5db14a0419b947bffffffffffffffffffffffff0100000001000000),
 ObjectRef(91581beb08e6c9deffffffffffffffffffffffff0100000001000000)]

Since the workers are already launched, let's get actor's state.

In [21]:
for _ in range(10):
    new_messages = ray.get(message_actor.get_and_clear_messages.remote())
    print("New messages\n:", new_messages)
    time.sleep(1)

New messages
: []
New messages
: ['Message 0 from worker 1.', 'Message 0 from worker 2.', 'Message 0 from worker 0.']
New messages
: ['Message 1 from worker 1.', 'Message 1 from worker 2.', 'Message 1 from worker 0.']
New messages
: ['Message 2 from worker 1.', 'Message 2 from worker 0.', 'Message 2 from worker 2.']
New messages
: ['Message 3 from worker 1.', 'Message 3 from worker 0.', 'Message 3 from worker 2.']
New messages
: ['Message 4 from worker 1.', 'Message 4 from worker 0.', 'Message 4 from worker 2.']
New messages
: ['Message 5 from worker 1.', 'Message 5 from worker 2.', 'Message 5 from worker 0.']
New messages
: ['Message 6 from worker 1.', 'Message 6 from worker 0.', 'Message 6 from worker 2.']
New messages
: ['Message 7 from worker 1.', 'Message 7 from worker 2.', 'Message 7 from worker 0.']
New messages
: ['Message 8 from worker 1.', 'Message 8 from worker 2.', 'Message 8 from worker 0.']


In [22]:
ray.shutdown()

### Exercises

1. Add a remote class, such as a logging actor, that keeps states by logging info (may be only in memory) about N experiments, where (N <=3).
2. Implement actor methods that alters the state. That is, it tracks results of 9 separate runs per each experiment.
3. Write a separate Ray task that executes 9 runs per each experiment.
4. Instantiate the actor and call its methods from within the remote Ray task.

### Solution hints

This solution is just a structural hint. There are few missing bits:
 * instantiation of `LoggingActor`
 * Need to use `ray.get()` to fetch the values from the object store

In [28]:
from collections import defaultdict
@ray.remote
class LoggingActor(object):
    def __init__(self):
        self.logs = defaultdict(list)
    
    def log(self, index, message):
        self.logs[index].append(message)
    
    def get_logs(self):
        return dict(self.logs)
    
@ray.remote
def run_experiment(experiment_index, logging_actor):
    for i in range(9):
        time.sleep(1)
        # Push a logging message to the actor.
        logging_actor.log.remote(experiment_index, 'On iteration {}'.format(i))    

In [29]:
# logging_actor = # TODO Instantiate Actor here
experiment_ids = []
for i in range(3):
    pass
    # TODO
    # invoke task and append results

In [None]:
logs = logging_actor.get_logs.remote()
# TODO use ray.get() to fetch the logs

### Next Step
We going to switch a focus little and learn how you can use Ray's replacement
for Python's Multiprocessing pool. 

Let's move on to the [Multiprocessing pool with Ray](ex_05_multiprocess_pool.ipynb)

### Homework

1. Read references below.
2. Can you implement calculating `pi` as a combination of actor (which keeps the state of the progress of calculating `pi` as it approaches its final value) and a task (which  computes candidates for `pi`)? 

**solution hint**: Check the Ray core quickstart docs only if you need to... :-)

---
## References

 * [Writing your First Distributed Python Application with Ray](https://www.anyscale.com/blog/writing-your-first-distributed-python-application-with-ray)
 * [Using and Programming with Actors](https://docs.ray.io/en/latest/actors.html)
 * [Ray Asynchronous and Threaded Actors: A way to achieve concurrency](https://medium.com/@2twitme/ray-asynchronous-and-threaded-actors-a-way-to-achieve-concurrency-ad9f86145f72)

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_05_multiprocess_pool.ipynb) <br>
⬅️ [Previous notebook](./ex_03_remote_classes.ipynb) <br>