# Exploring Ray API Calls

© 2019-2022, Anyscale. All Rights Reserved

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_07_ray_data.ipynb) <br>
⬅️ [Previous notebook](./ex_05_multiprocess_pool.ipynb) <br>

### Overview
Ray offers a rich and wide set of APIs across all its components. Discussing or covering all of them is out of the scope of this notebook (and tutorial), albeit we will touch upon some common APIs, its use, and arguments. A more exhaustive references below in the documentation provides a full set.

 * [Ray Core](https://docs.ray.io/en/latest/ray-core/package-ref.html)
 * [Ray AIR](https://docs.ray.io/en/latest/ray-air/package-ref.html)
 * [Ray Data](https://docs.ray.io/en/latest/data/dataset.html)
 * [Ray Train](https://docs.ray.io/en/latest/train/train.html)
 * [Ray Tune](https://docs.ray.io/en/latest/tune/index.html)
 * [Ray Serve](https://docs.ray.io/en/latest/serve/index.html)
 * [RLlib](https://docs.ray.io/en/latest/rllib/index.html)


### Learning objectives
In this quick tour of the API, you will learn about:

 * Common Ray Core APIs
 * Some useful arguments to these APIs 
 * Tips and Tricks for first-time users


This lesson explores a few of the other API calls you might find useful, as well as options that can be used with the API calls we've already used in the previous lessons. Additionally, we will walk through some tips and tricks for first time users.

> **Tip:** The [Ray Package Reference](https://docs.ray.io/en/latest/package-ref.html) in the [Ray Docs](https://docs.ray.io/en/latest/) is useful for exploring the API features we'll learn.

In [1]:
import ray, time, sys, logging
import numpy as np 
import json

In [2]:
if ray.is_initialized:
    ray.shutdown()
ray.init(num_cpus=4, logging_level=logging.ERROR)

0,1
Python version:,3.8.13
Ray version:,2.0.0
Dashboard:,http://127.0.0.1:8269


## ray.init()

When we used [`ray.init()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.init), we used it to start Ray on our local machine. When the optional `address=...` argument is specified, the driver connects to the corresponding Ray cluster.

There are a lot of optional keyword arguments you can pass to `ray.init()`. Here are some of them. All options are described in the [documentation](https://ray.readthedocs.io/en/latest/package-ref.html#ray.init). 

| Name | Type | Example | Description |
| :--- | :--- | :------ | :---------- |
| `address` | `str` | `address='auto'` | The address of the Ray cluster to connect to. If this address is not provided, then this command will start a raylet, a plasma store, a plasma manager, and some workers. It will also kill these processes when Python exits. If the driver is running on a node in a Ray cluster, using `auto` as the value tells the driver to detect the the cluster, removing the need to specify a specific node address. |
| `num_cpus` | `int` | `num_cpus=4` | Number of CPUs the user wishes to assign to each _raylet_. |
| `num_gpus` | `int` | `num_gpus=1` | Number of GPUs the user wishes to assign to each _raylet_. |
| `resources` | `dictionary` | `resources={'resource1': 4, 'resource2': 16}` | Maps the names of custom resources to the quantities of those resources available. |
| `memory` | `int` | `memory=1000000000` | The amount of memory (in bytes) that is available for use by workers requesting memory resources. By default, this is automatically set based on the available system memory. |
| `object_store_memory` | `int` | `object_store_memory=1000000000` | The amount of memory (in bytes) for the object store. By default, this is automatically set based on available system memory, subject to a 20GB cap. |
| `log_to_driver` | `bool` | `log_to_driver=True` | If true, then the output from all of the worker processes on all nodes will be directed to the driver program. |
| `local_mode` | `bool` | `local_mode=True` | If true, the code will be executed serially. This is useful for debugging. |
| `ignore_reinit_error` | `bool` | `ignore_reinit_error=True` | If true, Ray suppresses errors from calling `ray.init()` a second time (as we've done in these notebooks). Ray won't be restarted. |
| `configure_logging` | `bool` | `configure_logging=True` | If true (default), configuration of logging is allowed here. Otherwise, the user may want to configure it separately. |
| `logging_level` | _Flag_ | `logging_level=logging.INFO` | The logging level, defaults to `logging.INFO`. Ignored unless "configure_logging" is true. |
| `logging_format` | `str` | `logging_format='...'` | The logging format to use, defaults to a string containing a timestamp, filename, line number, and message. See the Ray source code `ray_constants.py` for details. Ignored unless "configure_logging" is true. |
| `runtime_env` | `map` | `{"working_dir": "/path/to/files"}` | Your Ray application might depend on source files or data files. For a development workflow, these might live on your local machine, but when it comes time to run things at scale, you will need to get them to your remote cluster. A way to send these files across all nodes in the cluster so that your distributed tasks or actors can access them, use this option, [for example.](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) |

See also the documentation for [ray.shutdown()](https://ray.readthedocs.io/en/latest/package-ref.html#ray.shutdown), which is needed in some contexts.

## @ray.remote()

We've used [@ray.remote](https://ray.readthedocs.io/en/latest/package-ref.html#ray.remote) a lot. You can pass arguments when using it. Here are some of them.

| Name | Type | Example | Description |
| :--- | :--- | :------ | :---------- |
| `num_cpus` | `int` | `num_cpus=4` | The number of CPU cores to reserve for this task or for the lifetime of the actor. |
| `num_gpus` | `int` | `num_gpus=1` | The number of GPU cores to reserve for this task or for the lifetime of the actor. |
| `num_returns` | `int` | `num_returns=2` | (Only for tasks, not actors.) The number of object refs returned by the remote function invocation. |
| `runtime_env` | `map` | `runtime_env = {"working_dir": ".", "pip": ["requests"]}}` | The runtime environment to use for this job (see [Runtime environments](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) for details. |
| `max_calls` | `int` | `max_calls=5` | Only for *remote tasks*. This specifies the maximum of times that a given worker can execute the given remote function before it must exit (this can be used to address memory leaks in third-party libraries or to reclaim resources that cannot easily be released, e.g., GPU memory that was acquired by TensorFlow). By default this is infinite. |
| `max_restarts` | `int` | `max_restarts=-1` | Only for *actors*. This specifies the maximum number of times that the actor should be restarted when it dies unexpectedly. The minimum valid value is 0 (default), which indicates that the actor doesn't need to be restarted. A value of -1 indicates that an actor should be restarted indefinitely. |
| `max_task_retries` | `int` | `max_task_retries=-1` | Only for *actors*. How many times to retry an actor task if the task fails due to a system error, e.g., the actor has died. If set to -1, the system will retry the failed task until the task succeeds, or the actor has reached its max_restarts limit. If set to n > 0, the system will retry the failed task up to n times, after which the task will throw a `RayActorError` exception upon `ray.get`. Note that Python exceptions are not considered system errors and will not trigger retries. |
| `max_retries` | `int` | `max_retries=-1` | Only for *remote functions*. This specifies the maximum number of times that the remote function should be rerun when the worker process executing it crashes unexpectedly. The minimum valid value is 0, the default is 4 (default), and a value of -1 indicates infinite retries. |

Here's an example with and without `num_return_vals`:

In [3]:
@ray.remote(num_returns=3)
def tuple3(one, two, three):
    return (one, two, three)

# return three object references with three distinct values in each 
x_ref, y_ref, z_ref = tuple3.remote("a", 1, 2.2)
x, y, z = ray.get([x_ref, y_ref, z_ref])
print(f'({x}, {y}, {z})')

(a, 1, 2.2)


In [4]:
@ray.remote(num_returns=1)
def tuple3(one, two, three):
    return (one, two, three)

# returns one object references with three values in it
xyz_ref = tuple3.remote("a", 1, 2.2)
x, y, z = ray.get(xyz_ref)
print(f'({x}, {y}, {z})')

(a, 1, 2.2)


### @ray.method()

Related to `@ray.remote()`, [@ray.method()](https://ray.readthedocs.io/en/latest/package-ref.html#ray.method) allows you to specify the number of return values for a method in an actor, by passing the `num_returns` keyword argument. None of the other `@ray.remote()` keyword arguments are allowed. Here is an example:

In [5]:
@ray.remote
class Tupleator:
    @ray.method(num_returns=3)
    def tuple3(self, one, two, three):
        return (one, two, three)
    
tupleator = Tupleator.remote()
x_ref, y_ref, z_ref = tupleator.tuple3.remote("a", 1, 2.2)
x, y, z = ray.get([x_ref, y_ref, z_ref])
print(f'({x}, {y}, {z})')   

(a, 1, 2.2)


## ray.put()

We used [`ray.get`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.gett) a lot to retrieve objects and we used actor methods to retrieve state from an actor. You can actually put objects into the object store explicitly with [`ray.put`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.put), as shown in the following example:

In [6]:
ref = ray.put("Hello World!")
print(f'Object returned: {ray.get(ref)}')

Object returned: Hello World!


In [7]:
ref = ray.put(np.random.rand(2_000, 5_000))
print(f'Object returned: {ray.get(ref)}')

Object returned: [[0.50760842 0.34383957 0.87589267 ... 0.54818893 0.38742916 0.55451995]
 [0.72852359 0.4447501  0.74991959 ... 0.81483538 0.39146026 0.55049575]
 [0.31595207 0.76749984 0.84525161 ... 0.20796359 0.46627672 0.51465945]
 ...
 [0.05086173 0.21944815 0.56648786 ... 0.12172574 0.59320914 0.51803127]
 [0.5090366  0.15105528 0.07830268 ... 0.78327928 0.45107828 0.82046439]
 [0.5016987  0.97562026 0.73111256 ... 0.93782357 0.48384053 0.01655192]]


There is an optional flag you can pass `weakref=True` (defaults to `False`). If true, Ray is allowed to evict the object while a reference to the returned ref still exists. This is useful if you are putting a lot of objects into the object store and many of them might not be needed in the future. It allows Ray to aggressively reclaim memory.

## Fetching Cluster Information

Many methods return information:

| Method | Brief Description |
| :----- | :---------------- |
| [`ray.get_gpu_ids()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.get_gpu_ids) | GPUs |
| [`ray.nodes()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.nodes) | Cluster nodes |
| [`ray.cluster_resources()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.cluster_resources) | All the available resources, used or not |
| [`ray.available_resources()`](https://ray.readthedocs.io/en/latest/package-ref.html#ray.available_resources) | Resources not in use |

In [8]:
print(f"""
ray.get_gpu_ids():          {ray.get_gpu_ids()}
ray.nodes():                {ray.nodes()}
ray.cluster_resources():    {ray.cluster_resources()}
ray.available_resources():  {ray.available_resources()}
""")


ray.get_gpu_ids():          []
ray.nodes():                [{'NodeID': 'f6ad93ee502640b0bedfb01139cd5d277abbcb5be144bc6c8e8d5a46', 'Alive': True, 'NodeManagerAddress': '127.0.0.1', 'NodeManagerHostname': 'Juless-MacBook-Pro-16', 'NodeManagerPort': 52488, 'ObjectManagerPort': 52487, 'ObjectStoreSocketName': '/tmp/ray/session_2022-10-17_14-45-10_143104_55069/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2022-10-17_14-45-10_143104_55069/sockets/raylet', 'MetricsExportPort': 57204, 'NodeName': '127.0.0.1', 'alive': True, 'Resources': {'object_store_memory': 2147483648.0, 'node:127.0.0.1': 1.0, 'CPU': 4.0, 'memory': 29495276340.0}}]
ray.cluster_resources():    {'object_store_memory': 2147483648.0, 'CPU': 4.0, 'memory': 29495276340.0, 'node:127.0.0.1': 1.0}
ray.available_resources():  {'memory': 29495276340.0, 'CPU': 4.0, 'object_store_memory': 2067483391.0, 'node:127.0.0.1': 1.0}



Recall that we used `ray.nodes()[0]['Resources']['CPU']` in the second lesson to determine the number of CPU cores on our machines:

In [10]:
ray.nodes()[0]['Resources']['CPU']

4.0

# Tips and Tricks for first-time users
First time users can trip upon certain API calls in the usage patterns. This short tips & tricks will insure you against unexpected results. Below we briefly explore a handful of API calls and their best practice.

### Tip 1: Delay ray.get()

With Ray, all invocations of `.remote()` calls are asynchronous, meaning the operation  returns immediately with a promise/future object ID. This is key to achieving massive parallelism, for it allows a devloper to launch many remote tasks, each returning a remote future object ID. Whenever needed, this object ID is fetched with `ray.get`. Because `ray.get` is a blocking call, where and how often you use affects the performance. 


In [11]:
@ray.remote
def do_some_work(x):
    time.sleep(1)
    return x * x

#### Bad usage
We use `ray.get` inside a loop since it blocks on each call of `.remote()`

In [12]:
%%time
results = [ray.get(do_some_work.remote(x)) for x in range(10)]
results

CPU times: user 145 ms, sys: 558 ms, total: 703 ms
Wall time: 10.1 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

#### Good usage
We delay `ray.get` after all the tasks have been invoked and their references have been returned.


In [13]:
%%time
results = ray.get([do_some_work.remote(x) for x in range(10)])
results

CPU times: user 43.1 ms, sys: 165 ms, total: 208 ms
Wall time: 3.05 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

#### Takeway tip 1: 
Since `ray.get` is a blocking call, postpone its use only when you need object ID's value. If called eagerly, it can
affect the performance of your desired parallelism.

### Tip 2: Avoid tiny remote tasks
Ray APIs are general and simple to use. As a result, new comers' natural instinct is to parallelize all tasks, including tiny ones, which can incur an overhead overtime. 
In short, if the Ray remote tasks are tiny or miniscule in compute, they may take longer to execute than their serial Python equivalents.

In [14]:
def tiny_task(x):
    time.sleep(0.0001)
    return x

In [15]:
%%time
results = [tiny_task(x) for x in range(100000)]
results[:10]

CPU times: user 197 ms, sys: 387 ms, total: 584 ms
Wall time: 13.1 s


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Now convert this into Ray remote task

In [16]:
@ray.remote
def remote_tiny_task(x):
    time.sleep(0.0001)
    return x

In [17]:
%%time
result_ids = [remote_tiny_task.remote(x) for x in range(100000)]
results = ray.get(result_ids)
results[:10]

CPU times: user 7.94 s, sys: 3.35 s, total: 11.3 s
Wall time: 11.4 s


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Surprisingly, Ray didn’t improve the execution time. Ray program is actually slower or closer in executiontime than the sequential program! What can we do to remedy it? What's going on?

Well, the issue here is that every task invocation has a non-trivial overhead (e.g., scheduling, inter-process communication, updating the system state) and this overhead dominates the actual time it takes to execute the task.

One way to mitigate is to make the remote tasks "larger" in order to amortize invocation overhead. This is achieved by aggregating tasks into bigger chunks of 1000.


In [18]:
@ray.remote
def mega_work(start, end):
    return [tiny_task(x) for x in range(start, end)]

In [19]:
%%time
result_ids = []
[result_ids.append(mega_work.remote(x*1000, (x+1)*1000)) for x in range(100)]
results = ray.get(result_ids)

CPU times: user 235 ms, sys: 93.9 ms, total: 329 ms
Wall time: 3.51 s


A huge difference in execution time, almost 4X faster!

### Tip 3: Using ray.wait() with ray.get()

As we noted above, an idiomatic way of using `ray.get()` is delay fetching the object until you need them. Another way is to use it with `ray.wait()`, and only fetch values that are already available. This is a way to [pipeline the execution](https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-4-pipeline-data-processing), especially when you
want to process the results of the completed Ray tasks.

Let's look at a simple example.

In [20]:
import numpy as np
@ray.remote
def make_array(n):
    time.sleep(n/10.0)
    return np.random.standard_normal(n)

Now define a task that can add two NumPy arrays together. The arrays need to be the same size, but we'll ignore any checking for this requirement.

In [21]:
@ray.remote
def add_arrays(a1, a2):
    time.sleep(a1.size/10.0)
    return np.add(a1, a2)

Now let's use `ray.wait` and `ray.get` in the recommended way

In [22]:
%%time

array_refs = [make_array.remote(n*10) for n in range(6)]
added_array_refs = [add_arrays.remote(ref, ref) for ref in array_refs]

arrays = []
waiting_refs = list(added_array_refs)  # Assign a working list to the full list of refs
while len(waiting_refs) > 0:           # Loop until all tasks have completed
    # Call ray.wait with:
    #   1. the list of refs we're still waiting to complete,
    #   2. tell it to return immediately as soon as one of them completes,
    #   3. tell it wait up to 10 seconds before timing out.
    ready_refs, remaining_refs = ray.wait(waiting_refs, num_returns=2, timeout=10.0)
    new_arrays = ray.get(ready_refs)
    arrays.extend(new_arrays)
    for array in new_arrays:
        print(f'{array.size}: {array}')
    waiting_refs = remaining_refs  # Reset this list; don't include the completed refs in the list again!
    
# print(f"\nall arrays: {arrays}")

0: []
10: [-2.95465341  2.62361862 -0.9709827   1.20121701 -3.15871418  0.95515512
 -0.58183771 -0.78388402 -2.15174708 -1.28391619]
20: [-1.68579785  2.389785    0.43981253 -1.16631623 -3.88743839  1.18525713
  0.1828003   0.75647277  1.67800321 -0.84983393 -2.75114661 -0.53912066
 -0.34130204 -0.22861622  0.18558549 -4.30599519  1.57052208  1.0186667
  1.83220248 -0.41025572]
30: [-0.93571163 -1.04205191  3.62223325  1.83127851 -0.41276751  0.61622148
  1.31362415  1.55611762  1.1284263  -2.85984813 -0.41485936  0.18443813
  0.55939491  0.76229764 -0.34663097 -2.38930675  2.97490243  0.03115254
 -0.65965427  2.98972712 -2.74402641 -0.41971211 -1.85231806 -0.05521468
 -1.34297677 -1.36967306  1.39218336 -5.28734457 -0.14647213  0.22265143]
40: [ 1.25236438e-01  6.72568851e-01  1.15882584e+00 -1.08785231e+00
 -2.10209993e+00 -9.50171462e-03  3.07679115e+00  5.92482497e-01
  1.75773272e+00 -6.70117623e-01  7.78768915e-01 -1.28744068e+00
  1.25903700e+00  3.33446294e-02 -3.43556603e+00  

### Tip 4: Avoid passing same object repeatedly to remote tasks¶

When you pass a large object as an argument to a remote function, Ray calls `ray.put()` under the hood to store that object in the local object store. Done once with the same object, say outside a loop, can significantly improve the performance of a remote task invocation when the remote task is executed locally, or if running on the same cluster node, as all tasks on the same node share the object store in shared memory. But if done by passing the same object repeatedly inside a loop can degrade the performance.

In [23]:
@ray.remote
def no_work(a):
    return

In [24]:
start = time.time()
a = np.zeros((5000, 5000))
# Sending the big 
result_ids = [no_work.remote(a) for x in range(10)]
results = ray.get(result_ids)
print(f"duration = {time.time() - start:.3f} sec")

duration = 0.505 sec


Now, we going to store the large array into the object store, so there is only a
single copy and added only once. Plus we sending not the array but a reference to it.

In [25]:
start = time.time()
a_id = ray.put(np.zeros((5000, 5000)))
result_ids = [no_work.remote(a_id) for x in range(10)]
results = ray.get(result_ids)
print(f"duration = {time.time() - start:.3f} sec")

duration = 0.028 sec


In [26]:
ray.shutdown()

### Next Step
Before we go to the next and last module, any questions?

Let's move on to our final module 3 and start with [Ray Datasets lesson](ex_07_ray_data.ipynb)

### Homework 

1. Read some more [tricks and tips](https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html) in the documentation

📖 [Back to Table of Contents](./ex_00_tutorial_overview.ipynb)<br>
➡ [Next notebook](./ex_07_ray_data.ipynb) <br>
⬅️ [Previous notebook](./ex_05_multiprocess_pool.ipynb) <br>