# Advanced Ray Tutorial - Exercise Solutions

© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademyLogo.png)

First, import everything we'll need and start Ray:

In [None]:
import ray, time, sys
import numpy as np
sys.path.append("../..")
from util.printing import pd, pnd  # convenience methods for printing results.

In [None]:
ray.init(ignore_reinit_error=True)

## Exercise 1 in 01: Ray Tasks Revisited

You were asked to convert the regular Python code to Ray code. Here are the three cells appropriately modified.

First, we need the appropriate imports and `ray.init()`.

In [None]:
@ray.remote
def slow_square(n):
    time.sleep(n)
    return n*n

In [None]:
start = time.time()
refs = [slow_square.remote(n) for n in range(4)]
squares = ray.get(refs)
duration = time.time() - start

In [None]:
assert squares == [0, 1, 4, 9]
# should fail until the code modifications are made:
assert duration < 4.1, f'duration = {duration}' 

## Exercise 2 in 01: Ray Tasks Revisited

You were asked to use `ray.wait()` with a shorter timeout, `2.5` seconds. First we need to redefine in this notebook the remote functions we used in that lesson:

In [None]:
@ray.remote
def make_array(n):
    time.sleep(n/10.0)
    return np.random.standard_normal(n)

@ray.remote
def add_arrays(a1, a2):
    time.sleep(a1.size/10.0)
    return np.add(a1, a2)

In [None]:
start = time.time()
array_refs = [make_array.remote(n*10) for n in range(5)]
added_array_refs = [add_arrays.remote(ref, ref) for ref in array_refs]

arrays = []
waiting_refs = list(added_array_refs)  # Assign a working list to the full list of refs
while len(waiting_refs) > 0:            # Loop until all tasks have completed
    # Call ray.wait with:
    #   1. the list of refs we're still waiting to complete,
    #   2. tell it to return immediately as soon as TWO of them complete,
    #   3. tell it wait up to 10 seconds before timing out.
    return_n = 2 if len(waiting_refs) > 1 else 1
    ready_refs, remaining_refs = ray.wait(waiting_refs, num_returns=return_n, timeout=2.5)
    print('Returned {:3d} completed tasks. (elapsed time: {:6.3f})'.format(len(ready_refs), time.time() - start))
    new_arrays = ray.get(ready_refs)
    arrays.extend(new_arrays)
    for array in new_arrays:
        print(f'{array.size}: {array}')
    waiting_refs = remaining_refs  # Reset this list; don't include the completed refs in the list again!
    
print(f"\nall arrays: {arrays}")
pd(time.time() - start, prefix="Total time:")

For a timeout of `2.5` seconds, the second call to `ray.wait()` times out before two tasks finish, so it only returns one completed task. Why did the third and last iteration not time out? (That is, they both successfully returned two items.) It's because all the tasks were running in parallel so they had time to finish. If you use a shorter timeout, you'll see more time outs, where zero or one items are returned. 

Try `1.5` seconds, where all but one iteration times out and returns one item. The first iteration returns two items.
Try `0.5` seconds, where you'll get several iterations that time out and return zero items, while all the other iterations time out and return one item.

## Exercise 3 in 01: Ray Tasks Revisited

You were asked to convert the code to use Ray, especially `ray.wait()`.

In [None]:
@ray.remote
def slow_square(n):
    time.sleep(n)
    return n*n

start = time.time()
refs = [slow_square.remote(n) for n in range(4)]
squares = []
waiting_refs = refs
while len(waiting_refs) > 0:
    finished_refs, waiting_refs = ray.wait(waiting_refs)  # We just assign the second list to waiting_refs...
    squares.extend(ray.get(finished_refs))
duration = time.time() - start

In [None]:
assert squares == [0, 1, 4, 9]
assert duration < 4.1, f'duration = {duration}' 

## Exercise - "Homework" - in 02: Ray Actors Revisited

Since profiling shows that `live_neighbors` is the bottleneck, what could be done to reduce its execution time? The new implementation  solution shown here reduces its overhead by about 40%. Not bad. 

The solution also implements parallel invocations grid updates, rather doing the whole grid in sequential steps.

As discussed in lesson 4, these kinds of optimizations make sense when you _really_ have a compelling reason to squeeze optimal performance out of the code. Hence, this optimization exercise will mostly appeal to those of you with such requirements or who low-level performance optimizations like this. 

This solution for optimizing `live_neighbors` was developed using [micro-perf-tests.py](micro-perf-tests.py). The changes to the game code can be found in [game-of-life-2-exercise.py](game-of-life-2-exercise.py) rather than repeating them in cells here. Both scripts run standalone and both have a `--help` flag for more information.

If you tried the "easier experiments" suggested, such as enhancing `RayConwaysRules.step()` to accept a `num_steps` argument, you probably found that they didn't improve performance. As for the non-Ray game, this change only moves processing around but doesn't parallelize it more than before, so performance is about the same.

In [None]:
from game_of_life_2_exercise import RayGame, apply_rules_block, time_ray_games

For comparison, one set of test runs with the exercise code before improvements took about 25 seconds.

If you look at `RayGame2.step`, it calls `RayConwaysRules.step` one step at a time, using remote calls. This seems like a good place for improvement. Let's extend `RayConwaysRules.step` to do more than one step, just like `RayGame2.step` already supports.

Changes are indicated with comments.

In [None]:
time_ray_games(
    num_games = 1, 
    max_steps = 400, 
    batch_size = 1, 
    grid_dimensions = (100,100), 
    use_block_updates = False)

In a test run, this ran in about 15.5 seconds, about 9.5 seconds faster than the first version! Hence, as expected, optimizing `live_neighbors` provided significant improvement.

What about using block updates? Let's try a bigger grid, but fewer steps. First, without the block updates:

In [None]:
time_ray_games(
    num_games = 1, 
    max_steps = 100, 
    batch_size = 1, 
    grid_dimensions = (200,200), 
    use_block_updates = False)

In [None]:
time_ray_games(
    num_games = 1, 
    max_steps = 100, 
    batch_size = 1, 
    grid_dimensions = (200,200), 
    use_block_updates = True,
    block_size = 50)  # The default block size is -1, so no blocks are used!

In a test run, this performed about twice as fast! So block processing definitely helps.

Finally, does batching help? We'll use fewer steps and the original 100x100 grid. First without batching and then with batching:

In [None]:
%time time_ray_games(num_games = 1, max_steps = 100, batch_size = 1, grid_dimensions = (100,100), use_block_updates=False)

In [None]:
%time time_ray_games(num_games = 1, max_steps = 100, batch_size = 50, grid_dimensions = (100,100), use_block_updates=False)

Batching doesn't make much difference and in fact, we don't expect to matter, because it doesn't change the parallelism, like blocking does, and it doesn't make the algorithm more efficient, like the new `live_neighbors` does.

To conclude, the new implementation of `live_neighbors` has a noticable benefit. Batching doesn't make much difference, but using parallel blocks helps a lot.