In [None]:
# Some imports and constants that we will need later on.
import ray
import pandas as pd
import numpy as np

COLUMNS='ABCD'
SIZE_100MiB = 100 * 1024 * 1024

First, let's try generating a random DataFrame using normal Python. How long does this take?

In [None]:
def generate_random_array():
    return pd.DataFrame(np.random.randint(0, 100,
            size=(SIZE_100MiB // len(COLUMNS), len(COLUMNS))),
            columns=list(COLUMNS),
            dtype=np.uint8)

%time generate_random_array()

# Introducing Ray core

Now let's try speeding that up with Ray. Your laptop has multiple cores that we can use to speed up this computation.

First, we'll start a local instance of Ray using `ray.init()`. Under the hood, this will create multiple Python workers that can start executing work in parallel.

In [None]:
ray.init(ignore_reinit_error=True)

Next, let's try turning that code that we ran earlier into a Ray *task*. This is a function that can run remotely, on a different Python process from the one that called the function.

Each task returns an *`ObjectRef`*, which is a future that can be used to get the result. The value of an `ObjectRef` can also be stored in distributed memory, which means that you can create a large object without having to allocate memory in the local process. If this were running on multiple nodes, the object could even be stored on a different machine!

Evaluate the next cell to see how you can invoke and get the result of a remote function.

In [None]:
@ray.remote
def generate_random_array():
    return pd.DataFrame(np.random.randint(0, 100,
            size=(SIZE_100MiB // len(COLUMNS), len(COLUMNS))),
            columns=list(COLUMNS),
            dtype=np.uint8)

# Why does this line return immediately?
%time ref = generate_random_array.remote()
# CPU times: user 5.3 ms, sys: 431 µs, total: 5.73 ms
# Wall time: 5.39 ms

# How long does the commented out line take?
# What does it return?
# %time ray.get(ref)

The code that we just ran isn't very useful! It just calls the same function in a different Python process, so we didn't have to execute it locally, but it still took just as long.

However, since tasks execute asynchronously, we can use that to parallelize the code across our cores. Try this out yourself in the next cell.

In [None]:
PARALLELISM = 8

@ray.remote
def generate_random_array(array_size):
    return pd.DataFrame(np.random.randint(0, 100,
            size=(array_size // len(COLUMNS), len(COLUMNS))),
            columns=list(COLUMNS),
            dtype=np.uint8)

# TODO: Populate the refs list with a list of ObjectRefs, which get computed by Ray in parallel.
%time refs = []

%time ray.get(refs)