# A Guided Tour of Ray Core: Parallel Iterators

[*Parallel Iterators*](https://docs.ray.io/en/latest/iter.html) provide a simple yet powerful API for data ingest and stream processing, where transformations are based on method chaining.

Parallel iterators get partitioned into *data shards*, and Ray creates a worker (an *actor*) to produces the data for each shard.
Evaluation is *lazy*, i.e., only executed when the application calls `next()` to fetch the next item in a sequence.

Parallel iterators are fully serializable, so they can be passed to remote tasks and actors.
In effect, these can be used to operate over infinite sequences of items, with the processing distributed across a cluster.

---

First, let's start Ray…

In [None]:
from icecream import ic
import logging
import ray

ray.init(
    ignore_reinit_error=True,
    logging_level=logging.ERROR,
)

print(f"Dashboard URL: http://{ray.get_dashboard_url()}")

## Parallel Iterators

We'll create a parallel iterator from the sequence `items`, using 2 worker actors:

In [None]:
items = [1, 2, 3, 4, 5]

iter1 = ray.util.iter.from_items(items, num_shards=2)
iter1

This `iter1` object can now be passed (i.e., serialized) to remote tasks and remote methods.

To read elements from a parallel iterator, it can be converted to a [`LocalIterator`](https://docs.ray.io/en/latest/iter.html#ray.util.iter.LocalIterator) using two approaches.

Calling [`gather_sync()`](https://docs.ray.io/en/latest/iter.html#ray.util.iter.ParallelIterator.gather_sync)
returns a local iterable for *synchronous* iteration.
In other words, next items will be fetched from the shards on-demand as the application steps through the iterator sequence.

In [None]:
local_iter1 = iter1.gather_sync()
local_iter1

In [None]:
for item in local_iter1:
    ic(item)

When applying a function to the sequence (i.e., some kind of transformation) a parallel iterator provides semantic guarantees for *fetch ordering*. In other words, the transformation is guaranteed to get applied to each element of the sequence before the next item is fetched from the source actor.
For example, this can be useful if you need to update the source actor between iterator steps.

To illustrate a simple case of how to apply a function, first we'll define a class to perform some calculation:

In [None]:
class CumulativeSum:
    def __init__ (self):
        self.total = 0

    def __call__ (self, x):
        self.total += x
        return (self.total, x)

Now apply that class to the sequence of items, using the [`for_each()`](https://docs.ray.io/en/latest/iter.html#ray.util.iter.ParallelIterator.for_each) method:

In [None]:
for x in iter1.for_each(CumulativeSum()).gather_sync():
    print(x)

Alternatively, calling [`gather_async()`](https://docs.ray.io/en/latest/iter.html#ray.util.iter.ParallelIterator.gather_async)
returns a local iterable for *asynchronous* iteration.
In other words, next items will be fetched from the shards asynchronously as soon as the previous item gets computed.
In this case, the fetch ordering only applies per shard.

Another way to access a parallel iterator is as a collection of its shards:

In [None]:
iter1.shards()

In [None]:
for shard in iter1.shards():
    ic(shard)
    
    for item in shard:
        ic(item)

Note that each shard should only be read by one process at a time.

As an example, let's iterate through the JSON source for the Jupyter notebooks in this repo as if this were a streaming input source.

We'll filter to get the text in markdown cells, evaluated in batches – which creates a sliding window across the input stream:

In [None]:
from pathlib import Path
import numpy as np
import json

nb_items = list(Path(".").glob("ex_*.ipynb"))
window_width = 20

iter2 = (
    ray.util.iter.from_items(nb_items, num_shards=3)
        .for_each(lambda f: json.load(open(f)))
        .for_each(lambda nb: nb["cells"])
        .flatten()
        .for_each(lambda cell: cell["source"] if cell["cell_type"] == "markdown" else [])
        .flatten()
        .for_each(lambda line: 1 if "Ray" in line else 0)
        .batch(window_width)
        .for_each(np.mean)
)

iter2

Now calculate the probability of the term `Ray` occurring within the lines in each batch:

In [None]:
for freq in iter2.gather_async():
    ic(freq)

Let's rework this to show an example of passing iterator shards to remote functions.
We'll define a remote function `nb_word_count` to tally *word count* among the markdown cells in each notebook:

In [None]:
from collections import defaultdict

@ray.remote
def nb_word_count (shard):
    wc = defaultdict(int)
    punct = """'`<>[](){}*.,:…-'"""
    
    for nb_path in shard:
        with open(nb_path) as f:
            nb = json.load(f)
            for cell in nb["cells"]:
                if cell["cell_type"] == "markdown":
                    for line in cell["source"]:
                        for token in line.strip("# ").lower().split():
                            token = token.strip(punct)
                            wc[token] += 1

    return wc

Now pass each of the shards to a remote function:

In [None]:
nb_items = list(Path(".").glob("ex_*.ipynb"))

iter3 = ray.util.iter.from_items(nb_items, num_shards=3)

work = [nb_word_count.remote(shard) for shard in iter3.shards()]

To show the end results, we'll aggregate the word counts calculated from each shard:

In [None]:
wc_sum = defaultdict(int)

for wc in ray.get(work):
    for token, count in wc.items():
        wc_sum[token] += count

Then list the tokens ranked in descending order:

In [None]:
for token, count in sorted(wc_sum.items(), key=lambda item: item[1], reverse=True):
    if count > 1:
        ic(token, count)

Finally, shutdown Ray

In [None]:
ray.shutdown()

---

## Summary

Parallel iterators provide a somewhat higher-level abstraction which uses Ray actors and `ray.wait` loops, and fit conveninently into efficient software patterns in Python.

Engineering trade-offs are available a multiple levels:

  * trade-off compute and memory requirements by partitioning sequences of items into data shards
  * trade-off compute and memory requirements for transformations on items by passing the data shards to remote functions (stateless) and remote methods (stateful)
  * trade-off the semantic guarantees on fetch ordering by using asynchronous iteration