# Preprocessing with Ray tasks

Cloud servers today often have multiple cores.
We can use Ray to parallelize the available work across both all the cores in a single machine and all the machines in a cluster.

Our local (SQLite) database maps a movie ID to information about the movie, including a color palette.
But some of the color palettes are missing!
In this tutorial, we'll use *Ray tasks* to parallelize the preprocessing step that gathers all of the missing color palettes from a dataset of movie covers.


## Serial execution

First, let's take a look at a serial execution of the code, without Ray.
Evaluate the next two notebook cells (shift+return, if you're new to notebooks), down to and including get_palette(...):

In [None]:
# Some imports that we'll need later on.
import json
from tqdm import tqdm

from colorthief import ColorThief
import ray

from util import MOVIE_IDS, get_db_connection, load_image, progress_bar, delete_palettes

In [None]:
def get_palette(movie_id):
    return ColorThief(load_image(movie_id)).get_palette(color_count=6)

This is the function that we'll use to preprocess a single image.
This function loads an image from disk and then gets the palette using a handy open source library called `colorthief`.
The function returns the palette, which is just a list of integer tuples representing RBG values.

Try running the function on an image ID to see how long it takes.
You can pick an image ID from the `IMAGE_IDS` variable, which is a list of all of the available movie IDs:

In [None]:
%time get_palette(MOVIE_IDS[0])

That should have taken several seconds.
Hopefully, we can do better with Ray!

Let's try running this function on many images next.
First we'll fetch all of movies from our local (SQLite) database that are currently missing a color palette.
We'll get the movie IDs from the database and use these as an input to the `get_palette` function that we defined above.

In [None]:
def get_missing_palettes():
    db = get_db_connection()
    c = db.execute("SELECT id, palette_json FROM movies WHERE palette_json == ''")
    return [r[0] for r in c.fetchall()]

movie_ids = get_missing_palettes()
print("Missing {} movie palettes.".format(len(movie_ids)))

Next we'll run `get_palette` on each of these image IDs.
**Evaluate the next cell and let it run.**
We'll use the `tqdm` library to display a progress bar.

As expected, the palettes are taking a while to generate.
That makes sense since we're running the `get_palette` function one movie at a time, and each image takes several seconds to process.

**Cancel the execution of the cell using the &#x25A0; button in the toolbar above.**
We'll try it again, this time using Ray to parallelize the execution

In [None]:
palettes = []
for movie_id in tqdm(movie_ids):
    palettes.append(get_palette(movie_id))

## Let's Start Ray

Each notebook will start a mini-cluster with one node when we call ray.init() below. They we'll shut it down at the end of the notebook, so be sure to evaluate the last cell in each notebook that calls ray.shutdown().

The ignore_reinit_error=True argument tells Ray not to complain if we rerun the cell; Ray will just ignore the request to initialize.

In [None]:
ray.init(ignore_reinit_error=True)

## From Python Functions to Ray Task

Now that we have a Ray cluster, we can begin to submit *tasks* to the cluster.
You create a Ray _task_ by decorating a normal Python function with `@ray.remote`. These tasks will be scheduled across your Ray cluster.

> **Tip:** The [Ray Package Reference](https://ray.readthedocs.io/en/latest/package-ref.html) in the [Ray Docs](https://ray.readthedocs.io/en/latest/) is useful for exploring the API features we'll learn.

**Try this yourself!** Go up to the cell with the `get_palette` definition and add a `@ray.remote` decorator right before the `def get_palette` line.
This will indicate to Ray that the function can be run on a remote process.

Next, try creating a Ray task.
To invoke a task, you use function.remote(args):

In [None]:
%time get_palette.remote(MOVIE_IDS[0])

What is this `ObjectRef`? A Ray task is an *asynchronous* computation. You'll notice this finished a lot faster than when we ran `get_palette` serially. That's because the actual computation is happening in the background, on a different process.

The `ObjectRef` returned is a future that we use to retrieve the resulting value from the task when it completes. We use `ray.get(ref)` to get it:

In [None]:
ref = get_palette.remote(MOVIE_IDS[0])
%time ray.get(ref)

We can also work with lists of `ObjectRefs`:

In [None]:
refs = [get_palette.remote(MOVIE_IDS[i]) for i in range(3)]
%time ray.get(refs)

This last cell actually processed three different movie covers *in parallel*.
We let Ray know that the movie covers could be processed in parallel by launching three `get_palette` tasks, each on a separate image ID.
Since tasks are asynchronous, this should have finished in a few milliseconds.

We let Ray know when we need the results with the `ray.get` call.
This will block until *all* of the functions have finished.
Like the previous version, this also took several seconds to finish.
The difference is that this time we actually processed three movie covers at the same time, instead of just one!

Now let's try running this on all of the missing color palettes.
We'll use the helper function `progress_bar` to display the progress during the `ray.get` call.
Try this yourself:

In [None]:
palettes = [get_palette.remote(movie_id) for movie_id in movie_ids]
palettes = ray.get(progress_bar(palettes))

That should have finished in several seconds!
Now, we just need to save the missing palettes into the database and clean up the cluster:

In [None]:
db = get_db_connection()
for movie_id, palette in zip(movie_ids, palettes):
    db.execute(
        "UPDATE movies SET palette_json = ('{}') WHERE id == ({})".format(json.dumps(palette), movie_id)
    )
db.commit()

ray.shutdown()

## If you need to start over:

Run this code to delete `n` color palettes from the database. Then, you can start the notebook again to fill out the missing palettes.

In [None]:
from util import delete_palettes

delete_palettes(n=100)