Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel raster optimisation #228

Merged
merged 34 commits into from Sep 1, 2021
Merged

Conversation

nickeopti
Copy link
Contributor

@nickeopti nickeopti commented Aug 6, 2021

Use a multiprocessing.Pool to optimize raster files in parallel (each raster file is applied individually). Closes #55

Currently uses multiprocessing.cpu_count() number of threads. This seems to be the number of logical (hyper-threading) cores: I'm unsure whether the number of physical cores might be preferable.

Currently has some caveats:
1: The sub-progress bars are removed, such that only the overall progress is shown.
(And the filename postfix is now showing the just-completed file, rather than the currently-being-processed file).

2: When raising an exception because optimized files already exists (and neither --overwrite nor --skip-existing flags are set), the entire trace-stack is now shown, where it previously just showed a nice little message.

3: The test_reoptimize test in tests/scripts/test_optimize_rasters.py fails. Not because the code performs wrongly, but because of the way exceptions inside threads are handled.
When submitting a task to the Pool, an error_callback function is given, which receives any errors which occurs in the thread (and handles them in the main-thread). But apparently pytest recognizes that an error has been thrown inside the threads, and fails the test based on that.
If a preprocessed file already exists, and terracotta optimize-rasters is run without --overwrite or --skip-existing flags, then the expected behaviour is to throw an exception. But somehow pytest expects the exception to be raised differently, or something like that?
And I can't figure out how to make the test accept the new exception flow, such that the test just tests whether the code handles preexisting optimized files properly, and not where/how the exception is raised.

@j08lue
Copy link
Collaborator

j08lue commented Aug 6, 2021

We often use concurrent.futures.ProcessPoolExecutor these days. Often with executor.map(worker, iterable). It has a slightly simpler interface than multiprocessing, and maybe a bit better error propagation?

In any case, perhaps worth a try to change to concurrent.futures and see how it goes with the exception from there.

@dionhaefner
Copy link
Collaborator

Using concurrent.futures fixes problems 2 and 3. Exception propagation is notoriously hard in concurrent contexts, and concurrent.futures handles this nicely.

For problem 1, I suggest to replace the progress bars with a spinner (#171) and log messages (and a --quiet option or so). So something like

$ terracotta optimize-rasters *.tif
Optimizing 17 files on 5 processes
foo1.tif ...  (1/17)
foo2.tif ...  (2/17)
foo3.tif ...  (3/17)
foo4.tif ...  (4/17)
foo5.tif ...  (5/17)
<spinner>

But I think the real problem here is memory usage. We either need some heuristic that ensures we don't over-commit, or disable concurrency by default. Otherwise we risk crashing people's machines.

BTW, since all of the heavy lifting happens inside GDAL, I suggest you also try using a thread pool instead of processes. Chances are, it's the same speed with less overhead.

@codecov
Copy link

codecov bot commented Aug 9, 2021

Codecov Report

Merging #228 (678d40c) into main (c1c5b56) will increase coverage by 0.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
+ Coverage   98.46%   98.57%   +0.10%     
==========================================
  Files          45       45              
  Lines        2221     2251      +30     
  Branches      278      287       +9     
==========================================
+ Hits         2187     2219      +32     
+ Misses         19       18       -1     
+ Partials       15       14       -1     
Impacted Files Coverage Δ
terracotta/scripts/optimize_rasters.py 96.62% <100.00%> (+2.00%) ⬆️
terracotta/server/rgb.py 96.07% <0.00%> (-0.15%) ⬇️
terracotta/server/colormap.py 97.22% <0.00%> (-0.08%) ⬇️
terracotta/server/keys.py 100.00% <0.00%> (ø)
terracotta/drivers/mysql.py 100.00% <0.00%> (ø)
terracotta/drivers/sqlite.py 100.00% <0.00%> (ø)
terracotta/server/compute.py 100.00% <0.00%> (ø)
terracotta/server/datasets.py 100.00% <0.00%> (ø)
terracotta/server/metadata.py 100.00% <0.00%> (ø)
terracotta/server/flask_api.py 90.27% <0.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1c5b56...678d40c. Read the comment docs.

@nickeopti
Copy link
Contributor Author

nickeopti commented Aug 9, 2021

The branch is now updated to use a ProcessPoolExecutor instead.

On my test data, the optimization process takes somewhere between 1:00 and 1:10 (min:sec) when using a ProcessPoolExecutor, whereas using a ThreadPoolExecutor increases the processing time to somewhere between 1:30 and 2:30 (varies a lot, sometimes at almost 3 minutes). But if you prefer a ThreadPoolExecutor anyways, the change required is only that class exchange (in line 292).

The process indicator now shows

Optimizing 10 files on 8 processes...
Optimized 'file_1.tif'
Optimized 'file_2.tif'
Optimized 'file_3.tif'
Optimized 'file_4.tif'
Optimizing rasters:  40%|███████████████████████████████████████████████████████

where the progress bar shows the overall progress.

I still cannot figure out how to find out which files are currently being processed (i.e., how to be notified when the pool starts processing a new file) -- only which files are done being processed. Hence the 'Optimized' when the file is completed, rather than an 'Optimizing...' as you showed in your example.

A --cores flag is added, to manually set the number of cores to use. It now defaults to 1. Setting it to -1 uses the (logical) CPU core count. Let me know if you have a more elegant solution for this.
Also, if you have any suggestions regarding memory usage heuristics, that would be interesting. Otherwise, I guess this current solution works.

I have added the --cores flag to the test_optimize_rasters test. I hope that is in accordance with the style.
Even though it seems somewhat pointless to test multi-process optimization, when only a single file is supplied to the test method, I guess the test works anyways.

@dionhaefner
Copy link
Collaborator

dionhaefner commented Aug 9, 2021

Let's stay with processes then.

I like it better when the file currently being optimzed is printed. This makes it a lot easier to debug hangs or running out of memory (because you know which file is the culprit).

For this I would simply move the print to the subprocesses.

Also, could we switch the progress bar for a spinner? When optimizing large rasters this can hang for a long time, so I would like to give users the feeling that something is happening.

Here's a snipped to get you started:

import time
import click
import click_spinner


@click.command()
def main():
    with click_spinner.spinner():
        i = 0
        while True:
            click.echo(f"\rfoo {i}")
            time.sleep(4)
            i += 1


if __name__ == "__main__":
    main()

@nickeopti
Copy link
Contributor Author

It now prints something like

Optimizing 10 files on 4 processes...
Optimizing 'file1.tif'... (1/10)
Optimizing 'file2.tif'... (2/10)
Optimizing 'file3.tif'... (3/10)
Optimizing 'file4.tif'... (4/10)
Optimized 'file4.tif'
Optimizing 'file5.tif'... (5/10)
Optimized 'file3.tif'
Optimizing 'file6.tif'... (6/10)
Optimized 'file2.tif'
Optimizing 'file7'... (7/10)
<spinner>

while optimising the raster files.

Does that match your intended style?

@dionhaefner
Copy link
Collaborator

It's sort of a nitpick, but could we match the style of #228 (comment) exactly?

@j08lue
Copy link
Collaborator

j08lue commented Aug 10, 2021

exactly

Do you want the dots and (n/m) horizontally aligned, even when file names have different length? And from what file name length should names be truncated - if at all? 😉

@dionhaefner
Copy link
Collaborator

No alignment or truncation or fancy stuff. Just don't repeat "optimizing" and don't log when you're done :)

@nickeopti
Copy link
Contributor Author

Updated to:

Optimizing 10 files on 4 processes
file1.tif ... (1/10)
file2.tif ... (2/10)
file3.tif ... (3/10)
file4.tif ... (4/10)
file5.tif ... (5/10)
<spinner>

I like it better when the file currently being optimzed is printed. This makes it a lot easier to debug hangs or running out of memory (because you know which file is the culprit).

In order to know which files might be causing problems, we need to know which files are currently being processed, right? As the optimization of files probably won't complete in order, I thought we needed either

  1. to log when a file is done, or
  2. to remove the files from the list once they are completed.

I considered the second option -- I think it would be a nice solution -- but deemed it overkill.
A third option is of course to do neither; and just don't really know which files are currently being processed.

Btw, occasionally the logging of files happens slightly out-of-order (file2 might log before file1), when both processes are started almost-simultaneously. Then the order seems slightly odd (prints (2/10) just before (1/10)). Do we agree that that is unimportant, and just ignore it? Otherwise I think a synchronised counter is needed, and that seems a bit excessive.

Copy link
Collaborator

@dionhaefner dionhaefner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good already, let's just get this polished a little more :)

setup.py Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
tests/scripts/test_optimize_rasters.py Outdated Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
terracotta/scripts/optimize_rasters.py Outdated Show resolved Hide resolved
@dionhaefner
Copy link
Collaborator

Btw, occasionally the logging of files happens slightly out-of-order (file2 might log before file1), when both processes are started almost-simultaneously. Then the order seems slightly odd (prints (2/10) just before (1/10)). Do we agree that that is unimportant, and just ignore it? Otherwise I think a synchronised counter is needed, and that seems a bit excessive.

Yes, no problem.

@nickeopti
Copy link
Contributor Author

nickeopti commented Sep 1, 2021

Status update:
I believe all change requests are now (finally) implemented.
Codecov is complaining that exception-handling (in the processes) isn't tested (lines 331-332). If you know how to trigger an exception there, please let me know. Otherwise I think we might have to just accept that it isn't tested.

@dionhaefner
Copy link
Collaborator

To trigger an exception you can use monkeypatch to replace _optimize_single_raster with a function that raises an exception.

@dionhaefner
Copy link
Collaborator

Thanks! This turned out to be quite a bit of work, but I think it was worth it. This has much better UX than before.

@dionhaefner dionhaefner merged commit 9bd5873 into main Sep 1, 2021
@dionhaefner dionhaefner deleted the parallel-raster-optimisation branch September 1, 2021 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add parallel preprocessing capabilities
3 participants