Parallel raster optimisation #228

nickeopti · 2021-08-06T12:59:48Z

Use a multiprocessing.Pool to optimize raster files in parallel (each raster file is applied individually). Closes #55

Currently uses multiprocessing.cpu_count() number of threads. This seems to be the number of logical (hyper-threading) cores: I'm unsure whether the number of physical cores might be preferable.

Currently has some caveats:
1: The sub-progress bars are removed, such that only the overall progress is shown.
(And the filename postfix is now showing the just-completed file, rather than the currently-being-processed file).

2: When raising an exception because optimized files already exists (and neither --overwrite nor --skip-existing flags are set), the entire trace-stack is now shown, where it previously just showed a nice little message.

3: The test_reoptimize test in tests/scripts/test_optimize_rasters.py fails. Not because the code performs wrongly, but because of the way exceptions inside threads are handled.
When submitting a task to the Pool, an error_callback function is given, which receives any errors which occurs in the thread (and handles them in the main-thread). But apparently pytest recognizes that an error has been thrown inside the threads, and fails the test based on that.
If a preprocessed file already exists, and terracotta optimize-rasters is run without --overwrite or --skip-existing flags, then the expected behaviour is to throw an exception. But somehow pytest expects the exception to be raised differently, or something like that?
And I can't figure out how to make the test accept the new exception flow, such that the test just tests whether the code handles preexisting optimized files properly, and not where/how the exception is raised.

j08lue · 2021-08-06T14:15:16Z

We often use concurrent.futures.ProcessPoolExecutor these days. Often with executor.map(worker, iterable). It has a slightly simpler interface than multiprocessing, and maybe a bit better error propagation?

In any case, perhaps worth a try to change to concurrent.futures and see how it goes with the exception from there.

dionhaefner · 2021-08-06T20:10:19Z

Using concurrent.futures fixes problems 2 and 3. Exception propagation is notoriously hard in concurrent contexts, and concurrent.futures handles this nicely.

For problem 1, I suggest to replace the progress bars with a spinner (#171) and log messages (and a --quiet option or so). So something like

$ terracotta optimize-rasters *.tif
Optimizing 17 files on 5 processes
foo1.tif ...  (1/17)
foo2.tif ...  (2/17)
foo3.tif ...  (3/17)
foo4.tif ...  (4/17)
foo5.tif ...  (5/17)
<spinner>

But I think the real problem here is memory usage. We either need some heuristic that ensures we don't over-commit, or disable concurrency by default. Otherwise we risk crashing people's machines.

BTW, since all of the heavy lifting happens inside GDAL, I suggest you also try using a thread pool instead of processes. Chances are, it's the same speed with less overhead.

codecov · 2021-08-09T11:37:49Z

Codecov Report

Merging #228 (678d40c) into main (c1c5b56) will increase coverage by 0.10%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #228      +/-   ##
==========================================
+ Coverage   98.46%   98.57%   +0.10%     
==========================================
  Files          45       45              
  Lines        2221     2251      +30     
  Branches      278      287       +9     
==========================================
+ Hits         2187     2219      +32     
+ Misses         19       18       -1     
+ Partials       15       14       -1

Impacted Files	Coverage Δ
terracotta/scripts/optimize_rasters.py	`96.62% <100.00%> (+2.00%)`	⬆️
terracotta/server/rgb.py	`96.07% <0.00%> (-0.15%)`	⬇️
terracotta/server/colormap.py	`97.22% <0.00%> (-0.08%)`	⬇️
terracotta/server/keys.py	`100.00% <0.00%> (ø)`
terracotta/drivers/mysql.py	`100.00% <0.00%> (ø)`
terracotta/drivers/sqlite.py	`100.00% <0.00%> (ø)`
terracotta/server/compute.py	`100.00% <0.00%> (ø)`
terracotta/server/datasets.py	`100.00% <0.00%> (ø)`
terracotta/server/metadata.py	`100.00% <0.00%> (ø)`
terracotta/server/flask_api.py	`90.27% <0.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1c5b56...678d40c. Read the comment docs.

nickeopti · 2021-08-09T12:44:52Z

The branch is now updated to use a ProcessPoolExecutor instead.

On my test data, the optimization process takes somewhere between 1:00 and 1:10 (min:sec) when using a ProcessPoolExecutor, whereas using a ThreadPoolExecutor increases the processing time to somewhere between 1:30 and 2:30 (varies a lot, sometimes at almost 3 minutes). But if you prefer a ThreadPoolExecutor anyways, the change required is only that class exchange (in line 292).

The process indicator now shows

Optimizing 10 files on 8 processes...
Optimized 'file_1.tif'
Optimized 'file_2.tif'
Optimized 'file_3.tif'
Optimized 'file_4.tif'
Optimizing rasters:  40%|███████████████████████████████████████████████████████

where the progress bar shows the overall progress.

I still cannot figure out how to find out which files are currently being processed (i.e., how to be notified when the pool starts processing a new file) -- only which files are done being processed. Hence the 'Optimized' when the file is completed, rather than an 'Optimizing...' as you showed in your example.

A --cores flag is added, to manually set the number of cores to use. It now defaults to 1. Setting it to -1 uses the (logical) CPU core count. Let me know if you have a more elegant solution for this.
Also, if you have any suggestions regarding memory usage heuristics, that would be interesting. Otherwise, I guess this current solution works.

I have added the --cores flag to the test_optimize_rasters test. I hope that is in accordance with the style.
Even though it seems somewhat pointless to test multi-process optimization, when only a single file is supplied to the test method, I guess the test works anyways.

dionhaefner · 2021-08-09T13:15:02Z

Let's stay with processes then.

I like it better when the file currently being optimzed is printed. This makes it a lot easier to debug hangs or running out of memory (because you know which file is the culprit).

For this I would simply move the print to the subprocesses.

Also, could we switch the progress bar for a spinner? When optimizing large rasters this can hang for a long time, so I would like to give users the feeling that something is happening.

Here's a snipped to get you started:

import time
import click
import click_spinner


@click.command()
def main():
    with click_spinner.spinner():
        i = 0
        while True:
            click.echo(f"\rfoo {i}")
            time.sleep(4)
            i += 1


if __name__ == "__main__":
    main()

nickeopti · 2021-08-10T11:39:56Z

It now prints something like

Optimizing 10 files on 4 processes...
Optimizing 'file1.tif'... (1/10)
Optimizing 'file2.tif'... (2/10)
Optimizing 'file3.tif'... (3/10)
Optimizing 'file4.tif'... (4/10)
Optimized 'file4.tif'
Optimizing 'file5.tif'... (5/10)
Optimized 'file3.tif'
Optimizing 'file6.tif'... (6/10)
Optimized 'file2.tif'
Optimizing 'file7'... (7/10)
<spinner>

while optimising the raster files.

Does that match your intended style?

dionhaefner · 2021-08-10T15:08:17Z

It's sort of a nitpick, but could we match the style of #228 (comment) exactly?

j08lue · 2021-08-10T18:31:16Z

exactly

Do you want the dots and (n/m) horizontally aligned, even when file names have different length? And from what file name length should names be truncated - if at all? 😉

dionhaefner · 2021-08-11T07:49:47Z

No alignment or truncation or fancy stuff. Just don't repeat "optimizing" and don't log when you're done :)

nickeopti · 2021-08-11T09:06:51Z

Updated to:

Optimizing 10 files on 4 processes
file1.tif ... (1/10)
file2.tif ... (2/10)
file3.tif ... (3/10)
file4.tif ... (4/10)
file5.tif ... (5/10)
<spinner>

I like it better when the file currently being optimzed is printed. This makes it a lot easier to debug hangs or running out of memory (because you know which file is the culprit).

In order to know which files might be causing problems, we need to know which files are currently being processed, right? As the optimization of files probably won't complete in order, I thought we needed either

to log when a file is done, or
to remove the files from the list once they are completed.

I considered the second option -- I think it would be a nice solution -- but deemed it overkill.
A third option is of course to do neither; and just don't really know which files are currently being processed.

Btw, occasionally the logging of files happens slightly out-of-order (file2 might log before file1), when both processes are started almost-simultaneously. Then the order seems slightly odd (prints (2/10) just before (1/10)). Do we agree that that is unimportant, and just ignore it? Otherwise I think a synchronised counter is needed, and that seems a bit excessive.

dionhaefner

Looks really good already, let's just get this polished a little more :)

setup.py

terracotta/scripts/optimize_rasters.py

tests/scripts/test_optimize_rasters.py

terracotta/scripts/optimize_rasters.py

dionhaefner · 2021-08-11T09:14:23Z

Btw, occasionally the logging of files happens slightly out-of-order (file2 might log before file1), when both processes are started almost-simultaneously. Then the order seems slightly odd (prints (2/10) just before (1/10)). Do we agree that that is unimportant, and just ignore it? Otherwise I think a synchronised counter is needed, and that seems a bit excessive.

Yes, no problem.

…ze_single_raster

nickeopti · 2021-09-01T12:22:39Z

Status update:
I believe all change requests are now (finally) implemented.
Codecov is complaining that exception-handling (in the processes) isn't tested (lines 331-332). If you know how to trigger an exception there, please let me know. Otherwise I think we might have to just accept that it isn't tested.

dionhaefner · 2021-09-01T12:48:11Z

To trigger an exception you can use monkeypatch to replace _optimize_single_raster with a function that raises an exception.

dionhaefner · 2021-09-01T19:18:25Z

Thanks! This turned out to be quite a bit of work, but I think it was worth it. This has much better UX than before.

nickeopti added 5 commits August 6, 2021 10:16

Parallelize raster optimization (#55)

94fd724

Adhere to mypy and flake8

fd4772b

Use a single progress bar for all processes (sadly removing sub-pbars)

1fee7e0

Remove unused sub_pbar settings

09d4ef6

Remove ununsed import (adhere flake8)

2f153d3

nickeopti added 3 commits August 9, 2021 12:58

Use ProcessPoolExecutor

7eaa1eb

Create --cores flag to select number of processes

21a58ab

Adhere to mypy and flake8

81436e7

nickeopti added 4 commits August 9, 2021 13:42

Don't use typing.Literal, as it was just added in 3.8

9480d28

Improve --cores flag to allow for automatically all cores

eb2a61d

Don't use pool executor when --cores=1

fae5e1e

Test optimize-rasters in single- and multicore mode

d761c78

nickeopti added 5 commits August 9, 2021 16:57

Print currently-being-processed file and use spinner

8c4f655

Adhere flake8

cbdc952

Add click-spinner as requirement

0b91186

Make mypy happy (probably click-spinner typing bug)

9cf1cf1

Print (i/n) progress status when optimizing

d0bdefc

Make logs simpler and nicer

def4808

dionhaefner reviewed Aug 11, 2021

View reviewed changes

nickeopti force-pushed the parallel-raster-optimisation branch from f757591 to def4808 Compare August 12, 2021 09:11

nickeopti added 13 commits August 12, 2021 11:38

Fail early (in main thread) if output file exists

8d66a91

Rename --cores to --nproc

8f18e8c

Rename --cores to --nproc also in tests

e7f076a

Skip existing files early

549d8e1

Raise error if both overwrite and skip-existing flags are set

45c0d75

Use f-string

6f62a27

Use ExitStack

8cb672b

Improve logging of number of files and processes

007e777

Use --quiet flag to disable spinner

a324c15

Handle exceptions from optimization processes nicely

8f1c1cd

Adhere flake8

2976108

Remove now unnecessary-overwrite and skip-existing flags from _optimi…

eea352c

…ze_single_raster

Add test with both --overwrite and --skip-existing flags

368a7d6

Use a function to determine output filename

fff73fe

nickeopti and others added 2 commits September 1, 2021 16:48

Add test for handling exceptions raised inside optimization processes

cfea82e

fix test on python 3.6

678d40c

dionhaefner merged commit 9bd5873 into main Sep 1, 2021

dionhaefner deleted the parallel-raster-optimisation branch September 1, 2021 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel raster optimisation #228

Parallel raster optimisation #228

nickeopti commented Aug 6, 2021 •

edited

j08lue commented Aug 6, 2021

dionhaefner commented Aug 6, 2021

codecov bot commented Aug 9, 2021 •

edited

nickeopti commented Aug 9, 2021 •

edited

dionhaefner commented Aug 9, 2021 •

edited

nickeopti commented Aug 10, 2021

dionhaefner commented Aug 10, 2021

j08lue commented Aug 10, 2021

dionhaefner commented Aug 11, 2021

nickeopti commented Aug 11, 2021

dionhaefner left a comment

dionhaefner commented Aug 11, 2021

nickeopti commented Sep 1, 2021 •

edited

dionhaefner commented Sep 1, 2021

dionhaefner commented Sep 1, 2021

Parallel raster optimisation #228

Parallel raster optimisation #228

Conversation

nickeopti commented Aug 6, 2021 • edited

j08lue commented Aug 6, 2021

dionhaefner commented Aug 6, 2021

codecov bot commented Aug 9, 2021 • edited

Codecov Report

nickeopti commented Aug 9, 2021 • edited

dionhaefner commented Aug 9, 2021 • edited

nickeopti commented Aug 10, 2021

dionhaefner commented Aug 10, 2021

j08lue commented Aug 10, 2021

dionhaefner commented Aug 11, 2021

nickeopti commented Aug 11, 2021

dionhaefner left a comment

Choose a reason for hiding this comment

dionhaefner commented Aug 11, 2021

nickeopti commented Sep 1, 2021 • edited

dionhaefner commented Sep 1, 2021

dionhaefner commented Sep 1, 2021

nickeopti commented Aug 6, 2021 •

edited

codecov bot commented Aug 9, 2021 •

edited

nickeopti commented Aug 9, 2021 •

edited

dionhaefner commented Aug 9, 2021 •

edited

nickeopti commented Sep 1, 2021 •

edited