Add unbuffered output option to proj #554

QuLogic · 2017-08-04T21:14:37Z

With this option, we are no longer limited to one single path per proj process when creating the plots. There are between 13 and 1462 paths, so cutting it down to 1 process per plot greatly improves plot runtime. Creating all plots takes 6 minutes on master and 1m25s with this branch.

This reduces external processes from between 13 and 1462 calls to 1 depending on the plot.

busstoptaktik · 2017-08-21T10:44:50Z

@QuLogic: I may be too blind to see the obvious, but could you add a few comments about why it is necessary to switch off the buffering for this to work?

busstoptaktik · 2017-10-19T16:42:31Z

@QuLogic - I may still be too blind to see the obvious, but could you add a few comments about why it is necessary to switch off the buffering for this to work?

@kbevers - this is probably of interest to you

kbevers · 2017-10-19T16:48:30Z

@kbevers - this is probably of interest to you

Well... As I don't plan on generating the complete set of projection plots very often I don't care about faster execution time.

I don't understand the difference of unbuffered versus buffered output and why it would make a difference to users.

aaronpuchert · 2017-11-16T19:09:36Z

Not the pull request owner, but my understanding is that @QuLogic used to start one process for every path, but with this change would be able to pipe all paths through the same process. (Of course sequentially.)

When using a pipe as stdout, as is the case here, the output is block buffered. This means that printf and similar functions don't write directly to the pipe, but to an internal buffer. The reason is that writing to a file takes a system call, i.e. context switch into the kernel, which is quite slow. Buffered output should be much faster.

The problem here: when starting a new path, some output data of the old path might still be stuck in the buffer. Having unbuffered output is essentially the same as calling fflush after every output method, so this guarantees all the data that was written to the FILE landed in the pipe when we're done with a path.

However, I'm not sure this is a clean solution. Even when turning off buffering, a lot of things have to be in the right place for this to work. A cleaner solution would be to merge the paths in Python, piping them altogether into proj, and then separating them in the output again. Then there is no need to turn off buffering, and it doesn't rely on peculiar implementation details.

Summary: A better solution would be to not turn off buffering, but merge the input data in Python and separate it there again. Intermediate flushing only works if all parts work together, and is rather fragile.

busstoptaktik · 2017-11-16T20:32:19Z

Thanks, @aaronpuchert, for this clear analysis. I see the reason now, and vaguely remember to have done a similar thing some 20 years ago - which I would obviously not have done, had I known what you have told me now :-)

@QuLogic - what do you think about Aaron's suggestion of assembling and splitting the individual paths in Python, and keeping the I/O buffered?

QuLogic · 2017-11-16T21:07:14Z

That might work for the frame and graticule, but I'm not sure we can use that for the general feature polygons as it's run through shapely's transform machinery. I don't think there's any way to get shapely to "transform and wait for results later". I'm pretty sure that the features (i.e., coastlines) are the majority of the small things that should be batched, so ignoring them is not really going to help.

setbuf is part of C89, so what else exactly do we need to be worried about? Keep in mind that the main use is this plotting script that is usually run on a proj's developer machine.

kbevers · 2017-11-16T21:23:32Z

I don't think an improvement of plot generation is enough reason for this to be added to the proj utility. It's just a quick hack. There's better ways to get good performance if that's your concern.

I am still not convinced that there are other use cases for this were interfacing with the library directly won't be the better option. Aarons analysis backs it up. I am -1 on merging this.

QuLogic · 2017-11-16T22:00:57Z

I think it's a right pain to make any incremental changes to the plots when it's so slow, but as this is a pretty niche change, that's understandable.

aaronpuchert · 2017-11-16T23:51:21Z

So you want to use this for an interactive feature? That changes the situation of course.

Piping data through proj or any other UNIX shell utility isn't exactly made for that. This mechanism is designed for throughput, not latency. (Some people call it batch processing for that reason.)

If you want to interactively transform a lot coordinates, you should use the library directly. Now I have a very limited understanding of shapely, but generally it's possible to bind to C functions in Python. Since there are but a few functions needed for a typical transformation, I think this shouldn't even be hard.

The good thing is that it scales much better than piping unbuffered data back and forth. I don't think this would have solved your problem sustainably, because having no buffering carries a significant performance impact of its own.

QuLogic · 2017-11-19T19:53:21Z

It's not interactive. It's the script that's in this PR.

QuLogic added 2 commits August 4, 2017 16:36

Add option for unbuffered output from proj.

a9afec7

Use only one proj process per plot invocation.

0d9bc2a

This reduces external processes from between 13 and 1462 calls to 1 depending on the plot.

QuLogic closed this Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unbuffered output option to proj #554

Add unbuffered output option to proj #554

QuLogic commented Aug 4, 2017

busstoptaktik commented Aug 21, 2017

busstoptaktik commented Oct 19, 2017

kbevers commented Oct 19, 2017

aaronpuchert commented Nov 16, 2017

busstoptaktik commented Nov 16, 2017

QuLogic commented Nov 16, 2017

kbevers commented Nov 16, 2017

QuLogic commented Nov 16, 2017

aaronpuchert commented Nov 16, 2017

QuLogic commented Nov 19, 2017

Add unbuffered output option to proj #554

Add unbuffered output option to proj #554

Conversation

QuLogic commented Aug 4, 2017

busstoptaktik commented Aug 21, 2017

busstoptaktik commented Oct 19, 2017

kbevers commented Oct 19, 2017

aaronpuchert commented Nov 16, 2017

busstoptaktik commented Nov 16, 2017

QuLogic commented Nov 16, 2017

kbevers commented Nov 16, 2017

QuLogic commented Nov 16, 2017

aaronpuchert commented Nov 16, 2017

QuLogic commented Nov 19, 2017