Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unbuffered output option to proj #554

Closed
wants to merge 2 commits into from

Conversation

QuLogic
Copy link
Contributor

@QuLogic QuLogic commented Aug 4, 2017

With this option, we are no longer limited to one single path per proj process when creating the plots. There are between 13 and 1462 paths, so cutting it down to 1 process per plot greatly improves plot runtime. Creating all plots takes 6 minutes on master and 1m25s with this branch.

This reduces external processes from between 13 and 1462 calls to 1
depending on the plot.
@busstoptaktik
Copy link
Member

@QuLogic: I may be too blind to see the obvious, but could you add a few comments about why it is necessary to switch off the buffering for this to work?

@busstoptaktik
Copy link
Member

@QuLogic - I may still be too blind to see the obvious, but could you add a few comments about why it is necessary to switch off the buffering for this to work?

@kbevers - this is probably of interest to you

@kbevers
Copy link
Member

kbevers commented Oct 19, 2017

@kbevers - this is probably of interest to you

Well... As I don't plan on generating the complete set of projection plots very often I don't care about faster execution time.

I don't understand the difference of unbuffered versus buffered output and why it would make a difference to users.

@aaronpuchert
Copy link
Contributor

Not the pull request owner, but my understanding is that @QuLogic used to start one process for every path, but with this change would be able to pipe all paths through the same process. (Of course sequentially.)

When using a pipe as stdout, as is the case here, the output is block buffered. This means that printf and similar functions don't write directly to the pipe, but to an internal buffer. The reason is that writing to a file takes a system call, i.e. context switch into the kernel, which is quite slow. Buffered output should be much faster.

The problem here: when starting a new path, some output data of the old path might still be stuck in the buffer. Having unbuffered output is essentially the same as calling fflush after every output method, so this guarantees all the data that was written to the FILE landed in the pipe when we're done with a path.

However, I'm not sure this is a clean solution. Even when turning off buffering, a lot of things have to be in the right place for this to work. A cleaner solution would be to merge the paths in Python, piping them altogether into proj, and then separating them in the output again. Then there is no need to turn off buffering, and it doesn't rely on peculiar implementation details.

Summary: A better solution would be to not turn off buffering, but merge the input data in Python and separate it there again. Intermediate flushing only works if all parts work together, and is rather fragile.

@busstoptaktik
Copy link
Member

Thanks, @aaronpuchert, for this clear analysis. I see the reason now, and vaguely remember to have done a similar thing some 20 years ago - which I would obviously not have done, had I known what you have told me now :-)

@QuLogic - what do you think about Aaron's suggestion of assembling and splitting the individual paths in Python, and keeping the I/O buffered?

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 16, 2017

That might work for the frame and graticule, but I'm not sure we can use that for the general feature polygons as it's run through shapely's transform machinery. I don't think there's any way to get shapely to "transform and wait for results later". I'm pretty sure that the features (i.e., coastlines) are the majority of the small things that should be batched, so ignoring them is not really going to help.

setbuf is part of C89, so what else exactly do we need to be worried about? Keep in mind that the main use is this plotting script that is usually run on a proj's developer machine.

@kbevers
Copy link
Member

kbevers commented Nov 16, 2017

I don't think an improvement of plot generation is enough reason for this to be added to the proj utility. It's just a quick hack. There's better ways to get good performance if that's your concern.

I am still not convinced that there are other use cases for this were interfacing with the library directly won't be the better option. Aarons analysis backs it up. I am -1 on merging this.

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 16, 2017

I think it's a right pain to make any incremental changes to the plots when it's so slow, but as this is a pretty niche change, that's understandable.

@QuLogic QuLogic closed this Nov 16, 2017
@aaronpuchert
Copy link
Contributor

So you want to use this for an interactive feature? That changes the situation of course.

Piping data through proj or any other UNIX shell utility isn't exactly made for that. This mechanism is designed for throughput, not latency. (Some people call it batch processing for that reason.)

If you want to interactively transform a lot coordinates, you should use the library directly. Now I have a very limited understanding of shapely, but generally it's possible to bind to C functions in Python. Since there are but a few functions needed for a typical transformation, I think this shouldn't even be hard.

The good thing is that it scales much better than piping unbuffered data back and forth. I don't think this would have solved your problem sustainably, because having no buffering carries a significant performance impact of its own.

@QuLogic
Copy link
Contributor Author

QuLogic commented Nov 19, 2017

It's not interactive. It's the script that's in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants