Improving stream operation performance #149

process0 · 2022-10-20T17:05:20Z

Would it be possible to improve the performance of stream operations in a threaded manner? My use case is creating and comparing FSTs from ~300B strings. The initial merging operations you've provided work well until the final FST merge. Creating the last stream via union takes hours where only one core is utilized. I imagine the same occurs with difference as well.

I have not familiarized myself with the internals yet, but maybe it is possible (in this case) to partition the keys based on some prefix and batch the work with some synchronization such that the receiver can build the final FST in order? This assumes there is an easy way to partition the prefixes. Maybe identifying eligible partitions could be done traversing and comparing the FST?

BurntSushi · 2022-10-20T17:51:18Z

I don't think so. You discuss perhaps some promising routes for how to parallelize reading the stream and doing the actual merge, but you don't discuss how to write the result using multiple threads. At some point, you bottom out to a single thread when it comes time to actually write the data to the final FST. I don't see how that itself could be parallelized, and thus don't see much point in parallelizing any other piece. In order to parallelize writing the FST itself, you'd have to fundamentally change probably the algorithm and even the binary format of the FST itself.

It sounds possible, but it likely a big enough change that I'd suggest starting a new project to tackle that.

An alternative to your situation is to impose an approximate max FST size and be okay with having multiple FSTs. Querying then requires querying all of the FSTs and merging the results. It's more complex code to write, but the advantage is that querying is then trivially parallelizeable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving stream operation performance #149

Improving stream operation performance #149

process0 commented Oct 20, 2022

BurntSushi commented Oct 20, 2022

Improving stream operation performance #149

Improving stream operation performance #149

Comments

process0 commented Oct 20, 2022

BurntSushi commented Oct 20, 2022