Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving stream operation performance #149

Open
process0 opened this issue Oct 20, 2022 · 1 comment
Open

Improving stream operation performance #149

process0 opened this issue Oct 20, 2022 · 1 comment

Comments

@process0
Copy link

Would it be possible to improve the performance of stream operations in a threaded manner? My use case is creating and comparing FSTs from ~300B strings. The initial merging operations you've provided work well until the final FST merge. Creating the last stream via union takes hours where only one core is utilized. I imagine the same occurs with difference as well.

I have not familiarized myself with the internals yet, but maybe it is possible (in this case) to partition the keys based on some prefix and batch the work with some synchronization such that the receiver can build the final FST in order? This assumes there is an easy way to partition the prefixes. Maybe identifying eligible partitions could be done traversing and comparing the FST?

@BurntSushi
Copy link
Owner

I don't think so. You discuss perhaps some promising routes for how to parallelize reading the stream and doing the actual merge, but you don't discuss how to write the result using multiple threads. At some point, you bottom out to a single thread when it comes time to actually write the data to the final FST. I don't see how that itself could be parallelized, and thus don't see much point in parallelizing any other piece. In order to parallelize writing the FST itself, you'd have to fundamentally change probably the algorithm and even the binary format of the FST itself.

It sounds possible, but it likely a big enough change that I'd suggest starting a new project to tackle that.

An alternative to your situation is to impose an approximate max FST size and be okay with having multiple FSTs. Querying then requires querying all of the FSTs and merging the results. It's more complex code to write, but the advantage is that querying is then trivially parallelizeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants