Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace TPIE sorting by partioning and merging fsa's #181

Closed
hendrikmuhs opened this issue Nov 12, 2020 · 1 comment · Fixed by #186
Closed

Replace TPIE sorting by partioning and merging fsa's #181

hendrikmuhs opened this issue Nov 12, 2020 · 1 comment · Fixed by #186

Comments

@hendrikmuhs
Copy link
Contributor

hendrikmuhs commented Nov 12, 2020

#180 made my think.

For some reason my vision was to replace the sorting code. I realized this might not be necessary, we have everything we need.

I compared the suggestion I gave in #180 on larger data sets:

  • creating a keyvi file from scratch, utilizing TPIE
  • creating x small keyvi files (using the "small data compilers" which don't use TPIE sort) and run merger on it

I ran different cases, in summary the merge approach was roughly 20% slower. Note, I did not optimize anything (I used simple python scripts). My merge approach had to copy more data, an improved implementation would avoid that.

The idea is as follows

  1. create an in-memory sorter
  2. if the in-memory sort buffer hits the threshold, sort the data, create an fsa, persist it, free buffers
  3. go to 1.
  4. after all data has been processed, sort, create, persist the final chunk
  5. merge the fsa's and create the final keyvi file
@narekgharibyan
Copy link
Member

@hendrikmuhs yeah, I really liked this. And this approach was tested on production on large scale datasets, so def worth doing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants