Replace TPIE sorting by partioning and merging fsa's #181

hendrikmuhs · 2020-11-12T07:20:52Z

#180 made my think.

For some reason my vision was to replace the sorting code. I realized this might not be necessary, we have everything we need.

I compared the suggestion I gave in #180 on larger data sets:

creating a keyvi file from scratch, utilizing TPIE
creating x small keyvi files (using the "small data compilers" which don't use TPIE sort) and run merger on it

I ran different cases, in summary the merge approach was roughly 20% slower. Note, I did not optimize anything (I used simple python scripts). My merge approach had to copy more data, an improved implementation would avoid that.

The idea is as follows

create an in-memory sorter
if the in-memory sort buffer hits the threshold, sort the data, create an fsa, persist it, free buffers
go to 1.
after all data has been processed, sort, create, persist the final chunk
merge the fsa's and create the final keyvi file

narekgharibyan · 2020-11-12T20:38:34Z

@hendrikmuhs yeah, I really liked this. And this approach was tested on production on large scale datasets, so def worth doing!

remove tpie from repository relates #186, #180, #181

hendrikmuhs mentioned this issue Nov 24, 2020

Refactor compiler to use a merge based approach for large dictionaries #186

Merged

hendrikmuhs closed this as completed in #186 Dec 5, 2020

hendrikmuhs mentioned this issue Dec 5, 2020

remove tpie from repository #194

Merged

hendrikmuhs added a commit that referenced this issue Dec 5, 2020

remove tpie from repository (#194)

43f56e2

remove tpie from repository relates #186, #180, #181

rickbeeloo mentioned this issue Mar 8, 2021

Advice to build a dictionary of already sorted keys #205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace TPIE sorting by partioning and merging fsa's #181

Replace TPIE sorting by partioning and merging fsa's #181

hendrikmuhs commented Nov 12, 2020 •

edited

Loading

narekgharibyan commented Nov 12, 2020

Replace TPIE sorting by partioning and merging fsa's #181

Replace TPIE sorting by partioning and merging fsa's #181

Comments

hendrikmuhs commented Nov 12, 2020 • edited Loading

narekgharibyan commented Nov 12, 2020

hendrikmuhs commented Nov 12, 2020 •

edited

Loading