feat: FST-based Curated Dictionary Spellchecking by grantlemons · Pull Request #258 · Automattic/harper

grantlemons · 2024-11-02T03:05:44Z

Created a new dictionary FstDictionary that uses a finite-state transducer (FST)-based map underneath. This precomputed map is built based on the dictionary file at compile time, and allows for extremely fast edit-distance calculations used in the spellchecking logic.

FstDictionary is ideal for the curated dictionary, but is immutable, so the current FullDictionary implementation is still used for user and file dictionaries.

Using this new dictionary results in much faster spellchecking (about 10x faster when not yet cached per benchmark below).

One downside to this PR is that it moves the hunspell parsing stuff out into a new crate harper-dictionary-parsing, which makes some of the imports a little weird. For instance, harper-core now imports Span from harper-dictionary-parsing.

Also, there is a noted bug in the fst library with its handling of japanese characters, but for the time being this should not pose an issue as multi-language support is nowhere near being implemented. (#138, fst/#38)

grantlemons · 2024-11-02T18:58:25Z

Made it more correct (better utf-8 edit-distance support using the levenshtein-automata crate mentioned in #138) and a tiny bit faster by storing the DFA builders thread-locally 💪

elijah-potter

Grant, this is some really great work. There are a couple of test cases and organizational things that need to be looked at, but overall it's looking very good, particularly for the WASM target.

Let me know if you have any questions about my comments.

elijah-potter

I've got some more requests for changes. Still looks good. Let me figure out the performance problem.

It was at this moment I recognized the height of my folly.

- Removed words_iter() - Removed words_with_len_iter() - Added fuzzy_match() - Added fuzzy_match_str()

…xcept fuzzy search

…enshtein

…parsing The dictionary parsing is relatively isolated, and the parsing is needed for the harper-core build script.

… numbers from dictionary

…se words

…zzy find operations

…o pub(super) - get_word - get_word_metadata These two functions are needed by `FstDictionary`, but aren't something we need to expose publicly.

This reverts commit 0d5c600. Using `Lrc` in this position makes it much harder to support choosing `concurrency` support.

Has the added bonus of doubling test execution speed

elijah-potter

Everything here looks "correct". I'm going to do some more extensive testing before merging.

grantlemons · 2024-11-20T00:30:52Z

Update: For correctness reasons it is now only 80% faster.

This reverts commit 426bd9d.

grantlemons · 2024-11-23T01:48:11Z

-    // Let common words bubble up, but do not prioritize them over all else.
-    found.sort_by_key(|fmr| fmr.edit_distance + if fmr.metadata.common { 0 } else { 1 });
+    // Make commonality relevant
+    found.sort_by_key(|fmr| if fmr.metadata.common { 0 } else { 1 });


Why did you remove the edit distance portion?

grantlemons added enhancement New feature or request harper-core labels Nov 2, 2024

grantlemons requested review from elijah-potter and lukasmwerner November 2, 2024 03:05

grantlemons self-assigned this Nov 2, 2024

grantlemons linked an issue Nov 2, 2024 that may be closed by this pull request

Use FST as dictionary? #138

Closed

elijah-potter requested changes Nov 4, 2024

View reviewed changes

elijah-potter force-pushed the fst-dictionary branch from c4645ac to d802b68 Compare November 8, 2024 02:13

elijah-potter requested changes Nov 14, 2024

View reviewed changes

grantlemons and others added 20 commits November 14, 2024 20:06

feat(fst-dict): create build script to generate map file

d9b361d

feat(fst-dict): start implementing dictionary based around fst map

d9a159d

It was at this moment I recognized the height of my folly.

feat(fst-dict): edit dictionary trait to be less FullDictionary-specific

55f8794

- Removed words_iter() - Removed words_with_len_iter() - Added fuzzy_match() - Added fuzzy_match_str()

fix(fst-dict): remove unneeded mut in build script

caf8e40

fix(fst-dict): use todo for unfinished bits of fst_dictionary.rs

d391b04

feat(fst-dict): mostly implement fuzzy-find

4ff2db7

feat(fst-dict): remove memmap and resolve errors

1e80e9c

feat(fst-dict): implement suggesting logic in spell/mod.rs

01f996c

feat(fst-dict): add a curated function and use in benches

f4a8bb1

fix(fst-dict): enable debug symbols for release profile (for flamegraph)

49287d2

fix(fst-dict): put methods back in Dictionary trait

fc0f464

feat(fst-dict): use a fulldictionary for all parts of fstdictionary e…

9879f5a

…xcept fuzzy search

fix(fst-dict): not incrementing loop variable when consuming from lev…

fd0655f

…enshtein

feat(fst-dict): migrate harper-core to use FstDictionary

2dca3d8

feat(fst-dict): move hunspell parsing to new crate harper-dictionary-…

1cd50d6

…parsing The dictionary parsing is relatively isolated, and the parsing is needed for the harper-core build script.

feat(fst-dict): create FstDictionary-specific tests

36e1e5f

fix(fst-dict): make word len function not mutate word list and remove…

9e955b3

… numbers from dictionary

fix(fst-dict): add Sync to words_iter dynamic return type

6ddc83c

fix(fst-dict): remove unused merged hashmap

39731fa

fix(fst-dict): FstDictionary not matching both normalized and lowerca…

5a95aaf

…se words

elijah-potter and others added 5 commits November 15, 2024 18:39

fix: we do not need a whole extra build step and crate for this

a7ffd4f

fix: removed broken benchmark

f9280ec

feat(fst-dict): create new struct FuzzyMatchResult returned from fu…

4359a72

…zzy find operations

fix(fst-dict): change visibility of some FullDictionary functions t…

025e3f0

…o pub(super) - get_word - get_word_metadata These two functions are needed by `FstDictionary`, but aren't something we need to expose publicly.

fix: formatting

69cbae1

grantlemons force-pushed the fst-dictionary branch from 1586d7b to 69cbae1 Compare November 16, 2024 02:01

elijah-potter and others added 4 commits November 16, 2024 12:01

fix: use Lrc for MergedDictionary

0d5c600

Revert "fix: use Lrc for MergedDictionary"

cd7a4d4

This reverts commit 0d5c600. Using `Lrc` in this position makes it much harder to support choosing `concurrency` support.

fix: removed dependency on implementation detail

0a01239

fix(fst-dictionary): change doc comments to reflect current state

9b14f92

elijah-potter reviewed Nov 18, 2024

View reviewed changes

Comment thread harper-core/src/spell/mod.rs Outdated

Comment thread harper-core/src/spell/mod.rs

elijah-potter and others added 4 commits November 18, 2024 09:29

test: removed redundant assertions

e0a8f06

feat: store just one curated dictionary regardless of thread count

55d0c01

Has the added bonus of doubling test execution speed

fix(fst-dictionary): remove features from criterion

5f1d010

style(fst-dictionary): add comments to document common_words_first test

41e4120

grantlemons requested a review from elijah-potter November 19, 2024 20:24

Merge branch 'master' into fst-dictionary

b800cc3

elijah-potter approved these changes Nov 19, 2024

View reviewed changes

elijah-potter and others added 5 commits November 19, 2024 14:14

fix(perf): removed allocation

16ca2bb

fix(lint): appeased clippy

c020607

fix(fst-dictionary): lowercase and uppercase distances for correctness

652e92f

fix(fst-dict): satisfy clippy

ebe01a6

fix(fst-dict): update dedup and sorting to be hopefully correct

426bd9d

grantlemons and others added 3 commits November 22, 2024 11:59

Revert "fix(fst-dict): update dedup and sorting to be hopefully correct"

8b92ea5

This reverts commit 426bd9d.

style(fst-dict): add comments explaining the dedup in FstDictionary

40d80ab

fix(test): improved performance and correctness

62ebf5e

grantlemons commented Nov 23, 2024

View reviewed changes

elijah-potter merged commit f3ed4c1 into master Nov 23, 2024

elijah-potter deleted the fst-dictionary branch November 23, 2024 19:05

Conversation

grantlemons commented Nov 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grantlemons commented Nov 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elijah-potter left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elijah-potter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elijah-potter left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grantlemons commented Nov 20, 2024

Uh oh!

grantlemons Nov 23, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grantlemons commented Nov 2, 2024 •

edited

Loading

grantlemons commented Nov 2, 2024 •

edited

Loading

elijah-potter left a comment •

edited

Loading

elijah-potter left a comment •

edited

Loading