A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
It's helpful for
- Precision and Recall: Word matching is a retrieval process, LOGICAL match improves precision while TEXT VARIATIONS match improves recall.
- Content Filtering: Detecting and filtering out offensive or sensitive words.
- Search Engines: Improving search results by identifying relevant keywords.
- Text Analysis: Extracting specific information from large volumes of text.
- Spam Detection: Identifying spam content in emails or messages.
- ···
For detailed implementation, see the Design Document.
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Transformation:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸
->虫草
- Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk
->Fuiiikkkk
- Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!
->hello world!
- PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安
->xi an
, matches洗按
->xi an
, but not先
->xian
- PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安
->xian
, matches洗按
and先
->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&world
matcheshello world
andworld,hello
- Example:
无&法&无&天
matches无无法天
(because无
is repeated twice), but not无法天
- Example:
hello~helloo~hhello
matcheshello
but nothelloo
andhhello
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
See the Rust README.
See the Python README.
We provide dynamic library to link. See the C README and Java README.
git clone https://github.com/Lips7/Matcher.git
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --default-toolchain nightly -y
cargo build --release
Then you should find the libmatcher_c.so
/libmatcher_c.dylib
/matcher_c.dll
in the target/release
directory.
Visit the release page to download the pre-built binary.
Please refer to benchmarks for details.
-
Cache middle results during different ProcessType reduce_process_text function calling. (failed, too slow) - Try more aho-corasick library to improve performance and reduce memory usage.
-
https://github.com/daac-tools/crawdad (produce char-wise index, not byte-wise index, it's not acceptable) - https://github.com/daac-tools/daachorse (use it when Fanjian, PinYin or PinYinChar transformation is performed)
-
Test char-wise HashMap transformation for Chinese Characters. (Too slow)
-
- Make aho-corasick unsafe.
- Optimize NOT logic word-wise.
- Optimize
RegexMatcher
usingRegexSet
. - Optimize
SimpleMatcher
when multipleProcessType
are used.- Consider if there are multiple
ProcessType
- None
- Fanjian
- FanjianDelete
- FanjianDeleteNormalize
- FanjianNormalize
- We can construct a chain of transformations,
- None -> Fanjian -> Delete -> Normalize
- \ -> Normalize.
- Calcuate all possible transformations, and cache the results, so that instead calculating 8 times (Fanjian, Fanjian + Delete, Fanjian + Delete + Normalize, Fanjian + Normalize), we only need to calculate 4 times (Fanjian, Delete, Normalize, Normalize).
- Consider if there are multiple
-
Optimize process matcher when perform reduce text processing.- Consider we have to perform FanjianDeleteNormalize, we need to perform Fanjian first, then Delete, then Normalize, 3 kinds of Process Matcher are needed to perform replacement or delete, the text has to be scanned 3 times.
- What if we only construct only 1 Process Matcher which's patterns contains all the Fanjian, Delete and Normalize 3 kinds of patterns? We could scan the text only once to get all the positions that should be perform replacement or delete.
- We need to take care of the byte index will change after replacement or delete, so we need to take the offset changes into account.
- Merge multiple aho-corasick matcher into one when multiple
ProcessType
are used. - When
dfa
feature is disabled, use daachorse to perform text processing.- Do not use it for simple process function, too slow to build.
- Use more regex set to optimize regex matcher.
- Cache
get_process_matcher
results globally, instead of caching result inside SimpleMatcher. - Expose
reduce_process_text
to Python. - Add a new function that can handle single simple match type.
-
text_process
now is available.
-
- Add fuzzy matcher, https://github.com/lotabout/fuzzy-matcher.
- Use
rapidfuzz
instead.
- Use
- Make
SimpleMatcher
andMatcher
serializable.- Make aho-corasick serializable.
- See https://github.com/Lips7/aho-corasick.
- Implement NOT logic word-wise.
- Support stable rust.
- Support iterator.
- A real java package.
- Multiple Python version wheel build.
- Customize str conversion map.
- Add Matcher process function to py, c and java.
-
For simple matcher, is it possible to use regex-automata to replace aho-corasick? and support regex. (Keep it simple and efficient) - Add simple match type to
RegexMatcher
andSimMatcher
to pre-process a text. - Try to replace msgpack.
- More precise and convenient MatchTable.
- More detailed and rigorous benchmarks.
- More detailed and rigorous tests.
- More detailed simple match type explanation.
- More detailed DESIGN.
- Write a Chinese README.