SEETM (Sinhala-English Equivalent Token Mapper) allows creating equivalent token maps and replace them with a base token to avoid OOV tokens and generate a single feature for all equivalent tokens in a Sinhala-English code-switching dataset in rasa-based conversational AIs.
- Allows mapping multiple equivalent tokens into a base token
- Fully supports rasa 2.8.x projects
- Provides an easy-to-use CLI
- Provides an efficient server-based GUI
- Provides a fully-functional custom whitespace tokenizer
- Fully-supports Sinhala in the GUI
- Mapping suggestions in the SEETM server GUI
- Automatically generated mappings
- Should manually add the SEETM tokenizer to the rasa pipeline or else the token maps are not taking any effect
- IPA-based suggestions could contain slight changes based on th IPA mapping origin. (SEETM uses CMU)
- CMU Pronunciation Dictionary
- eng-to-ipa pip package (GitHub)
📒 Docs: https://seetm.github.io
📦 PyPi: https://pypi.org/project/seetn/1.1.0/
🪵 Full Changelog: https://github.com/SEETM-NLP/seetm/blob/main/CHANGELOG.md