Tokenizer 1.37.0
New features
- Add tokenization option
allow_isolated_marks
to allow combining marks to appear isolated in the tokenization output in specific conditions
Fixes and improvements
- Fix infinite loop when the text contains an invalid Unicode character
- Fix segmentation fault when the
BPELearner
does not not find any pairs of characters in the tokenized data - [Python] Update ICU to 72.1