Ignore invalid UTF8 characters when tokenizing #61

guillaumekln · 2019-03-20T10:02:01Z

* master: (73 commits) Fix base directory in data path Only upgrade data configuration in training runs Also update tokenization config in local config after buildvocab Allow missing sample_dist Enable corpus synchronization for old-style data configuration (#68) Improve storage (#64) --copy_source translation option to build aligned source/target files (#67) Update to OpenNMT-tf 1.22.0 Declare dtype in tensor proto Translate from gzip files (#65) Refresh OpenNMT-py framework with serving support (#54) Update TensorFlow to 1.13 change request.get to request.post to support long sentences (#63) Ignore invalid UTF8 characters when tokenizing (#61) Update OpenNMT-tf to 1.21.7 By default, disable TER and METEOR for computation reason (#60) refs 51557: change Thai to Char based evaluation (#59) Add missing auto_config flag for inference Runner Support returning the averaged checkpoint from OpenNMT-tf Update OpenNMT-tf to 1.21.6 ...

Ignore invalid UTF8 characters when tokenizing

21d0d7f

guillaumekln merged commit d49756f into OpenNMT:master Mar 20, 2019

guillaumekln deleted the ignore-invalid-utf8 branch March 20, 2019 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore invalid UTF8 characters when tokenizing #61

Ignore invalid UTF8 characters when tokenizing #61

guillaumekln commented Mar 20, 2019

Ignore invalid UTF8 characters when tokenizing #61

Ignore invalid UTF8 characters when tokenizing #61

Conversation

guillaumekln commented Mar 20, 2019