The camel_word_tokenize
tool splits words from punctuation while collapsing
contiguous segments of spaces into a single whitespace character.
It is also language agnostic and splits all characters marked as punctuation or
symbols in the Unicode specification.
For example the following sentence:
Hello, world!!!!
مرحبا يا عالم!!!
becomes:
Hello , world ! ! ! !
مرحبا يا عالم ! ! !
Below is the usage information that can be generated by running
camel_word_tokenize --help
.
Usage:
camel_word_tokenize [-o OUTPUT | --output=OUTPUT] [FILE]
camel_word_tokenize (-v | --version)
camel_word_tokenize (-h | --help)
Options:
-o OUTPUT --output=OUTPUT
Output file. If not specified, output will be printed to stdout.
-h --help
Show this screen.
-v --version
Show version.