The camel_diac
tool allows you to diacritize Arabic text.
Below is the usage information that can be generated by running camel_diac --help
.
Usage:
camel_diac [-d DATABASE | --db=DATABASE]
[-m MARKER | --marker=MARKER]
[-I | --ignore-markers]
[-S | --strip-markers]
[-p | --pretokenized]
[-o OUTPUT | --output=OUTPUT] [FILE]
camel_diac (-l | --list-schemes)
camel_diac (-v | --version)
camel_diac (-h | --help)
Options:
-d DATABASE --db=DATABASE
Morphology database to use. DATABASE could be the name of a builtin
database or a path to a database file. [default: calima-msa-r13]
-o OUTPUT --output=OUTPUT
Output file. If not specified, output will be printed to stdout.
-m MARKER --marker=MARKER
Marker used to prefix tokens not to be transliterated.
[default: @@IGNORE@@]
-I --ignore-markers
Transliterate marked words as well.
-S --strip-markers
Remove markers in output.
-p --pretokenized
Input is already pre-tokenized by punctuation. When this is set,
camel_diac will not split tokens by punctuation but any tokens that
do contain punctuation will not be diacritized.
-l --list
Show a list of morphological databases.
-h --help
Show this screen.
-v --version
Show version.
We provide builtin databases to be able to run camel_diac
out of the box that can be passed to -d
or --db
. A list of available databases can be found at camel_morphology_dbs
.
You can always check what builtin databases are provided in your current camel_tools
installation by running camel_diac --list
. Alternatively, you can pass in a path to a database of your chosing instead of one of the above listed databases.
If no database is specified, calima-msa-r13 is used.
A marker a string with no whitespace characters at the beginning, middle, or end of it (in otherwords, it's a single token without padding spaces). As a rule-of-thumb pick a marker that is not-likely to appear in your text. We use @@IGNORE@@
as a default value, while some Arabic NLP tools use @@LAT@@
to denote latin/foreign text.