Skip to content

HunspellXML Format (Settings)

TrnsltLife edited this page Jun 9, 2017 · 6 revisions

HunspellXML   HunspellXML Format > AffixFile > Settings


<settings>...</settings>

The settings element contains general settings that affect the Hunspell dictionary. <languageCode>, <characterSet> and <flagType> are required in HunspellXML.

<settings>
	<languageCode>xxx_XX</languageCode>
	<characterSet>UTF-8</characterSet>
	<flagType>long</flagType>
	<wordChars>- ' _</wordChars>
	<ignore>- ' _</ignore>
	<circumfix flag="CF"/>
	<forbiddenWord flag="FW"/>
	<keepCase flag="KC"/>
	<needAffix flag="NA"/>
	<substandard flag="ss"/>
	<checkSharpS flag="SS"/>
	<complexPrefixes/>
	<fullStrip/>
</settings>

<languageCode>[locale]</languageCode> (required)

The <languageCode> element must contain the 2- or 3-letter language code for your language, followed optionally by an underscore and the 2-letter country code.

Note: If you do not specify a country code here, you will need to specify a country code in the <localeList> element in <metadata>, otherwise the LibreOffice plugin will not function properly.

<characterSet>[option]</characterSet> (required)

You are encouraged to use the <characterSet> value of "UTF-8", and to encode all of your input files in UTF-8.

However, the following character sets are also available for use: UTF-8, ISO8859-1, ISO8859-2, ISO8859-3, ISO8859-4, ISO8859-5, ISO8859-6, ISO8859-7, ISO8859-8, ISO8859-9, ISO8859-10, ISO8859-13, ISO8859-14, ISO8859-15, KOI8-R, KOI8-U, microsoft-cp1251, ISCII-DEVANAGARI

One of these codes must be specified, and all your input data (including the HunspellXML file!) must be in this encoding for your dictionary to be properly created.

<flagType>[option]</flagType> (required)

Hunspell uses "flags" to specify what kinds of affixes can attach to dictionary words to add morphemes to them. Each morphology rule has a flag associated with it, and words in the dictionary may also have flags, indicating which morphology rules can attach to them.

For example, an English verb like "jump" might have affix flags "ED", "-S", and "IG", corresponding to the morphology rules for adding the morphemes "-ed" (past tense), "-s" (3rd person present tense) and "-ing (present participle).

All flags in the Hunspell affix rules and in the dictionary must have the same form, but there are 4 different options. I recommend using either the "short" or "long" flag types, but here are all 4 options:

  • short - 1-character ASCII codes, e.g. A B s x ! (recommended)
  • long - 2-character ASCII codes, e.g. AF PF ED -S dv (recommended)
  • num - integer number codes between 1 and 65000, e.g. 1 2 4 129 65000
  • UTF-8 - 1-character Unicode code (not currently recommended since it doesn't work on ARM platforms)

<wordChars>[list of chars]</wordChars>

This element can contain a list of single characters that are considered as part of a word. For example, - (dash) and ' (apostrophe). Each character should be just a single letter separated from the others by a space.

<wordChars>- '</wordChars>

Hunspell documentation:

WORDCHARS extends tokenizer of Hunspell command line interface with additional word character. For example, dot, dash, n-dash, numbers, percent sign are word character in Hungarian.

<ignore>[list of chars]</ignore>

This element can contain a list of characters that should be ignored in dictionary words. Each character should be just a single letter, and separated from the others by a space.

<ignore>a e i o u</ignore>

Hunspell documentation:

Ignore characters from dictionary words, affixes and input words. Useful for optional characters, as Arabic diacritical marks (Harakat) or Hebrew niqqud.

Settings that require a flag attribute

The following setting elements require a "flag" attribute. Consult the Hunspell documentation for more about how and when to use them.

  • <checkSharpS flag="[flag]"/>
  • <circumfix flag="[flag]"/>
  • <forbiddenWord flag="[flag]"/>
  • <keepCase flag="[flag]"/>
  • <needAffix flag="[flag]"/>
  • <substandard flag="[flag]"/>

<checkSharpS flag="[flag]"/>

Hunspell documentation:

SS letter pair in uppercased (German) words may be upper case sharp s (ß). Hunspell can handle this special casing with the CHECKSHARPS declaration (see also KEEPCASE flag and tests/germancompounding example) in both spelling and suggestion.

<circumfix flag="[flag]"/>

Hunspell documentation:

Affixes signed with CIRCUMFIX flag may be on a word when this word also has a prefix with CIRCUMFIX flag and vice versa.

<forbiddenWord flag="[flag]"/>

Hunspell documentation:

This flag signs forbidden word form. Because affixed forms are also forbidden, we can subtract a subset from set of the accepted affixed and compound words.

<keepCase flag="[flag]"/>

Hunspell documentation:

Forbid uppercased and capitalized forms of words signed with KEEPCASE flags. Useful for special orthographies (measurements and currency often keep their case in uppercased texts) and writing systems (e.g. keeping lower case of IPA characters). Note: With CHECKSHARPS declaration, words with sharp s and KEEPCASE flag may be capitalized and uppercased, but uppercased forms of these words may not contain sharp s, only SS. See germancompounding example in the tests directory of the Hunspell distribution. Note: Using lot of zero affixes may have a big cost, because every zero affix is checked under affix analysis before the other affixes.

<needAffix flag="[flag]"/>

Hunspell documentation:

This flag signs virtual stems in the dictionary. Only affixed forms of these words will be accepted by Hunspell. Except, if the dictionary word has a homonym or a zero affix. NEEDAFFIX works also with prefixes and prefix + suffix combinations (see tests/pseudoroot5.*).

<substandard flag="[flag]"/>

Hunspell documentation:

SUBSTANDARD flag signs affix rules and dictionary words (allomorphs) not used in morphological generation (and in suggestion in the future versions). See also NOSUGGEST.

Settings that require only an empty element

The following elements do not require any attributes or text. They are empty elements that, by their presence, turn on a feature. Consult the Hunspell documentation for more about how and when to use them.

<complexPrefixes/>

By default, Hunspell allows a chain of 3 affix rules: 2 suffixes and 1 prefix. However, if the <complexPrefixes/> setting is used, Hunspell will allow 2 prefixes and 1 suffix instead. You may want to change this setting if your language has more prefix slots than suffix slots.

<fullStrip/>

Hunspell documentation:

With FULLSTRIP ,affixrules can strip full words, not only one less characters. Note: conditions may be word length without FULLSTRIP ,too.