Skip to content

HunspellXML Format (DictionaryFile)

TrnsltLife edited this page Mar 19, 2018 · 17 revisions

HunspellXML   HunspellXML Format > DictionaryFile


<dictionaryFile>...</dictionaryFile>

The <dictionaryFile>...</dictionayFile> element contains lists of words with their affixation rules and/or links to files that contain lists of words with their affixation rules. The data from files and from the <words>...</words> elements combine to build the Hunspell .dic file.

Summary:

<dictionaryFile>
<words>
word
word/flags
word/flags morphemecodes
word morphemecodes
</words>
<words>
	<w>word</w>
	<w>word/flags</w>
	<w>word/flags morphemecodes</w>
	<w>word morphemecodes</w>
</words>
<words flags="[list of flags]" morph="[list of morphs]">
word
word/flags
word/flags morphemecodes
</words>
<words flags="[list of flags]" morph="[list of morphs]">
	<w>word</w>
	<w>word/flags</w>
	<w>word/flags morphemecodes</w>
	<w>word morphemecodes</w>
	<w flags="[list of flags]">word/flags</w>
	<w flags="[list of flags]" morph="[list of morphs]">word/flags morphemecodes</w>
</words>
<include file="[filename]" flags="[list of flags]" morph="[list of morphs]"/>
<include file="[filename]" flags="[list of flags]" morph="[list of morphs]"/>
</dictionaryFile>

<words>...</words>

The <words>...</words> element contain a list of words in the language. The list is in plain text in Hunspell's .dic file format. The list contains one word per line. Each line consists of:

  1. [word] the dictionary word (required)
  2. /[affixation-flags] a slash / followed by a list of affixation rule flags (optional)
    • Two letter flags are placed right next to each other, e.g. NA, SU, SF, IS, IF becomes NASUSFISIF
    • One letter flags are placed right next to each other, e.g. N, U, F, S, I becomes NUFSI
    • Numeric flags are separated by commas, e.g. 93,12,9
  3. [tab][morpheme-codes] a tab followed by a list of morpheme information (optional)
    • Morphological description fields should consist of a two-letter code followed by a colon : followed by a text label.
    • Multiple morphological description fields may be used. They are separated from each other by spaces.
    • Morphological information is used for parsing and is not needed for spell checking.
    • The morphological field codes that Hunspell defines are:
      • ph: Phonetic
      • st: Stem
      • al: Allomorph(s)
      • is: Inflectional suffix(es)
      • ts: Terminal suffix(es)
      • sp: Surface prefix
      • pa: Parts of the compound words
      • dp: Derivational prefix
      • ip: Inflectional prefix
      • tp: Terminal prefix

For example:

<words>
moose
fish
cat/s
dog/s
foot	is:singular
feet	st:foot is:plural
mice	st:mouse is:plural
sing/S	po:verb is:present al:sang al:sung
sang	st:sing
sung	st:sing
</words>

In the list above, "moose" and "fish" are dictionary entries with no affixation rules or morphology. "Cat" and "dog" have a single affix rule "s". "Foot", "feet", "mice", "sang", and "sung" all have morphology fields but no affixation rules. "Sing" has an affixation rule "S" and morphology fields.

Slashes in Words

If you need to put a slash / in a word (such as "either/or"), escape it using a backslash.

<words>
either\/or
</words>

You'll also need to specify that "/" is a word character too.

<affixFile>
...
<settings>
<wordChars>- ' /</wordChars> 
...
</settings>
...
</affixFile>

Grouping Classes of Words

Although it is possible to specify the affixation rules and morphology for each word separately, a more efficient way to do it in HunspellXML is to use the flags and morph attributes of the <words>...</words> element to specify the affixation rules and morphology of a whole group of words at once. e.g.

<words flags="s" morph="po:noun is:singular">
cat
dog
rabbit
cow
horse
chicken
</words>

This will result in final output to the Hunspell .dic file that looks like this:

cat/s	po:noun is:singular
dog/s	po:noun is:singular
rabbit/s	po:noun is:singular
cow/s	po:noun is:singular
horse/s	po:noun is:singular
chicken/s	po:noun is:singular

If you need to make changes to the affixation rules or morphology information for this group of words, you don't have to modify every word. You can just modify the flags or morph attribute for the whole group. But even when you specify the flags and morph attributes at the top, you can still specify additional affixation flags or morphology fields on individual words (see the example below where the list of irregular past participles specifies the stem for each word individually).

<!-- nouns with regular plural in "-s" -->
<words flags="s" morph="po:noun is:singular">
cat
dog
rabbit
cow
horse
chicken
</words>
<words morph="po:verb is:past.part.">
sung	st:sing
drunk	st:drink
gone	st:go
shot	st:shoot
flown	st:fly
</words>

The use of the flags and morph attributes with different <words>...</words> blocks also makes it easier to separate different classes of words:

  • English: regular nouns, regular verbs, irregular past participles
  • Spanish: regular -ar verbs, regular -er verbs, regular -ir verbs
  • Lingala: nouns in class 1/2, nouns in class 3/4, nouns in class 5/6
  • etc.

<w>word</w>

Instead of putting a list of words in plain-text Hunspell format, you can specify a list of words with each word surrounded by <w>...</w> tags.

Like the <words>...</words> tags, the <w>...</w> tags can also contain attributes to specify affixation flags and morpheme codes. The words themselves can also be tagged in Hunspell format with /affixation-rules and morpheme codes. All three levels (<words>, <w> and Hunspell format) will be combined in the output .dic file.

<!-- nouns with regular plural in "-s" -->
<words flags="-s" morph="po:noun is:singular">
	<w>cat</w><w>dog</w>
	<w>rabbit</w><w>cow</w>
	<w>horse</w><w>chicken</w>
</words>
<!-- noun with plural in "-en" -->
<words>
	<w flags="EN">ox is:plural</w>
</words>
<!-- past participles of verbs, some can combine with un-, e.g. unsung -->
<words morph="po:verb">
	<w morph="st:sing">sung/un is:past.part</w>
	<w morph="st:drink">drunk/un is:past.part</w>
	<w morph="st:go">gone is:past.part</w>
	<w morph="st:shoot">shot/un is:past.part</w>
	<w morph="st:fly">flown/un is:past.part</w>
</words>

The entry for 'sung' in the example above would output in the Hunspell .dic file as:

sung/un po:verb st:sing is:past.part

<include .../>

The <include .../> element instructs HunspellXML to open an external file and load all its words (with optional affixation rules and morphology) into the dictionary word list that will be used to create the Hunspell .dic file. Anything that can go in a <words>...</words> block can go in the external file.

Imagine that you have a text file with a list of all the regular nouns of class 1/2 in Lingala, and it is called "lin_nouns_1-2.txt". The file is in the same directory as your HunspellXML file. You could include the words in the file into your wordlist like this:

<include file="lin_nouns_1-2.txt"/>

And you can assign affixation rules and morphology to every word in the file by using the flags and morph attributes:

<include file="lin_nouns_1-2.txt" flags="12" morph="po:noun is:class_1-2"/>