Skip to content

Getting Started

TrnsltLife edited this page Dec 13, 2013 · 34 revisions

/!\ UNDER CONSTRUCTION /!\

Getting Started Tutorial

This tutorial will try to get you off to a good start learning HunspellXML. It's not going to cover every aspect of HunspellXML, but it should teach you enough to get started. If you are willing to experiment, you should be able to figure the rest out as you go along by consulting the rest of the wiki.

You can create the sample HunspellXML files by following the instructions in this tutorial, or find them in the samples/ directory for this project.

Getting Started

  • Download the latest version (Google Code) of HunspellXML Converter.
  • Unzip the contents.
  • Run the HunspellXML-Converter-[version].jar file. (You need to have Java installed. If you don't, download it from Oracle and install it first.)

Minimal HunspellXML File

  • Create a new file called "tutorial.xml".
  • Open it with a simple text editor such as Notepad++.
  • For this file, you'll want to save it as UTF-8 encoded. In Notepad++, you do this through the Encoding->Convert to UTF-8 without BOM menu item.
  • Copy and paste the following HunspellXML definition into your file and save it.s

This is the minimal valid HunspellXML file. If you leave any of these elements out, the file won't validate.

<hunspell>
	<affixFile>
		<settings>
			<languageCode>eng_US</languageCode>
			<characterSet>UTF-8</characterSet>
			<flagType>long</flagType>
		</settings>
	</affixFile>
	

	<dictionaryFile>
	</dictionaryFile>
</hunspell>
  • You should change the language and country code values in <languageCode>...</languageCode> to match your language and country.
  • You can change the value in <characterSet>...</characterSet> but "UTF-8" is the recommended value. (See info about <characterSet>.)
  • You can change the value in <flagType>...</flagType> but "long" is the recommended value. (See <flagType>.)

Converting the File with HunspellXML-Converter

  • If HunspellXML-Converter isn't running, run it by double-clicking the HunspellXML-Converter-[version].jar file that you downloaded.
  • Drag the tutorial.xml file onto the open HunspellXML-Converter window, or use HunspellXML-Converter's File->Open and Convert... menu to find and open the tutorial.xml file.
  • You should see output like this:
Opening file D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial.xml
Validate HunspellXML...
Reading the HunspellXML...
Read.
Character Set: UTF-8
(Java Character Set: UTF-8)
Hunspell flag type: long
Validating the HunspellXML file...
HunspellXMLValidator: successfully validated
Parsing the HunspellXML file...
Exporting the XML file...
Exporting to D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial
Affix file created.
Dictionary file created.
Test files created.
License file created.
Readme file created.
Firefox plugin file created.
LibreOffice plugin file created.
Opera plugin file created.
Finished!
Find your files in: D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial

Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial\tutorial_good.txt...
Correctly spelled words test completed without errors.
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial\tutorial_bad.txt...
Misspelled words test completed without errors.

It gives a lot of information about what it's doing. It opens, reads, validates, and parses the file. It exports the dictionary file and creates the plugin files. It runs tests (although there aren't any right now) and doesn't encounter any errors.

Normal informational messages will show up in black or blue. If you see a message in red (as you scroll back through the messages) that usually means that something went wrong and you'll need to figure out what happened and fix it.

Dictionary Files and Plugins

If you look back in the directory where your tutorial.xml file is, you should see that some new files and directories have been created. Specifically:

  • tutorial_HunspellXML-Converter.log (This contains log messages about your latest HunspellXML conversion)
  • tutorial/
    • tutorial.aff (The Hunspell .aff file. It will eventually contain all the affix rules, suggestion rules, etc. used to make your dictionary useful. It currently only has the language code, character encoding, and flag type settings.)
    • tutorial.dic (This will eventually contain all your dictionary words. It currently has 0 words.)
    • tutorial_bad.txt (A file to contain a list of all the test words that should be flagged as improperly spelled.)
    • tutorial_good.txt (A file to contain a list of words that should be flagged as properly spelled. If any of them are flagged as misspelled, it means something went wrong in your dictionary definition.)
    • license.txt (The basic license for use in your spell-check plugins is "All rights reserved." You can edit this in the <metadata> section.)
    • Firefox/ (This directory contains the spell-check plugin that will work with Firefox.)
    • LibreOffice/ (This directory contains the spell-check plugin that will work with LibreOffice.)
    • Opera/ (This directory contains the spell-check plugin that will work with Opera.)

Adding Word Lists

Right now, the spell-check dictionary doesn't do anything. We haven't specified any affixation rules, and more importantly, we haven't included any dictionary words!

In HunspellXML-Converter, if you try typing words in the spell-check test box at the lower left, they will all show up with red underlines, meaning they are not recognized as correctly spelled words. Try it out.

Let's add some words now. The easiest way to do that for a simple test like ours is to include the dictionary words directly into the HunspellXML file. (The other method is to include words from a separate file using the <include> element.)

  • Modify the <dictionaryFile> section of the tutorial.xml file to include a set of basic nouns:
<hunspell>
	<affixFile>
		<settings>
			<languageCode>eng_US</languageCode>
			<characterSet>UTF-8</characterSet>
			<flagType>long</flagType>
		</settings>
	</affixFile>
	

	<dictionaryFile>
		<words>
		bunny
		butterfly
		cat
		cow
		dog
		fish
		fly
		frog
		fox
		goose
		horse
		monkey
		jay
		moose
		mouse
		ox
		pony
		puppy
		walrus
		</words>
	</dictionaryFile>
</hunspell>
  • Drop the tutorial.xml file on the HunspellXML-Converter window again. If everything runs fine, you should be able to type some of the words in the test box. Good words should be underlined in green.
  • Type cow, dog, and cat. They should be underlined in green.
  • Type cows, dogs, and cats. They should be underlined in red. These words aren't in the dictionary.

Adding a Simple Affix Rule

What if we want our dictionary to recognize the plural forms of nouns? How do we do this? We could add all the plural forms to our dictionary. That would almost double the size of our dictionary. It turns out that this is not a very good strategy. What about the forms of verbs? If we enter English infinitive forms, there are a lot of verbs. But there are also the 3rd Person Present, the Gerund (-ing), the Past Tense (-ed), sometimes the Past Participle (-en etc.). That would require entering 5 different forms for each verb. Other languages have much richer morphology. Can you imagine entering every form of a Spanish verb or a Greek verb? There might be dozens, scores, or hundreds of forms per verb.

Fortunately, Hunspell allows us to define affixation rules that specify what affixes can attach to certain words. The Hunspell spell-checker computes possible word-forms based on these rules, which means we can have a much smaller dictionary file.

Let's see how this works by working on a set of noun pluralization rules for English.

A First Attempt

In English, most nouns become plural by adding an "s" to the end of the word (a suffix). Cat becomes cats, dog becomes dogs, cow becomes cows, and so on.

Inside the <affixFile> section, we need to add an <affixes> section and a suffix rule.

  • Modify the tutorial.xml file so that the <affixFile> block looks like this (leave the rest of the file unchanged):
	<affixFile>
		<settings>
			<languageCode>eng_US</languageCode>
			<characterSet>UTF-8</characterSet>
			<flagType>long</flagType>
		</settings>
		
		<affixes>
			<suffix flag="NP">
				<rule add="s" />
			</suffix>
		</affixes>
	</affixFile>

This suffix rule (nicknamed "NP" for "Noun Plurals") states that we can add "s" to the end of words that take the "NP" rule.

If you were to save the file and drop it into the converter at this point, none of the plural forms of words (cats, dogs, cows) would work yet. That's because we haven't applied the "NP" rule to any of the words in our dictionary!

  • Make the following change to the <dictionaryFile> section:
	<dictionaryFile>
		<words flags="NP">
		bunny

By adding the flags="NP" attribute to the <words> element, we are telling Hunspell that the "NP" suffix rule can apply to any of the words inside the <words> element.

  • Now save the tutorial.xml file and drop it onto the converter window.
  • Test some words and their plurals, e.g.:
    • cat, cats
    • dog, dogs
    • cow, cows
    • fox, foxes
    • bunny, bunnies
    • mouse, mice
    • ox, oxen

As you can see, some of the words for plurals still show up with red underlines, because our simplistic "plural in -s" rule doesn't cover all the necessary cases in English.

Now try some of these English misspellings:

  • bunnys
  • fishs
  • flys
  • foxs
  • mouses
  • oxs
  • walruss

These incorrect words are marked as correctly spelled! We have some work to do on our plural rules still.

Breaking down the problems from our first attempt, we can see that there are several different plural rules that we'll eventually need to get pluralization right:

  • -s for most nouns (e.g. cats)
  • -es for most nouns that end in "ch", "sh", "s", "z" (e.g. foxes)
  • -ies for some nouns that end in "y" (e.g. pony)
  • -s for nouns that end in a vowel + "y" (e.g. jays, monkeys)
  • nouns with irregular plurals (e.g. moose, geese, mice, oxen)

Adding Some Tests

The thing to understand is that when you write new rules, it is possible to:

  • flag words as correctly spelled that are actually incorrect
  • flag words as incorrectly spelled that are actually correct

One way to help deal with this problem is to create tests for each new type of rule you create. When you have a complicated task, like adding rules that cover all of the English noun pluralization patterns, it can be helpful to create a series of tests before you even start writing the affixation rules. Every time you make changes and drop the file on the converter, the conversion process will run the tests, and you'll be able to see your progress towards your goal.

Let's write some tests. What tests would you add to see if our plural rule behaves properly? We can try to check for words that Hunspell shows as correct that we know are incorrect, and words that Hunspell shows as incorrect that we know are correct.

At the bottom of the <hunspell> element, after the <dictionaryFile> block, create a new <tests> block like this:

	<tests>
		<!-- Regular plural -s -->
		<good>cats cows dogs frogs monkeys</good>
		<bad>caties cowen froges monkies gooses mooses mouses</bad>
		<!-- plural in -es -->
		<good>fishes foxes horses</good>
		<bad>fishs foxs oxs</bad>
		<!-- plural in -ies -->
		<good>bunnies butterflies flies ponies puppies</good>
		<bad>cowies monkies monkeies mousies</bad>
		<!-- irregular plurals -->
		<good>geese mice oxen</good>
		<bad>gooses mouses oxes foxen meese</bad>
	</tests>
</hunspell>

When you save the file and drop it onto the conversion window, you should see messages like this at the end of the log:

Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_good.txt...
WARNING: Some words listed in tutorial_good.txt (which should contain only correct spellings) are rejected as misspellings by the current Hunspell dictionary:
	*  fishes :: suggest:[fishs, fish]
	*  foxes :: suggest:[foxs]
	*  walruses :: suggest:[walruss, walrus]
	*  bunnies :: suggest:[bunnys]
	*  butterflies :: suggest:[butterflys, butterfly]
	*  flies :: suggest:[flys]
	*  ponies :: suggest:[ponys]
	*  puppies :: suggest:[puppys]
	*  geese :: suggest:[goose]
	*  mice :: suggest:[moose]
	*  oxen :: suggest:[ox]
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_bad.txt...
WARNING: Some words listed in tutorial_bad.txt (which should contain only incorrect spellings) are accepted as correctly spelled by the current Hunspell dictionary:
	*  gooses :: morph:[ st:goose fl:NP] stem:[goose]
	*  mooses :: morph:[ st:moose fl:NP] stem:[moose]
	*  mouses :: morph:[ st:mouse fl:NP] stem:[mouse]
	*  walruss :: morph:[ st:walrus fl:NP] stem:[walrus]
	*  fishs :: morph:[ st:fish fl:NP] stem:[fish]
	*  foxs :: morph:[ st:fox fl:NP] stem:[fox]
	*  oxs :: morph:[ st:ox fl:NP] stem:[ox]
	*  gooses :: morph:[ st:goose fl:NP] stem:[goose]
	*  mouses :: morph:[ st:mouse fl:NP] stem:[mouse]

Hunspell is identifying the places where your list of <good> words show up as incorrectly spelled, and where your list of <bad> words show up as correctly spelled. These are the problem areas you still need to fix.

Of course, when creating these tests, you don't want to include every word in the dictionary. You should just choose a small sampling of words that you feel adequately represents the entire range of options for a particular class of words, from the normal (cats and dogs) to more problematic (meese and foxen - I mean, geese and oxen). You also need to use a bit of creativity as you imagine what problems you might accidentally introduce with the rules you'll be writing. "What if I accidentally apply the rule for 'pony->ponies' to 'monkey' and get "'monkeies'?"

Adding More Plural Rules

Let's try adding another plural rule.

  • -es for most nouns that end in "ch", "sh", "s", "z" (e.g. foxes)

We can add this rule inside the <suffix flag="NP"> block. What would this rule look like?

<rule where="ch" add="es" />
<rule where="sh" add="es" />
<rule where="s" add="es" />
<rule where="x" add="es" />
<rule where="z" add="es" />

The where="..." in each of these rules tells what the end of a word needs to look like if the rule is going to apply. (If this were a prefix rule, it would show what the start of the word needed to look like.) So for the second rule <rule where="sh" add="es" />, the rule could apply to "fish" but not to "fox". The rule <rule where="x" add="es" /> could apply to "fox" but not to "fish" or "walrus".

Why don't we just write a rule that says <rule where="h" add="es"/>? This would allow invalid plurals like "moth" -> "mothes".

Before we add these rules into our tutorial.xml file, we should really try to do a little work to combine them as much as possible. Hunspell actually gives us access to simplified regular expressions to use in the where attribute of rules. Besides an exact character match like where="ch" that we mentioned above, here are the additional expressions we can use:

  • . - A dot/period matches any single character. Within bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".
  • [ ] - A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z]. The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc]. Note that backslash escapes are not allowed.
  • [^ ] - A bracket exclusion expression. Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.

So let's simplify our "-es" rules a little bit.

<rule where="[cs]h" add="es" />
<rule where="[sxz]" add="es" />
  • Add these rules into the "NP" suffix rule set. Your <affixes> block should now look like this:
		<affixes>
			<suffix flag="NP">
				<rule add="s" />
				<rule where="[cs]h" add="es" />
				<rule where="[sxz]" add="es" />
			</suffix>
		</affixes>

However, just adding these new rules is not enough. We need to change the original <rule add="s"/> rule. Right now, it will still happily allow the creation of words like "fishs" and and "foxs". We need to restrict where the "s" rule can apply.

<rule where="[^hsxz]" add="s"/>
<rule where="[^cs]h" add="s"/>

The first new "s" rule allows an s anywhere the word does not end in an "h", "s", "x", or "z". The second rule is an exception to the "no h" rule. It allows an "s" on words that end in "h" but don't end in "ch" or "sh".

  • Change the suffix rule to look like this:
		<affixes>
			<suffix flag="NP">
				<rule where="[cs]h" add="es" />
				<rule where="[sxz]" add="es" />
				<rule where="[^hsxz]" add="s" />
				<rule where="[^cs]h" add="s" />
			</suffix>
		</affixes>
  • Save the tutorial.xml file and drop it on the converter. You should see test results like this:
Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_good.txt...
WARNING: Some words listed in tutorial_good.txt (which should contain only correct spellings) are rejected as misspellings by the current Hunspell dictionary:
	*  bunnies :: suggest:[bunnys]
	*  butterflies :: suggest:[butterflys, butterfly]
	*  flies :: suggest:[fishes]
	*  ponies :: suggest:[ponys]
	*  puppies :: suggest:[puppys]
	*  geese :: suggest:[goose]
	*  mice :: suggest:[mouse]
	*  oxen :: suggest:[ox]
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_bad.txt...
WARNING: Some words listed in tutorial_bad.txt (which should contain only incorrect spellings) are accepted as correctly spelled by the current Hunspell dictionary:
	*  gooses :: morph:[ st:goose fl:NP] stem:[goose]
	*  mooses :: morph:[ st:moose fl:NP] stem:[moose]
	*  mouses :: morph:[ st:mouse fl:NP] stem:[mouse]
	*  gooses :: morph:[ st:goose fl:NP] stem:[goose]
	*  mouses :: morph:[ st:mouse fl:NP] stem:[mouse]
	*  oxes :: morph:[ st:ox fl:NP] stem:[ox]

This is progress! We aren't seeing problem words like "foxs" and "fishs" anymore. And correct words like "fishes", "foxes", and "walruses" are being correctly recognized. But we still have problems with plurals that should end in "ies" and with irregular plurals.

Adding the -ies Rule

Let's think about the -ies rule. It applies to words that end in "y". But not all words that end in "y". Witness "jay", "donkey", "monkey". In fact, words that end in a vowel + "y" have a regular plural in "s". Words that end in a consonant + "y" should drop the "y" and add "ies".

Here is a rule that will add "ies" as a plural to the right words:

<rule where="[^aeiou]y" remove="y" add="ies"/>

The [^aeiou] portion means "one character that is not a vowel". So the rule will match when there is a word like "puppy", where the last two letters are "py". "P" is not a vowel, and the final "y" matches. It wouldn't match for "monkey" because the next-to-last letter is an "e" which is disallowed by the [^aeiou] portion of the rule, even though the final "y" matches.

How can we modify our "s" plural rules so that they will apply for words like "monkey" but not words like "puppy"?

<rule where="[^hsxyz]" add="s" />
<rule where="[^cs]h" add="s" />
<rule where="[aeiou]y" add="s"/>

First, we modify the rule for where="[^hsxz]" and make it be where="[^hsxyz]". This forbids an "s" to be added to a word ending in "y". Then, we add an exception rule where="[^aeiou]y" that allows an "s" to be added to a word ending in "y" only when the next-to-last letter is a vowel.

  • Modify the affix section of your tutorial.xml file so that it looks like this:
		<affixes>
			<suffix flag="NP">
				<rule where="[cs]h" add="es" />
				<rule where="[sxz]" add="es" />
				<rule where="[^aeiou]y" remove="y" add="ies" />
				<rule where="[^hsxyz]" add="s" />
				<rule where="[^cs]h" add="s" />
				<rule where="[aeiou]y" add="s"/>
			</suffix>
		</affixes>
  • Save the file and drop it on the converter. You should see test output like this:
Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_good.txt...
WARNING: Some words listed in tutorial_good.txt (which should contain only correct spellings) are rejected as misspellings by the current Hunspell dictionary:
	*  geese :: suggest:[goose]
	*  mice :: suggest:[mouse]
	*  oxen :: suggest:[ox]
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_bad.txt...
WARNING: Some words listed in tutorial_bad.txt (which should contain only incorrect spellings) are accepted as correctly spelled by the current Hunspell dictionary:
	*  gooses :: morph:[ st:goose fl:NP] stem:[goose]
	*  mooses :: morph:[ st:moose fl:NP] stem:[moose]
	*  mouses :: morph:[ st:mouse fl:NP] stem:[mouse]
	*  gooses :: morph:[ st:goose fl:NP] stem:[goose]
	*  mouses :: morph:[ st:mouse fl:NP] stem:[mouse]
	*  oxes :: morph:[ st:ox fl:NP] stem:[ox]

More progress! "Bunnies" and "monkeys" both work properly. It's not incorrectly accepting "bunnys" or "monkeies". All the errors now have to do with irregular plurals like "oxen" and "mice".

Adding Irregular Plurals

In English, there are no longer any significant number of nouns that undergo pluralization processes like these:

  • goose -> geese
  • moose -> moose
  • mouse -> mice
  • ox -> oxen

It would also be somewhat ridiculous to specify an individual affixation rule for each of them, although it would be possible in some cases, e.g.

<rule where="goose" remove="oose" add="eese"/>
<rule where="moose" add=""/>
<rule where="mouse" remove="ouse" add="ice"/>

And a rule like <rule where="ox" add="en"/> wouldn't even work, because it would improperly match "fox" and "box" too. The Hunspell implementation of regular expressions isn't sophisticated enough to allow us to make the distinction.

A better way to handle these exceptions is to put both the singular and plural words in the dictionary. And the singular word-forms should not have the "NP" affixation rule flag attached to them in the dictionary like they currently do.

  • Modify the <dictionaryFile> block so that it looks like this:
	<dictionaryFile>
		<words flags="NP">
		bunny
		butterfly
		cat
		cow
		dog
		fish
		fly
		frog
		fox
		horse
		jay
		monkey
		moth
		pony
		puppy
		walrus
		</words>
		<!-- List of irregular nouns and their plurals -->
		<words>
		goose
		geese
		moose
		mouse
		mice
		ox
		oxen
		</words>
	</dictionaryFile>
  • Now save the tutorial.xml file and drop it on the converter. You should see test results like these:
Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_good.txt...
Correctly spelled words test completed without errors.
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\tutorial\tutorial_bad.txt...
Misspelled words test completed without errors.

Success! We've eliminated all of our spelling errors and false positives. We've handled the vast majority of English noun pluralization rules. If we find more irregular nouns, we can put them in the irregular noun list.

An Alternative Solution

There's often more than one way to solve a problem. Representing noun pluralization in English is no exception. The above solution used regular expressions to instruct Hunspell how to analyze the phonological/orthographic context of each word to decide what pluralization rule should be applied. But just like we separated the irregular nouns out into their own group at the end, we could have separated all the words out into different groups depending on the pluralization rule they should use.

This could be a useful technique if the rules became too complicated to represent using regular expressions, for example. Let's try it out and see how it would look.

  • Create a new HunspellXML file called alternative_plural.xml.
  • Paste the following dictionary definition into it and save it.
<hunspell>
	<affixFile>
		<settings>
			<languageCode>eng_US</languageCode>
			<characterSet>UTF-8</characterSet>
			<flagType>long</flagType>
		</settings>
		
		<affixes>
			<suffix flag="PS">
				<rule add="s"/>
			</suffix>
			<suffix flag="PE">
				<rule add="es"/>
			</suffix>
			<suffix flag="PI">
				<rule remove="y" add="ies"/>
			</suffix>
		</affixes>
	</affixFile>
	

	<dictionaryFile>
		<!-- List of nouns with plurals in -s -->
		<words flags="PS">
		cat
		cow
		dog
		frog
		horse
		jay
		monkey
		moth		
		</words>
		<!-- List of nouns with plurals in -es -->
		<words flags="PE">
		fish
		fox
		walrus
		</words>
		<!-- List of nouns with plurals in -ies -->
		<words flags="PI">
		bunny
		butterfly
		fly
		pony
		puppy
		</words>
		<!-- List of irregular nouns and their plurals -->
		<words>
		goose
		geese
		moose
		mouse
		mice
		ox
		oxen
		</words>
	</dictionaryFile>
	
	<tests>
		<!-- Regular plural -s -->
		<good>cats cows dogs frogs monkeys moths</good>
		<bad>caties cowen froges monkies gooses mooses mouses walruss</bad>
		<!-- plural in -es -->
		<good>fishes foxes walruses</good>
		<bad>horsees fishs foxs mothes oxs</bad>
		<!-- plural in -ies -->
		<good>bunnies butterflies flies ponies puppies</good>
		<bad>cowies monkies monkeies mousies</bad>
		<!-- irregular plurals -->
		<good>geese mice oxen</good>
		<bad>gooses mouses oxes foxen meese</bad>
	</tests>
</hunspell>
  • Save the file, drop it onto the converter, and note that all of the tests pass.

Now note the changes from our original example in the tutorial.xml file. In this file, we specify three pluralization rules:

  1. -s (the "PS" rule)
  2. -es (the "PE" rule)
  3. -ies (the "PI" rule)

And in the <dictionaryFile> section, we split up the word lists into 4 groups instead of 2.

  1. One group for nouns whose plurals are -s (with the affixation rule "PS")
  2. One group for nouns whose plurals are -es (with the affixation rule "PE")
  3. One group for nouns whose plurals are -ies (with the affixation rule "PI")
  4. As before, one group for irregular plural and singular nouns.

To some extent, it is up to you in each case to decide which method is better.

  1. Try to specify all the necessary rules using regular expressions (not always possible as we saw with "ox -> oxen")
  2. Use different affixation rules for different sets of words. (But this will have added complexity when you start chaining one affixation rule to another. You'll have more chaining rules to keep track of this way, since you'll have more affixation rule sets overall.)

Note: If you're using FLEx to manage your data, you could create a Hunspell field to keep track of the rule flags that can apply to each word. Then you could use FLEx's powerful bulk editing features to assign the proper rules to words, and [export the resulting word lists](wiki/Other-Tips:-Export a Hunspell dictionary from FLEx).

Chaining Affix Rules

In this section, we'll work on adding another affix to nouns - the possessive "-'s" suffix. There are various opinions on what the "rules" for English apostrophization should be, but here are the rules we'll use for this example:

  1. If a noun is singular, add -'s even if it ends in -s already: (dog's, fox's, walrus's)
  2. If a plural noun ends in -s, add an apostrophe: (cats', foxes' walruses')
  3. Otherwise, add -'s (geese's, mice's)

Add More Tests

If you want to make sure you're doing it right, you need to have some tests. And although you can add tests after you add your rules, it doesn't hurt to add the tests first so you'll have your goal in mind.

  • Create a new file called possessives.xml. Copy the contents of your tutorial.xml file into it and save it.
  • Now add these tests to the <tests> block in your file:
		<!-- possessives -->
		<good>dog's fox's walrus's cats' fox's foxes' walruses' goose's geese's mice's oxen's</good>
		<bad>dog' walrus' fox' goose' geese' mices' oxens'</bad>

Define New Suffix Rules

Now we will define two new suffix rules:

  1. "SP" for Singular Possessives
  2. "PP" for Plural Possessives

They should look like this:

			<suffix flag="SP">
				<rule add="'s"/>
			</suffix>
			<suffix flag="PP">
				<rule where="s" add="'"/>
				<rule where="[^s]" add="'s"/>
			</suffix>

In our original suffix rule for noun pluralization, we'll add a chaining rule by adding the combineFlags="PP" attribute to each of the individual rules. This tells Hunspell that once a particular rule has been applied, it can continue on to apply another rule, in this case the Plural Possessive (PP) rule.

The whole <affixes> block should now look like this:

		<affixes>
			<suffix flag="NP">
				<rule where="[cs]h" add="es" combineFlags="PP"/>
				<rule where="[sxz]" add="es" combineFlags="PP"/>
				<rule where="[^aeiou]y" remove="y" add="ies" combineFlags="PP"/>
				<rule where="[^hsxyz]" add="s" combineFlags="PP"/>
				<rule where="[^cs]h" add="s" combineFlags="PP"/>
				<rule where="[aeiou]y" add="s" combineFlags="PP"/>
			</suffix>
			
			<suffix flag="SP">
				<rule add="'s"/>
			</suffix>
			<suffix flag="PP">
				<rule where="s" add="'"/>
				<rule where="[^s]" add="'s"/>
			</suffix>
		</affixes>

That takes care of adding possessives to regular plural nouns.

Test the New Suffix Rules

  • Save the file and drop it onto the converter. You should see test results like this:
Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\possessive\possessive_good.txt...
WARNING: Some words listed in possessive_good.txt (which should contain only correct spellings) are rejected as misspellings by the current Hunspell dictionary:
	*  dog's :: suggest:[dogs', dogs]
	*  fox's :: suggest:[fox]
	*  walrus's :: suggest:[walrus]
	*  fox's :: suggest:[fox]
	*  goose's :: suggest:[goose]
	*  geese's :: suggest:[geese]
	*  mice's :: suggest:[mice]
	*  oxen's :: suggest:[oxen]
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\possessive\possessive_bad.txt...
Misspelled words test completed without errors.

We see that none of our "bad" tests failed by showing up as correctly spelled words. But our singular possessives and our possessives of irregular plurals are still not working.

Adding New Rules to the Dictionary Words

Our regular singular noun list in the dictionary file already links to the "NP" sufix rule to do Noun Pluralization. Those rules in turn link to the "PP" suffix rule for Plural Possessives. But the regular noun list doesn't link to the "SP" rule for Singular Possessives. And the words in the irregular nouns list don't link to the singular or plural possessive rules either.

For the list of regular singular nouns, we can just add a second flag, so that the <words> element looks like this:

		<words flags="NP SP">
  • Make that change and save the file. Test the changes.
Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\possessive\possessive_good.txt...
WARNING: Some words listed in possessive_good.txt (which should contain only correct spellings) are rejected as misspellings by the current Hunspell dictionary:
	*  goose's :: suggest:[goose]
	*  geese's :: suggest:[geese]
	*  mice's :: suggest:[mice]
	*  oxen's :: suggest:[oxen]
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\possessive\possessive_bad.txt...
Misspelled words test completed without errors.

Now the only remaining errors have to do with the irregular nouns.

One problem we have now is that there are both singular and plural nouns in the list of irregular noun forms. We could specify a link on each word to say whether it links to the "SP" or "PP" rule like this:

		<words>
		goose/SP
		geese/PP
		moose/SPPP
		mouse/SP
		mice/PP
		ox/SP
		oxen/PP
		</words>

That would solve the problem. But it might be more difficult to maintain over the long term.

We can also split the word list in two, one list for irregular singular forms and one list for irregular plural forms, like below. (Note that "moose" needs to go in both lists.)

  • Change your <dictionaryFile> block to look like this:
	<dictionaryFile>
		<words flags="NP SP">
		bunny
		butterfly
		cat
		cow
		dog
		fish
		fly
		frog
		fox
		horse
		jay
		monkey
		moth
		pony
		puppy
		walrus
		</words>
		<!-- List of irregular singular nouns -->
		<words flags="SP">
		goose
		moose
		mouse
		ox
		</words>
		<!-- List of irregular plural nouns -->
		<words flags="PP">
		geese
		moose
		mice
		oxen
		</words>
	</dictionaryFile>
  • Save the file and test it. You shouldn't see any errors.

Now you've seen the basics of how to chain one suffix rule set to another, and different ways of linking the dictionary words to affix rule sets.

Exercise: Regular English Verbs

As an exercise, add the following list of regular English verbs to your HunspellXML file:

  • add
  • subtract
  • multiply
  • divide
  • array
  • match
  • push
  • amass

Create affixation rules so that Hunspell will also recognize the present 3rd person singular, past, an gerund forms of these words. i.e.:

  • add, adds, added, adding
  • subtract, subtracts, subtracted, subtracting
  • multiply, multiplies, multiplied, multiplying
  • divide, divides, divided, dividing
  • array, arrays, arrayed, arraying
  • match, matches, matched, matching
  • amass, amasses, amassed, amassing

Don't forget to define your tests first! What invalid forms might your rules accidentally produce? Test for those too!

Hint: Can you reuse some of the work you did on noun plurals, especially for the present 3rd person singular -s suffix?

When you have finished working on the problem, you can compare your answer to the basic_verbs.xml file in the samples/ directory for this project. Remember, there's more than one way to solve a problem!

Chaining from a Suffix to a Prefix

Up to now, we've only looked at suffix rules in our examples. In this section, we'll talk about what you need to do to chain affixation rules from a suffix to a prefix or from a prefix to a suffix.

We'll use the example of a fairly regular construction in English. The suffix -able (or -ible) can be added to most transitive verbs in English, e.g. break -> breakable, drink -> drinkable, speak -> speakable. Only after the -able morpheme has been added can another prefix, un-, be added to the front of most verbs. e.g. unbreakable but not *unbreak, undrinkable but not *undrink, unspeakable but not *unspeak.

To allow Hunspell to recognize words like unbreakable and undrinkable as correctly spelled, we need the dictionary word to link to the "-able" suffix rule, and we need the "-able" rule to link to the "un-" prefix rule.

Let's test our rules on a set of regular verbs (break, drink, and speak all have irregular past forms). Let's use the list of verbs from the exercise above:

  • add
  • subtract
  • multiply
  • divide
  • array
  • match
  • push
  • amass

First, let's create the test that will help us define whether our rules work successfully or not. Something like this:

		<!-- Un-verb-able tests -->
		<good>addable multipliable arrayable unpushable unmatchable undividable unamassable unsubtractable unarrayable unmultipliable</good>
		<bad>unadd unpush multiplyable undivideable arraiable</bad>

We need to add our list of verbs, with a rule that links to the "-able" suffix rule. Let's call that rule "BL". (If you already created this list of words in the exercise, you can simply add the "BL" rule to the rule(s) you already linked to. In the example below, the "VB" rule already linked to suffixes for regular verbs like -s, -ed, -ing.)

		<words flags="VB BL">
		add
		subtract
		multiply
		divide
		array
		match
		push
		amass
		</words>

Now let's create a simple "BL" suffix rule for the "-able" suffix. We'll handle the special spelling rules for verbs that end in -e and -y. We'll also add an "UN" rule for the "un-" prefix. And of course we'll use the combineFlags="UN" attribute on all the rules in the "BL" suffix rule so that they will point to the "UL" prefix rule.

			<suffix flag="BL">
				<rule where="[^aeiou]y" remove="y" add="iable" combineFlags="UN"/>
				<rule where="[aeiou]y" add="able" combineFlags="UN"/>
				<rule where="e" remove="e" add="able" combineFlags="UN"/>
				<rule where="[^ey]" add="able" combineFlags="UN"/>
			</suffix>
			
			<prefix flag="UN">
				<rule add="un"/>
			</prefix>

If you test it at this point, you'll get these results:

Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\un-verb-able\un-verb-able_good.txt...
WARNING: Some words listed in un-verb-able_good.txt (which should contain only correct spellings) are rejected as misspellings by the current Hunspell dictionary:
	*  unpushable :: suggest:[pushable]
	*  unmatchable :: suggest:[matchable]
	*  undividable :: suggest:[dividable]
	*  unamassable :: suggest:[amassable]
	*  unsubtractable :: suggest:[subtractable]
	*  unarrayable :: suggest:[arrayable]
	*  unmultipliable :: suggest:[multipliable]
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\HunspellXML-MinimalFile\un-verb-able\un-verb-able_bad.txt...
Misspelled words test completed without errors.

Basically, none of the "un-" forms work. Why? We linked all the word-forms produced by the "BL" rule so they could chain to the "UN" rule. As it turns out, Hunspell requires us to explicitly state when a suffix rule is allowed to combine with a prefix rule and vice versa. We do this using the cross="true" attribute in the <prefix> and <suffix> elements. Think of "cross" as being needed whenever an affixation rule chain needs to cross the dictionary word. (The Hunspell documentation talks about "cross-multiplying" the prefix and suffix.)

So we need to add cross="true" to both the "BL" and "UN" rules:

			<suffix flag="BL" cross="true">
				<rule where="[^aeiou]y" remove="y" add="iable" combineFlags="UN"/>
				<rule where="[aeiou]y" add="able" combineFlags="UN"/>
				<rule where="e" remove="e" add="able" combineFlags="UN"/>
				<rule where="[^ey]" add="able" combineFlags="UN"/>
			</suffix>
			
			<prefix flag="UN" cross="true">
				<rule add="un"/>
			</prefix>

Now when you save the file and drop it onto HunspellXML-Converter, it should tell you that all the tests passed.

See the un-verb-able.xml file in the samples/ directory for this project

/!\ UNDER CONSTRUCTION /!\