Skip to content

Getting Started

TrnsltLife edited this page Dec 10, 2013 · 34 revisions

UNDER CONSTRUCTION

Getting Started Tutorial

This tutorial will try to get you off to a good start learning HunspellXML. It's not going to cover every aspect of HunspellXML, but it should teach you enough to get started. If you are willing to experiment, you should be able to figure the rest out as you go along by consulting the rest of the wiki.

Getting Started

  • Download the latest version (Google Code) of HunspellXML Converter.
  • Unzip the contents.
  • Run the HunspellXML-Converter-[version].jar file. (You need to have Java installed. If you don't, download it from Oracle and install it first.)

Minimal HunspellXML File

  • Create a new file called "tutorial.xml".
  • Open it with a simple text editor such as Notepad++.
  • For this file, you'll want to save it as UTF-8 encoded. In Notepad++, you do this through the Encoding->Convert to UTF-8 without BOM menu item.
  • Copy and paste the following HunspellXML definition into your file and save it.s

This is the minimal valid HunspellXML file. If you leave any of these elements out, the file won't validate.

<hunspell>
	<affixFile>
		<settings>
			<languageCode>eng_US</languageCode>
			<characterSet>UTF-8</characterSet>
			<flagType>long</flagType>
		</settings>
	</affixFile>
	

	<dictionaryFile>
	</dictionaryFile>
</hunspell>
  • You should change the language and country code values in <languageCode>...</languageCode> to match your language and country.
  • You can change the value in <characterSet>...</characterSet> but "UTF-8" is the recommended value. (See info about <characterSet>.)
  • You can change the value in <flagType>...</flagType> but "long" is the recommended value. (See <flagType>.)

Converting the File with HunspellXML-Converter

  • If HunspellXML-Converter isn't running, run it by double-clicking the HunspellXML-Converter-[version].jar file that you downloaded.
  • Drag the tutorial.xml file onto the open HunspellXML-Converter window, or use HunspellXML-Converter's File->Open and Convert... menu to find and open the tutorial.xml file.
  • You should see output like this:
Opening file D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial.xml
Validate HunspellXML...
Reading the HunspellXML...
Read.
Character Set: UTF-8
(Java Character Set: UTF-8)
Hunspell flag type: long
Validating the HunspellXML file...
HunspellXMLValidator: successfully validated
Parsing the HunspellXML file...
Exporting the XML file...
Exporting to D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial
Affix file created.
Dictionary file created.
Test files created.
License file created.
Readme file created.
Firefox plugin file created.
LibreOffice plugin file created.
Opera plugin file created.
Finished!
Find your files in: D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial

Testing 'correctly' spelled words in D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial\tutorial_good.txt...
Correctly spelled words test completed without errors.
Testing 'misspelled' words in D:\Dev\LangDev\SpellCheck\tutorialFile\tutorial\tutorial_bad.txt...
Misspelled words test completed without errors.

It gives a lot of information about what it's doing. It opens, reads, validates, and parses the file. It exports the dictionary file and creates the plugin files. It runs tests (although there aren't any right now) and doesn't encounter any errors.

Normal informational messages will show up in black or blue. If you see a message in red (as you scroll back through the messages) that usually means that something went wrong and you'll need to figure out what happened and fix it.

Dictionary Files and Plugins

If you look back in the directory where your tutorial.xml file is, you should see that some new files and directories have been created. Specifically:

  • tutorial_HunspellXML-Converter.log (This contains log messages about your latest HunspellXML conversion)
  • tutorial/
    • tutorial.aff (The Hunspell .aff file. It will eventually contain all the affix rules, suggestion rules, etc. used to make your dictionary useful. It currently only has the language code, character encoding, and flag type settings.)
    • tutorial.dic (This will eventually contain all your dictionary words. It currently has 0 words.)
    • tutorial_bad.txt (A file to contain a list of all the test words that should be flagged as improperly spelled.)
    • tutorial_good.txt (A file to contain a list of words that should be flagged as properly spelled. If any of them are flagged as misspelled, it means something went wrong in your dictionary definition.)
    • license.txt (The basic license for use in your spell-check plugins is "All rights reserved." You can edit this in the <metadata> section.)
    • Firefox/ (This directory contains the spell-check plugin that will work with Firefox.)
    • LibreOffice/ (This directory contains the spell-check plugin that will work with LibreOffice.)
    • Opera/ (This directory contains the spell-check plugin that will work with Opera.)

Adding Word Lists

Right now, the spell-check dictionary doesn't do anything. We haven't specified any affixation rules, and more importantly, we haven't included any dictionary words!

In HunspellXML-Converter, if you try typing words in the spell-check test box at the lower left, they will all show up with red underlines, meaning they are not recognized as correctly spelled words. Try it out.

Let's add some words now. The easiest way to do that for a simple test like ours is to include the dictionary words directly into the HunspellXML file. (The other method is to include words from a separate file.)

  • Modify the <dictionaryFile> section of the tutorial.xml file to include a set of basic nouns:
<hunspell>
	<affixFile>
		<settings>
			<languageCode>eng_US</languageCode>
			<characterSet>UTF-8</characterSet>
			<flagType>long</flagType>
		</settings>
	</affixFile>
	

	<dictionaryFile>
		<words>
		bunny
		butterfly
		cat
		cow
		dog
		fish
		fly
		frog
		fox
		goose
		horse
		monkey
		moose
		mouse
		ox
		pony
		puppy
		</words>
	</dictionaryFile>
</hunspell>
  • Drop the tutorial.xml file on the HunspellXML-Converter window again. If everything runs fine, you should be able to type some of the words in the test box. Good words should be underlined in green.
  • Type cow, dog, and cat. They should be underlined in green.
  • Type cows, dogs, and cats. They should be underlined in red. These words aren't in the dictionary.

Adding a Simple Affix Rule (Part 1)

What if we want our dictionary to recognize the plural forms of nouns? How do we do this? We could add all the plural forms to our dictionary. That would almost double the size of our dictionary. It turns out that this is not a very good strategy. What about the forms of verbs? If we enter English infinitive forms, there are a lot of verbs. But there are also the 3rd Person Present, the Gerund (-ing), the Past Tense (-ed), sometimes the Past Participle (-en etc.). That would require entering 5 different forms for each verb. Other languages have much richer morphology. Can you imagine entering every form of a Spanish verb or a Greek verb? There might be dozens, scores, or hundreds of forms per verb.

Fortunately, Hunspell allows us to define affixation rules that specify what affixes can attach to certain words. The Hunspell spell-checker computes possible word-forms based on these rules, which means we can have a much smaller dictionary file.

Let's see how this works by working on a set of noun pluralization rules for English.

A First Attempt

In English, most nouns become plural by adding an "s" to the end of the word (a suffix). Cat becomes cats, dog becomes dogs, cow becomes cows, and so on.

Inside the <affixFile> section, we need to add an <affixes> section and a suffix rule.

  • Modify the tutorial.xml file so that the <affixFile> block looks like this (leave the rest of the file unchanged):
	<affixFile>
		<settings>
			<languageCode>eng_US</languageCode>
			<characterSet>UTF-8</characterSet>
			<flagType>long</flagType>
		</settings>
		
		<affixes>
			<suffix flag="NP">
				<rule add="s" />
			</suffix>
		</affixes>
	</affixFile>

This suffix rule (nicknamed "NP" for "Noun Plurals") states that we can add "s" to the end of words that take the "NP" rule.

If you were to save the file and drop it into the converter at this point, none of the plural forms of words (cats, dogs, cows) would work yet. That's because we haven't applied the "NP" rule to any of the words in our dictionary!

  • Make the following change to the <dictionaryFile> section:
	<dictionaryFile>
		<words flags="NP">
		bunny

By adding the flags="NP" attribute to the <words> element, we are telling Hunspell that the "NP" suffix rule can apply to any of the words inside the <words> element.

  • Now save the tutorial.xml file and drop it onto the converter window.
  • Test some words and their plurals, e.g.:
    • cat, cats
    • dog, dogs
    • cow, cows
    • fox, foxes
    • bunny, bunnies
    • mouse, mice
    • ox, oxen

As you can see, some of the words for plurals still show up with red underlines, because our simplistic "plural in -s" rule doesn't cover all the necessary cases in English.

Now try some of these English misspellings:

  • bunnys
  • fishs
  • flys
  • foxs
  • mouses
  • oxs

These incorrect words are marked as correctly spelled! We have some work to do on our plural rules still.

Breaking down the problems from our first attempt, we can see that there are several different plural rules that we'll eventually need to get pluralization right:

  • -s for most nouns (e.g. cats)
  • -es for most nouns that end in "ch", "sh", "s" (e.g. foxes)
  • -ies for some nouns that end in "y" (e.g. pony)
  • -s for nouns that end in a vowel + "y" (e.g. monkeys)
  • nouns with irregular plurals (e.g. moose, geese, mice, oxen)

Adding Some Tests

The thing to understand is that when you write new rules, it is possible to:

  • flag words as correctly spelled that are actually incorrect
  • flag words as incorrectly spelled that are actually correct

One way to help deal with this problem is to create tests for each new type of rule you create. When you have a complicated task, like adding rules that cover all of the English noun pluralization patterns, it can be helpful to create a series of tests before you even start writing the affixation rules. Every time you make changes and drop the file on the converter, the conversion process will run the tests, and you'll be able to see your progress towards your goal.

Let's write some tests. What tests would you add to see if our plural rule behaves properly? We can try to check for words that Hunspell shows as correct that we know are incorrect, and words that Hunspell shows as incorrect that we know are correct.

At the bottom of the <hunspell> element, after the <dictionaryFile> block, create a new <tests> block like this:

	<tests>
		<!-- Regular plural -s -->
		<good>cats cows dogs frogs monkeys</good>
		<bad>caties cowen froges monkies gooses mooses mouses</bad>
		<!-- plural in -es -->
		<good>fishes foxes horses</good>
		<bad>fishs foxs oxs</bad>
		<!-- plural in -ies -->
		<good>bunnies butterflies flies ponies puppies</good>
		<bad>cowies monkies mousies</bad>
		<!-- irregular plurals -->
		<good>geese mice oxen</good>
		<bad>gooses mouses oxes foxen meese</bad>
	</tests>
</hunspell>