Skip to content
TrnsltLife edited this page Mar 19, 2018 · 39 revisions
HunspellXML

Introduction

HunspellXML defines an XML format for creating Hunspell dictionaries, and a Java/Groovy library for transforming dictionaries described in HunspellXML into the standard Hunspell format.

Purpose of HunspellXML

Hunspell is a very flexible and powerful spell-check dictionary engine that has been used in a wide variety of programs including Firefox, LibreOffice, OpenOffice, and Opera, as well as in other software. Nevertheless, the file format for specifying a Hunspell dictionary, although documented, is rather complex and difficult to master. HunspellXML aims to facilitate the process of creating Hunspell dictionaries by:

  • providing a simple XML file format which is more human-readable than raw Hunspell files
  • converting the XML to valid Hunspell affix and dictionary files
  • creating Firefox, LibreOffice, OpenOffice, and Opera spell-check plugins automatically

Benefits of Using HunspellXML

Defining your dictionary first in HunspellXML provides the following advantages over defining it directly in the raw Hunspell format:

  • Human-readable - The HunspellXML file is human-readable and thus provides an excellent option for creating Hunspell dictionary source code, without having to learn all formatting options required to create a raw Hunspell dictionary and affix file.
  • Error checking - The HunspellXML library provides some error checking for affix rules, including some restrictions that are not currently documented in the Hunspell documentation.
  • Plugin packaging - The HunspellXML library provides utilities for creating packaged Hunspell dictionary plugins for Firefox, LibreOffice/OpenOffice, and Opera.
  • MyThes thesaurus - HunspellXML also provides basic support for creating MyThes thesaurus files.
  • Testing - In HunspellXML, you can define and export tests (correctly and incorrectly spelled words) to help verify that the Hunspell dictionary you create does what you intended.
  • Affix multiplication - While Hunspell only provides the possibility to represent 3 levels of affixes, one method to get around this is to combine multiple affixes into one Hunspell affix slot. For example, the Lingala verb extensions (-am, -an, -el, -is, -ol), can combine with verb tense markers (-a, -i, -aka, -aki) which requires 20 rules to be typed in a raw Hunspell affix file (5 x 4). HunspellXML provides a <multiply> feature so you don't have to type all the combinations out. You only have to enter the rules from each affix group (9 rules instead of 20 for the Lingala example). For languages that need to combine lots of affix rules, this can be a significant improvement in readability and maintainability.

Requirements

  • Java
  • The groovy-all-[version].jar library from the Groovy distribution.
  • The RelaxNG library (jing.jar) from Thai Open Source
  • The hunspell.jar library and its jna.jar dependency from HunspellJNA

User Interface

To Do

Command Line Utility

If you download version 1.8 or above from the releases, you will be able to use the command line utility to convert HunspellXML to Hunspell or vice versa. Read more about it on the Command Line Tool page.

Getting Started

HunspellXML File Format Reference

Tips for Designing Your Dictionary Definition