Brill and Moore Noisy Channel Spelling Correction

This is a Java implementation of the noisy channel spell checking approach presented in:

Brill and Moore (2000). An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of the ACL 2000.

The spell checker's error model is trained on a list of pairs of misspellings with corrections, considering generic character edits up to a specified maximum edit length (e.g., the edit ant→ent from the pair dependant→dependent).

To use this spell checker you need:

a list of misspellings with corrections
a list of potential corrections (i.e., a dictionary of real words)

The spell checker does not know anything about morphology or sentence-initial capitalization, so it expects all possible forms of a word (inflected, capitalized, lowercase, mixed case, etc.) to appear in the list of potential corrections. The command-line wrapper includes flags to expand a provided dictionary with lowercase and capitalized versions of all words.

Command Line Usage

Compile and Package

$ mvn package

Run

$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar

Usage

usage: java -jar brillmoore-0.1-jar-with-dependencies.jar
 -a,--minatoa <arg>      minimum a -> a probability (default 0.8)
 -c,--candidates <arg>   number of candidates to output (default 10)
 -d,--dict <arg>         dictionary file
 -h,--help               this help message
 -l,--lowercase          expand dictionary with lowercase versions of all
                         words
 -p,--train <arg>        training file
 -s,--single             add training instances for all single character
                         edits
 -t,--test <arg>         testing file
 -u,--capitalized        expand dictionary with capitalized versions of
                         all words
 -w,--window <arg>       window for expanding alignments (Brill and
                         Moore's N; default 3)

Data Formats

Tab-separated values are used for input and output.

Training/Testing

counts are optional, assumed to be 1 if no count provided
the test counts are merely copied into the output for further use

misspelling TAB target TAB count

Dictionary

without probabilities (one word per line, all words equally likely):

word

with probabilities:

word TAB probability

Output

The output echoes the test input columns (misspelling, target, count) and appends the ranked candidate corrections as pairs of columns containing the candidate correction and the -log(prob) of the candidate.

misspelling TAB target TAB count TAB candidate1 TAB -log(prob1) TAB candidate2 TAB -log(prob2) ...

Example

Sample input files based on the Aspell common misspellings test data are provided in data/. See data/README.md for details.

$ java -jar target/brillmoore-0.1-jar-with-dependencies.jar -d data/aspell-wordlist-en_USGBsGBz.70-1.txt -p data/aspell-common.train -t data/aspell-common.dev.first10 -c 3 > data/aspell-common.dev.first10.USGBsGBz.70-1.out

Sample output:

pumkin  pumpkin 1       pumpkin 4.38    pumpkin's       6.67    bumkin  7.32
reorganision    reorganisation  1       reorganisation  2.88    reorganisation's        5.20    reorganisations 7.09
gallaxies       galaxies        1       galaxies        4.01    galaxy's        13.26   galaxy  17.45
superceeded     superseded      1       superseded      7.91    supersede       14.46   succeeded       18.34
millenia        millennia       1       millennia       2.11    millennial      6.23    millennial's    8.52
pseudonyn       pseudonym       1       pseudonym       4.69    pseudonym's     6.98    pseudonyms      8.87
synonymns       synonyms        1       synonyms        6.46    synonym's       8.29    synonym 12.49
prominant       prominent       1       predominant     1.76    prominent       2.71    preeminent      10.01
manouver        maneuver        1       maneuver        1.93    manoeuvre       3.76    maneuver's      4.27
obediance       obedience       1       obedience       1.98    obedience's     4.33    obeisance       10.12

Evaluation for sample output:

$ data/eval.py data/aspell-common.dev.first10.USGBsGBz.70-1.out

NotFnd	Found	First	1-5	1-10	1-25	1-50	Any (Max: 3)
--------------------------------------------------------------------
0	10	90.0	100.0	100.0	100.0	100.0	100.0

Evaluation for the whole dev set output in data/aspell-common.dev.USGBsGBz.70-1.out considering the first 100 suggestions:

NotFnd	Found	First	1-5	1-10	1-25	1-50	Any (Max: 100)
----------------------------------------------------------------------
18	403	84.1	93.1	94.8	95.5	95.7	95.7

(Compare to: http://aspell.net/test/common-all/)

Evaluation with default paramemeters training on all Aspell common misspellings (data/aspell-common.all) and testing on Aspell current test data (data/aspell-current.all), which focuses on difficult misspellings:

NotFnd	Found	First	1-5	1-10	1-25	1-50	Any (Max: 100)
----------------------------------------------------------------------
43	504	56.3	78.4	83.7	88.8	91.2	92.1

(Compare to: http://aspell.net/test/cur/)

Note: some target corrections aren't found in the provided dictionary due to capitalization (e.g., The, muslims) and run-on errors (incase). The flags -l and -u could be used to expand the base word list with lowercase and capitalized versions respectively.

Java Usage

// create a list of pairs of misspellings and corrections
List<Misspelling> trainMisspellings = new ArrayList<>();
trainMisspellings.add(new Misspelling("Abril", "April", 1));

// create a dictionary
Map<String, Double> dict = new HashMap<>();
dict.put("April", 1.0);
dict.put("Arzt", 1.0);
dict.put("Altstadt", 1.0);

// set the parameters
int window = 3;
double minAtoA = 0.8;

try {
    // train spell checker
    SpellChecker spellchecker = new SpellChecker(trainMisspellings, dict, window, minAtoA);

    // run spell checker
    List<Candidate> candidates = spellchecker.getRankedCandidates("Abril");

    // iterate over top ten candidates
    for (Candidate cand : candidates.subList(0, Math.min(candidates.size(), 10))) {
        System.out.println(cand.getTarget() + "\t" + cand.getProb());
    }
} catch (ParseException e) {
    System.err.println(e.getMessage());
}

Output

April	1.6094379124341005
Altstadt	Infinity
Arzt	Infinity

Using Maven

Install in the local maven archive:

$ mvn install

Add the maven dependency:

<dependency>
	<groupId>de.unituebingen.sfs</groupId>
	<artifactId>brillmoore</artifactId>
	<version>0.1</version>
</dependency>

Credits

This code includes modified versions of:

Trie by Ryan Gantt (further documentation)
Damerau Levenshtein Algorithm by Kevin L. Stern

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Brill and Moore Noisy Channel Spelling Correction

Command Line Usage

Compile and Package

Run

Usage

Data Formats

Training/Testing

Dictionary

Output

Example

Java Usage

Output

Using Maven

Credits

About

Releases

Packages

Languages

License

adrianeboyd/BrillMooreSpellChecker

Folders and files

Latest commit

History

Repository files navigation

Brill and Moore Noisy Channel Spelling Correction

Command Line Usage

Compile and Package

Run

Usage

Data Formats

Training/Testing

Dictionary

Output

Example

Java Usage

Output

Using Maven

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages