### Spell Checker

In order to check spelling we need a dictionary.<br/>
For this program we will be using the dictionary `words.words()` from the `nltk` (natural language tool kit) module.

In [27]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.corpus import words as words

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Akbar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Now we import the regex package `re`.

In [28]:
import re

We will use `sortedcontainers` to improve performance.

In [29]:
from sortedcontainers import SortedSet

Unfortunately, `wordnet` does **NOT** include:<br/> `determiners`, `prepositions`, `pronouns`, `conjunctions`, `particles`, `auxiliary verbs`.<br/>
Lets add these to our dictionary manually

In [30]:
ACCEPTED = SortedSet([])
notACCEPTED = SortedSet([])
hardfiles = ["determiners.txt", 
             "prepositions.txt", 
             "pronouns.txt", 
             "conjunctions.txt", 
             "particles.txt",
             "auxiliaryverbs.txt"]
for filenm in hardfiles:
    with open(filenm,'r') as file:
        for line in file:
            ACCEPTED.add("".join(line.split()))

Time to start parsing our file!

In [48]:
pattern = re.compile(r"([\w\-\\']*[a-zA-Z]+[\w\-\']*)")
with open("mobydick.txt") as file:
    for count , line in enumerate(file):
        for match in re.finditer(pattern, line):
            word = line[match.start():match.end()].lower()
            if word in ACCEPTED or word in notACCEPTED:
                continue
            if wn.synsets(word,'asrnv'):
                ACCEPTED.add(word)
            else:
                notACCEPTED.add(word)

Lets at what we have so far

In [25]:
print(str(len(notACCEPTED))+ " words not 'yet' accepted")
for i , word in enumerate(notACCEPTED[::100]):
    print(str(i*100+1).zfill(4) +" | " + word)
    print("...")
    

2588 words not 'yet' accpted
0001 | '-gallant-cross-trees
...
0101 | -in
...
0201 | addi-
...
0301 | attend-
...
0401 | bookbinder's
...
0501 | cellini's
...
0601 | corre-
...
0701 | dinner-time
...
0801 | exasper-
...
0901 | freebooting
...
1001 | half-inch
...
1101 | how-
...
1201 | jocularly
...
1301 | lo
...
1401 | mazeppa
...
1501 | nescio
...
1601 | paint-
...
1701 | ponder-
...
1801 | relin-
...
1901 | sea-salt
...
2001 | sleeper's
...
2101 | stopping-places
...
2201 | terrorem
...
2301 | trous
...
2401 | val
...
2501 | whaleboning
...
