## Spell Checker
*by Mohammad Akbar*

In order to check spelling we need a dictionary.<br/>
For this program we will be using the dictionary `words.words()` from the `nltk` (natural language tool kit) module.

In [27]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.corpus import words as words

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Akbar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Now we import the regex package `re`.

In [28]:
import re

We will use `sortedcontainers` to improve performance.

In [29]:
from sortedcontainers import SortedSet

Unfortunately, `wordnet` does **NOT** include:<br/> `determiners`, `prepositions`, `pronouns`, `conjunctions`, `particles`, `auxiliary verbs`.<br/>
Lets add these to our dictionary manually

In [30]:
ACCEPTED = SortedSet([])
notACCEPTED = SortedSet([])
hardfiles = ["determiners.txt", 
             "prepositions.txt", 
             "pronouns.txt", 
             "conjunctions.txt", 
             "particles.txt",
             "auxiliaryverbs.txt"]
for filenm in hardfiles:
    with open(filenm,'r') as file:
        for line in file:
            ACCEPTED.add("".join(line.split()))

Time to start parsing our file!

In [48]:
pattern = re.compile(r"([\w\-\\']*[a-zA-Z]+[\w\-\']*)")
with open("mobydick.txt") as file:
    for count , line in enumerate(file):
        for match in re.finditer(pattern, line):
            word = line[match.start():match.end()].lower()
            if word in ACCEPTED or word in notACCEPTED:
                continue
            if wn.synsets(word,'asrnv'):
                ACCEPTED.add(word)
            else:
                notACCEPTED.add(word)

Lets at what we have so far

In [50]:
print(str(len(notACCEPTED))+ " words not 'yet' accepted")
for i , word in enumerate(notACCEPTED):
    print(str(i+1).zfill(4) +" | " + word)

2539 words not 'yet' accepted
0001 | '-gallant-cross-trees
0002 | '-gallant-mast
0003 | '-gallant-sails
0004 | '-sails
0005 | '-west
0006 | '-wester
0007 | 'a
0008 | 'about
0009 | 'all
0010 | 'balmed
0011 | 'beat
0012 | 'bout
0013 | 'corrupt
0014 | 'd
0015 | 'dinner
0016 | 'em
0017 | 'ere
0018 | 'gainst
0019 | 'he's
0020 | 'heart
0021 | 'if
0022 | 'it
0023 | 'landlord
0024 | 'm
0025 | 'mong
0026 | 'most
0027 | 'mrs
0028 | 'narwhale
0029 | 'nothing
0030 | 'often
0031 | 're
0032 | 'ready
0033 | 's
0034 | 'sheaves
0035 | 'silly
0036 | 'spouters'
0037 | 'struck
0038 | 'tell
0039 | 'that
0040 | 'the
0041 | 'this
0042 | 'thou
0043 | 'tis
0044 | 'twas
0045 | 'twill
0046 | 've
0047 | 'way
0048 | 'whale'
0049 | -alive
0050 | -apprehensions
0051 | -bag
0052 | -belt
0053 | -biscuit
0054 | -bits
0055 | -bo
0056 | -board
0057 | -bodied
0058 | -bolted
0059 | -bone
0060 | -bow
0061 | -bowl
0062 | -boy
0063 | -browed
0064 | -captains
0065 | -condemning
0066 | -covered
0067 | -crackers
0068 | -craft
00