## Spell Checker
*by Mohammad Akbar*

In order to check spelling we need a dictionary.<br/>
For this program we will be using the dictionary `words.words()` from the `nltk` (natural language tool kit) module.

In [1]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.corpus import words as words

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Akbar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Now we import the regex package `re`.

In [2]:
import re

We will use `sortedcontainers` to improve performance.

In [3]:
from sortedcontainers import SortedSet

Unfortunately, `wordnet` does **NOT** include:<br/> `determiners`, `prepositions`, `pronouns`, `conjunctions`, `particles`, `auxiliary verbs`.<br/>
Lets add these to our dictionary manually

In [14]:
ACCEPTED = SortedSet([])
notACCEPTED = SortedSet([])
hardfiles = ["determiners.txt", 
             "prepositions.txt", 
             "pronouns.txt", 
             "conjunctions.txt", 
             "particles.txt",
             "auxiliaryverbs.txt",
             "contractions.txt", 
             "irregularverbs.txt"]
import os
print(os.listdir("./hardcode"))
for filenm in os.listdir("./hardcode"):
    with open(filenm,'r') as file:
        for line in file:
            word = "".join(line.split())
            if not wn.synsets(word,'asrnv'):
                ACCEPTED.add(word.lower())

['.ipynb_checkpoints', 'auxiliaryverbs.txt', 'conjunctions.txt', 'contractions.txt', 'determiners.txt', 'irregularverbs.txt', 'modalverbs.txt', 'particles.txt', 'prepositions.txt', 'pronouns.txt', 'staticverbs.txt']


PermissionError: [Errno 13] Permission denied: '.ipynb_checkpoints'

Time to start parsing our file!

In [None]:
pattern = re.compile(r"([\w\-\\']*[a-zA-Z]+[\w\-\']*)") # regex for words with atleast 1 a-zA-Z
with open("mobydick.txt") as file:                         # open input file
    for count , line in enumerate(file):                      # foreach line
        for match in re.finditer(pattern, line):                 # foreach word in line
            word = line[match.start():match.end()].lower()          # words found in line, forced lowercase
            if word in ACCEPTED or word in notACCEPTED:             # if word already memoized
                continue                                               # go to next word
            if wn.synsets(word,'asrnv'):                            # if word in wordnet, 'asrnv' means nouns,verbs,... 
                ACCEPTED.add(word)                                     # memoize as ACCEPTED
            else:                                                   # if word NOT in wordnet
                notACCEPTED.add(word)                                  # memoize as notACCEPTED

Great! We have our file parsed. However, there are some false negatives in `notACCEPTED`.<br/>
Lets account for words ending with `'s` or `s'`

In [None]:
def goodApostrophe(word):
    word_no_apst = re.sub("\'s$|s\'$",'',word)
    if word == word_no_apst:
        return False
    elif word_no_apst in ACCEPTED or wn.synsets(word_no_apst,'asrnv'):
        return True
    else:
        return False

In [None]:
APOSTROPHES = SortedSet([])
for word in notACCEPTED:
    if goodApostrophe(word):
        APOSTROPHES.add(word)

ACCEPT = ACCEPTED.union(APOSTROPHES)
notACCEPTED = notACCEPTED.difference(APOSTROPHES)

In [None]:
from IPython.display import display, Markdown, Latex
display(Markdown("**"
                 + format(len(APOSTROPHES), ',d')
                 + "** words found in dictionary, when `'s` or `s'` was removed"
                ))

We've go as far as we can with dictionaries, but there are still more words to recognize.<br/>
Lets include compound words next `compound words` example: *gallant-cross-tree*

In [None]:
COMPOUNDWORDS = SortedSet([])
pattern_compound = re.compile(r"([^\-\s]+)")
for word in notACCEPTED:
    accept_compound = True
    roots = re.findall(pattern_compound, word)
    for r , root in enumerate(roots):
        if root in ACCEPTED or wn.synsets(root,'asrnv') or goodApostrophe(root):
            continue
        else:
            accept_compound = False
            break
    if word.startswith('-') or word.endswith('-'):
        accept_compound = False
    if accept_compound:
        COMPOUNDWORDS.add(word)
print(str(len(COMPOUNDWORDS)) + " compund words found!")

In [None]:
ACCEPT = ACCEPTED.union(COMPOUNDWORDS)
notACCEPTED = notACCEPTED.difference(COMPOUNDWORDS)


In [None]:
from IPython.display import display, Markdown, Latex
display(Markdown( "**" 
      + format(len(ACCEPTED), ',d')
      + "** (*correctly spelled*) + **"
      + format(len(notACCEPTED), ',d')
      + "** (*NOT in dictionary*) = **" 
      + format(len(ACCEPTED)+len(notACCEPTED), ',d')
      + "** (*total words*)<br/>**"
      + '{0:.2%}'.format(float(len(ACCEPTED))/float(len(ACCEPTED)+len(notACCEPTED))) 
      + "** *correctly spelled*"))

Lets at what we have so far

In [None]:
print(str(len(notACCEPTED))+ " words not 'yet' accepted")
for i , word in enumerate(notACCEPTED):
    print(str(i+1).rjust(5) +" " + word)

Whe still need to account for tenses past,present,future and irregular verbs