## Spelling Recommender

Creation of two spelling correctors/recommenders using the Jaccard distance and the edit distance methods

Each of the recommenders provide recommendations for the three default words : `['cormulent', 'incendenece', 'validrate']`.

In [3]:
import nltk
nltk.download('popular')
nltk.download('nps_chat')
nltk.download('webtext')
from nltk.corpus import words

correct_spellings = words.words()

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\chiar\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\chiar\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gazetteers.zip.
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\chiar\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\genesis.zip.
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\chiar\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gutenberg.zip.
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\chiar\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     C:\Users\chiar\AppData\Roaming\nl

### Jaccard distance

For this recommender, we will use the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words (word in entry and word in the nltk corpus).**

In [9]:
import time

entries=['cormulent', 'incendenece', 'validrate']

# Timing the execution
start_time = time.time()

cormu_tri = set(nltk.ngrams('cormulent', n=3))
incen_tri = set(nltk.ngrams('incendenece', n=3))
vali_tri = set(nltk.ngrams('validrate', n=3))
    
dist_cormu = 1
dist_incen = 1
dist_vali = 1
    
for x in correct_spellings:
    if x[0] == entries[0][0] and nltk.jaccard_distance(cormu_tri, set(nltk.ngrams(x, n=3))) < dist_cormu:
        min_cormu = x
        dist_cormu = nltk.jaccard_distance(cormu_tri, set(nltk.ngrams(x, n=3)))
            
    if x[0] == entries[1][0] and nltk.jaccard_distance(incen_tri, set(nltk.ngrams(x, n=3))) < dist_incen:
        min_incen = x
        dist_incen = nltk.jaccard_distance(incen_tri, set(nltk.ngrams(x, n=3)))
            
    if x[0] == entries[2][0] and nltk.jaccard_distance(vali_tri, set(nltk.ngrams(x, n=3))) < dist_vali:
        min_vali = x
        dist_vali = nltk.jaccard_distance(vali_tri, set(nltk.ngrams(x, n=3)))

    
print(min_cormu, min_incen, min_vali)
print("execution time : %s seconds" % (time.time() - start_time))

corpulent indecence validate
execution time : 0.4411618709564209 seconds


### Edit distance

For this recommender, we will use the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

In [10]:
# Timing the execution
start_time = time.time()

dist_cormu = 50
dist_incen = 50
dist_vali = 50
    
for x in correct_spellings:
    if x[0] == entries[0][0] and nltk.edit_distance('cormulent', x, transpositions=True) < dist_cormu:
        min_cormu = x
        dist_cormu = nltk.edit_distance('cormulent', x, transpositions=True)
            
    if x[0] == entries[1][0] and nltk.edit_distance('incendenece', x, transpositions=True) < dist_incen:
        min_incen = x
        dist_incen = nltk.edit_distance('incendenece', x, transpositions=True)
            
    if x[0] == entries[2][0] and nltk.edit_distance('validrate', x, transpositions=True) < dist_vali:
        min_vali = x
        dist_vali = nltk.edit_distance('validrate', x, transpositions=True)

print(min_cormu, min_incen, min_vali)
print("execution time : %s seconds" % (time.time() - start_time))

corpulent intendence validate
execution time : 6.542162179946899 seconds


### Conclusion

The jaccard distance method is way faster than the edit distance. The results are almost the same, the edit distance results are maybe a little more accurate : intendence is closer to incendenece than indecence.

Finally for a live spelling corrector use, the jaccard distance seems to be a better choice. But for an accurate use like a text verification, the edit distance may be better.