Create Your Own Spell Checker
Objective: Creating a spell checker, correct the incorrect word in the given sentence.
Problem Statement: While typing or sending any message to person, we generally make 
spelling mistakes. Write a script which will correct the misspelled words in a sentence. 
The input will be a raw string and the output will be a string with the case normalized 
and the incorrect word corrected.
Domain: General
Analysis to be done: Words availability in corpus
Content: 
Dataset: None
We will be using NLTK’s inbuilt corpora (words, stop words etc.) and no specific dataset.
Steps to perform:
While there are several approaches to correct spelling , you will use the Levenshtein or 
Edit distance approach. 
The approach will be straightforward for correcting a word: 
▪ If the word is present in a list of valid words, the word is correct.
▪ If the word is absent from the valid word list, we will find the correct 
word, i.e., the word from the valid word list which has the lowest edit 
distance from the target word.
Once you define a function, you will iterate over the terms in the given sentence, 
correct the words identified as incorrect, and return a joined string with all the terms. 
To help speed up execution, you won’t be applying the spell check on the stop words
and punctuation.
Tasks: 
1. Get a list of valid words in the English language using NLTK’s list of words (Hint: 
use nltk.download(‘words’) to get the raw list.
2. Look at the first 20 words in the list. Is the casing normalized?
3. Normalize the casing for all the terms.
4. Some duplicates would have been induced, create unique list after normalizing.
5. Create a list of stop words which should include: 
i. Stop words from NLTK
ii. All punctuations (Hint: use ‘punctuation’ from string module)
iii. Final list should be a combination of these two
6. Define a function to get correct a single term
• For a given term, find its edit distance with each term in the valid word 
list. To speed up execution, you can use the first 20,000 entries in the 
valid word list.
• Store the result in a dictionary, the key as the term, and edit distance as 
value.
• Sort the dictionary in ascending order of the values.
• Return the first entry in the sorted result (value with minimum edit 
distance).
• Using the function, get the correct word for committee.
7. Make a set from the list of valid words, for faster lookup to see if word is in valid 
list or not.
8. Define a function for spelling correction in any given input sentence:
1. To tokenize them after making all the terms in lowercase 
For each term in the tokenized sentence:
2. Check if the term is in the list of valid words (valid_words_set).
3. If yes, return the word as is.
4. If no, get the correct word using get_correct_term function.
5. To return the joined string as output.
9. Test the function for the input sentence “The new abacos is great”.

In [1]:
import nltk

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SIRISHA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\SIRISHA\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


True

In [8]:
from nltk.corpus import words, stopwords
from string import punctuation
from collections import defaultdict

In [24]:
# Task 1: Get a list of valid words in the English language
valid_words = words.words()

In [25]:
print(valid_words[:10])

['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron']


In [26]:
valid_words_lower = [word.lower() for word in valid_words]
valid_words_lower = list(set(valid_words_lower))
print(valid_words_lower[:10])

['botryopterid', 'ridgingly', 'scattered', 'refashionment', 'menacement', 'patriarch', 'monte', 'salicylamide', 'trichogynial', 'uncatholic']


In [27]:
stop_words = set(stopwords.words('english')).union(set(punctuation))

In [28]:
from nltk.metrics import edit_distance

def get_correct_term(target_word, valid_words):
    edit_distances = {}
    for word in valid_words:
        if word[0] == target_word[0]:
            distance = edit_distance(target_word, word)
            edit_distances[word] = distance
        else:
            continue
    sorted_edits = sorted(edit_distances.items(), key=lambda x: x[1])
    return sorted_edits[0][0] 


In [29]:
valid_words_set = set(valid_words_lower)

In [30]:
def correct_spelling(sentence):
    tokenized_sentence = nltk.word_tokenize(sentence.lower())
    corrected_sentence = []
    
    for word in tokenized_sentence:
        if word in valid_words_set:
            corrected_sentence.append(word)
        else:
            correct_word = get_correct_term(word, valid_words_lower)
            corrected_sentence.append(correct_word)
    
    # Return the joined string as output
    return ' '.join(corrected_sentence)


In [36]:
# Task 9: Test the function for the input sentence "The new abacos is great"
sentence = "The new abacos is great"
corrected_sentence = correct_spelling(sentence)
print(corrected_sentence)

the new abacus is great
