# Create Your Own Spell Checker

Objective: Creating a spell checker, correct the incorrect word in the given sentence.

Problem Statement: While typing or sending any message to person, we generally make spelling mistakes. Write a script which will correct the misspelled words in a sentence. The input will be a raw string and the output will be a string with the case normalized and the incorrect word corrected.

Domain: General

Analysis to be done: Words availability in corpus

Content:

Dataset: None
We will be using NLTK’s inbuilt corpora (words, stop words etc.) and no specific dataset.

Steps to perform:
While there are several approaches to correct spelling , you will use the Levenshtein or Edit distance approach.

The approach will be straightforward for correcting a word:
▪ If the word is present in a list of valid words, the word is correct.
▪ If the word is absent from the valid word list, we will find the correct word, i.e., the word from the valid word list which has the lowest edit distance from the target word.

Once you define a function, you will iterate over the terms in the given sentence, correct the words identified as incorrect, and return a joined string with all the terms. To help speed up execution, you won’t be applying the spell check on the stop words and punctuation.

In [1]:
import nltk
from nltk.corpus import words, stopwords
from string import punctuation
from collections import Counter

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ramesh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Tasks:

# 1. Get a list of valid words in the English language using NLTK’s list of words (Hint: use nltk.download(‘words’) to get the raw list.

In [2]:
nltk.download('words')

valid_words = nltk.corpus.words.words()

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Ramesh\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


# 2. Look at the first 20 words in the list. Is the casing normalized?

In [3]:
print("First 20 words in the list:", valid_words[:20])

First 20 words in the list: ['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron', 'Aaronic', 'Aaronical', 'Aaronite', 'Aaronitic', 'Aaru', 'Ab', 'aba', 'Ababdeh', 'Ababua', 'abac']


# 3. Normalize the casing for all the terms.

In [4]:
normalized_valid_words = [word.lower() for word in valid_words]
normalized_valid_words

['a',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aani',
 'aardvark',
 'aardwolf',
 'aaron',
 'aaronic',
 'aaronical',
 'aaronite',
 'aaronitic',
 'aaru',
 'ab',
 'aba',
 'ababdeh',
 'ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'abadite',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'abama',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'abanic',
 'abantes',
 'abaptiston',
 'abarambo',
 'abaris',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'abasgi',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'abassin',
 'abastardize',
 'abatable',
 'abate

# 4. Some duplicates would have been induced, create unique list after normalizing.

In [5]:
unique_normalized_valid_words = list(set(normalized_valid_words))
unique_normalized_valid_words

['uncorrespondent',
 'verbous',
 'persuasiveness',
 'ommiades',
 'baptismally',
 'smidge',
 'uncognized',
 'gymnasiast',
 'leotard',
 'panheaded',
 'carapaced',
 'gunnery',
 'abutilon',
 'pedestrian',
 'multispindle',
 'treater',
 'macropia',
 'repealable',
 'cola',
 'cistvaen',
 'prerealize',
 'objectionist',
 'phascolarctinae',
 'anorthitite',
 'benthic',
 'threadbareness',
 'schizogenous',
 'carroch',
 'jiggers',
 'rotatory',
 'collective',
 'airship',
 'dinsome',
 'sucrose',
 'nonfiduciary',
 'audaciously',
 'evergreenite',
 'prognose',
 'railroadana',
 'spirographidin',
 'unprogressively',
 'billbroking',
 'supradural',
 'dispeople',
 'barriness',
 'septoic',
 'deoxidator',
 'bagattino',
 'trisyllable',
 'copalm',
 'rhodanate',
 'saccharomycosis',
 'rechaser',
 'emmetropy',
 'unendangered',
 'vacuum',
 'patrician',
 'metis',
 'thirstful',
 'trashily',
 'handhole',
 'tetractinellida',
 'variolation',
 'lardizabalaceae',
 'acromyotonia',
 'prelusory',
 'belduque',
 'bantery',
 'clev

# 5. Create a list of stop words which should include:

    i.   Stop words from NLTK
    
    ii.  All punctuations (Hint: use ‘punctuation’ from string module)
    
    iii. Final list should be a combination of these two

In [6]:
from nltk.corpus import stopwords
from string import punctuation

stop_words = set(stopwords.words('english'))
punctuation_set = set(punctuation)
stop_words_combined = stop_words.union(punctuation_set)

# 6. Define a function to get correct a single term

   • For a given term, find its edit distance with each term in the valid word list. To speed up execution, you can use the   first 20,000 entries in the valid word list.
   
   • Store the result in a dictionary, the key as the term, and edit distance as value.
   
   • Sort the dictionary in ascending order of the values.
   
   • Return the first entry in the sorted result (value with minimum edit distance).
   
   • Using the function, get the correct word for committee.

In [7]:
def get_correct_term(term, valid_word_list):
    valid_word_list = valid_word_list[:20000]
    edit_distances = {word: nltk.edit_distance(term, word) for word in valid_word_list}
    sorted_distances = dict(sorted(edit_distances.items(), key=lambda x: x[1]))
    return list(sorted_distances.keys())[0]

# Using the function, get the correct word for 'committee'
correct_committee = get_correct_term('committee', unique_normalized_valid_words)
print("Correct word for 'committee':", correct_committee)

Correct word for 'committee': commingle


# 7. Make a set from the list of valid words, for faster lookup to see if word is in valid list or not.

In [8]:
valid_words_set = set(unique_normalized_valid_words)
valid_words_set

{'uncorrespondent',
 'verbous',
 'persuasiveness',
 'ommiades',
 'baptismally',
 'smidge',
 'uncognized',
 'gymnasiast',
 'leotard',
 'panheaded',
 'carapaced',
 'gunnery',
 'abutilon',
 'pedestrian',
 'multispindle',
 'treater',
 'macropia',
 'repealable',
 'cola',
 'cistvaen',
 'prerealize',
 'objectionist',
 'phascolarctinae',
 'anorthitite',
 'benthic',
 'threadbareness',
 'schizogenous',
 'carroch',
 'jiggers',
 'rotatory',
 'collective',
 'airship',
 'dinsome',
 'sucrose',
 'nonfiduciary',
 'audaciously',
 'evergreenite',
 'prognose',
 'railroadana',
 'spirographidin',
 'unprogressively',
 'billbroking',
 'supradural',
 'dispeople',
 'barriness',
 'septoic',
 'deoxidator',
 'bagattino',
 'trisyllable',
 'copalm',
 'rhodanate',
 'saccharomycosis',
 'rechaser',
 'emmetropy',
 'unendangered',
 'vacuum',
 'patrician',
 'metis',
 'thirstful',
 'trashily',
 'handhole',
 'tetractinellida',
 'variolation',
 'lardizabalaceae',
 'acromyotonia',
 'prelusory',
 'belduque',
 'bantery',
 'clev

# 8. Define a function for spelling correction in any given input sentence:

       1. To tokenize them after making all the terms in lowercase
   
   For each term in the tokenized sentence:

       2. Check if the term is in the list of valid words (valid_words_set).

       3. If yes, return the word as is.

       4. If no, get the correct word using get_correct_term function.

       5. To return the joined string as output.

In [9]:
def correct_spelling(sentence):
    tokenized_sentence = nltk.word_tokenize(sentence.lower())
    corrected_sentence = [word if word in valid_words_set else get_correct_term(word, unique_normalized_valid_words) for word in tokenized_sentence]
    return ' '.join(corrected_sentence)

# 9. Test the function for the input sentence “The new abacos is great”.

In [10]:
input_sentence = "The new abacos is great"
output_sentence = correct_spelling(input_sentence)
print("Input Sentence:", input_sentence)
print("Corrected Sentence:", output_sentence)

Input Sentence: The new abacos is great
Corrected Sentence: the new abas is great
