# Create Your Own Spell Checker
Objective: Creating a spell checker, correct the incorrect word in the given sentence.
Problem Statement: While typing or sending any message to person, we generally make 
spelling mistakes. Write a script which will correct the misspelled words in a sentence. 
The input will be a raw string and the output will be a string with the case normalized 
and the incorrect word corrected.
Domain: General
Analysis to be done: Words availability in corpus
Content: 
Dataset: None
We will be using NLTK’s inbuilt corpora (words, stop words etc.) and no specific dataset.
Steps to perform:
While there are several approaches to correct spelling , you will use the Levenshtein or 
Edit distance approach. 
The approach will be straightforward for correcting a word: 
▪ If the word is present in a list of valid words, the word is correct.
▪ If the word is absent from the valid word list, we will find the correct 
word, i.e., the word from the valid word list which has the lowest edit 
distance from the target word.
Once you define a function, you will iterate over the terms in the given sentence, 
correct the words identified as incorrect, and return a joined string with all the terms. 
To help speed up execution, you won’t be applying the spell check on the stop words
and punctuation

Tasks: 
1. Get a list of valid words in the English language using NLTK’s list of words (Hint: 
use nltk.download(‘words’) to get the raw list.
2. Look at the first 20 words in the list. Is the casing normalized?
3. Normalize the casing for all the terms.
4. Some duplicates would have been induced, create unique list after normalizing.
5. Create a list of stop words which should include: 
i. Stop words from NLTK
ii. All punctuations (Hint: use ‘punctuation’ from string module)
iii. Final list should be a combination of these two
6. Define a function to get correct a single term
• For a given term, find its edit distance with each term in the valid word 
list. To speed up execution, you can use the first 20,000 entries in the 
valid word list.
• Store the result in a dictionary, the key as the term, and edit distance as 
value.
• Sort the dictionary in ascending order of the values.
• Return the first entry in the sorted result (value with minimum edit 
distance).
• Using the function, get the correct word for committee.
7. Make a set from the list of valid words, for faster lookup to see if word is in valid 
list or not.
8. Define a function for spelling correction in any given input sentence:
1. To tokenize them after making all the terms in lowercase 
For each term in the tokenized sentence:
2. Check if the term is in the list of valid words (valid_words_set).
3. If yes, return the word as is.
4. If no, get the correct word using get_correct_term function.
5. To return the joined string as output.
9. Test the function for the input sentence “The new abacos is great”

In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.




In [9]:
from nltk import wsd
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
from spacy.cli import download
from spacy import load
import warnings

nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('wordnet2022')


! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet # temp fix for lookup error.

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet2022 to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet2022 is already up-to-date!
'cp' is not recognized as an internal or external command,
operable program or batch file.


# 1

In [11]:
import nltk

# Download the words corpus if not already downloaded
nltk.download('words')

# Get the list of valid English words
valid_words = nltk.corpus.words.words()

# Print the first 20 words in the list
print("First 20 words in the list:", valid_words[:20])


First 20 words in the list: ['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron', 'Aaronic', 'Aaronical', 'Aaronite', 'Aaronitic', 'Aaru', 'Ab', 'aba', 'Ababdeh', 'Ababua', 'abac']


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


# 2

In [12]:
import nltk

# Download the words corpus if not already downloaded
nltk.download('words')

# Get the list of valid English words and normalize the casing
valid_words = [word.lower() for word in nltk.corpus.words.words()]

# Print the first 20 words in the list with normalized casing
for word in valid_words[:20]:
    print(word)


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


a
a
aa
aal
aalii
aam
aani
aardvark
aardwolf
aaron
aaronic
aaronical
aaronite
aaronitic
aaru
ab
aba
ababdeh
ababua
abac


# 3

In [13]:
import nltk

# Download the words corpus if not already downloaded
nltk.download('words')

# Get the list of valid English words and normalize the casing
valid_words = [word.lower() for word in nltk.corpus.words.words()]

# Print the first 20 words in the list with normalized casing
for word in valid_words[:20]:
    print(word)



[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


First 20 words in the list with normalized casing: ['a', 'a', 'aa', 'aal', 'aalii', 'aam', 'aani', 'aardvark', 'aardwolf', 'aaron', 'aaronic', 'aaronical', 'aaronite', 'aaronitic', 'aaru', 'ab', 'aba', 'ababdeh', 'ababua', 'abac']


# 4

In [14]:
import nltk

# Download the words corpus if not already downloaded
nltk.download('words')

# Get the list of valid English words and normalize the casing
valid_words = [word.lower() for word in nltk.corpus.words.words()]

# Create a unique list after normalizing casing
unique_valid_words = list(set(valid_words))

# Print the first 20 words in the unique list
print("First 20 words in the unique list with normalized casing:", unique_valid_words[:20])


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


First 20 words in the unique list with normalized casing: ['tonometry', 'dishonorable', 'anklejack', 'phaenogamia', 'nivellation', 'bidential', 'pachyvaginitis', 'stagnum', 'letteret', 'lepidosaurian', 'inedibility', 'placentalian', 'kuroshio', 'flatfish', 'bridebowl', 'taranchi', 'cataloguer', 'latrine', 'inferrer', 'aphakial']


# 5

In [15]:
import nltk
import string

# Download the NLTK stop words if not already downloaded
nltk.download('stopwords')

# Get the list of stop words from NLTK
stop_words_nltk = set(nltk.corpus.stopwords.words('english'))

# Get all punctuations from the string module
punctuations = set(string.punctuation)

# Create the final list of stop words by combining the two sets
stop_words_final = stop_words_nltk.union(punctuations)

# Print the first 20 words in the final list
print("First 20 words in the final list of stop words:", list(stop_words_final)[:20])


First 20 words in the final list of stop words: ['too', 'your', 'were', 'll', 't', "aren't", 'during', 'up', 'them', 'other', "won't", '<', 'over', "it's", 'don', ':', 'him', "hasn't", 'against', 'each']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 6

In [16]:
import nltk
from nltk.metrics import edit_distance

# Download the words corpus if not already downloaded
nltk.download('words')

def get_correct_term(target_term, valid_words_list):
    # Use the first 20,000 entries in the valid word list
    valid_words_list = valid_words_list[:20000]

    # Store the edit distances in a dictionary
    edit_distances = {word: edit_distance(target_term, word) for word in valid_words_list}

    # Sort the dictionary in ascending order of edit distances
    sorted_distances = sorted(edit_distances.items(), key=lambda x: x[1])

    # Return the first entry in the sorted result (minimum edit distance)
    return sorted_distances[0][0]

# Example: Get the correct word for 'committee'
correct_word_committee = get_correct_term("committee", valid_words_list)
print("Correct word for 'committee':", correct_word_committee)


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Correct word for 'committee': commitment


# 7

In [17]:
import nltk

# Download the words corpus if not already downloaded
nltk.download('words')

# Get the list of valid English words and normalize the casing
valid_words_list = [word.lower() for word in nltk.corpus.words.words()]

# Create a set from the list of valid words for faster lookup
valid_words_set = set(valid_words_list)

# Example: Check if a word is in the valid words set
word_to_check = 'example'
if word_to_check in valid_words_set:
    print(f"'{word_to_check}' is in the valid words list.")
else:
    print(f"'{word_to_check}' is not in the valid words list.")


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


'example' is in the valid words list.


# 8

In [18]:
import nltk
from nltk.metrics import edit_distance

# Download the words corpus if not already downloaded
nltk.download('words')

def get_valid_words_set():
    # Get the list of valid English words and normalize the casing
    valid_words_list = [word.lower() for word in nltk.corpus.words.words()]
    # Create a set from the list of valid words for faster lookup
    return set(valid_words_list)

def get_correct_term(target_term, valid_words_list):
    # Use the first 20,000 entries in the valid word list
    valid_words_list = valid_words_list[:20000]

    # Store the edit distances in a dictionary
    edit_distances = {word: edit_distance(target_term, word) for word in valid_words_list}

    # Sort the dictionary in ascending order of edit distances
    sorted_distances = sorted(edit_distances.items(), key=lambda x: x[1])

    # Return the first entry in the sorted result (minimum edit distance)
    return sorted_distances[0][0]

def correct_spelling(input_sentence, valid_words_set):
    # Tokenize the input sentence after making all terms lowercase
    tokenized_sentence = nltk.word_tokenize(input_sentence.lower())

    # Correct spelling for each term in the tokenized sentence
    corrected_sentence = [term if term in valid_words_set else get_correct_term(term, list(valid_words_set)) for term in tokenized_sentence]

    # Return the joined string as output
    return ' '.join(corrected_sentence)

# Example usage:
input_sentence = "The new abacos is great"
valid_words_set = get_valid_words_set()
corrected_sentence = correct_spelling(input_sentence, valid_words_set)
print("Corrected Sentence:", corrected_sentence)


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Corrected Sentence: the new abatis is great


# 9

In [19]:
import nltk
from nltk.metrics import edit_distance

# Download the words corpus if not already downloaded
nltk.download('words')

def get_valid_words_set():
    # Get the list of valid English words and normalize the casing
    valid_words_list = [word.lower() for word in nltk.corpus.words.words()]
    # Create a set from the list of valid words for faster lookup
    return set(valid_words_list)

def get_correct_term(target_term, valid_words_list):
    # Use the first 20,000 entries in the valid word list
    valid_words_list = valid_words_list[:20000]

    # Store the edit distances in a dictionary
    edit_distances = {word: edit_distance(target_term, word) for word in valid_words_list}

    # Sort the dictionary in ascending order of edit distances
    sorted_distances = sorted(edit_distances.items(), key=lambda x: x[1])

    # Return the first entry in the sorted result (minimum edit distance)
    return sorted_distances[0][0]

def correct_spelling(input_sentence, valid_words_set):
    # Tokenize the input sentence after making all terms lowercase
    tokenized_sentence = nltk.word_tokenize(input_sentence.lower())

    # Correct spelling for each term in the tokenized sentence
    corrected_sentence = [term if term in valid_words_set else get_correct_term(term, list(valid_words_set)) for term in tokenized_sentence]

    # Return the joined string as output
    return ' '.join(corrected_sentence)

# Test the function with the input sentence
input_sentence = "The new abacos is great"
valid_words_set = get_valid_words_set()
corrected_sentence = correct_spelling(input_sentence, valid_words_set)
print("Input Sentence:", input_sentence)
print("Corrected Sentence:", corrected_sentence)


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\TmC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Input Sentence: The new abacos is great
Corrected Sentence: the new abatis is great
