# Spell Checker

Another possible use of text similarity is to correct spelling mistakes. For instance, lets say we’ve got the following text written incorrectly:

"it could be a greal busines"

Humans can easily spot spelling mistakes by checking a dictionary, but this task is more challenging for a computer. In this section, we will use **NLTK** and the **Jaccard distance** to implement a basic spell checker.

### What Do We Need to Create a Spell Checker?
1. **A Corpus**: This serves as the ground truth or the source of correct spellings. In our case, the corpus is the dictionary we consult for corrections.
2. **A Method**: To find the most similar word from the corpus based on a misspelled word. We will use the Jaccard similarity for this task.

### Using n-grams:
Instead of comparing words letter by letter, we will compare them by breaking them into **n-grams**, which represent sequences of `n` characters. This allows us to compare based on minimal units of similarity, improving the matching process for our spell checker.

### Computing n-grams using NLTK:

In [1]:
from typing import Set
from nltk.util import ngrams

def ngram_gen(word: str, n: int) -> Set[str]:
    """
    Generate n-grams from a word.
    
    Parameters
    ----------
    word: str
        The word to generate n-grams from.
    n: int
        The size of the n-grams.

    Returns
    -------
    Set[str]
        A set of n-grams.
    """
    return ...

# test the function
word = "test"
ngram_gen(word, 2)

{('e', 's'), ('s', 't'), ('t', 'e')}

### Suggestions for a Misspelled Word:

For that we need:
- A corpus (NLTK provides a corpus of words).
- A method to find the most similar word from the corpus based on a misspelled word. We will use the Jaccard similarity for this task.
    

In [7]:
from nltk.metrics.distance import jaccard_distance
import nltk

nltk.download('words')

from nltk.corpus import words

def get_recommended_word(word: str, n: int = 3) -> str:
    """
    Get the most similar word from a corpus based on the Jaccard similarity.
    
    Parameters
    ----------
    word: str
        The misspelled word.
    corpus: Set[str]
        The corpus of words.
    n: int
        The size of the n-grams.

    Returns
    -------
    str
        The most similar word from the corpus.
    """
    if len(word) < n:
        return word
    
    # get our corpus of words
    corpus = set(words.words())
    
    # generate n-grams for the misspelled word
    word_ngrams = ...
    
    # calculate the Jaccard similarity between the n-grams of the misspelled word and the n-grams of the corpus
    similarities = ...
    
    # get the word with the highest Jaccard similarity
    recommended_word = ...
    
    return recommended_word

# test the function
sentence = "it could be a greal busines"
corrected_sentence = " ".join([get_recommended_word(word) for word in sentence.split()])
corrected_sentence

[nltk_data] Downloading package words to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package words is already up-to-date!


'it could be a great business'

In [9]:
# another test (sentence with a misspelled words)
sentence = "i love to eat choclate"
corrected_sentence = " ".join([get_recommended_word(word) for word in sentence.split()])
corrected_sentence

'i love to eat chocolate'

In [13]:
# another test (sentence with a misspelled words) 
# it will fail
sentence = "i love to eat appls"
corrected_sentence = " ".join([get_recommended_word(word) for word in sentence.split()])
corrected_sentence

'i love to eat apply'