# Spell Correction

Spelling correction is a crucial preprocessing step in NLP tasks to ensure that the text data is clean and standardized. Here are some common methods and tools you can use for spelling correction:

- TextBlob
- PySpellChecker
- SymSpell
- Hunspell
- BERT for Contextual Spell Checking

## TextBlob
TextBlob is a simple library for processing textual data. It provides a straightforward way to perform spell correction. 
To get the most out of TextBlob, download its corpora: `python -m textblob.download_corpora`.

In [1]:
from textblob import TextBlob

def correct_spelling(text):
    corrected_text = str(TextBlob(text).correct())
    return corrected_text

text = "I havv a speling mistakke"
corrected_text = correct_spelling(text)
print(corrected_text)  # Output: I have a spelling mistake

I have a spelling mistake


## PySpellChecker

PySpellChecker is a pure Python spell checking library that provides fast and easy spell checking.

In [2]:
from spellchecker import SpellChecker

spell = SpellChecker()

def correct_spelling(text):
    words = text.split()
    corrected_text = " ".join([spell.correction(word) for word in words])
    return corrected_text

text = "I havv a speling mistakke"
corrected_text = correct_spelling(text)
print(corrected_text)  # Output: I have a spelling mistake


I have a spelling mistake


## SymSpell

SymSpell is a very efficient spell checker and corrector for processing large amounts of text data. It's designed for high-performance; much higher speed and lower memory consumption.

Note: Copy `frequency_dictionary_en_82_765.txt` (found in the inner symspellpy directory) to your project directory

In [4]:
from symspellpy import SymSpell, Verbosity

# Initialize the spell checker
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

# Load the dictionary
dictionary_path = "../data/dict/frequency_dictionary_en_82_765.txt"
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

def correct_spelling(text):
    suggestions = sym_spell.lookup_compound(text, max_edit_distance=2)
    if suggestions:
        return suggestions[0].term
    return text

text = "I havv a speling mistakke"
corrected_text = correct_spelling(text)
print(corrected_text)  # Output: I have a spelling mistake


i have a spelling mistake


## Hunspell

[Hunspell](https://hunspell.github.io/) is a spell checker and morphological analyzer designed for languages with rich morphology and complex word compounding and character encoding.

To install it on Mac, follow this [link](https://pankdm.github.io/hunspell.html).

In [5]:
import hunspell

# Initialize the spell checker
h = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')

def correct_spelling(text):
    words = text.split()
    corrected_text = " ".join([h.suggest(word)[0] if h.suggest(word) else word for word in words])
    return corrected_text

text = "I havv a speling mistakke"
corrected_text = correct_spelling(text)
print(corrected_text)  # Output: I have a spelling mistake


ModuleNotFoundError: No module named 'hunspell'

## BERT for Contextual Spell Checking

For more advanced and context-aware spell checking, you can use transformer models like BERT. This approach involves using a pre-trained language model to identify and correct misspelled words based on the context.

In [30]:
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

In [29]:
from transformers import pipeline

def correct_spelling(text):
    fill_mask = pipeline('fill-mask', model='bert-base-uncased')
    words = text.split()
    corrected_text = []
    for word in words:
        masked_text = text.replace(word, '[MASK]')
        predictions = fill_mask(masked_text)
        
        try: 
            corrected_word = predictions[0]['token_str'] if predictions else word
        except:
            print(predictions[0])
        corrected_text.append(corrected_word)
    return " ".join(corrected_text)

text = "I havv a speling mistakke"
corrected_text = correct_spelling(text)
print(corrected_text)  # Output: I have a spelling mistake

[{'score': 0.09278012067079544, 'token': 1012, 'token_str': '.', 'sequence': '[CLS] i h. vv [MASK] speling mist [MASK] kke [SEP]'}, {'score': 0.09081113338470459, 'token': 2102, 'token_str': '##t', 'sequence': '[CLS] i ht vv [MASK] speling mist [MASK] kke [SEP]'}, {'score': 0.0415608249604702, 'token': 1010, 'token_str': ',', 'sequence': '[CLS] i h, vv [MASK] speling mist [MASK] kke [SEP]'}, {'score': 0.02465035952627659, 'token': 1041, 'token_str': 'e', 'sequence': '[CLS] i h e vv [MASK] speling mist [MASK] kke [SEP]'}, {'score': 0.02280365116894245, 'token': 2140, 'token_str': '##l', 'sequence': '[CLS] i hl vv [MASK] speling mist [MASK] kke [SEP]'}]
i am am good .


In [28]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

def correct_spelling(text):
    words = text.split()
    corrected_text = []

    for i, word in enumerate(words):
        masked_sentence = words.copy()
        masked_sentence[i] = tokenizer.mask_token
        masked_sentence = " ".join(masked_sentence)
        
        # Encode the masked sentence
        input_ids = tokenizer.encode(masked_sentence, return_tensors='pt')
        
        # Get predictions
        with torch.no_grad():
            outputs = model(input_ids)
            predictions = outputs.logits
        
        # Get the predicted token
        mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]
        predicted_token_id = predictions[0, mask_token_index, :].argmax(dim=-1)
        predicted_token = tokenizer.decode(predicted_token_id)
        
        # If the predicted token is different from the original word, replace it
        if predicted_token != word:
            corrected_text.append(predicted_token)
        else:
            corrected_text.append(word)
    
    return " ".join(corrected_text)

# Example usage
text = "I havv a speling mistakke"
corrected_text = correct_spelling(text)
print(corrected_text)  # Output: I have a spelling mistake


i am a good .


In [27]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Hello I'm a [MASK] model.")
# I havv a speling mistakke
unmasker("I have a [MASK] mistakke.")

[{'score': 0.08409816026687622,
  'token': 2047,
  'token_str': 'new',
  'sequence': 'i have a new mistakke.'},
 {'score': 0.06373114138841629,
  'token': 2210,
  'token_str': 'little',
  'sequence': 'i have a little mistakke.'},
 {'score': 0.05808905139565468,
  'token': 2204,
  'token_str': 'good',
  'sequence': 'i have a good mistakke.'},
 {'score': 0.020978424698114395,
  'token': 2919,
  'token_str': 'bad',
  'sequence': 'i have a bad mistakke.'},
 {'score': 0.019894922152161598,
  'token': 2307,
  'token_str': 'great',
  'sequence': 'i have a great mistakke.'}]

## ContextualSpellCheck

[contextualSpellCheck](https://github.com/R1j1t/contextualSpellCheck) This package is based on BERT and currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. 

This package is under improvement to extend the functionality to identify RWE, optimising the package, and improving the documentation. 

To install this package you need: 
- `pip install git+https://github.com/roy-ht/editdistance.git@v0.6.2`
- `pip install contextualSpellCheck`

In [23]:
import spacy
import contextualSpellCheck

nlp = spacy.load('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)
doc = nlp('I havv a speling mistakke.')

print(doc._.performed_spellCheck) #Should be True
print(doc._.outcome_spellCheck) #Income was $9.4 million compared to the prior year of $2.7 million.

True
I have av attack.


In [26]:
import spacy
import contextualSpellCheck

# Load a spaCy model
nlp = spacy.load('en_core_web_sm')

# Add the contextual spell checker to the spaCy pipeline
contextualSpellCheck.add_to_pipe(nlp)

# Example text with a spelling error
text = "I havv a speling mistakke."

# Process the text with spaCy
doc = nlp(text)

# Check for spelling errors
if doc._.performed_spellCheck:
    corrected_text = doc._.outcome_spellCheck
    print("Corrected Text:", corrected_text)
else:
    print("No spelling errors found.")


Corrected Text: I have av attack.


## Some other packages

https://github.com/neuspell/neuspell

## Grammar Error Correction (GEC) with BERT

This library: https://github.com/sunilchomal/GECwBERT