# Spelling Corrector

### About the dataset
The public version of the Corpus of Linguistic Acceptability (CoLA) dataset contains 9,594 sentences from training and development sets, which are used to assess the grammatical correctness of sentences. The dataset utilized in this project is derived from the original CoLA dataset, with grammatically correct sentences removed.

##### Dataset download links:
- in_domain_train.tsv:- https://github.com/nyu-mll/CoLA-baselines/blob/master/acceptability_corpus/cola_public/raw/in_domain_train.tsv
- in_domain_dev.tsv:- https://github.com/nyu-mll/CoLA-baselines/blob/master/acceptability_corpus/cola_public/raw/in_domain_dev.tsv

This revision clarifies the purpose of the dataset and improves the overall readability.

In [1]:
import pandas as pd
from symspellpy import SymSpell
import torch
from transformers import AutoTokenizer
from language_tool_python import LanguageTool

In [2]:
df = pd.read_csv(r'../data/grammar_data.csv')
df.head()

Unnamed: 0,input_text,target_text
0,"As you eat the most, you want the least.","As you eat more, you desire less."
1,"The more you would want, the less you would eat.","The more you desire, the less you eat."
2,"I demand that the more John eat, the more he p...","I demand that the more John eats, the more he ..."
3,"The more does Bill smoke, the more Susan hates...","The more Bill smokes, the more Susan hates him."
4,Who does John visit Sally because he likes?,Whom does John visit because he likes Sally?


In [3]:
from datasets import load_dataset
dataset = load_dataset("csv", data_files=r"..\data\grammar_data.csv")
split_dataset = dataset['train'].train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

In [4]:
print(train_dataset.shape)
print(test_dataset.shape)

(412, 2)
(104, 2)


The dictionary used by symspell is a made from combining the following files.
- core-wordnet.txt: https://wordnetcode.princeton.edu/standoff-files/core-wordnet.txt
- en-80k.txt: blob:https://github.com/f5083720-193f-45aa-8469-2b47b81d9314
- morphosemantic-links.xls: https://wordnetcode.princeton.edu/standoff-files/morphosemantic-links.xls
- teleological-links.xls: https://wordnetcode.princeton.edu/standoff-files/teleological-links.xls

In [5]:
en_80k = pd.read_csv(r"../data/en-80k.txt", sep="\t", header=None, names=["word1", "frequency"])
core_wordnet = pd.read_csv(r"../data/core-wordnet.txt", sep="\t", header=None, names=["word1", "relation", "word2", "gloss1", "gloss2"])
teleological = pd.read_csv(r"../data/teleological-links.txt", sep="\t", header=None, names=["word1", "relation", "word2"])
morphosemantic = pd.read_csv(r"../data/morphosemantic-links.txt", sep="\t", header=None, names=["word1", "relation", "word2", "gloss1", "gloss2"])

teleological["gloss1"] = ""
teleological["gloss2"] = ""

en_80k["relation"] = "frequency"
en_80k["word2"] = ""
en_80k["gloss1"] = ""
en_80k["gloss2"] = ""

# Combine all datasets
combined = pd.concat([teleological, morphosemantic, core_wordnet, en_80k], ignore_index=True)

# Save the combined file
combined.to_csv(r"../data/combined-dictionary.txt", sep="\t", index=False)

print("Combined dictionary saved as 'combined-dictionary.txt'")


Combined dictionary saved as 'combined-dictionary.txt'


In [6]:
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

dictionary_path = r"../data/combined-dictionary.txt"
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

In [7]:
def correct_spelling(texts, sym_spell):
    corrected_texts = []
    for text in texts:
        suggestions = sym_spell.lookup_compound(text, max_edit_distance=2)
        corrected_texts.append(suggestions[0].term if suggestions else text)
    return corrected_texts

In [8]:
# LanguageTool for grammar correction (if needed)
tool = LanguageTool('en')

def correct_grammar(texts):
    corrected_texts = []
    for text in texts:
        matches = tool.check(text)
        corrected_text = tool.correct(text)
        corrected_texts.append(corrected_text)
    return corrected_texts

In [9]:
# Preprocess and tokenize the dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def preprocess_function(examples):
    inputs = tokenizer(examples['input_text'], max_length=128, truncation=True, padding='max_length')
    targets = tokenizer(examples['target_text'], max_length=128, truncation=True, padding='max_length')
    
    # Adjust labels for training (replace padding token ID with -100)
    labels = [
        [(label if label != tokenizer.pad_token_id else -100) for label in labels]
        for labels in targets["input_ids"]
    ]
    
    # Add labels to inputs for training
    inputs["labels"] = labels
    
    return inputs

In [10]:
# Apply preprocessing to datasets
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

In [11]:
# Verifying preprocessing
decoded_refs = []
for label in tokenized_test_dataset['labels']:
    filtered_label = [token if token >= 0 else tokenizer.pad_token_id for token in label]
    decoded_refs.append(tokenizer.decode(filtered_label, skip_special_tokens=True))

In [12]:
# Apply spelling correction
input_texts = test_dataset['input_text']  # Extract input texts from test dataset
spelling_corrected = correct_spelling(input_texts, sym_spell)

In [13]:
#grammar correction
final_corrected = correct_grammar(spelling_corrected)

In [14]:
for corrected_pred, ref in zip(final_corrected[:5], test_dataset['target_text'][:5]):
    print(f"Corrected Prediction: {corrected_pred}")
    print(f"Reference: {ref}")
    print()

Corrected Prediction: Bill pushed harry off the sofa for hours
Reference: Bill pushed Harry off the sofa repeatedly for hours.  

Corrected Prediction: Sharon came the room
Reference: Sharon entered the room.  

Corrected Prediction: The bottle drained the liquid free
Reference: The bottle was drained of its liquid.  

Corrected Prediction: Sam gave the ball out of the basket
Reference: Sam took the ball out of the basket.  

Corrected Prediction: The more pictures of himself appear in the news the more likely john is to get arrested
Reference: The more pictures of himself that appear in the news, the more likely John is to get arrested.  

