# Grammar & Spelling Corrector

### About the dataset
The public version of the Corpus of Linguistic Acceptability (CoLA) dataset contains 9,594 sentences from training and development sets, which are used to assess the grammatical correctness of sentences. The dataset utilized in this project is derived from the original CoLA dataset, with grammatically correct sentences removed.

##### Dataset download links:
- in_domain_train.tsv:- https://github.com/nyu-mll/CoLA-baselines/blob/master/acceptability_corpus/cola_public/raw/in_domain_train.tsv
- in_domain_dev.tsv:- https://github.com/nyu-mll/CoLA-baselines/blob/master/acceptability_corpus/cola_public/raw/in_domain_dev.tsv

This revision clarifies the purpose of the dataset and improves the overall readability.

In [1]:
import pandas as pd

In [2]:
df= pd.read_csv(r'../data/grammar_data.csv')
df.head()

Unnamed: 0,input_text,target_text
0,"As you eat the most, you want the least.","As you eat more, you desire less."
1,"The more you would want, the less you would eat.","The more you desire, the less you eat."
2,"I demand that the more John eat, the more he p...","I demand that the more John eats, the more he ..."
3,"The more does Bill smoke, the more Susan hates...","The more Bill smokes, the more Susan hates him."
4,Who does John visit Sally because he likes?,Whom does John visit because he likes Sally?


In [3]:
df.shape

(516, 2)

In [4]:
df.isnull().sum()

input_text     0
target_text    0
dtype: int64

In [5]:
df.describe()


Unnamed: 0,input_text,target_text
count,516,516
unique,516,507
top,"As you eat the most, you want the least.",The lions ate the meat raw.
freq,1,2


### Preprocessing

In [6]:
import wandb
wandb.init(mode='disabled')

In [7]:
from datasets import load_dataset

dataset = load_dataset("csv", data_files=r"..\data\grammar_data.csv")
split_dataset = dataset['train'].train_test_split(test_size=0.2, seed=42)

train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Train dataset size: {len(train_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")

Train dataset size: 412
Test dataset size: 104


In [8]:
from transformers import AutoTokenizer

tokenizer= AutoTokenizer.from_pretrained('t5-small')

def preprocess_function(examples):
    # Tokenize inputs and targets
    inputs = tokenizer(examples['input_text'], max_length=128, truncation=True, padding='max_length')
    targets = tokenizer(examples['target_text'], max_length=128, truncation=True, padding='max_length')
    
    # Adjust labels for training (replace padding token ID with -100)
    labels = [
        [(label if label != tokenizer.pad_token_id else -100) for label in labels]
        for labels in targets["input_ids"]
    ]
    
    # Add labels to inputs for training
    inputs["labels"] = labels
    
    return inputs

tokenized_train_dataset= train_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset= test_dataset.map(preprocess_function, batched=True)

In [9]:
#verifying the correctness of preprocessing
decoded_refs = []
for label in tokenized_test_dataset['labels']:
    # Replace -100 with pad_token_id to make it decodable
    filtered_label = [token if token >= 0 else tokenizer.pad_token_id for token in label]
    decoded_refs.append(tokenizer.decode(filtered_label, skip_special_tokens=True))

print(decoded_refs[:5])  # Print a few decoded references for verification

['Bill pushed Harry off the sofa repeatedly for hours.', 'Sharon entered the room.', 'The bottle was drained of its liquid.', 'Sam took the ball out of the basket.', 'The more pictures of himself that appear in the news, the more likely John is to get arrested.']


### Fine-tuining the T5 model

#### 1- Finding the best Lerning Rate

In [10]:
import logging
import tensorflow as tf

# Suppress TensorFlow warnings
tf.get_logger().setLevel(logging.ERROR)

In [11]:
from transformers import TrainingArguments, Trainer, AutoModelForSeq2SeqLM

learning_rates = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
epochs = [1, 2, 3, 5]  # Different values of epochs to test

results = []  # To store learning rate, epoch, and evaluation loss

for lr in learning_rates:
    for epoch in epochs:
        # Initialize model for each iteration
        model = AutoModelForSeq2SeqLM.from_pretrained('t5-base')

        # Update training arguments
        lr_range_args = TrainingArguments(
            output_dir=f"./lr_test_lr{lr}_epoch{epoch}",
            fp16=True,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            num_train_epochs=epoch,
            logging_steps=5,
            save_steps=0,  # No saving
            learning_rate=lr,
        )
        # Initialize Trainer
        trainer = Trainer(
            model=model,
            args=lr_range_args,
            train_dataset=tokenized_train_dataset,
            eval_dataset=tokenized_test_dataset,
            processing_class=tokenizer,
        )
        # Train and evaluate
        trainer.train()
        metrics = trainer.evaluate()
        results.append((lr, epoch, metrics['eval_loss']))

'\nfrom transformers import TrainingArguments, Trainer, AutoModelForSeq2SeqLM\n\nlearning_rates = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]\nepochs = [1, 2, 3, 5]  # Different values of epochs to test\n\nresults = []  # To store learning rate, epoch, and evaluation loss\n\nfor lr in learning_rates:\n    for epoch in epochs:\n        # Initialize model for each iteration\n        model = AutoModelForSeq2SeqLM.from_pretrained(\'t5-base\')\n\n        # Update training arguments\n        lr_range_args = TrainingArguments(\n            output_dir=f"./lr_test_lr{lr}_epoch{epoch}",\n            fp16=True,\n            per_device_train_batch_size=8,\n            per_device_eval_batch_size=8,\n            num_train_epochs=epoch,\n            logging_steps=5,\n            save_steps=0,  # No saving\n            learning_rate=lr,\n        )\n        # Initialize Trainer\n        trainer = Trainer(\n            model=model,\n            args=lr_range_args,\n            train_dataset=tokenized_train_dataset,\

In [12]:
# Save results
results_df = pd.DataFrame(results, columns=["Learning Rate", "Epochs", "Eval Loss"])
results_df.to_csv("lr_epoch_results.csv", index=False)


'\n# Save results\nresults_df = pd.DataFrame(results, columns=["Learning Rate", "Epochs", "Eval Loss"])\nresults_df.to_csv("lr_epoch_results.csv", index=False)\n'

In [13]:
import seaborn as sns
import matplotlib.pyplot as plt

pivot_table = results_df.pivot("Learning Rate", "Epochs", "Eval Loss")

plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt=".4f", cmap="viridis")
plt.title("Evaluation Loss for Different Learning Rates and Epochs")
plt.xlabel("Epochs")
plt.ylabel("Learning Rate")
plt.show()


'\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\npivot_table = results_df.pivot("Learning Rate", "Epochs", "Eval Loss")\n\nplt.figure(figsize=(10, 6))\nsns.heatmap(pivot_table, annot=True, fmt=".4f", cmap="viridis")\nplt.title("Evaluation Loss for Different Learning Rates and Epochs")\nplt.xlabel("Epochs")\nplt.ylabel("Learning Rate")\nplt.show()\n'

- The learning rate around 1e-5 or 1e-4 seems to be optimal since the loss decreases consistently.
- Hence I'll use the lower bound of the suggested range, i.e. 1e-4.

#### 2- Training the model

In [14]:
#setting up the training arguments
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./t5_corrector",
    run_name= "grammar_corrector",
    eval_strategy="epoch",
    learning_rate=1e-6,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    logging_dir="./logs",
    logging_steps=10,
)

In [15]:
from transformers import Seq2SeqTrainer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model= AutoModelForSeq2SeqLM.from_pretrained('t5-base')
data_collector= DataCollatorForSeq2Seq(tokenizer, model=model)

trainer= Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    processing_class=tokenizer,
    data_collator=data_collector,
)

In [16]:
trainer.train()

  0%|          | 0/105 [00:00<?, ?it/s]

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


{'loss': 1.6762, 'grad_norm': 8.218499183654785, 'learning_rate': 9.047619047619047e-07, 'epoch': 0.29}
{'loss': 1.6476, 'grad_norm': 9.330354690551758, 'learning_rate': 8.095238095238095e-07, 'epoch': 0.57}
{'loss': 1.6445, 'grad_norm': 11.684385299682617, 'learning_rate': 7.142857142857143e-07, 'epoch': 0.86}


  0%|          | 0/9 [00:00<?, ?it/s]

{'eval_loss': 1.601723313331604, 'eval_runtime': 190.2338, 'eval_samples_per_second': 0.547, 'eval_steps_per_second': 0.047, 'epoch': 1.0}
{'loss': 1.6168, 'grad_norm': 8.2537260055542, 'learning_rate': 6.19047619047619e-07, 'epoch': 1.14}
{'loss': 1.6556, 'grad_norm': 8.574418067932129, 'learning_rate': 5.238095238095238e-07, 'epoch': 1.43}
{'loss': 1.5864, 'grad_norm': 9.601585388183594, 'learning_rate': 4.285714285714285e-07, 'epoch': 1.71}
{'loss': 1.4564, 'grad_norm': 11.905344009399414, 'learning_rate': 3.333333333333333e-07, 'epoch': 2.0}


  0%|          | 0/9 [00:00<?, ?it/s]

{'eval_loss': 1.5402783155441284, 'eval_runtime': 215.0359, 'eval_samples_per_second': 0.484, 'eval_steps_per_second': 0.042, 'epoch': 2.0}
{'loss': 1.6306, 'grad_norm': 8.841704368591309, 'learning_rate': 2.3809523809523806e-07, 'epoch': 2.29}
{'loss': 1.4687, 'grad_norm': 10.055954933166504, 'learning_rate': 1.4285714285714285e-07, 'epoch': 2.57}
{'loss': 1.4933, 'grad_norm': 12.520830154418945, 'learning_rate': 4.7619047619047613e-08, 'epoch': 2.86}


  0%|          | 0/9 [00:00<?, ?it/s]

{'eval_loss': 1.5202873945236206, 'eval_runtime': 220.2286, 'eval_samples_per_second': 0.472, 'eval_steps_per_second': 0.041, 'epoch': 3.0}
{'train_runtime': 13226.0786, 'train_samples_per_second': 0.093, 'train_steps_per_second': 0.008, 'train_loss': 1.5859757832118444, 'epoch': 3.0}


TrainOutput(global_step=105, training_loss=1.5859757832118444, metrics={'train_runtime': 13226.0786, 'train_samples_per_second': 0.093, 'train_steps_per_second': 0.008, 'total_flos': 188167988183040.0, 'train_loss': 1.5859757832118444, 'epoch': 3.0})

### Post Processing

In [17]:
import torch

input_ids = torch.tensor(tokenized_test_dataset['input_ids'])
attention_mask = torch.tensor(tokenized_test_dataset['attention_mask'])

# Obtained predictions from the trained model using beam search
predictions = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    num_beams=7, 
    early_stopping=True
)

In [24]:
# Decoding predictions
decoded_predictions = [
    tokenizer.decode(pred, skip_special_tokens=True)
    for pred in predictions
]

In [25]:
# Decoding references, replacing -100 with pad_token_id for valid decoding
decoded_refs = []
for label in tokenized_test_dataset['labels']:
    # Replaced -100 with pad_token_id for decoding
    filtered_label = [token if token >= 0 else tokenizer.pad_token_id for token in label]
    decoded_refs.append(tokenizer.decode(filtered_label, skip_special_tokens=True))

In [29]:
from symspellpy import SymSpell
# Initialize SymSpell for spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

# Load dictionary files for SymSpell
# Replace these paths with the actual paths to your dictionary files
dictionary_path = r"..\ml\data\en-80k.txt"

sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)


True

In [30]:
# SymSpell for spelling correction
def correct_spelling(texts, sym_spell):
    corrected_texts = []
    for text in texts:
        suggestion = sym_spell.lookup_compound(text, max_edit_distance=2)
        corrected_texts.append(suggestion[0].term if suggestion else text)
    return corrected_texts

In [31]:
# LanguageTool for grammar correction
from language_tool_python import LanguageTool
tool = LanguageTool('en')

def correct_grammar(texts):
    corrected_texts = []
    for text in texts:
        matches = tool.check(text)
        corrected_text = tool.correct(text)
        corrected_texts.append(corrected_text)
    return corrected_texts

In [32]:
# Correct predictions
spelling_corrected = correct_spelling(decoded_predictions, sym_spell)
final_corrected = correct_grammar(spelling_corrected)

In [33]:
# Compare predictions to references
for corrected_pred, ref in zip(final_corrected[:5], decoded_refs[:5]):
    print(f"Corrected Prediction: {corrected_pred}")
    print(f"Reference: {ref}")
    print()

Corrected Prediction: Harry off the sofa for hours bill pushed harry off the sofa for hours
Reference: Bill pushed Harry off the sofa repeatedly for hours.

Corrected Prediction: Sharon came the room
Reference: Sharon entered the room.

Corrected Prediction: Drained the liquid free the bottle drained the liquid free
Reference: The bottle was drained of its liquid.

Corrected Prediction: Sam gave the ball out of the basket
Reference: Sam took the ball out of the basket.

Corrected Prediction: The more pictures of himself appear in the news the more likely john is to get arrested
Reference: The more pictures of himself that appear in the news, the more likely John is to get arrested.



### Model Evaluation

In [34]:
import numpy as np
from evaluate import load

metric = load("sacrebleu")

# Generate predictions
predictions = trainer.predict(tokenized_test_dataset)
decoded_preds = tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)

decoded_refs = [
    [tokenizer.decode([label for label in labels if label >= 0], skip_special_tokens=True)]
    for labels in tokenized_test_dataset['labels']
]

# Compute BLEU score
bleu_score = metric.compute(predictions=decoded_preds, references=decoded_refs)
print(bleu_score)


Using the latest cached version of the module from C:\Users\jaysh\.cache\huggingface\modules\evaluate_modules\metrics\evaluate-metric--sacrebleu\28676bf65b4f88b276df566e48e603732d0b4afd237603ebdf92acaacf5be99b (last modified on Wed Jan  8 15:27:06 2025) since it couldn't be found locally at evaluate-metric--sacrebleu, or remotely on the Hugging Face Hub.


  0%|          | 0/9 [00:00<?, ?it/s]

OverflowError: can't convert negative int to unsigned