# Word Corrector

- We use Birbeck dataset, as it gives us a large and versatile combination of data to test the chosen models.

- We will work a pre-trained model and analyze its performance.

## STEP 1:- CHOOSING AND MODIFYING THE DATASET

- We choose the Birbeck dataset due to its large and diverse combination of words. 

- I downloaded it and converted into a CSV File.

- I need to make some changes to the CSV File to make the testing easier.

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv("./data/wrong_words.csv")

- The correct word is the word that has a dollar sign at the start.

- So the idea is that we want to make another column, that has the correct word for each misspelled word.

In [3]:
data.head()

Unnamed: 0,Word
0,$Albert
1,Ab
2,$America
3,Ameraca
4,Amercia


In [4]:
subset_data = data[0:9]

subset_data

Unnamed: 0,Word
0,$Albert
1,Ab
2,$America
3,Ameraca
4,Amercia
5,$American
6,Ameracan
7,$April
8,Apirl


In [5]:
data["Word"][23551]

'$manage'

In [6]:
correct_words = []
i = 0
j = 1

while j < data.shape[0] and i < data.shape[0]:

    word1 = data["Word"][i]
    word2 = data["Word"][j]
    if "$" in word1:
        if "$" in word2:
            correct_word = word1.replace("$", "")
            correct_words += [correct_word for _ in range(j-i)]
            i = j
            j += 1
        else:
            j += 1


if i < data.shape[0]:
    word1 = data["Word"][i+1]
    correct_word = word1.replace("$", "")
    correct_words += [correct_word for _ in range(j-i)]
    
data["Correct_word"] = correct_words

In [7]:
data.head()

Unnamed: 0,Word,Correct_word
0,$Albert,Albert
1,Ab,Albert
2,$America,America
3,Ameraca,America
4,Amercia,America


- Will remove the rows that has the correct words, as there no need to have their own row.

In [8]:
data = data[~data["Word"].str.contains("\$")]

In [9]:
data.head()

Unnamed: 0,Word,Correct_word
1,Ab,Albert
3,Ameraca,America
4,Amercia,America
6,Ameracan,American
8,Apirl,April


## STEP 2:- CHOOSING THE METRICS 

The chosen Metrics:-
 
1) Accuracy: Accuracy is a straightforward and intuitive metric that directly reflects how often the spell checker provides the correct correction. It’s useful as an overall performance indicator, providing a clear picture of the tool’s effectiveness. I chose Accuracy because it offers a simple yet comprehensive assessment of the spell checker’s ability to make the right corrections without needing to analyze the rank or multiple options.

2) Mean Reciprocal Rank (MRR): MRR is particularly beneficial for spell checkers that generate multiple suggestions, as it rewards tools that place the correct answer higher on the list. This metric is ideal for applications where the user can choose from several suggested corrections. I chose MRR because it captures not only if the correct answer is present but also how well-ranked it is, adding a layer of quality assessment to the tool’s suggestions.

Reasons for Choosing These Over Other Metrics:

- Precision & Recall: While these metrics are useful for evaluating over-correction and under-correction, they are more suited for applications where false positives and false negatives carry different weights, such as in classification tasks. For spell checking, Accuracy and MRR provide a more holistic view of performance.
- Edit Distance: This metric measures the number of changes needed to transform the misspelled word into the correct word. However, it may not fully capture the quality of the spell checker’s suggestions, especially when multiple suggestions are offered. MRR, in contrast, focuses on the rank of correct suggestions, which is more relevant for assessing user-facing spell checkers.
These choices balance the need for an overall accuracy assessment (Accuracy) with a focus on ranking quality (MRR), making them well-suited to evaluating spell checkers in a practical, user-oriented context.

## STEP 3:- CHOOSING THE MODELS

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load a grammar and spell correction model
tokenizer = AutoTokenizer.from_pretrained("prithivida/grammar_error_correcter_v1")
model = AutoModelForSeq2SeqLM.from_pretrained("prithivida/grammar_error_correcter_v1")


  from .autonotebook import tqdm as notebook_tqdm
Error while downloading from https://cdn-lfs.hf.co/prithivida/grammar_error_correcter_v1/7a25e717ec2582f77c087d77bf7f09d866bb6ea63b1ff79a4fb17061a4b39c8d?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1731013390&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMTAxMzM5MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9wcml0aGl2aWRhL2dyYW1tYXJfZXJyb3JfY29ycmVjdGVyX3YxLzdhMjVlNzE3ZWMyNTgyZjc3YzA4N2Q3N2JmN2YwOWQ4NjZiYjZlYTYzYjFmZjc5YTRmYjE3MDYxYTRiMzljOGQ%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=CataPlwBG5ZwMQj%7Ec2g3EmDmka-RCODA3zMVocKz3TIPvywqglJCQCglYiPuZrPwX6dwZ5LDes4GmJJp5wVr4i9zLuXzppfAhOu5iQm5HceH1W81yHlN4CjG0Bd95cZXFCvbEKktTykKm6vKJbcO%7EaUnvebcjf58k%7EGHt0T8S7qQidDp%7ESO2ampEud1TpeAI6KksW1poSStpkEV1Fhkwf-fHLrh

In [12]:
sentences = ["Amerca", "Aperl"]

for sentence in sentences:
    # Encode and generate correction
    inputs = tokenizer.encode("gec: " + sentence, return_tensors="pt")
    outputs = model.generate(inputs, max_length=128, num_beams=5, early_stopping=True)
    corrected_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Original: {sentence}, Corrected: {corrected_sentence}")


Original: Amerca, Corrected: Amerca
Original: Aperl, Corrected: Aperl.


In [13]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load T5 model for spell correction
tokenizer_T5 = AutoTokenizer.from_pretrained("google/flan-t5-base")
model_T5 = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [14]:
# Define a few sentences with spelling errors
sentences = ["Amerca is a continet.", "Aperil is in the sping."]

for sentence in sentences:
    # Use T5 for spell correction by framing the input with a prompt
    inputs = tokenizer_T5("Correct the spelling: " + sentence, return_tensors="pt")
    outputs = model_T5.generate(inputs.input_ids, max_length=50, num_beams=5, early_stopping=True)
    corrected_sentence = tokenizer_T5.decode(outputs[0], skip_special_tokens=True)
    print(f"Original: {sentence}, Corrected: {corrected_sentence}")


Original: Amerca is a continet., Corrected: America is a continent.
Original: Aperil is in the sping., Corrected: Aperil is in the sping.
