# Word Corrector

- We use Birbeck dataset, as it gives us a large and versatile combination of data to test the chosen models.

- We will work with two pre-trained models, and analyze and compare their performances.

## STEP 1:- CHOOSING AND MODIFYING THE DATASET

- We choose the Birbeck dataset due to its large and diverse combination of words. 

- I downloaded it and converted into a CSV File.

- I need to make some changes to the CSV File to make the testing easier.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("./data/wrong_words.csv")

- The correct word is the word that has a dollar sign at the start.

- So the idea is that we want to make another column, that has the correct word for each misspelled word.

In [3]:
data.head()

Unnamed: 0,Word
0,$Albert
1,Ab
2,$America
3,Ameraca
4,Amercia


In [4]:
subset_data = data[0:9]

subset_data

Unnamed: 0,Word
0,$Albert
1,Ab
2,$America
3,Ameraca
4,Amercia
5,$American
6,Ameracan
7,$April
8,Apirl


In [5]:
data["Word"][23551]

'$manage'

In [6]:
correct_words = []
i = 0
j = 1

while j < data.shape[0] and i < data.shape[0]:

    word1 = data["Word"][i]
    word2 = data["Word"][j]
    if "$" in word1:
        if "$" in word2:
            correct_word = word1.replace("$", "")
            correct_words += [correct_word for _ in range(j-i)]
            i = j
            j += 1
        else:
            j += 1


if i < data.shape[0]:
    word1 = data["Word"][i+1]
    correct_word = word1.replace("$", "")
    correct_words += [correct_word for _ in range(j-i)]
    
data["Correct_word"] = correct_words

In [7]:
data.head()

Unnamed: 0,Word,Correct_word
0,$Albert,Albert
1,Ab,Albert
2,$America,America
3,Ameraca,America
4,Amercia,America


- Will remove the rows that has the correct words, as there no need to have their own row.

In [8]:
data = data[~data["Word"].str.contains("\$")]

In [9]:
data.head()

Unnamed: 0,Word,Correct_word
1,Ab,Albert
3,Ameraca,America
4,Amercia,America
6,Ameracan,American
8,Apirl,April


## STEP 2:- CHOOSING THE METRICS 

The chosen Metrics:-
 
1) Accuracy: Accuracy is a straightforward and intuitive metric that directly reflects how often the spell checker provides the correct correction. It’s useful as an overall performance indicator, providing a clear picture of the tool’s effectiveness. I chose Accuracy because it offers a simple yet comprehensive assessment of the spell checker’s ability to make the right corrections without needing to analyze the rank or multiple options.

2) Mean Reciprocal Rank (MRR): MRR is particularly beneficial for spell checkers that generate multiple suggestions, as it rewards tools that place the correct answer higher on the list. This metric is ideal for applications where the user can choose from several suggested corrections. I chose MRR because it captures not only if the correct answer is present but also how well-ranked it is, adding a layer of quality assessment to the tool’s suggestions.

Reasons for Choosing These Over Other Metrics:

- Precision & Recall: While these metrics are useful for evaluating over-correction and under-correction, they are more suited for applications where false positives and false negatives carry different weights, such as in classification tasks. For spell checking, Accuracy and MRR provide a more holistic view of performance.
- Edit Distance: This metric measures the number of changes needed to transform the misspelled word into the correct word. However, it may not fully capture the quality of the spell checker’s suggestions, especially when multiple suggestions are offered. MRR, in contrast, focuses on the rank of correct suggestions, which is more relevant for assessing user-facing spell checkers.
These choices balance the need for an overall accuracy assessment (Accuracy) with a focus on ranking quality (MRR), making them well-suited to evaluating spell checkers in a practical, user-oriented context.

## STEP 3:- CHOOSING THE MODELS

- BERT Model

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load a grammar and spell correction model
tokenizer = AutoTokenizer.from_pretrained("prithivida/grammar_error_correcter_v1")
model = AutoModelForSeq2SeqLM.from_pretrained("prithivida/grammar_error_correcter_v1")


  from .autonotebook import tqdm as notebook_tqdm


In [27]:
subset_data = data.iloc[:3000,:]

words = subset_data["Word"]

In [28]:
sentences = []

for sentence in words:
    # Encode and generate correction
    inputs = tokenizer.encode("gec: " + sentence, return_tensors="pt")
    outputs = model.generate(inputs, max_length=128, num_beams=5, early_stopping=True)
    corrected_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    sentences.append(corrected_sentence)

subset_data["BERT_word"] = sentences


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_data["BERT_word"] = sentences


T5 Model

In [29]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load T5 model for spell correction
tokenizer_T5 = AutoTokenizer.from_pretrained("google/flan-t5-base")
model_T5 = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")


In [None]:
sentences = []

for sentence in words:
    # Encode and generate correction
    inputs = tokenizer_T5.encode("gec: " + sentence, return_tensors="pt")
    outputs = model_T5.generate(inputs, max_length=128, num_beams=5, early_stopping=True)
    corrected_sentence = tokenizer_T5.decode(outputs[0], skip_special_tokens=True)
    sentences.append(corrected_sentence)

subset_data["T5_word"] = sentences


## STEP 4:- Evaluating the models 

- We first need to create functions that calculate the Accuracy and the MRR.

In [43]:
def calculate_accuracy(predictions, ground_truth):
    correct_count = 0
    total = len(ground_truth)
    for i, correct_word in enumerate(ground_truth):
        # Check if the correct word is in the top prediction (first suggestion)
        if predictions.iloc[i] == correct_word:
            correct_count += 1
    
    accuracy = correct_count / total
    return accuracy

def calculate_mrr(predictions, ground_truth):
    reciprocal_ranks = []
    
    for i, correct_word in enumerate(ground_truth):
        if correct_word in predictions.iloc[i]:
            rank = predictions.iloc[i].index(correct_word) + 1  # +1 because ranks are 1-based
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0)  # No correct suggestion in the list
    
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return mrr

- The accuracy of the **BERT Model**

In [49]:
predictions_BERT = subset_data["BERT_word"]
ground_truth = subset_data["Word"]

accuracy_BERT = calculate_accuracy(predictions_BERT, ground_truth)

In [50]:
mrr_BERT = calculate_mrr(predictions_BERT, ground_truth)

In [51]:
print(f"Accuracy for BERT: {accuracy_BERT:.2f}")
print(f"MRR for BERT: {mrr_BERT:.2f}")

Accuracy for BERT: 0.25
MRR for BERT: 0.81


- The Accuracy of the T5 model

In [52]:
predictions_T5 = subset_data["T5_word"]

accuracy_T5 = calculate_accuracy(predictions_T5, ground_truth)

In [53]:
mrr_T5 = calculate_mrr(predictions_T5, ground_truth)

In [54]:
print(f"Accuracy for T5: {accuracy_T5:.2f}")
print(f"MRR for T5: {mrr_T5:.2f}")

Accuracy for T5: 0.20
MRR for T5: 0.32


## Step 5 

1) Strengths and Weaknesses

    - BERT Model:

- Accuracy: The BERT model achieved an accuracy of 0.25, meaning it correctly identified the exact spelling for 25% of the cases. This indicates that while BERT is somewhat effective in suggesting the correct spelling, its primary strength lies not in providing a high number of exact matches but in its contextual understanding.
- MRR: The Mean Reciprocal Rank for BERT is 0.81, suggesting that even when BERT doesn’t provide the correct word as the top suggestion, it often ranks it highly among its suggestions. This high MRR value highlights BERT's ability to produce relevant corrections that may not be perfect but are close to the correct answer. The strong MRR score implies that BERT’s suggestions could be useful if further optimized or if the tool allows users to select from multiple suggestions.
- Weakness: Although BERT performs well in ranking relevant suggestions, its relatively low accuracy indicates that it often fails to provide the exact correct word as the first suggestion.

 	- T5 Model:

- Accuracy: The T5 model achieved a lower accuracy of 0.20, meaning it correctly identified the exact spelling for only 20% of the cases. This lower accuracy compared to BERT suggests that T5 might not be as effective in making precise corrections.
- MRR: The MRR for T5 is 0.32, which is considerably lower than that of BERT. This indicates that T5’s suggestions are generally less relevant or useful compared to BERT’s, as the correct answer is less likely to be ranked highly among its suggestions.
- Weakness: T5’s lower accuracy and MRR suggest that it may lack both the precision and the ranking quality seen in BERT, making it less reliable for tasks requiring top-ranked correct suggestions.

2) Improvement Suggestions

- Model Ensembling: Given that BERT has a high MRR but moderate accuracy, combining it with other models (e.g., T5 or rule-based spell checkers) might improve the overall accuracy by leveraging the strengths of multiple models. An ensemble approach could enhance the chances of obtaining the correct answer in the top suggestions.
- Fine-Tuning: Fine-tuning the BERT model on a specialized spelling correction dataset could help increase its accuracy for exact matches. Tailoring the model to correct specific types of spelling errors might address its shortcomings in providing precise corrections.
- Context-Based Re-Ranking: Since BERT provides relatively high-quality suggestions, implementing a context-aware re-ranking mechanism could improve the rank of the correct answer. By taking into account additional linguistic features or common spelling patterns, this approach could optimize the order of BERT’s suggestions for better usability.