<a href="https://colab.research.google.com/github/AndreRab/Research-of-the-existing-spell-checking-tools-using-NLP/blob/main/jetBrains.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Abstract**

This research aims to evaluate and compare three spell-checking tools, focusing on their effectiveness and accuracy in correcting spelling errors. Using a diverse dataset, we assess the performance of GPT-4, the language_tool_python library, and a fine-tuned T5 model. Through this study, we seek to identify the best-performing tool for spell-checking tasks and provide insights into potential improvements.



# **Introduction**

Spell-checking tools are essential in ensuring effective written communication. With the increasing reliance on digital text, the demand for reliable and efficient spell-checkers has grown. This research investigates the performance of several popular spell-checking tools, aiming to provide a comparative analysis that can guide users in selecting the most suitable tool for their needs.

# **Data preporation**

For these task, I will use s dataset from Kaggle [Grammar Correction](https://www.kaggle.com/datasets/satishgunjal/grammar-correction), which consists of ungrammatical sentences and and their corrected versions. I chose this type of data because, to check grammar, we typically need only a single sentence. If we want to review a longer text, we can simply split it into sentences and correct each one individually

In [None]:
!pip install kaggle
!kaggle datasets download -d satishgunjal/grammar-correction -p ./data

Dataset URL: https://www.kaggle.com/datasets/satishgunjal/grammar-correction
License(s): apache-2.0
grammar-correction.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!unzip data/grammar-correction.zip

Archive:  data/grammar-correction.zip
replace Grammar Correction.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
!pip install datasets



In [None]:
import pandas as pd
from datasets import Dataset

df = pd.read_csv("Grammar Correction.csv")

In [None]:
df.rename(columns = {'Serial Number': 'Index', 'Error Type': 'error_type', 'Ungrammatical Statement': 'error_sentence', 'Standard English': 'correct_sentence'}, inplace=True)
df.drop(columns=['Index'], inplace=True)
df.head()

Unnamed: 0,error_type,error_sentence,correct_sentence
0,Verb Tense Errors,I goes to the store everyday.,I go to the store everyday.
1,Verb Tense Errors,They was playing soccer last night.,They were playing soccer last night.
2,Verb Tense Errors,She have completed her homework.,She has completed her homework.
3,Verb Tense Errors,He don't know the answer.,He doesn't know the answer.
4,Verb Tense Errors,The sun rise in the east.,The sun rises in the east.


As we can see, this dataset contains not only sentences but also labels indicating the types of mistakes. I will explore different models, some of which I will fine-tune, so let's select a few examples from each category for our validation set.

In [None]:
val_indx = []
train_indx = []
number_of_samples_per_mistake = 2

for error_type in df.error_type.unique():
  val_indx += list(df.loc[df.error_type == error_type].index)[:number_of_samples_per_mistake]
  train_indx += list(df.loc[df.error_type == error_type].index)[number_of_samples_per_mistake:]

df_train = df.iloc[train_indx, :]
df_val = df.iloc[val_indx, :]

In [None]:
df_train.head()

Unnamed: 0,error_type,error_sentence,correct_sentence
2,Verb Tense Errors,She have completed her homework.,She has completed her homework.
3,Verb Tense Errors,He don't know the answer.,He doesn't know the answer.
4,Verb Tense Errors,The sun rise in the east.,The sun rises in the east.
5,Verb Tense Errors,I am eat pizza for lunch.,I am eating pizza for lunch.
6,Verb Tense Errors,The students studies for the exam.,The students study for the exam.


In [None]:
df_val.head()

Unnamed: 0,error_type,error_sentence,correct_sentence
0,Verb Tense Errors,I goes to the store everyday.,I go to the store everyday.
1,Verb Tense Errors,They was playing soccer last night.,They were playing soccer last night.
100,Subject-Verb Agreement,The dogs runs quickly to the park.,The dogs run quickly to the park.
101,Subject-Verb Agreement,She go to the library every Tuesday.,She goes to the library every Tuesday.
200,Article Usage,I went to a school yesterday.,I went to school yesterday.


Creating datasets for fine-tuning.

In [None]:
df_dataset = df[['error_sentence', 'correct_sentence']]

df_dataset_train = df_dataset.iloc[train_indx, :]
df_dataset_val = df_dataset.iloc[val_indx, :]

df_dataset_train.columns = ['input_text', 'target_text']
df_dataset_val.columns = ['input_text', 'target_text']

dataset_train = Dataset.from_pandas(df_dataset_train)
dataset_val = Dataset.from_pandas(df_dataset_val)

#**Metrics**
As metrics for my research, I will use the BLEU and METEOR metrics.

BLEU is a good metric for us because people often make small mistakes or typos, which makes it important to examine n-grams.

METEOR was chosen because it can handle synonyms. Sometimes people may use incorrect synonyms that we cannot utilize, so there could be several different synonyms that would be acceptable. Therefore, we need a metric that can effectively address synonyms.

In [None]:
!pip install nltk



In [None]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate import meteor_score
from nltk import word_tokenize

nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Let's define a function for BLEU and METEOR score calculation

In [None]:
def calculate_bleu(candidate, reference):
    reference = word_tokenize(reference)
    candidate = word_tokenize(candidate)
    smoothing_function = SmoothingFunction().method1
    score = sentence_bleu([reference], candidate, smoothing_function=smoothing_function)
    return score

print(f'BLEU metric for the correct sentence: {calculate_bleu(df_train.correct_sentence[2], df_train.correct_sentence[2])}')
print(f'BLEU metric for the wrong sentence: {calculate_bleu(df_train.error_sentence[2], df_train.correct_sentence[2])}')

BLEU metric for the correct sentence: 1.0
BLEU metric for the wrong sentence: 0.537284965911771


In [None]:
def calculate_meteor(candidate, reference):
    reference = word_tokenize(reference.lower())
    candidate = word_tokenize(candidate.lower())
    score = meteor_score.meteor_score([reference], candidate)
    return score

print(f'METEOR metric for the correct sentence: {calculate_meteor(df_train.correct_sentence[2], df_train.correct_sentence[2])}')
print(f'METEOR metric for the wrong sentence: {calculate_meteor(df_train.error_sentence[2], df_train.correct_sentence[2])}')


METEOR metric for the correct sentence: 0.9976851851851852
METEOR metric for the wrong sentence: 0.8066666666666668


Let's look at our sentences and try to understand why the METEOR metric is higher than BLEU.

In [None]:
df_train.loc[2, ['error_sentence', 'correct_sentence']]

Unnamed: 0,2
error_sentence,She have completed her homework.
correct_sentence,She has completed her homework.


As we can see, our sentence is grammatically incorrect, but the meaning of the verb 'have' remains the same. That’s why the METEOR score is higher; METEOR rates the overall meaning while BLEU focuses on precision. For our task, it is better to have metrics for both parameters, but I believe BLEU will be the more important one.

# **Models evalution**

In this section, I will explore different models and libraries and try to fine-tune my own model.

In [None]:
df_model_comparison = pd.DataFrame(columns = ['model', 'avg_BLEU', 'avg_METEOR', 'description'])
df_model_comparison

Unnamed: 0,model,avg_BLEU,avg_METEOR,description


###**LLM - GPT-4o model**

Our first model will be GPT-4, which will have the following prompt:

In [None]:
prompt = "Correct all sentences. Sentences:"
for sentence in df_val.error_sentence:
  prompt += '\n' + sentence

print(prompt)

Correct all sentences. Sentences:
I goes to the store everyday.
They was playing soccer last night.
The dogs runs quickly to the park.
She go to the library every Tuesday.
I went to a school yesterday.
She bought the ice cream for her daughter.
In Friday, I will go to the party.
He is interested to learn guitar.
I be going to the store to buy groceries.
He no likes pineapple on his pizza.
1. She always be triying new things.
2. John and me went to the store yesterday.
I dont know what your talking about.
She said that she was going to the store but she never came back.
i have a meeting on Monday with dr. Smith.
my iphone suddenly stopped working, can you help?
The weather outside is very extreem today.
She didn't excepted the job offer.
I went to the store I bought groceries.
She loves reading she reads every night.
Eating an apple, hungry.
She late for meeting.
I went to the store store.
At 3 PM in the afternoon, I have a meeting.
1. Let's not open up that can of worms in the same boa

I got the folowing answer from GPT:

In [None]:
result = """I go to the store every day.
They were playing soccer last night.
The dogs run quickly to the park.
She goes to the library every Tuesday.
I went to school yesterday.
She bought ice cream for her daughter.
On Friday, I will go to the party.
He is interested in learning guitar.
I am going to the store to buy groceries.
He does not like pineapple on his pizza.
She is always trying new things.
John and I went to the store yesterday.
I don’t know what you’re talking about.
She said that she was going to the store, but she never came back.
I have a meeting on Monday with Dr. Smith.
My iPhone suddenly stopped working. Can you help?
The weather outside is very extreme today.
She didn't accept the job offer.
I went to the store and bought groceries.
She loves reading; she reads every night.
Eating an apple, I felt hungry.
She is late for the meeting.
I went to the store.
At 3 PM, I have a meeting.
1. Let's not open up that can of worms.
2. You're comparing apples to oranges while the cat's away.
The car was driven by Sarah to the supermarket.
The cake was eaten by the guests at the party.
He and I went to the store yesterday.
When you see Sarah, give her this letter.
I like pizza, but I don't like ice cream.
She can sing, and she can dance.
Walking incredibly, John finished the race.
She served sandwiches on plates made of whole wheat to her guests.
He runs faster than I do.
This is the tallest building I've ever seen.
She enjoys reading, watching movies, and cooking.
John needs to clean his room, his car, and organize his desk.
Some people like to play basketball.
Many cars are electric nowadays.
I need to repeat what I said earlier.
She always exaggerates her problems.
Yo, how's it going? I need to know the time of the meeting.
Could you kindly provide me with a burger, fries, and a Coke, please?
She goes to the store to buy bread.
He ran the race quickly.
If I had a car, I would go on a road trip.
If he finishes his work, he could go to the party.
Apples are better than oranges.
She is taller than her brother, but not as tall as her sister.
How are you doing today?
She doesn't know the answer to the question.
I don't know anything about cars.
She never visited the museum.
I went to the store and bought milk.
She was so tired that she couldn't keep her eyes open.
1. There's no way I'm going to that party.
2. I'm going to pick up some food. Are you hungry?
He's at the end of his rope.
She got quick on the draw.
He works at Facebook, but I work at Microsoft.
My brother just joined the USMC yesterday.
Its door is locked; we can't get in.
I don't know where its key is.
The man who bought the car was happy.
The cake I ate for dessert was delicious.
He decided to finish his homework quickly.
She wants to study computer science in the future.
I enjoy reading books.
She is excited about the trip.
She enjoys swimming, dancing, and reading books.
To become a doctor, you need to study hard, be patient, and have good communication skills."""

Let's count average BLEU and METEOR score for GPT-4o


In [None]:
bleu_array = []
meteor_array = []

sentences = result.split('\n')
for (generated, real) in zip(df_val.correct_sentence, sentences):
  bleu_array.append(calculate_bleu(generated, real))
  meteor_array.append(calculate_meteor(generated, real))

new_row = {'model': 'GPT-4o',
           'avg_BLEU': sum(bleu_array)/len(bleu_array),
           'avg_METEOR': sum(meteor_array)/len(meteor_array),
           'description': 'Chat GPT-4o with one prompt for all sentences'}

df_model_comparison.loc[len(df_model_comparison), :] = new_row
df_model_comparison

Unnamed: 0,model,avg_BLEU,avg_METEOR,description
0,GPT-4o,0.813936,0.928173,Chat GPT-4o with one prompt for all sentences


As we can see, our BLEU metric is 0.81, indicating that the result isn't ideal. This is largely because the model generated everything in a single prompt, which is a more advanced task. Additionally, the model occasionally replaced some words with synonyms. However, the METEOR metric shows that the meaning of the sentences hasn't really changed.

###**Library - language_tool_python**

In [None]:
!pip install language_tool_python
import language_tool_python



The example of usage

In [None]:
tool = language_tool_python.LanguageTool('en-US')
text = "I goes to the store everyday."
matches = tool.check(text)
corrected_text = language_tool_python.utils.correct(text, matches)

print(f"Corrected Sentence: {corrected_text}")

Corrected Sentence: I go to the store every day.


Let's count average BLEU and METEOR score for language_tool_python library

In [None]:
bleu_array = []
meteor_array = []

for (error_sentence, correct_sentence) in zip(df_val.error_sentence, df_val.correct_sentence):
  matches = tool.check(error_sentence)
  generated_sentence = language_tool_python.utils.correct(error_sentence, matches)
  bleu_array.append(calculate_bleu(generated_sentence, correct_sentence))
  meteor_array.append(calculate_meteor(generated_sentence, correct_sentence))

new_row = {'model': 'language_tool_python ',
           'avg_BLEU': sum(bleu_array)/len(bleu_array),
           'avg_METEOR': sum(meteor_array)/len(meteor_array),
           'description': 'Python library that does not publicly specify a particular model architecture.'}

df_model_comparison.loc[len(df_model_comparison), :] = new_row
df_model_comparison

Unnamed: 0,model,avg_BLEU,avg_METEOR,description
0,GPT-4o,0.813936,0.928173,Chat GPT-4o with one prompt for all sentences
1,language_tool_python,0.62425,0.891729,Python library that does not publicly specify ...


Unfortunately, I couldn't find any information about the architecture of the model used in this library. However, as we see, the GPT model has better scores for both metrics, which we can explain by the fact that GPT must understand grammar and structure of language ideally for good text generation.

### **Fine-tuning using transformers**

In [None]:
!pip install transformers
import torch
import numpy as np



In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

As my base model for fine-tuning, I chose T5. The entire process of fine-tuning is located in the following cells.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq

model_name = 't5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name)

def preprocess_function(examples):
    inputs = ['correct: ' + text for text in examples['input_text']]
    model_inputs = tokenizer(inputs, max_length=32, padding=True, truncation=True, return_tensors="pt")

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['target_text'], max_length=32, padding=True, truncation=True, return_tensors="pt")

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

tokenized_dataset_train = dataset_train.map(preprocess_function, batched=True)
tokenized_dataset_val = dataset_val.map(preprocess_function, batched=True)

Map:   0%|          | 0/1946 [00:00<?, ? examples/s]



Map:   0%|          | 0/72 [00:00<?, ? examples/s]

In [None]:
def compute_metrics(pred):
    predictions, labels = pred.predictions[0], pred.label_ids
    predictions = np.argmax(predictions, axis=-1)

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Calculate BLEU and METEOR scores for each pair
    bleu_scores = [calculate_bleu(pred, label) for pred, label in zip(decoded_preds, decoded_labels)]
    meteor_scores = [calculate_meteor(pred, label) for pred, label in zip(decoded_preds, decoded_labels)]

    # Return average scores
    return {
        'avg_bleu': sum(bleu_scores) / len(bleu_scores),
        'avg_meteor': sum(meteor_scores) / len(meteor_scores),
    }


In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-small')
model.to(device)

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    fp16=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_train,
    eval_dataset=tokenized_dataset_val,
    data_collator = data_collator,
    compute_metrics=compute_metrics
)

trainer.train()


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Avg Bleu,Avg Meteor
1,No log,0.665701,0.540558,0.748479
2,No log,0.342933,0.620384,0.82886
3,No log,0.271172,0.678337,0.861901
4,No log,0.253401,0.679571,0.862786
5,1.137700,0.243963,0.690929,0.864494
6,1.137700,0.237761,0.705979,0.868241
7,1.137700,0.233834,0.708816,0.869671
8,1.137700,0.231095,0.708816,0.869671
9,0.234400,0.22966,0.71185,0.871382
10,0.234400,0.229295,0.71185,0.871382


TrainOutput(global_step=1220, training_loss=0.6016781134683578, metrics={'train_runtime': 215.9204, 'train_samples_per_second': 90.126, 'train_steps_per_second': 5.65, 'total_flos': 164609466040320.0, 'train_loss': 0.6016781134683578, 'epoch': 10.0})

The model's loss started to decrease slowly, so I believe we can stop the fine-tuning process and begin observing our results. First, let's add our new fine-tuned model to the comparison DataFrame.

In [None]:
new_row = {'model': 'Fine-tuned T5-small model',
           'avg_BLEU': 0.711850,
           'avg_METEOR': 0.871382,
           'description': 'Fine-tuned small-T5 model using transformers library.'}

df_model_comparison.loc[len(df_model_comparison), :] = new_row
df_model_comparison

Unnamed: 0,model,avg_BLEU,avg_METEOR,description
0,GPT-4o,0.813936,0.928173,Chat GPT-4o with one prompt for all sentences
1,language_tool_python,0.62425,0.891729,Python library that does not publicly specify ...
2,Fine-tuned T5-small model,0.71185,0.871382,Fine-tuned small-T5 model using transformers l...


During my training, I encountered some issues that we can address for better results:

* The size of the evaluation and test datasets was incorrect. I used only 1% of the data for evaluation, which isn't ideal because this 1% may not be representative. However, I did this to facilitate the use of GPT-4o for compilation. To effectively compare the fine-tuned models, we need a larger validation set.

* We should have more data. While the current dataset is suitable for comparing trained models, we need a more diverse dataset to fine-tune a model effectively.
    
* Finding optimal hyperparameters is essential. We can experiment with different hyperparameters to identify the best configuration.
      
* We should consider trying other transformers as the base for our model.

# **Final model comparison**

In [None]:
df_model_comparison

Unnamed: 0,model,avg_BLEU,avg_METEOR,description
0,GPT-4o,0.813936,0.928173,Chat GPT-4o with one prompt for all sentences
1,language_tool_python,0.62425,0.891729,Python library that does not publicly specify ...
2,Fine-tuned T5-small model,0.71185,0.871382,Fine-tuned small-T5 model using transformers l...


Let's see the best model for each metric.

In [None]:
df_model_comparison.loc[:, ['model', 'avg_BLEU']].sort_values(by = ['avg_BLEU'], ascending = False)

Unnamed: 0,model,avg_BLEU
0,GPT-4o,0.813936
2,Fine-tuned T5-small model,0.71185
1,language_tool_python,0.62425


As we can see, the best model for the BLEU metric is the GPT model, which is not surprising because it is the largest one. The second model is the fine-tuned T5, and I believe this is due to the fact that I trained it using this dataset, so the model already knows what the input data will look like, while language_tool_python does not.

In [None]:
df_model_comparison.loc[:, ['model', 'avg_METEOR']].sort_values(by = ['avg_METEOR'], ascending = False)

Unnamed: 0,model,avg_METEOR
0,GPT-4o,0.928173
1,language_tool_python,0.891729
2,Fine-tuned T5-small model,0.871382


For the METEOR metric, each model achieved better results, and once again, GPT is the winner in this comparison. However, language_tool_python secured second place, which suggests that the overall coherence of the sentences generated by this model is better than that of the T5 model.

## How can I improve each model?

For T5 and GPT model we can fine-tune them using more various data, diferient techniks of learning and optimal hyperparameters.

language_tool_python we can expand using our customs grammar rules

# **Conclusion**

During the research, I tested three spell-checking tools and evaluated them to determine which one is the best.

I started with GPT-4, which is the best tool of all that I tried during this research. However, the main issue is that this model is very large, and I couldn't fine-tune it locally. That's why I used a web application with the model for evaluation. Therefore, I can't write very large prompts, which means my validation set wasn't extensive enough for all the models.

Then I tried the library language_tool_python, which worked well for my examples, and I thought that the model would perform better than GPT, but in reality, it did not.

The last tool I tested was a fine-tuned T5 model, with which I encountered some training difficulties. As I mentioned during my training, there were issues that we can address to achieve better results.

In conclusion, I believe that if we utilize a more diverse dataset, identify a better base model, and optimize the hyperparameters, we can build a model that outperforms GPT-4 in spell-checking tasks.