# Grammar Error Correction Using LLM (Large Language Models)

## Introduction to Grammar Error Correction

Grammar error correction (GEC) refers to the process of detecting and correcting grammatical errors in written text. This includes a wide range of mistakes, such as verb tense errors, subject-verb agreement, incorrect word order, preposition mistakes, and punctuation errors. Effective GEC is crucial for improving the readability, understanding, and professionalism of written communication. It plays a significant role in various applications such as content creation, learning English as a second language, business communications, and academic writing.

The importance of grammar error correction stems from its ability to ensure clarity and precision in communication. By reducing the incidence of errors, GEC helps maintain the author's credibility and ensures that the message is conveyed effectively without misunderstandings. Additionally, automated grammar correction tools empower individuals by saving time and effort that would otherwise be spent in manual proofreading and editing.

# Project Overview

In this project, titled "Grammar Error Correction Using LLM," I will leverage the capabilities of Large Language Models (LLMs) to fine-tune a model specifically for the task of correcting grammatical errors in English text. Using state-of-the-art techniques in machine learning and natural language processing, this project aims to develop a robust system that can automatically correct a wide array of grammatical mistakes with high accuracy.

# Fine-Tuning the Model

To achieve this, the project involves fine-tuning a pre-trained model on a large dataset of grammatically incorrect sentences paired with their corrected versions. The choice of model for this task is the T5 (Text-to-Text Transfer Transformer) model, known for its effectiveness in handling diverse text-based tasks through a unified text-to-text approach.

## Detailed Breakdown of the Project Components

1. **Data Preprocessing**:
    *  I Have used the data of lang-8 then i have preprocess the data on my local machine
    * The dataset comprises approximately 5 million rows of data, each containing a pair of 'processed_input' (incorrect grammar) and 'processed_output' (corrected grammar).
    * The data is split into training and validation sets, with the training set used to fine-tune the model and the validation set to evaluate its performance.


In [22]:
import pandas as pd
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Load your dataset
df = pd.read_csv('/kaggle/input/gramer-checker/processed_data.csv')


In [23]:
df.dropna(inplace=True)

In [24]:
# Split data into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.1)
 
# Convert dataframes into Hugging Face dataset objects
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# **Model Selection and Tokenization**:
   * The T5 model is chosen for its versatility and ability to generate text based on the context provided in the input.
   * The T5Tokenizer is used to convert text into tokens that serve as input for the model. This involves encoding the text sequences and setting up the model to interpret these sequences correctly.

In [25]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-small')

def tokenize_function(examples):
    model_inputs = tokenizer(examples['processed_input'], max_length=128, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['processed_output'], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/453477 [00:00<?, ? examples/s]



Map:   0%|          | 0/50387 [00:00<?, ? examples/s]

In [26]:
from transformers import T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained('t5-small')


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# Training the Model:

   * The model is trained using the Hugging Face Trainer API, which simplifies the training process with powerful GPU acceleration and efficient memory management.
   * Training parameters such as batch size, number of epochs, and learning rate are configured to optimize the learning process.

In [29]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=16,  # batch size for training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()


Epoch,Training Loss,Validation Loss
1,0.1019,0.085451
2,0.0944,0.083351
3,0.1021,0.082894


TrainOutput(global_step=85029, training_loss=0.09249421293084507, metrics={'train_runtime': 14957.6744, 'train_samples_per_second': 90.952, 'train_steps_per_second': 5.685, 'total_flos': 4.603079557958861e+16, 'train_loss': 0.09249421293084507, 'epoch': 3.0})

In [30]:
# Save the trained model and tokenizer
model.save_pretrained('./saved_model')
tokenizer.save_pretrained('./saved_tokenizer')


('./saved_tokenizer/tokenizer_config.json',
 './saved_tokenizer/special_tokens_map.json',
 './saved_tokenizer/spiece.model',
 './saved_tokenizer/added_tokens.json')

# Model Evaluation and Correction Function:

   * After training, the model's ability to correct new sentences is evaluated on the validation dataset to ensure that it generalizes well to unseen data.
   * A function correct_grammar is created to take an input sentence, process it through the model, and output the corrected version of the sentence.

In [40]:
# Load the trained model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('/kaggle/working/saved_model')
tokenizer = T5Tokenizer.from_pretrained('/kaggle/working/saved_tokenizer')

def correct_grammar(sentence):
    # Prefixing the input text with 'grammar:' is optional and depends on how the model was trained.
    # If you used a specific prefix during training (like "grammar correction:"), use the same here.
    inputs = tokenizer.encode("grammar: " + sentence, return_tensors="pt", max_length=512, truncation=True)
    
    # Generate output using the model
    outputs = model.generate(inputs, max_length=512, num_beams=5, early_stopping=True)
    
    # Decode the generated ids to a string
    corrected_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return corrected_sentence

# Test the function
sentence = "i was watched tv then my father has call me for do her work"
corrected_sentence = correct_grammar(sentence)
print("Original:", sentence)
print("Corrected:", corrected_sentence)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Original: i was watched tv then my father has call me for do her work
Corrected: grammar: i watched tv then my father called me to do her work


In [32]:
import zipfile
import os

def zip_directory(folder_path, output_path):
    """Zips the contents of an entire directory."""
    with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for root, dirs, files in os.walk(folder_path):
            for file in files:
                # Create a proper archive path by getting the path relative to the folder to be zipped
                archive_path = os.path.relpath(os.path.join(root, file), os.path.join(folder_path, '..'))
                zipf.write(os.path.join(root, file), arcname=archive_path)
    print(f"Created zip file: {output_path}")


In [34]:
# Path to the directory where the model is saved
model_directory = '/kaggle/working/saved_tokenizer'

# Path to the output zip file
zip_output_path = '/kaggle/working/tokenize.zip'

# Call the function
zip_directory(model_directory, zip_output_path)


Created zip file: /kaggle/working/tokenize.zip
