# Fine-Tune a Transformer Model for Grammar Correction

# Installation

First I will try to fine tune the "grammer_correction" model on Hugging face
link: https://huggingface.co/HamadML/grammer_correction

Training dataset : https://huggingface.co/datasets/jhu-clsp/jfleg?ref=vennify.ai

In [1]:
! pip install datasets



# Model

In [3]:
# Loading Model from Hugging face
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("HamadML/grammer_correction")
model = AutoModelForSeq2SeqLM.from_pretrained("HamadML/grammer_correction")

# Dataset Collection

In [4]:
# Loading Dataset from hugging face
from datasets import load_dataset

ds = load_dataset("jhu-clsp/jfleg")

Downloading readme:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/755 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/748 [00:00<?, ? examples/s]

In [5]:
ds.shape

{'validation': (755, 2), 'test': (748, 2)}

In [6]:
train_ds = ds['validation']
val_ds = ds['test']

# Data Examination

In [7]:
train_ds.column_names, val_ds.column_names

(['sentence', 'corrections'], ['sentence', 'corrections'])

In [8]:
print(train_ds["sentence"][0])
print("Corrections : ")
for i in train_ds["corrections"][0]:
    print(f" - {i}")

So I think we can not live if old people could not find siences and tecnologies and they did not developped . 
Corrections : 
 - So I think we would not be alive if our ancestors did not develop sciences and technologies . 
 - So I think we could not live if older people did not develop science and technologies . 
 - So I think we can not live if old people could not find science and technologies and they did not develop . 
 - So I think we can not live if old people can not find the science and technology that has not been developed . 


# Data PreProcessing

We need to structure both of the training and evaluating data into the same format, which is a CSV file with two columns: input and target. The input column contains grammatically incorrect text, and the target column contains text that is the corrected version of the text from the target column.

Below is code that processes data into the proper format. We must specify the task we wish to perform by adding the same prefix to each input. In this case, **we'll use the prefix "grammar: ".** This is done because T5 models are able to perform multiple tasks like translation and summarization with a single model, and a unique prefix is used for each task so that the model learns which task to perform. We also need to skip over cases that contain a blank string to avoid errors while fine-tuning.

In [9]:
import csv

def generate_csv(csv_path, dataset):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case in dataset:
     	    # Adding the task's prefix to input
            input_text = "grammar: " + case["sentence"]
            for correction in case["corrections"]:
                # a few of the cases contain blank strings.
                if input_text and correction:
                    writter.writerow([input_text, correction])



generate_csv("train.csv", train_ds)
generate_csv("eval.csv", val_ds)

# Before Training Evaluation
Evaluate the model before fine-tuning to check if the loss decreased after training this means that the model learnded.

## Load Data

In [11]:
import pandas as pd
# Loading data
eval_data = pd.read_csv("eval.csv")
eval_data["original"] = eval_data["input"]
eval_data["corrected"] = eval_data["target"]
eval_data = eval_data.drop(columns=["input", "target"])
eval_data.head()

Unnamed: 0,original,corrected
0,grammar: New and new technology has been intro...,New technology has been introduced to society .
1,grammar: New and new technology has been intro...,New technology has been introduced into the so...
2,grammar: New and new technology has been intro...,Newer and newer technology has been introduced...
3,grammar: New and new technology has been intro...,Newer and newer technology has been introduced...
4,grammar: One possible outcome is that an envir...,One possible outcome is that an environmentall...


## Tokenize input Sentences

In [12]:
from datasets import Dataset
import pandas as pd
eval_dataset = eval_data[['original', 'corrected']]
eval_dataset = Dataset.from_pandas(eval_data)

# Tokenize the dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
def tokenize_function(examples):
    inputs = tokenizer(examples["original"], max_length=128, truncation=True, padding="max_length")
    targets = tokenizer(examples["corrected"], max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

eval_dataset = eval_dataset.map(tokenize_function, batched=True)
eval_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Create data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

2024-07-28 11:17:41.982281: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-28 11:17:41.982385: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-28 11:17:42.123448: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Map:   0%|          | 0/2988 [00:00<?, ? examples/s]

In [31]:
from transformers import Trainer, TrainingArguments

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',
    per_device_eval_batch_size=8,    # Adjust based on available RAM
    fp16=True  # Enable mixed precision training
)

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    eval_dataset=eval_dataset
)

# Evaluate the model
results = trainer.evaluate()
print(results)



  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


{'eval_loss': 16.770992279052734, 'eval_runtime': 26.2515, 'eval_samples_per_second': 113.822, 'eval_steps_per_second': 14.247}


## GLEU Score

In [None]:
from nltk.translate.gleu_score import sentence_gleu
from transformers import Trainer, TrainingArguments

import torch

torch.cuda.empty_cache()

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    gleu_scores = []
    for pred, label in zip(decoded_preds, decoded_labels):
        gleu_scores.append(sentence_gleu([label.split()], pred.split()))

    average_gleu_score = sum(gleu_scores) / len(gleu_scores)
    return {"gleu": average_gleu_score}

training_args = TrainingArguments(
    output_dir='./results',
    per_device_eval_batch_size=4,  # Reduced batch size
    fp16=True  # Enable mixed precision training
)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

# Evaluate the model
results = trainer.evaluate()
print(results)



# Training

In [13]:
train_data = pd.read_csv("train.csv")
train_data["original"] = train_data["input"]
train_data["corrected"] = train_data["target"]
train_data = train_data.drop(columns=["input", "target"])

train_dataset = [['original', 'corrected']]
train_dataset = Dataset.from_pandas(train_data)

# Tokenize the training data
train_dataset = train_dataset.map(tokenize_function, batched=True)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])


Map:   0%|          | 0/3016 [00:00<?, ? examples/s]

In [19]:
from transformers import Trainer, TrainingArguments

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=8,    # Adjust based on available RAM
    per_device_eval_batch_size=8,     # Adjust based on available RAM
    learning_rate=5e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,  # Enable mixed precision training
    evaluation_strategy="epoch",  # Evaluate after each epoch
    save_strategy="epoch",        # Save model after each epoch
    logging_dir='./logs',         # Directory for storing logs
    logging_steps=10,
    save_total_limit=2,           # Limit the total amount of checkpoints
    load_best_model_at_end=True,  # Load the best model at the end of training
    report_to="none"              # Disable all reporting integrations
)




In [20]:
import os

# Disable W&B logging
os.environ["WANDB_DISABLED"] = "true"


In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,  # If you have an evaluation dataset
    data_collator=data_collator,
    tokenizer=tokenizer
)

# Start training
trainer.train()



  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Epoch,Training Loss,Validation Loss
1,0.0833,0.080039
2,0.0777,0.079767
3,0.0777,0.079379


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=1131, training_loss=0.21880814693308215, metrics={'train_runtime': 513.3162, 'train_samples_per_second': 17.627, 'train_steps_per_second': 2.203, 'total_flos': 1377462748446720.0, 'train_loss': 0.21880814693308215, 'epoch': 3.0})

In [22]:
# Save the final model
trainer.save_model("final_model")

# Save the tokenizer
tokenizer.save_pretrained("final_tokenizer")


('final_tokenizer/tokenizer_config.json',
 'final_tokenizer/special_tokens_map.json',
 'final_tokenizer/spiece.model',
 'final_tokenizer/added_tokens.json',
 'final_tokenizer/tokenizer.json')

In [23]:
# Save the final model
trainer.save_model("final_model")

# Save the tokenizer
tokenizer.save_pretrained("final_tokenizer")


('final_tokenizer/tokenizer_config.json',
 'final_tokenizer/special_tokens_map.json',
 'final_tokenizer/spiece.model',
 'final_tokenizer/added_tokens.json',
 'final_tokenizer/tokenizer.json')

In [24]:
import shutil

# Zip the final_model directory
shutil.make_archive('final_model', 'zip', 'final_model')

# Zip the final_tokenizer directory
shutil.make_archive('final_tokenizer', 'zip', 'final_tokenizer')


'/kaggle/working/final_tokenizer.zip'

In [26]:
from IPython.display import FileLink

# Create download links for the zipped files
FileLink(r'final_model.zip')



# Testing the model 

In [31]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Check if GPU is available and set device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("final_tokenizer")

# Load the model and move it to the correct device
model = AutoModelForSeq2SeqLM.from_pretrained("final_model").to(device)

# Function to correct text
def correct_text(input_text):
    # Tokenize the input text
    inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True, padding="max_length")

    # Move input tensors to the correct device
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Generate prediction
    with torch.no_grad():
        outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=128)

    # Decode the output
    corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return corrected_text



In [32]:
# Example input
input_text = "she are going to school by bus."

# Get the corrected output
corrected_text = correct_text(input_text)

print("Original:", input_text)
print("Corrected:", corrected_text)

Original: she are going to school by bus.
Corrected: she is going to school by bus.
