# [Fine-Tune a Transformer Model for Grammar Correction](https://www.vennify.ai/fine-tune-grammar-correction/)
|-> Main objective is to correct grammer of text (only english language).

|-> Train T-5 Model from scratch for the task of grammer correction.

|-> Save the model and do Inference.

|-> Further Improvement.

# Example:

![](https://production-media.paperswithcode.com/tasks/gec_foTfIZW.png)

# Table of content:
- Introduction
- Installation
- Data Collection
- Data Examination
- Dataset Preprocessing
- Before Training Evaluating
- Training
- After Training Evaluating
- Inference

# Introduction:
- In linguistics, the grammar of a natural language is its set of structural constraints on speakers' or writers' composition of clauses, phrases, and words.
- A grammar checker, in computing terms, is a program, or part of a program, that attempts to verify written text for grammatical correctness.
- Here in Grammer Correction we will be using [T5 Model](https://huggingface.co/docs/transformers/model_doc/t5) (only for English Language).
- T5 was created by Google AI and released to the world for anyone to download and use.
- T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing. This means that for training, we always need an input sequence and a corresponding target sequence.
- We'll use Python package called [Happy Transformer](https://happytransformer.com/).
- Happy Transformer is built on top of Hugging Face's Transformers library and makes it easy to implement and train transformer models with just a few lines of code.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Installation:
- We need to install happytransformer using following command.
- pip install happytransformer.
- Read more about [pypi](https://pypi.org/project/happytransformer/)
- [Documentation](https://happytransformer.com/)

In [None]:
!pip install happytransformer

Collecting happytransformer
  Downloading happytransformer-3.0.0-py3-none-any.whl (24 kB)
Collecting datasets<3.0.0,>=2.13.1 (from happytransformer)
  Downloading datasets-2.16.0-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece (from happytransformer)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate<1.0.0,>=0.20.1 (from happytransformer)
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
Collecting wandb (from happytransformer)
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [None]:
!pip uninstall tensorflow

Found existing installation: tensorflow 2.12.0
Uninstalling tensorflow-2.12.0:
  Would remove:
    /usr/local/bin/estimator_ckpt_converter
    /usr/local/bin/import_pb_to_tensorboard
    /usr/local/bin/saved_model_cli
    /usr/local/bin/tensorboard
    /usr/local/bin/tf_upgrade_v2
    /usr/local/bin/tflite_convert
    /usr/local/bin/toco
    /usr/local/bin/toco_from_protos
    /usr/local/lib/python3.10/dist-packages/tensorflow-2.12.0.dist-info/*
    /usr/local/lib/python3.10/dist-packages/tensorflow/*
Proceed (Y/n)? y
  Successfully uninstalled tensorflow-2.12.0


In [None]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.15.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.16,>=2.15 (from tensorflow)
  Downloading tensorboard-2.15.1-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-estimator<2.16,>=2.15.0 (from tensorflow)
  Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl (441 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.0/442.0 kB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras<2.16,>=2.15.0 (from tensorflow)
  Downloading keras-2.15.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
Collecting google-auth-oa

In [None]:
print(tensorflow.__version__)

NameError: ignored

In [None]:
""" Imports are mentioned here """

import csv
from datasets import load_dataset
from happytransformer import TTSettings
from happytransformer import TTTrainArgs
from happytransformer import HappyTextToText

# Model
- T5 comes in several different sizes, and we'll use the base model, which has 220 million parameters.
- T5 is a text-to-text model, meaning given text, it generated a standalone piece of text based on the input.
- Thus, we'll import a class called HappyTextToText from Happy Transformer, which we'll use to load the model.
- We'll provide the model type (T5) to the first position parameter and the model name (t5-base) to the second.
- If you want to read more about T5 you can find the resouces below.


In [None]:
""" Model """

happy_tt = HappyTextToText("T5", "t5-base")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


# Data Collection
- The [dataset](https://huggingface.co/datasets/jfleg) is available on Hugging Face's datasets distribution network and can be accessed using their Datasets library.
- Since this library is a dependency for Happy Transformer, we do not need to install it and can go straight to importing a function called load_dataset from the library.  

In [None]:
train_dataset = load_dataset("jfleg", split='validation[:]')

eval_dataset = load_dataset("jfleg", split='test[:]')

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/755 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/748 [00:00<?, ? examples/s]

# Data Examination  
- We just successfully downloaded the dataset.
- Let's now explore it by iterating over some cases. Both the train and eval datasets are structured the same way and have two features, sentences and corrections.
- The sentence feature contains a single string for each case, while the correction feature contains a list of 4 human-generated corrections.

In [None]:
for case in train_dataset["corrections"][:2]:
    print(case)
    print(case[0])
    print("--------------------------------------------------------")

['So I think we would not be alive if our ancestors did not develop sciences and technologies . ', 'So I think we could not live if older people did not develop science and technologies . ', 'So I think we can not live if old people could not find science and technologies and they did not develop . ', 'So I think we can not live if old people can not find the science and technology that has not been developed . ']
So I think we would not be alive if our ancestors did not develop sciences and technologies . 
--------------------------------------------------------
['Not for use with a car . ', 'Do not use in the car . ', 'Car not for use . ', 'Can not use the car . ']
Not for use with a car . 
--------------------------------------------------------


# Data Preprocessing  
- Now, we must process the into the proper format for Happy Transformer.
- We need to structure both of the training and evaluating data into the same format, which is a CSV file with two columns: input and target.
- The input column contains grammatically incorrect text, and the target column contains text that is the corrected version of the text from the target column.

In [None]:
def generate_csv(csv_path, dataset):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case in dataset:
     	    # Adding the task's prefix to input
            input_text = "grammar: " + case["sentence"]
            for correction in case["corrections"]:
                # a few of the cases contain blank strings.
                if input_text and correction:
                    writter.writerow([input_text, correction])

In [None]:
generate_csv("train.csv", train_dataset)
generate_csv("eval.csv", eval_dataset)

# Before Training Evaluating
- We'll evaluate the model before and after fine-tuning using a common metric called loss.
- Loss can be described as how "wrong" the model's predictions are compared to the correct answers.
- So, if the loss decreases after fine-tuning, then that suggests the model learned.
- It's important that we use separate data for training and evaluating to show that the model can generalize its obtained knowledge to solve unseen cases.

In [None]:
before_result = happy_tt.eval("eval.csv")

Generating eval split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2988 [00:00<?, ? examples/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


- The result is a dataclass object with a single variable called loss, which we can isolate as shown below.

In [None]:
print("Before loss:", before_result.loss)

Before loss: 1.2803919315338135


# Training
- Let's now train the model.
- We can do so by calling happy_tt's train() method.
- For simplicity, we'll use the default parameters other than the batch size which we'll increase to 8.
- If you experience an out of memory error,  then I suggest you reduce the batch size.
- You can visit this [webpage](https://happytransformer.com/text-to-text/finetuning/) to learn how to modify various parameters like the learning rate and the number of epochs.

In [None]:
args = TTTrainArgs(batch_size=8)
happy_tt.train("train.csv", args=args)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2714 [00:00<?, ? examples/s]

Map:   0%|          | 0/302 [00:00<?, ? examples/s]

Step,Training Loss,Validation Loss
1,1.5583,1.120213
34,0.8185,0.667035
68,0.7035,0.584954
102,0.6639,0.555662
136,0.6128,0.533301
170,0.6313,0.515759
204,0.568,0.513847
238,0.6002,0.501848
272,0.5771,0.498818
306,0.6506,0.495666


# After Training Evaluating
- Like before, let's determine the model's loss.

In [None]:
before_loss = happy_tt.eval("eval.csv")

print("After loss: ", before_loss.loss)

Map:   0%|          | 0/2988 [00:00<?, ? examples/s]

After loss:  0.47910746932029724


# Inference
- Let's now use the model to correct the grammar of examples we'll provide it.
- To accomplish this, we'll use happy_tt's generate_text() method.
- We'll also use an algorithm called beam search for the generation.
- You can view the different text generation parameters you can modify on this [webpage](https://happytransformer.com/text-to-text/settings/), along with different configurations you could use for common algorithms.

In [None]:
beam_settings =  TTSettings(num_beams=5, min_length=1, max_length=20)

In [None]:
""" Example1: """
example_1 = "grammar: This sentences, has bads grammar and spelling!"
result_1 = happy_tt.generate_text(example_1, args=beam_settings)
print(result_1.text)

This sentences, has bad grammar and spelling!


In [None]:
""" Example2: """

example_2 = "grammar: I am enjoys, writtings articles ons AI and I also enjoyed write articling on AI."

result_2 = happy_tt.generate_text(example_2, args=beam_settings)
print(result_2.text)

I enjoy writing articles on AI and I also enjoyed writing articles on AI.


In [None]:
happy_tt.save('/content/drive/MyDrive/mymodel')

# Further Improvement:
- I suggest transferring some of the evaluating cases to the training data and then optimize the hyperparameters by applying a technique like grid search.
- You can then include the evaluating cases in the training set to fine-tune a final model using your best set of hyperparameters.
- Even we can try multiple languages to support multilinguality.
- Add custom layers to refine output.
- Try other models as well.

# Additional Resources:
- [Transformers](https://towardsdatascience.com/transformers-89034557de14)
- [T5](https://paperswithcode.com/method/t5)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
- [Hugging Face](https://huggingface.co/)

In [None]:
mymodel = HappyTextToText("T5", model_name="vennify/t5-base-grammar-correction")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

In [None]:
example_2 = "grammar: Me and him went to the store yesterday and we buyed a new phone. It was so excited! We didn't seen that model before, but it was on sale so we couldn't resist. Then, we brings it home and played with it all night. I think I will uses it for everything now, it's way better than my old phone"

result_2 = mymodel.generate_text(example_2, args=beam_settings)
print(result_2.text)

Me and him went to the store yesterday and bought a new phone. It was so excited


# The End

In [1]:
from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("grammarly/coedit-large")
model = T5ForConditionalGeneration.from_pretrained("grammarly/coedit-large")

  from .autonotebook import tqdm as notebook_tqdm
tokenizer_config.json: 100%|██████████| 2.50k/2.50k [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
spiece.model: 100%|██████████| 792k/792k [00:05<00:00, 140kB/s]
tokenizer.json: 100%|██████████| 2.42M/2.42M [00:01<00:00, 1.45MB/s]
special_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<?, ?B/s]
config.json: 100%|██████████| 787/787 [00:00<?, ?B/s] 
model.safetensors: 100%|██████████| 3.13G/3.13G [03:19<00:00, 15.7MB/s]
generation_config.json: 100%|██████████| 142/142 [00:00<?, ?B/s] 


In [2]:
model.save_pretrained('C:/Users/hitor/Github/grammar-correction-flask/t5normal', from_pt=True)

In [3]:
new_model = T5ForConditionalGeneration.from_pretrained("C:/Users/hitor/Github/grammar-correction-flask/t5normal")

In [4]:
input_text = "Me and him went to the store yesterday and we buyed a new phone. It was so excited! We didn't seen that model before, but it was on sale so we couldn't resist. Then, we brings it home and played with it all night. I think I will uses it for everything now, it's way better than my old phone."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = new_model.generate(input_ids, max_length=256)
edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

In [5]:
print(edited_text)

We went to the store yesterday and bought a new phone. We hadn't seen it before, but it was on sale. We took it home and used it all night. I think I will use it for everything now.
