**Installation required libraries**

In [None]:
!pip install happytransformer
from IPython.display import clear_output
clear_output()

**Import the required packages**

In [None]:
import csv
from datasets import load_dataset
from happytransformer import TTSettings
from happytransformer import TTTrainArgs
from happytransformer import HappyTextToText

**Model**

In [None]:
happy_tt = HappyTextToText("T5", "t5-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


**Data Collection**

In [None]:
train_dataset = load_dataset("jfleg", split='validation[:]')

eval_dataset = load_dataset("jfleg", split='test[:]')

Downloading builder script:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.30k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.82k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/27.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.6k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/27.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.6k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Generating validation split:   0%|          | 0/755 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/748 [00:00<?, ? examples/s]

**Data Examination**

In [None]:
for case in train_dataset["corrections"][:2]:
    print(case)
    print(case[0])
    print("--------------------------------------------------------")

['So I think we would not be alive if our ancestors did not develop sciences and technologies . ', 'So I think we could not live if older people did not develop science and technologies . ', 'So I think we can not live if old people could not find science and technologies and they did not develop . ', 'So I think we can not live if old people can not find the science and technology that has not been developed . ']
So I think we would not be alive if our ancestors did not develop sciences and technologies . 
--------------------------------------------------------
['Not for use with a car . ', 'Do not use in the car . ', 'Car not for use . ', 'Can not use the car . ']
Not for use with a car . 
--------------------------------------------------------


**Data Preprocessing**

In [None]:
def generate_csv(csv_path, dataset):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case in dataset:
     	    # Adding the task's prefix to input
            input_text = "grammar: " + case["sentence"]
            for correction in case["corrections"]:
                # a few of the cases contain blank strings.
                if input_text and correction:
                    writter.writerow([input_text, correction])

In [None]:
generate_csv("train.csv", train_dataset)
generate_csv("eval.csv", eval_dataset)

**Before Training Evaluating**

In [None]:
before_result = happy_tt.eval("eval.csv")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2988 [00:00<?, ? examples/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
print("Before loss:", before_result.loss)

Before loss: 1.2803919315338135


**Training**

In [None]:
args = TTTrainArgs(batch_size=8)
happy_tt.train("train.csv", args=args)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2714 [00:00<?, ? examples/s]

Map:   0%|          | 0/302 [00:00<?, ? examples/s]

Step,Training Loss,Validation Loss
1,1.3217,1.138514
34,0.834,0.685736
68,0.7519,0.577079
102,0.6762,0.543681
136,0.6373,0.523778
170,0.64,0.51312
204,0.6128,0.503484
238,0.6141,0.498517
272,0.5694,0.495701
306,0.5508,0.493465


**After Training Evaluating**

In [None]:
before_loss = happy_tt.eval("eval.csv")

print("After loss: ", before_loss.loss)

Map:   0%|          | 0/2988 [00:00<?, ? examples/s]

After loss:  0.47985807061195374


In [None]:
beam_settings =  TTSettings(num_beams=5, min_length=1, max_length=20)

**Inference**

In [None]:
example_1 = "grammar: This sentences, has bads grammar and spelling!"
result_1 = happy_tt.generate_text(example_1, args=beam_settings)
print(result_1.text)

This sentence has bad grammar and spelling!


In [None]:
example_2 = "grammar: I am enjoys, writtings articles ons AI and I also enjoyed write articling on AI."
result_2 = happy_tt.generate_text(example_2, args=beam_settings)
print(result_2.text)

I enjoy writing articles on AI and I also enjoyed writing articles on AI.
