**A SPELLING CORRECTOR**


There are several English words that confuse learners of the language and these words tend to be misspelt when they are written examples are homophones (words that are pronounced the same) and homographs (words written/spelt the same) in English. The goal of this project is to build an automatic spelling corrector to correct misspelt English words with the aid of deep learning and pytorch. 

Transformer is the state of the art deep learning models that produces excellent results for solving NLP sequence to sequence tasks because it learns the context of sentences and monitors the association of linguistic units in sequential data.

For our model, we will be finetuning with a T5 encoder-decoder language model that takes in a text as an input and returns a text as an output. This suits our need. The T5 model has been trained on huge amounts of text-to-text data.

Below shows how the spelling corrector model was built with a T5 model.

# **Install libraries**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install datasets tqdm pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m33.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloa

In [None]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m103.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m100.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.26.0


In [None]:
!pip install wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.13.9-py2.py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.30-py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.0/184.0 KB[0m [31m712.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.14.0-py2.py3-none-any.whl (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.9/178.9 KB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle
  Downloading setproctitle-1.3.2-cp38

# **Import Libraries**

In [None]:
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm

In [None]:
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
)
 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# **Load The Dataset**

I collected from https://github.com/cameronehrlich/homz - a list of commonly confused English words. I modified the dataset by creating a csv file with input and output columns - input is the sentence that is ungrammatical based on the given context with the misseplt English words and output is the correct grammatical structure of the English words as used in context.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Spelling/data.csv')
df.shape

(1245, 2)

In [None]:
df = df.dropna()
df.shape

(1245, 2)

In [None]:
df.head()

Unnamed: 0,input,output
0,I have for in my Family Dad Mum and sister .,I have four in my Family Dad Mum and sister .
1,My Dad work's at Melton,My Dad works at Melton
2,My siter go to Tonbury,My sister goes to Tonbury .
3,My Mum goes out some_times.,My Mum goes out sometimes.
4,i go too the Youth clob.,i go to the Youth club .


# **Preprocess the dataset for T5 Model**

Preprocess the data to be in the right format for the model by instantiating the tokenizer of the T5-base model and then return a dictionary containing an input_ids and an attention_mask arrays containing the token ids and the attention masks.

In [None]:
from transformers import (
    T5ForConditionalGeneration, T5Tokenizer, 
    Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
  )

from torch.utils.data import Dataset, DataLoader
     

In [None]:
model_name = 't5-base'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
def calculate_token_length(example):
  return len(tokenizer(example).input_ids)

In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.10, shuffle=True)
train_df.shape, test_df.shape

((1120, 2), (125, 2))

In [None]:
test_df['input_token_length'] = test_df['input'].apply(calculate_token_length)

In [None]:
test_df.head()

Unnamed: 0,input,output,input_token_length
784,I hope this letter will chair you up .,I hope this letter will cheer you up .,11
509,I miss hit,I miss it,4
1162,THE man whore a facemark,THE man wore a facemask,9
483,I hate the food that you served,I ate the food that you served,8
701,The shepherd soon lost site of them in the dar...,The shepherd soon lost sight of them in the da...,12


In [None]:
test_df['input_token_length'].describe()

count    125.000000
mean      13.120000
std        7.978398
min        4.000000
25%        8.000000
50%       12.000000
75%       16.000000
max       50.000000
Name: input_token_length, dtype: float64

In [None]:
#spliting the dataset into - test and train
from datasets import Dataset
train_data = Dataset.from_pandas(train_df)
test_data = Dataset.from_pandas(test_df)

In [None]:
test_data

Dataset({
    features: ['input', 'output', 'input_token_length', '__index_level_0__'],
    num_rows: 125
})

In [None]:
from torch.utils.data import Dataset, DataLoader
class SpellingDataset(Dataset):
    def __init__(self, dataset, tokenizer,print_text=False):         
        self.dataset = dataset
        self.pad_to_max_length = False
        self.tokenizer = tokenizer
        self.print_text = print_text
        self.max_len = 64
  
    def __len__(self):
        return len(self.dataset)


    def tokenize_data(self, example):
        input_, target_ = example['input'], example['output']

        # tokenize inputs
        tokenized_inputs = tokenizer(input_, pad_to_max_length=self.pad_to_max_length, 
                                            max_length=self.max_len,
                                            return_attention_mask=True)
    
        tokenized_targets = tokenizer(target_, pad_to_max_length=self.pad_to_max_length, 
                                            max_length=self.max_len,
                                            return_attention_mask=True)

        inputs={"input_ids": tokenized_inputs['input_ids'],
            "attention_mask": tokenized_inputs['attention_mask'],
            "labels": tokenized_targets['input_ids']
        }
        
        return inputs

  
    def __getitem__(self, index):
        inputs = self.tokenize_data(self.dataset[index])
        
        if self.print_text:
            for k in inputs.keys():
                print(k, len(inputs[k]))

        return inputs
     

In [None]:
dataset = SpellingDataset(test_data, tokenizer, True)
print(dataset[12])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


input_ids 34
attention_mask 34
labels 34
{'input_ids': [27, 1219, 33, 116, 27, 3666, 45, 82, 4537, 7, 16, 8, 7788, 27, 31, 26, 394, 253, 82, 385, 7669, 4354, 42, 255, 228, 1605, 160, 2039, 67, 274, 160, 2053, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [27, 1219, 160, 116, 27, 3666, 45, 82, 4537, 7, 16, 8, 7788, 27, 31, 26, 394, 253, 82, 385, 7669, 4354, 42, 255, 228, 1605, 160, 2039, 67, 274, 160, 2053, 5, 1]}


**Getting Ready to Train Model**

**Evaluation Metrics**

The *rouge_source* is the metric that will be used to evaluate the trainer. 

In [None]:
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=ab6beaa0dd9c90da09211c20891d959cfbd396393b5c29f0ed7bed1630078ab6
  Stored in directory: /root/.cache/pip/wheels/24/55/6f/ebfc4cb176d1c9665da4e306e1705496206d08215c1acd9dde
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
from datasets import load_metric
rouge_metric = load_metric("rouge")

  rouge_metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

# **Defining Arguments for Training**

The Seq2Seq trainer was instantaiated for the model and the arguments needed for training the model were defined.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding='longest', return_tensors='pt')

In [None]:
batch_size = 16
args = Seq2SeqTrainingArguments(output_dir="/content/drive/MyDrive/Spelling/weights",
                        evaluation_strategy="steps",
                        per_device_train_batch_size=batch_size,
                        per_device_eval_batch_size=batch_size,
                        learning_rate=2e-5,
                        num_train_epochs=1,
                        weight_decay=0.01,
                        save_total_limit=2,
                        predict_with_generate=True,
                        gradient_accumulation_steps = 6,
                        eval_steps = 500,
                        save_steps = 500,
                        load_best_model_at_end=True,
                        logging_dir="/logs",
                        report_to="wandb")

In [None]:
import nltk
nltk.download('punkt')
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}
     

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                train_dataset= SpellingDataset(train_data, tokenizer),
                eval_dataset=SpellingDataset(test_data, tokenizer),
                tokenizer=tokenizer,
                data_collator=data_collator,
                compute_metrics=compute_metrics)

**Fine-tune The T5 Model**

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1120
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 96
  Gradient Accumulation steps = 6
  Total optimization steps = 11
  Number of trainable parameters = 222903552
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=11, training_loss=1.5620061700994319, metrics={'train_runtime': 31.5312, 'train_samples_per_second': 35.52, 'train_steps_per_second': 0.349, 'total_flos': 50048726630400.0, 'train_loss': 1.5620061700994319, 'epoch': 0.94})

In [None]:
trainer.save_model('t5_spell_model')

Saving model checkpoint to t5_spell_model
Configuration saved in t5_spell_model/config.json
Configuration saved in t5_spell_model/generation_config.json
Model weights saved in t5_spell_model/pytorch_model.bin
tokenizer config file saved in t5_spell_model/tokenizer_config.json
Special tokens file saved in t5_spell_model/special_tokens_map.json


In [None]:
!zip -r 't5_spell_model.zip' 't5_spell_model'

  adding: t5_spell_model/ (stored 0%)
  adding: t5_spell_model/special_tokens_map.json (deflated 86%)
  adding: t5_spell_model/tokenizer_config.json (deflated 82%)
  adding: t5_spell_model/pytorch_model.bin (deflated 13%)
  adding: t5_spell_model/spiece.model (deflated 48%)
  adding: t5_spell_model/generation_config.json (deflated 29%)
  adding: t5_spell_model/training_args.bin (deflated 49%)
  adding: t5_spell_model/config.json (deflated 62%)


In [None]:
!mv t5_spell_model.zip /content/drive/MyDrive/Spelling


**Test The model**

In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = 't5_spell_model'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def correct_spelling(input_text,num_return_sequences):
  batch = tokenizer([input_text],truncation=True,padding='max_length',max_length=64, return_tensors="pt").to(torch_device)
  translated = model.generate(**batch,max_length=64,num_beams=4, num_return_sequences=num_return_sequences, temperature=1.5)
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
  return tgt_text

loading file spiece.model
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file t5_spell_model/config.json
Model config T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_rep

In [None]:
text = 'Their is a bag here.'
print(correct_spelling(text, num_return_sequences=2))
     

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}



['There is a bag here.', "There's a bag here."]


In [None]:
text = 'he his my man.'
print(correct_spelling(text, num_return_sequences=2))

Generate config GenerationConfig {
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.26.0"
}



['he his my man.', 'he is my man.']
