## Install Requirements

In [None]:
%tensorflow_version 1.13
!pip install numpy==1.19.5
!pip install bert-tensorflow==1.0.1

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1.13`. This will be interpreted as: `1.x`.


After that, `%tensorflow_version 1.x` will throw an error.

Your notebook should be updated to use Tensorflow 2.
See the guide at https://www.tensorflow.org/guide/migrate#migrate-from-tensorflow-1x-to-tensorflow-2.

TensorFlow 1.x selected.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from google.colab import drive
import pandas as pd
from sklearn.model_selection import train_test_split
import re
from collections import Counter
import nltk
nltk.download('words')
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization
from bert import modeling
import tensorflow as tf
import numpy as np
import random
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!





## Load Data

we load the data that we gathered in first phase, and split to train and test. then we have 2 seperated dataframes for train and test. IMPORTANT: we shouldn't have any null values in this dataframe.

In [None]:
drive.mount('drive/')

df = pd.read_csv('drive/MyDrive/NPL-Project-Data/dataset.csv')
df = df.dropna()

train_df, test_df = train_test_split(df, train_size=0.8, shuffle=True)
test_df = test_df.reset_index()
train_df = train_df.reset_index()

Drive already mounted at drive/; to attempt to forcibly remount, call drive.mount("drive/", force_remount=True).


we download the bert models that we need from google api.

In [None]:
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip uncased_L-12_H-768_A-12.zip

--2022-07-11 18:09:46--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.187.128, 64.233.188.128, 64.233.189.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.187.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘uncased_L-12_H-768_A-12.zip.1’


2022-07-11 18:09:48 (208 MB/s) - ‘uncased_L-12_H-768_A-12.zip.1’ saved [407727028/407727028]

Archive:  uncased_L-12_H-768_A-12.zip
replace uncased_L-12_H-768_A-12/bert_model.ckpt.meta? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


## Corrector Class

here we create a dictionary from all correct words in training set. with their repetation times. you can see the result.

In [None]:
all_words = []
for row in train_df['label']:
  all_words += row.split(' ')
words = dict(Counter(all_words))
words

{'The': 34505,
 'production': 1469,
 'growth': 1068,
 'was': 15063,
 'almost': 717,
 'entirely': 290,
 'due': 2023,
 'to': 103595,
 'increased': 1090,
 'productivity': 197,
 'by': 36650,
 'developing': 655,
 'nations': 379,
 'where': 4110,
 'As': 2005,
 'a': 111239,
 'window': 197,
 'the': 249869,
 'unconscious': 198,
 'interpreting': 111,
 'intentions': 344,
 'behind': 352,
 'such': 13791,
 'phenomena': 344,
 'and': 135038,
 'raising': 150,
 'patient': 317,
 'awareness': 379,
 'of': 178701,
 'them': 3291,
 'are': 36589,
 'important': 2294,
 'aspects': 599,
 'Freudian': 19,
 'psychoanalysis': 43,
 'Many': 1445,
 'Victorian': 71,
 'novels': 122,
 'begin': 302,
 'with': 28887,
 'childhood': 192,
 'their': 12777,
 'heroine': 10,
 'as': 47828,
 'Jane': 65,
 'Eyre': 8,
 'an': 21845,
 'orphan': 5,
 'who': 5937,
 'suffers': 25,
 'ill': 84,
 'treatment': 620,
 'from': 24199,
 'her': 1086,
 'guardians': 23,
 'then': 2589,
 'at': 11997,
 'girls': 322,
 'boarding': 17,
 'school': 1211,
 'On': 105

this class has multi functions and they all want to create a list of candidates to replace for a word. 

we choose the ones that have 1 or 2 distance from the token. because we know that 99% of human error typings are happen with 1 or 2 distances.

we assume the 4 error types for 1-distance words: delete, insert, transpose, replacement.

and we find out error types with 2-distance with calling the 1-fistance function for each of that results.

at the end we create an object of that class.

In [None]:
class SpellCorrector:
    """
    The SpellCorrector extends the functionality of the Peter Norvig's
    spell-corrector in http://norvig.com/spell-correct.html
    """

    def __init__(self):
        """
        :param corpus: the statistics from which corpus to use for the spell correction.
        """
        super().__init__()
        self.WORDS = words
        self.N = sum(self.WORDS.values())
        self.words_set = set(words.keys()).union(set(nltk.corpus.words.words()))
        
    @staticmethod
    def tokens(text):
        return text.split(' ')

    def P(self, word):
        """
        Probability of `word`.
        """
        return self.WORDS[word] / self.N

    def most_probable(self, words):
        _known = self.known(words)
        if _known:
            return max(_known, key=self.P)
        else:
            return []

    @staticmethod
    def edit_step(word):
        """
        All edits that are one edit away from `word`.
        """
        letters = 'abcdefghijklmnopqrstuvwxyz'
        splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes = [L + R[1:] for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
        replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
        inserts = [L + c + R for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(self, word):
        """
        All edits that are two edits away from `word`.
        """
        return (e2 for e1 in self.edit_step(word)
                for e2 in self.edit_step(e1))

    def known(self, words):
        """
        The subset of `words` that appear in the dictionary of WORDS.
        """
        return set(w for w in words if w in self.WORDS)

    def edit_candidates(self, word, assume_wrong=False, fast=True):
        """
        Generate possible spelling corrections for word.
        """

        if fast:
            ttt = self.known(self.edit_step(word)) or {word}
        else:
            ttt = self.known(self.edit_step(word)) or self.known(self.edits2(word)) or {word}
        
        ttt = self.known([word]) | ttt
        return list(ttt)

corrector = SpellCorrector()

## Bert Model

BERT is an open source machine learning framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context.

The output of the BERT is the hidden state vector of pre-defined hidden size corresponding to each token in the input sequence. These hidden states from the last layer of the BERT are then used for various NLP tasks

BERT can be used for a variety of NLP tasks such as Text Classification or Sentence Classification , Semantic Similarity between pairs of Sentences , Question Answering Task with paragraph , Text summarization, Machine translation.

As we want to map input to output and its many to many model, the it is the same as a machine translator.



here we set some requirements of bert with respect to the bert google model that we want to use.

In [None]:
BERT_VOCAB = 'uncased_L-12_H-768_A-12/vocab.txt'
BERT_INIT_CHKPNT = 'uncased_L-12_H-768_A-12/bert_model.ckpt'
BERT_CONFIG = 'uncased_L-12_H-768_A-12/bert_config.json'
bert_config = modeling.BertConfig.from_json_file(BERT_CONFIG)




here is bert tokenizer that we will need in the last methods.

In [None]:
tokenization.validate_case_matches_checkpoint(True,BERT_INIT_CHKPNT)
tokenizer = tokenization.FullTokenizer(vocab_file=BERT_VOCAB, do_lower_case=True)

In [None]:
def tokens_to_masked_ids(tokens, mask_ind):
    masked_tokens = tokens[:]
    masked_tokens[mask_ind] = "[MASK]"
    masked_tokens = ["[CLS]"] + masked_tokens + ["[SEP]"]
    masked_ids = tokenizer.convert_tokens_to_ids(masked_tokens)
    return masked_ids

In [None]:
class Model:
    def __init__(self):
        self.X = tf.placeholder(tf.int32, [None, None])
        
        model = modeling.BertModel(
                                    config=bert_config,
                                    is_training=False,
                                    input_ids=self.X,
                                    use_one_hot_embeddings=False
                                   )
        
        output_layer = model.get_sequence_output()
        embedding = model.get_embedding_table()
        
        with tf.variable_scope('cls/predictions'):
            with tf.variable_scope('transform'):
                input_tensor = tf.layers.dense(
                    output_layer,
                    units = bert_config.hidden_size,
                    activation = modeling.get_activation(bert_config.hidden_act),
                    kernel_initializer = modeling.create_initializer(
                        bert_config.initializer_range
                    ),
                )
                input_tensor = modeling.layer_norm(input_tensor)
            
            output_bias = tf.get_variable(
                                          'output_bias',
                                          shape = [bert_config.vocab_size],
                                          initializer = tf.zeros_initializer(),
                                          )
            logits = tf.matmul(input_tensor, embedding, transpose_b = True)
            self.logits = tf.nn.bias_add(logits, output_bias)

create a model and ready the tensorflow session to run the model.

In [None]:
tf.reset_default_graph()
sess = tf.InteractiveSession()
model = Model()

sess.run(tf.global_variables_initializer())
var_lists = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope = 'bert')
cls = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope = 'cls')
saver = tf.train.Saver(var_list = var_lists + cls)
saver.restore(sess, BERT_INIT_CHKPNT)




The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
INFO:tensorflow:Restoring parameters from uncased_L-12_H-768_A-12/bert_model.ckpt


## Correction Method

this method enters a probable sentence, and convert it to numbers. then find the predictions of the pretrained model with feeding the ids to it. 

In [None]:
def get_score(mask):
    tokens = tokenizer.tokenize(mask)
    input_ids = [tokens_to_masked_ids(tokens, i) for i in range(len(tokens))]
    preds = sess.run(tf.nn.softmax(model.logits), feed_dict = {model.X: input_ids})
    tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
    return np.prod([preds[i, i + 1, x] for i, x in enumerate(tokens_ids)])

in this method, we iterate the whole tokens of a text. and for each token, we have the following steps:


*   find the possible replacements with the corrector class.
*   replace the token with each of the replacements. and create new sentences.
*   call get_score method for each of the new sentences.
*   normalize the returned scores and find the word that has must probability.
*   if the main token is not a correct word then we return the most probable one, but if it isn't, then we check if it has a very lower probability then we return that.



In [None]:
def spell_correct(text):
  new_text = text
  for token in corrector.tokens(text):
      possible_states = corrector.edit_candidates(token)
      text_mask = new_text.replace(token, '**mask**')
      
      replaced_masks = [text_mask.replace('**mask**', state) for state in possible_states]
      
      scores = [get_score(mask) for mask in replaced_masks]

      prob_scores = np.array(scores) / np.sum(scores)
      
      if token not in corrector.words_set or token not in possible_states:
        best_state = possible_states[np.argmax(prob_scores)]
      else:
        if np.max(prob_scores) > 5 * prob_scores[possible_states.index(token)]:
          best_state = possible_states[np.argmax(prob_scores)]
        else:
          best_state = token
    
      new_text = new_text.replace(token, best_state)

  return new_text

## Test Model

In [None]:
def evaluate(samples_number):
  true_count = 0
  total_count = 0
  while total_count<samples_number:
    r = random.randint(0, test_df.shape[0])
    text = test_df['noise_sentence'][r]
    if(len(text)>120):
      continue
    print(text)
    corrected = spell_correct(text)
    print(corrected)
    print(test_df['label'][r])
    if test_df['label'][r].lower()==corrected.lower():
      true_count += 1
      print(True)
    total_count += 1
    print('-----------------------------\n')
  print('ACCURACY:', true_count/samples_number)

In [None]:
evaluate(10)

Stripping one of their own bodily agency and sexuality as ewll as humaniy
Stripping one of their own bodily agency and sexuality as well as humanity
Stripping one of their own bodily agency and sexuality as well as humanity
True
-----------------------------

Then therf is the mysid shrimp Dioptromysis pacuispinosa
Then there is the mysid shrimp Dioptromysis pacuispinosa
Then there is the mysid shrimp Dioptromysis paucispinosa
-----------------------------

Sedentary behaviors and ehalth outcomes amogg adults a systematic review 9f prospective studies
Sedentary behaviors and health outcomes among adults a systematic review of prospective studies
Sedentary behaviors and health outcomes among adults a systematic review of prospective studies
True
-----------------------------

Lifecycle A sjip rill pass through several stages during itC career
Lifecycle A sip rill pass through several stages during its career
Lifecycle A ship will pass through several stages during its career
-----------