<a href="https://colab.research.google.com/github/NITHISHM2410/Text_Preprocessing/blob/NLP/Text%20Encoding/encoding_for_bert/TextPreprocessing_For_Bert_CustomVocabulary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import tensorflow as tf
import tensorflow_hub as hub

Importing a pretrained BERT Model.

In [6]:
bert_model = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1")

In [7]:
bert_model.trainable = True

This Module can be used for scenarios where we need to train BERT models on our custom vocabulary instead of BERT inbuilt vocabulary as in some scenarios we don't prefer to use bert vocabulary because of its large size and predicting tokens might need a 'long' softmax layer in case of decoding or in the cases where we prefer to have domain specific vocabulary.

In [8]:
import TextBertED as txtbert

DEFINE YOUR CUSTOM VOCABULARY IN 'vocab' PARAMETER AND MAXIMUM LENGTH OF INPUT SENTENCE IN 'max_len' PARAMETER.


In [9]:
preprocess = txtbert.BertED(
    vocab = "/content/words.txt",
    max_len = 40,
    max_tokens = 1_000_000

)

T0 RETURNS THE VOCABULARY : preprocess.return_vocab()

In [10]:
len(preprocess.return_vocab())

1005

In [11]:
preprocess.return_vocab()[:30]

['',
 '[UNK]',
 '[MASK]',
 '[END]',
 '[START]',
 'the',
 'of',
 'to',
 'and',
 'a',
 'in',
 'is',
 'it',
 'you',
 'that',
 'he',
 'was',
 'for',
 'on',
 'are',
 'with',
 'as',
 'I',
 'his',
 'they',
 'be',
 'at',
 'one',
 'have',
 'this']

PASSING FEW INPUT SENTENCES TO ENCODE BASED ON OUR CUSTOM VOCAB

In [12]:
sentences = [["I am not in danger"],["I am the danger"]]
sentences = tf.convert_to_tensor(sentences)
sentences

<tf.Tensor: shape=(2, 1), dtype=string, numpy=
array([[b'I am not in danger'],
       [b'I am the danger']], dtype=object)>

In [13]:
results = preprocess(sentences)
results

{'input_word_ids': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
 array([[  1, 511,  34,  10, 798,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0],
        [  1, 511,   5, 798,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0]], dtype=int32)>,
 'input_mask': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
 array([[0, 0, 

The Above format is the standard format for any bert model. This module also inherits Keras Layer,so it can also be directly used as preprocessing layer

In [14]:
input = tf.keras.Input(shape = (1,),batch_size = 32,dtype = tf.string)
output = txtbert.BertED(vocab = '/content/words.txt',max_len = 100)(input) # Our layer
output = bert_model(output)['pooled_output']
output = tf.keras.layers.Dense(5,activation = 'softmax')(output)
model = tf.keras.Model(inputs = [input],outputs = output)

In [15]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(32, 1)]            0           []                               
                                                                                                  
 bert_ed_1 (BertED)             {'input_word_ids':   0           ['input_1[0][0]']                
                                (None, 100),                                                      
                                 'input_mask': (Non                                               
                                e, 100),                                                          
                                 'input_type_ids':                                                
                                (None, 100)}                                                  

Passing a Input to the model

In [16]:
sentences.shape

TensorShape([2, 1])

In [17]:
outputs = model(sentences)

In [18]:
outputs.shape

TensorShape([2, 5])

DECODING USING THE SAME MODULE (Integer token -> String)

In [19]:
integer_token_sentence = tf.convert_to_tensor([[100,102,104,912],[256,413,235,154]])

In [20]:
preprocess.back_to_string(integer_token_sentence)

['down been find solution', 'seem cry hard tell']

LETS DECODE THE ENCODED SENTENCES AND CHECK IF DECODER IS ABLE TO COMPUTE THE ORIGINAL SENTENCE BACK.

In [21]:
sentences

<tf.Tensor: shape=(2, 1), dtype=string, numpy=
array([[b'I am not in danger'],
       [b'I am the danger']], dtype=object)>

In [22]:
encoded_sentences = preprocess(sentences)
encoded_sentences

{'input_word_ids': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
 array([[  1, 511,  34,  10, 798,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0],
        [  1, 511,   5, 798,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0]], dtype=int32)>,
 'input_mask': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
 array([[0, 0, 

In [23]:
decoded_sentences = preprocess.back_to_string(encoded_sentences['input_word_ids'])
decoded_sentences

['[UNK] am not in danger', '[UNK] am the danger']

THE DECODED SENTENCE IS ALMOST SAME AS THE ORIGINAL EXCEPT SOME TOKEN LIKE  < I'M > .

# SUMMARY

*   **Import :**
                    import TextBertED as txtbert

*   **Initialize :**   
                    preprocess = txtbert.BertED(
                    vocab = "words.txt",
                    max_len = 40,
                    max_tokens = 1_000_000
                    )
  

*   **To Encode(call the function / can also be used as keras layer) :**
               encoded_sentences = preprocess(tensor_of_sentences)
                
*   **To Decode :**
               decoded_sentences = preprocess.back_to_string(integer_tensor)



*   **To get the vocabulary** :
               vocab = preprocess.return_vocab()
                           






