# KERAS-BERT: predict masked words
This is the KERAS-BERT python module that is accessible through R. 
Code modified from this demo:
https://github.com/CyberZHG/keras-bert/blob/master/demo/load_model/keras_bert_load_and_predict.ipynb

Download Pretrained Weights

In [9]:
!pip install -q keras-bert

In [10]:
!wget -q https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip

/bin/bash: wget: command not found


In [11]:
!unzip -o chinese_L-12_H-768_A-12.zip

unzip:  cannot find or open chinese_L-12_H-768_A-12.zip, chinese_L-12_H-768_A-12.zip.zip or chinese_L-12_H-768_A-12.zip.ZIP.


### 1. Build Model & Dictionary

Set paths:

In [18]:
import os
pretrained_path = '../../../../uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')

Enable tf.keras by adding TF_KERAS to environment variables:

In [19]:
os.environ['TF_KERAS'] = '1'

Build the dictionary & inverse dictionary:

In [20]:
import codecs

token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)
token_dict_inv = {v: k for k, v in token_dict.items()}

Build the model:

In [21]:
from keras_bert import load_trained_model_from_checkpoint
model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=True)
model.summary(line_length=120)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
















Model: "model"
________________________________________________________________________________________________________________________
Layer (type)                           Output Shape               Param #       Connected to                            
Input-Token (InputLayer)               [(None, 512)]              0                                                     
________________________________________________________________________________________________________________________
Input-Segment (InputLayer)             [(None, 512)]              0                                                     
________________________________________________________________________________________________________________________
Embedding-Token (TokenEmbedding)       [(None, 512, 768), (30522, 23440896      Input-Token[0][0]                       
________________________________________________________________________________________________________________________
Embedding-Segment

### 2. Masked certain words
Masked the word "intention" (token 13) and "Kingdom" (token 33) to be predicted. Here, the masked words are not errors and are selected manually. In the text cleaning process, the words should be picked by a spell checker or other error detection proccess.

In [22]:
from keras_bert import Tokenizer
tokenizer = Tokenizer(token_dict)
text = 'His Majesty’s Government wish to add that they have no invitation of requesting the establishment of military bases in peace time within the area of Palestine now united to the freedom of Jordan .'
tokens = tokenizer.tokenize(text)
tokens[13] = tokens[33] = '[MASK]'
print('Tokens:', tokens)

Tokens: ['[CLS]', 'his', 'majesty', '’', 's', 'government', 'wish', 'to', 'add', 'that', 'they', 'have', 'no', '[MASK]', 'of', 'requesting', 'the', 'establishment', 'of', 'military', 'bases', 'in', 'peace', 'time', 'within', 'the', 'area', 'of', 'palestine', 'now', 'united', 'to', 'the', '[MASK]', 'of', 'jordan', '.', '[SEP]']


Build arrays of indices tokens(word index), segments tokens(sentence segments), and mask indices as the model inputs. Due to the limitation of the package, the arrays are made manually. The three arrays have the same shape, which is 512 by 512 (set by the tutorial, as the maximum input length). Each element of the array represent a token. We do not have 512 tokens here so there are unused space in the array. 

In [23]:
import numpy as np

indices = np.array([[token_dict[token] for token in tokens] + [0] * (512 - len(tokens))])
segments = np.array([[0] * len(tokens) + [0] * (512 - len(tokens))])
masks = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]])

Set the prediction matrix and predict the masked words. The prediction matrix will give us a index of the predicted word, the index is then used to look for the corresponding word through the inverse token dictionary. Here, the predictions of the two masked words are "intention" and "state". The original words are "intention" and "Kingdom".

In [25]:
predicts = model.predict([indices, segments, masks])

In [26]:
predicts_dict = predicts[0].argmax(axis=-1).tolist()
print('Fill with: ', token_dict_inv[predicts_dict[0][13]], token_dict_inv[predicts_dict[0][33]])

Fill with:  intention state
