# Load & Predict

## Download Pretrained Weights

In [1]:
!pip install -q keras-bert

In [2]:
!wget -q https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip

In [3]:
!unzip -o chinese_L-12_H-768_A-12.zip

Archive:  chinese_L-12_H-768_A-12.zip
  inflating: chinese_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: chinese_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: chinese_L-12_H-768_A-12/vocab.txt  
  inflating: chinese_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: chinese_L-12_H-768_A-12/bert_config.json  


## Build Model & Dictionary

Set paths:

In [4]:
import os

pretrained_path = 'chinese_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')

Enable tf.keras by adding TF_KERAS to environment variables:

In [5]:
os.environ['TF_KERAS'] = '1'

Build the dictionary & inverse dictionary:

In [6]:
import codecs

token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)
token_dict_inv = {v: k for k, v in token_dict.items()}

Build the model:

In [7]:
from keras_bert import load_trained_model_from_checkpoint

model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=True)
model.summary(line_length=120)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
________________________________________________________________________________________________________________________
Layer (type)                           Output Shape               Param #       Connected to                            
Input-Token (InputLayer)               (None, 512)                0                                                     
________________________________________________________________________________________________________________________
Input-Segment (InputLayer)             (None, 512)                0                                                     
________________________________________________________________________________________________________________________
Embedding-Token (TokenEmbedding)       [(None, 512, 768), (21128, 16226304      Input-T

## Predict Masked

In [8]:
from keras_bert import Tokenizer

tokenizer = Tokenizer(token_dict)
text = '数学是利用符号语言研究数量、结构、变化以及空间等概念的一门学科'
tokens = tokenizer.tokenize(text)
tokens[1] = tokens[2] = '[MASK]'
print('Tokens:', tokens)

Tokens: ['[CLS]', '[MASK]', '[MASK]', '是', '利', '用', '符', '号', '语', '言', '研', '究', '数', '量', '、', '结', '构', '、', '变', '化', '以', '及', '空', '间', '等', '概', '念', '的', '一', '门', '学', '科', '[SEP]']


In [9]:
import numpy as np

indices = np.array([[token_dict[token] for token in tokens] + [0] * (512 - len(tokens))])
segments = np.array([[0] * len(tokens) + [0] * (512 - len(tokens))])
masks = np.array([[0, 1, 1] + [0] * (512 - 3)])

In [10]:
predicts = model.predict([indices, segments, masks])[0].argmax(axis=-1).tolist()
print('Fill with: ', list(map(lambda x: token_dict_inv[x], predicts[0][1:3])))

Fill with:  ['数', '学']


## Predict Next Sentence

In [11]:
sentence_1 = '数学是利用符号语言研究數量、结构、变化以及空间等概念的一門学科。'
sentence_2 = '从某种角度看屬於形式科學的一種。'
print('Tokens:', tokenizer.tokenize(first=sentence_1, second=sentence_2))
indices, segments = tokenizer.encode(first=sentence_1, second=sentence_2, max_len=512)
masks = np.array([[0] * 512])

Tokens: ['[CLS]', '数', '学', '是', '利', '用', '符', '号', '语', '言', '研', '究', '數', '量', '、', '结', '构', '、', '变', '化', '以', '及', '空', '间', '等', '概', '念', '的', '一', '門', '学', '科', '。', '[SEP]', '从', '某', '种', '角', '度', '看', '屬', '於', '形', '式', '科', '學', '的', '一', '種', '。', '[SEP]']


In [12]:
predicts = model.predict([np.array([indices]), np.array([segments]), masks])[1]
print('%s is random next: ' % sentence_2, bool(np.argmax(predicts, axis=-1)[0]))

从某种角度看屬於形式科學的一種。 is random next:  False


In [13]:
sentence_2 = '任何一个希尔伯特空间都有一族标准正交基。'
print('Tokens:', tokenizer.tokenize(first=sentence_1, second=sentence_2))
indices, segments = tokenizer.encode(first=sentence_1, second=sentence_2, max_len=512)

Tokens: ['[CLS]', '数', '学', '是', '利', '用', '符', '号', '语', '言', '研', '究', '數', '量', '、', '结', '构', '、', '变', '化', '以', '及', '空', '间', '等', '概', '念', '的', '一', '門', '学', '科', '。', '[SEP]', '任', '何', '一', '个', '希', '尔', '伯', '特', '空', '间', '都', '有', '一', '族', '标', '准', '正', '交', '基', '。', '[SEP]']


In [14]:
predicts = model.predict([np.array([indices]), np.array([segments]), masks])[1]
print('%s is random next: ' % sentence_2, bool(np.argmax(predicts, axis=-1)[0]))

任何一个希尔伯特空间都有一族标准正交基。 is random next:  True
