# Bahasa Indonesia POS Tagger using BLSTM-CNN-CRF
I construct a network based on [this paper](https://arxiv.org/abs/1603.01354) for Bahasa Indonesia part-of-speech (POS) tagger. The idea is to use word embedding and also character embedding as features to be inputted to LSTM. Before LSTM, character vectors are going to Convolution and Max Pooling layers. Then, CRF layer is used to predict the sequence.

The workflow is as follows:
1. [Preprocessing Data](#Preprocessing-Data): I do data preprocessing mostly on numpy, instead of doing it on tensorflow. Besides word and character vectors, I also use word shape as feature.
2. [Building the Network](#Building-the-Network): On top of my network, I also use Dropout.
3. [Training the Network](#Training-the-Network): I use mini-batch Adam optimizer with learning rate decay and gradient clipping.
4. [Evaluating the Network](#Evaluating the Network)

I add the summary for Tensorboard as well.

## Library
Note that [Sastrawi](https://github.com/har07/PySastrawi) is the stemmer library for Bahasa Indonesia.

In [1]:
import re
import string
import time
from tqdm._tqdm_notebook import tqdm_notebook as tqdm

from gensim.models import Word2Vec
import numpy as np
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import tensorflow as tf

Using TensorFlow backend.


## Preprocessing Data
The tagged Bahasa Indonesia is obtained from [here](https://github.com/famrashel/idn-tagged-corpus). It contains around 10k labeled sentences. Also, the list of tagset with the description can be found [here](http://bahasa.cs.ui.ac.id/postag/downloads/Tagset.pdf).

First, I extract the token and label of each sentence. Then, I use the stemmer for each token.

In [2]:
#initialize stemmer
stemmer = StemmerFactory().create_stemmer()

In [3]:
raw_sentences = open('INPUT-THE-FILE-LOCATION-HERE', 'rb') \
                        .read().split('\n\n')
print '#sentences:', len(raw_sentences)

#sentences: 10030


In [4]:
sentences = list() #list of raw tokenized sentences 
tags = list() #list of labels
stem_sentences = list() #list of stemmed tokenized sentences
for sentence_ in tqdm(raw_sentences):
    sentences_ = list()
    tags_ = list()
    stem_sentences_ = list()
    for words_tags_ in sentence_.split('\n'):
        _ = words_tags_.split('\t')
        sentences_.append(_[0])
        tags_.append(_[1].upper())
        stem_sentences_.append(stemmer.stem(_[0]))
    sentences.append(sentences_)
    tags.append(tags_)
    stem_sentences.append(stem_sentences_)

A Jupyter Widget




I use **gensim's Word2Vec** to create initial word embedding for each token. Although the word embedding is trained during the network training, Word2Vec provides better starting point than random initialization. I use *min_count=2* to provide representation for OOV token, since token with only 1 occurence will be removed. Note that I use label 0 for pad token and 1 for unknown token.

Then, I preprocess each sentence with post padding and post truncation so each sentence has equal length.

In [5]:
w2v = Word2Vec(stem_sentences, sg=1, size=128, min_count=2, seed=100)
print '#vocab:', len(w2v.wv.vocab)
word_vec_embed = np.vstack((np.zeros(128), np.zeros(128), w2v.wv.syn0))

#vocab: 6737


In [6]:
word2index = {token_: idx_+2 for idx_, token_ in \
                              enumerate(w2v.wv.index2word)}
word2index['<PAD>'] = 0
word2index['<UNK>'] = 1
index2word = {idx_: token_ for token_, idx_ in word2index.items()}

In [7]:
SEQ_LENGTH = 30
index_sentences = list()
for sentence_ in stem_sentences:
    index_sentences.append([word2index[token]
                            if token in word2index else 1
                            for token in sentence_])
index_sentences = tf.contrib.keras.preprocessing.sequence \
                    .pad_sequences(index_sentences, SEQ_LENGTH,
                                   padding='post', truncating='post')
#sentence length will be used in calculating loss and prediction
seq_length = np.array([min(SEQ_LENGTH, len(sentence_))
                       for sentence_ in sentences])

Using character embedding, character feature from OOV token can still be extracted. 

Actually I can use gensim to provide the initial character embedding (just like I do with word), but I prefer not to since I don't think it will be significant. Hence, I use random initialization for character embedding. Note that, just like word embedding, I use label 0 for pad character and 1 for character token.

Also, I preprocess each word with post padding and post truncation so each word has equal length.

In [8]:
char2index = {char_: idx_+2 for idx_, char_ in
              enumerate(list(string.lowercase))}
for i in range(10):
    char2index[str(i)] = i+28
char2index['<PAD>'] = 0
char2index['<UNK>'] = 1

In [9]:
TOKEN_LENGTH = 15
def convert_token(token):
    if token == '<PAD>': #all zero sequence
        return tf.contrib.keras.preprocessing.sequence \
                    .pad_sequences([[0]], TOKEN_LENGTH,
                                   padding='post', truncating='post')
    else:
        chars = [char2index[char] if char in char2index else 1
                 for char in token]
        return tf.contrib.keras.preprocessing.sequence \
                    .pad_sequences([chars], TOKEN_LENGTH,
                                   padding='post', truncating='post')

In [10]:
char_sentences = list()
for sentence_ in stem_sentences:
    char_sentence = list()
    count = 0
    for token_ in sentence_:
        char_sentence.append(convert_token(token_))
        count += 1
        if count == SEQ_LENGTH-1:
            break
    for i in range(SEQ_LENGTH-count):
        char_sentence.append(convert_token('<PAD>'))
    char_sentences.append(char_sentence)
#the output is reshaped to fit input of convolution layer
char_sentences = np.array(char_sentences).reshape(-1, SEQ_LENGTH,
                                                  TOKEN_LENGTH)

Next, I build label encoder for POS tag and the label sequence has to be preprocessed with padding and truncation as well. Note that 0 is the label for padding.

In [11]:
tag_list = ['<PAD>']
for tag_ in tags:
    tag_list += tag_
tag_list = set(tag_list)
print '#tag:', len(tag_list)

#tag: 24


In [12]:
tag_encoder = LabelEncoder().fit(list(tag_list))
index_tags = list()
for tag_ in tags:
    index_tags.append(list(tag_encoder.transform(tag_)))
index_tags = tf.contrib.keras.preprocessing.sequence \
                    .pad_sequences(index_tags, SEQ_LENGTH,
                                   padding='post', truncating='post')

I include word shape as the additional feature and preprocess the sequence of word shape with padding and truncation also. Note that 0 is the label for padding.

In [13]:
def word_shape(token):
    if re.sub('[^A-Za-z\s]+', '', token) != token:
        return 1 #other
    elif token.lower().title() == token:
        return 2 #upperInitial
    elif token.lower() == token:
        return 3 #lowercase
    elif token.upper() == token:
        return 4 #uppercase
    else:
        return 5 #mixed

In [14]:
sentence_word_shapes = list()
for sentence_ in sentences:
    _ = [word_shape(token) for token in sentence_]
    sentence_word_shapes.append(_)
sentence_word_shapes = tf.contrib.keras.preprocessing.sequence \
                        .pad_sequences(sentence_word_shapes,
                        SEQ_LENGTH, padding='post', truncating='post')

Finally, I split train (80%) and test (20%) set. Actually, I have to prepare dev set as well, but I don't that it's necessary for my case. So basically, my test set is dev set.

In [15]:
X_train, X_test, Y_train, Y_test, X_ws_train, X_ws_test, \
    seq_length_train, seq_length_test, X_char_train, X_char_test = \
        train_test_split(index_sentences, index_tags,
                         sentence_word_shapes, seq_length,
                         char_sentences, random_state=100)

## Building the Network
Now, I start building the network. Since this is a quite complex network, there are so many hyperparameters. I also write the hyperparameter values to tensorboard.

In [66]:
#resetting the graph
tf.reset_default_graph()

In [68]:
SEED = 100
TAG_SIZE = len(tag_list) #number of POS tag
SHAPE_SIZE = np.max(sentence_word_shapes) + 1 #number of wordshape type
DATA_SIZE = X_train.shape[0] #size of train data
CHAR_SIZE = len(char2index) #number of character type
CHAR_DIM = 32 #dimension of character embedding
COMB_CHAR = 5 #convolution filter size
OUT_CHAR = 8 #output dimension from convolution layer
LSTM_DIM = 128 #output dimension of LSTM layer
KEEP_PROB = 0.5 #on dropout
LEARNING_RATE = 1e-2
DECAY_RATE = 0.95 #on learning rate
DECAY_STEP = 50 #on learning rate
GRAD_MAX = 5.0 #for gradient clipping
EPOCHS = 16 #number of iterations; can be manually interrupted
MINI_BATCH_SIZE = 2**11
PRINT_INTERVAL = 1 #and interval for summary writer
SAVER_INTERVAL = 100

In [69]:
with tf.name_scope('hyperparameter'):
    #need keep_prob placeholder since it should be 1 during prediction
    keep_prob = tf.placeholder(tf.float32)

    #summaries for hyperparameter
    keep_prob_summ = tf.summary.scalar('keep_prob', keep_prob)
    seed_summ = tf.summary.scalar('seed', tf.convert_to_tensor(SEED))
    lstm_dim_summ = tf.summary.scalar('lstm_dim',
                                      tf.convert_to_tensor(LSTM_DIM))
    decay_step_summ = tf.summary.scalar('decay_step',
                                    tf.convert_to_tensor(DECAY_STEP))
    decay_rate_summ = tf.summary.scalar('decay_rate',
                                    tf.convert_to_tensor(DECAY_RATE))
    grad_max_summ = tf.summary.scalar('grad_max',
                                     tf.convert_to_tensor(GRAD_MAX))
    data_size_summ = tf.summary.scalar('data_size',
                                tf.convert_to_tensor(DATA_SIZE))
    tag_size_summ = tf.summary.scalar('tag_size',
                                tf.convert_to_tensor(TAG_SIZE))
    shape_size_summ = tf.summary.scalar('shape_size',
                                tf.convert_to_tensor(SHAPE_SIZE))
    mini_batch_summ = tf.summary.scalar('mini_batch_size',
                                tf.convert_to_tensor(MINI_BATCH_SIZE))
    seq_length_summ = tf.summary.scalar('seq_length',
                                      tf.convert_to_tensor(SEQ_LENGTH))
    token_length_summ = tf.summary.scalar('token_length',
                                    tf.convert_to_tensor(TOKEN_LENGTH))
    char_size_summ = tf.summary.scalar('char_size',
                                      tf.convert_to_tensor(CHAR_SIZE))
    char_dim_summ = tf.summary.scalar('char_dim',
                                      tf.convert_to_tensor(CHAR_DIM))
    comb_char_summ = tf.summary.scalar('comb_char',
                                      tf.convert_to_tensor(COMB_CHAR))
    out_char_summ = tf.summary.scalar('out_char',
                                      tf.convert_to_tensor(OUT_CHAR))

I like to write notes of the experiment I am going to do. So here it is!

In [70]:
notes = 'W1: Xavier; b1: zeros; ' + \
        'loss: crf log likelihood; ' + \
        'optimizer: adam'
comments = tf.placeholder(dtype=tf.string) #will be filled later
with tf.name_scope('experiment_notes'):
    notes_summ = tf.summary.text('notes', tf.convert_to_tensor(notes))
    comments_summ = tf.summary.text('comments', comments)

I create placeholder for the data I have prepared before.

In [71]:
with tf.name_scope('input'):
    X_input = tf.placeholder(name='X_input', dtype=tf.int32,
                         shape=[None, SEQ_LENGTH])
    X_word_shape = tf.placeholder(name='X_word_shape', dtype=tf.int32,
                                  shape=[None, SEQ_LENGTH])
    X_char_input = tf.placeholder(name='X_char_input', dtype=tf.int32,
                                  shape=[None, SEQ_LENGTH,
                                         TOKEN_LENGTH])
    Y_input = tf.placeholder(name='Y_input', dtype=tf.int32,
                         shape=[None, SEQ_LENGTH])
    seq_length_input = tf.placeholder(dtype=tf.int32, shape=[None])
    mask_input = tf.cast(tf.sign(X_input), dtype=tf.int32)

I initialize character embedding randomly using **Xavier Initializer**. Then, the char vectors are inputted to **convolution** and **max pool** layers. Later, the result is appended to word vectors.

In [72]:
with tf.name_scope('char'):
    char_embed = tf.get_variable(name='char_embed', shape=[CHAR_SIZE,
                            CHAR_DIM], initializer=tf.contrib.layers \
                            .xavier_initializer(seed=SEED))
    X_char = tf.nn.embedding_lookup(char_embed, X_char_input)
    char_filter = tf.get_variable(name='char_filter', shape=[1,
                    COMB_CHAR, CHAR_DIM, OUT_CHAR], initializer= \
                    tf.contrib.layers.xavier_initializer(seed=SEED))
    conv = tf.nn.conv2d(X_char, char_filter, [1, 1, 1, 1], 'VALID')
    maxpool = tf.nn.max_pool(conv, [1, 1, TOKEN_LENGTH-COMB_CHAR+1,
                                    1], [1, 1, 1, 1], 'VALID')
    maxpool = tf.reshape(maxpool,shape= [-1, SEQ_LENGTH, OUT_CHAR])
    
    #summary
    char_embed_summ = tf.summary.histogram('char_embed_summary',
                                           char_embed)
    char_filter_summ = tf.summary.histogram('char_filter_summary',
                                            char_filter)

Word Embedding is initialized using the pre-trained Word2Vec. However, the embedding is still trainable.

In [73]:
with tf.name_scope('word'):
    word_embed = tf.Variable(name='word_embed',
                             initial_value=word_vec_embed,
                             trainable=True, dtype=tf.float32)
    X_word = tf.nn.embedding_lookup(word_embed, X_input)
    X_word = tf.concat([tf.one_hot(X_word_shape, SHAPE_SIZE), X_word],
                       axis=2)
#     X_embed = X_word
    X_embed = tf.concat([X_word, maxpool], axis=2, name='X_embed')
    
    #summary
    word_embed_summ = tf.summary.histogram('word_embed_summary',
                                           word_embed)

I initialize weights and biases for next layer after LSTM.

In [74]:
with tf.name_scope('weights'):
    W1 = tf.get_variable(name='W1', shape=[2*LSTM_DIM, TAG_SIZE],
                         initializer=tf.contrib.layers \
                                     .xavier_initializer(seed=SEED))
    b1 = tf.get_variable(name='b1', shape=[TAG_SIZE],
                         initializer=tf.zeros_initializer())
    
    #summary
    W1_summ = tf.summary.histogram('W1_summary', W1)
    b1_summ = tf.summary.histogram('b1_summary', b1)

I use basic LSTM with default activation (**tanh**) and dropout layer for forward and backward cells. Then, those cells are inputted to **bidirectional dynamic RNN**. That is why the sequence length is necessary. After that, the output of LSTM are multiplied by the weights and then added by biases.

In [75]:
with tf.name_scope('lstm'):
    cell_fw = tf.nn.rnn_cell.BasicLSTMCell(num_units=LSTM_DIM)
    cell_fw = tf.nn.rnn_cell.DropoutWrapper(cell_fw, seed=SEED,
                                            output_keep_prob=keep_prob)
    cell_bw = tf.nn.rnn_cell.BasicLSTMCell(num_units=LSTM_DIM)
    cell_bw = tf.nn.rnn_cell.DropoutWrapper(cell_bw, seed=SEED,
                                            output_keep_prob=keep_prob)
    outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,
                        X_embed, seq_length_input, dtype=tf.float32)
    outputs = tf.concat([outputs[0], outputs[1]], axis=-1)

In [76]:
with tf.name_scope('output'):
    seq_outputs = tf.reshape(tf.nn.xw_plus_b(tf.reshape( \
                    outputs, shape=[-1, 2*LSTM_DIM]),
                    W1, b1), shape=[-1, SEQ_LENGTH, TAG_SIZE],
                    name='seq_outputs')
    softmax = tf.nn.softmax(seq_outputs)
    #prediction without crf
    seq_predict = tf.cast(tf.argmax(softmax, axis=2), tf.float32,
                          name='seq_predict')
    
    #summary
    seq_outputs_summ = tf.summary.histogram('seq_outputs_summary',
                                            seq_outputs)
    seq_pred_summ = tf.summary.histogram('seq_predict_summary',
                                         seq_predict)

**CRF** module in tensorflow is pretty amazing! I just need to write few words of code to calculate the log likelihood and the predicted sequence. Note that **transition matrix** here is the probability of moving from one tag to other tag and **vetrebi sequence** is the predicted sequence of tags.

In [77]:
with tf.name_scope('loss'):
    log_likelihood, transition_matrix = tf.contrib.crf \
        .crf_log_likelihood(seq_outputs, Y_input, seq_length_input)
    loss = tf.reduce_mean(-log_likelihood)
    viterbi_seq, viterbi_score = tf.contrib.crf.crf_decode( \
        seq_outputs, transition_matrix, seq_length_input)

    #summary
    loss_summ = tf.summary.scalar('loss_function', loss)
    trans_matrix_summ = tf.summary.histogram('trans_matrix_summ',
                                             transition_matrix)
    viterbi_seq_summ = tf.summary.histogram('viterbi_seq_summary',
                                             viterbi_seq)

Now, I initialize the training. I use **Adam** for my optimizer. I understand that Adam can adjust the learning rate, but when I tried to extract the learning rate from Adam, the learning rate just did not change. Thus, to make sure I use learning rate with **exponential decay**. I also use **gradient clipping** to ensure the learning is not too big.

In [78]:
with tf.name_scope('train'):
    global_step = tf.Variable(0, trainable=False, name='global_step')
    inc_global_step_op = global_step.assign_add(1)
    learning_rate = tf.train.exponential_decay(LEARNING_RATE,
                                        global_step, DECAY_STEP,
                                        DECAY_RATE, staircase=True)
    optimizer = tf.train.AdamOptimizer(learning_rate, name='optimizer')
    grads, vars = zip(*optimizer.compute_gradients(loss))
    capped_grads, _ = tf.clip_by_global_norm(grads, GRAD_MAX)
    train_op = optimizer.apply_gradients(zip(capped_grads, vars))

    #summary
    learning_summ = tf.summary.scalar('learning_rate', learning_rate)

I use 2 accuracy measures. First is the **sentence accuracy**. I think the measure is too strict since the prediction sequence has to be completely equal to the true sequence. Moreover, the sentence lengths are various and some sentences might contain many OOV words. Therefore, I use another measure which is **token accuracy**.

In [79]:
with tf.name_scope('accuracy'):
    viterbi_seq = tf.multiply(viterbi_seq, mask_input)
    output_equal = tf.cast(tf.equal(tf.cast(viterbi_seq, tf.int32),
                        Y_input), dtype=tf.float32)
    accuracy = tf.reduce_mean(tf.reduce_min(output_equal, axis=1),
                              name='acc')
    accuracy_token = tf.reduce_sum(tf.multiply(output_equal,
                        tf.cast(mask_input, tf.float32)))/tf.cast( \
                        tf.reduce_sum(mask_input), tf.float32)

    #summary
    #I create separate summary for train and test
    train_acc_summ = tf.summary.scalar('train_accuracy', accuracy)
    test_acc_summ = tf.summary.scalar('test_accuracy', accuracy)
    train_acc_tkn_summ = tf.summary.scalar('train_token_accuracy',
                                           accuracy_token)
    test_acc_tkn_summ = tf.summary.scalar('test_token_accuracy',
                                          accuracy_token)

Finally, I create merger object for summary and saver object.

In [80]:
with tf.name_scope('summary'):
    train_merger = tf.summary.merge([train_acc_summ, loss_summ,
                                     train_acc_tkn_summ,
                                     learning_summ])
    test_merger = tf.summary.merge([test_acc_summ,
                                    test_acc_tkn_summ])
    histo_merger = tf.summary.merge([word_embed_summ, W1_summ, b1_summ,
                                     char_embed_summ, char_filter_summ,
                                     seq_outputs_summ, seq_pred_summ,
                                     viterbi_seq_summ,
                                     trans_matrix_summ])
    hyper_merger = tf.summary.merge([seed_summ, lstm_dim_summ,
                                data_size_summ, shape_size_summ,
                                tag_size_summ, grad_max_summ,
                                mini_batch_summ, keep_prob_summ,
                                decay_rate_summ, decay_step_summ,
                                seq_length_summ, token_length_summ,
                                char_size_summ, char_dim_summ, 
                                comb_char_summ, out_char_summ])
    notes_merger = tf.summary.merge([notes_summ])
    comments_merger = tf.summary.merge([comments_summ])

In [81]:
with tf.name_scope('saver'):
    saver = tf.train.Saver()

## Training the Network
Let's start training the network! First, I prepare the writer object and set common seed to ensure reproduceable. I also prepare separate dictionary for train and test.

In [82]:
#save target
LOG_DIR = 'INPUT-YOUR-LOG-DIRECTORY-HERE'

In [83]:
tf.set_random_seed(SEED)
np.random.seed(SEED)
sess = tf.Session()
writer = tf.summary.FileWriter(LOG_DIR, sess.graph)
sess.run(tf.global_variables_initializer())

In [84]:
train_dict = {X_input: X_train, X_word_shape: X_ws_train,
              Y_input: Y_train, seq_length_input: seq_length_train,
              X_char_input: X_char_train, keep_prob: 1.0}
test_dict = {X_input: X_test, X_word_shape: X_ws_test,
             Y_input: Y_test, seq_length_input: seq_length_test,
             X_char_input: X_char_test, keep_prob: 1.0}

In [85]:
#comment before training
start_comments = 'best setup'

Here is the training algorithm. I use shuffle the data each epoch to reduce overfitting.

In [86]:
# writing notes
summ = sess.run(hyper_merger, {keep_prob: KEEP_PROB})
writer.add_summary(summ, 0)
summ = sess.run(notes_merger)
writer.add_summary(summ, 0)
summ = sess.run(comments_merger, {comments: start_comments})
writer.add_summary(summ, 0)

# training
summ = sess.run(train_merger, train_dict)
writer.add_summary(summ, 0)
summ = sess.run(histo_merger, train_dict)
writer.add_summary(summ, 0)
summ = sess.run(test_merger, test_dict)
writer.add_summary(summ, 0)
print 'step 0'
START_TIME = time.time()

for step in range(1, EPOCHS+1):
    r = np.random.permutation(DATA_SIZE) #to shuffle the data
    for no_batch in range(int(1.0*DATA_SIZE/MINI_BATCH_SIZE + 0.99)):
        start = no_batch*MINI_BATCH_SIZE
        end = min((no_batch+1)*MINI_BATCH_SIZE, DATA_SIZE)
        batch_dict = {X_input: X_train[r][start: end],
                    X_word_shape: X_ws_train[r][start: end],
                    Y_input: Y_train[r][start: end],
                    seq_length_input: seq_length_train[r][start: end],
                    X_char_input: X_char_train[r][start: end],
                    keep_prob: KEEP_PROB}
        sess.run(train_op, batch_dict)
    i_step = sess.run(inc_global_step_op)
    
    if step%PRINT_INTERVAL == 0:
        summ = sess.run(train_merger, train_dict)
        writer.add_summary(summ, step)
        summ = sess.run(histo_merger, train_dict)
        writer.add_summary(summ, step)
        summ = sess.run(test_merger, test_dict)
        writer.add_summary(summ, step)
        print 'step {0}: {1:.2f} mins'.format(step,
                                        (time.time() - START_TIME)/60)
        START_TIME = time.time()

    if step%SAVER_INTERVAL == 0:
        saver.save(sess, LOG_DIR + 'model', global_step=step)

step 0
step 1: 0.17 mins
step 2: 0.17 mins
step 3: 0.19 mins
step 4: 0.17 mins
step 5: 0.16 mins
step 6: 0.19 mins
step 7: 0.16 mins
step 8: 0.17 mins
step 9: 0.17 mins
step 10: 0.19 mins
step 11: 0.16 mins
step 12: 0.17 mins
step 13: 0.19 mins
step 14: 0.18 mins
step 15: 0.18 mins
step 16: 0.21 mins


## Evaluating the Network
To evaluate the network, I have to observe the log in the tensorboard. So here I evaluate the performance empirically.

In [130]:
# print incorrect prediction
predict = sess.run(tf.cast(viterbi_seq, tf.int32), test_dict)
mask_pred = sess.run(tf.cast(mask_input, tf.int32), test_dict)
for i in range(len(X_test[:5])):
    if not np.array_equal(Y_test[i], predict[i]):
        print np.array([index2word[idx_] for idx_ in X_test[i]]) \
                    [mask_pred[i] == 1]
        print tag_encoder.inverse_transform(Y_test[i]) \
                    [mask_pred[i] == 1] #correct sequence
        print tag_encoder.inverse_transform(predict[i]) \
                    [mask_pred[i] == 1] #predicted sequence

['kita' 'lihat' 'daya beli' 'masyarakat' 'masih' 'batas' '' 'indikasi'
 '-nya' 'rasio' 'tabung' 'hadap' 'pdb' 'cenderung' 'turun' '']
['PRP' 'VB' 'NN' 'NN' 'MD' 'JJ' 'Z' 'NN' 'PRP' 'NN' 'NN' 'IN' 'NN' 'JJ'
 'VB' 'Z']
['PRP' 'VB' 'NN' 'NN' 'MD' 'VB' 'Z' 'VB' 'PRP' 'NN' 'NN' 'IN' 'NN' 'JJ'
 'VB' 'Z']
['juru bicara' 'nato' 'tolak' 'beri' 'detail' 'jadi' '' 'dan' 'kata'
 'bahwa' 'dengan' 'sebut' 'provinsi' 'dapat' 'ungkap' 'kewarganegaraan'
 'prajurit' 'itu' '']
['NN' 'NNP' 'VB' 'VB' 'NN' 'NN' 'Z' 'CC' 'VB' 'SC' 'SC' 'VB' 'NN' 'MD' 'VB'
 'NN' 'NN' 'PR' 'Z']
['NN' 'NNP' 'VB' 'VB' 'NN' 'VB' 'Z' 'CC' 'VB' 'SC' 'IN' 'PR' 'NN' 'MD' 'VB'
 'NN' 'NN' 'PR' 'Z']


To evaluate the network more, I create custom function to predict POS tag of custom sentence.

In [91]:
def custom_predict(sentence_raw):
    sentence_raw = sentence_raw.split()

    sentence_ws = [word_shape(token) for token in sentence_raw]
    sentence_ws = tf.contrib.keras.preprocessing.sequence \
                        .pad_sequences([sentence_ws], SEQ_LENGTH,
                                    padding='post', truncating='post')
    
    sentence = [stemmer.stem(token) for token in sentence_raw]
    print sentence
    
    char_sentence = list()
    count = 0
    for token_ in sentence:
        char_sentence.append(convert_token(token_))
        count += 1
        if count == SEQ_LENGTH-1:
            break
    for i in range(SEQ_LENGTH-count):
        char_sentence.append(convert_token('<PAD>'))
    char_sentence = np.array(char_sentence).reshape(-1, SEQ_LENGTH,
                                                    TOKEN_LENGTH)

    sentence = [word2index[token] if token in word2index else 1
                for token in sentence]
    print sentence
    sentence = tf.contrib.keras.preprocessing.sequence \
                        .pad_sequences([sentence], SEQ_LENGTH,
                                    padding='post', truncating='post')

    custom_dict = {X_input: sentence, X_word_shape: sentence_ws,
                   seq_length_input: np.array([len(sentence_raw)]),
                   X_char_input: char_sentence,
                   keep_prob:1}
    predictor = sess.run(viterbi_seq, custom_dict)
    masked = sess.run(mask_input, custom_dict)
    return tag_encoder.inverse_transform(predictor[0])[masked[0] == 1]

In [131]:
stc = 'Saya pergi ke pasar .'
print custom_predict(stc)

['saya', 'pergi', 'ke', 'pasar', '']
[189, 1100, 53, 70, 2]
['PRP' 'VB' 'IN' 'NN' 'Z']


Finally, I add closing comments below consist of the result of evaluation and further experiment plan. I also save the final step model.

In [99]:
end_comments = 'works nice'
summ = sess.run(comments_merger, {comments: end_comments})
writer.add_summary(summ, 1)
writer.close()
saver.save(sess, LOG_DIR + 'model-fin', global_step=step)

'/home/av170602/log/tensorflow/pos_tagger/171206/01/model-fin-16'

That's all. Hope you enjoy!