# Finetuning of the pretrained Japanese BERT model

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretrained Japanese BERT model

Dataset is livedoor ニュースコーパス in https://www.rondhuit.com/download.html.  
I make train:dev = 8:2 datasets.

In [6]:
import configparser
import glob
import os
import pandas as pd
import subprocess
import sys
import tarfile 
from urllib.request import urlretrieve

CURDIR = os.getcwd()
CONFIGPATH = os.path.join(CURDIR, os.pardir, 'config.ini')
config = configparser.ConfigParser()
config.read(CONFIGPATH)
print("Done reading config")

Done reading config


## Finetune pre-trained model

It will take a lot of hours to execute the following cells on CPU environment.  
You can also use colab to recieve the power of TPU. You need to uplode the created data onto your GCS bucket.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zZH2GWe0U-7GjJ2w2duodFfEUptvHjcx)

In [2]:
PRETRAINED_MODEL_PATH = '../model/model.ckpt-1400000'
FINETUNE_OUTPUT_DIR = '../model/livedoor_output'

In [4]:
%%time
# It will take many hours on CPU environment.

!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --do_train=true \
  --do_eval=true \
  --data_dir=../data/livedoor \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=10 \
  --output_dir={FINETUNE_OUTPUT_DIR}

W1008 19:16:31.047955 139977052022528 deprecation_wrapper.py:119] From /data2/m-taketani/practice/text_classification/sentencepiece/bert-japanese/src/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W1008 19:16:31.049230 139977052022528 deprecation_wrapper.py:119] From ../src/run_classifier.py:854: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W1008 19:16:31.049881 139977052022528 deprecation_wrapper.py:119] From ../src/run_classifier.py:659: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W1008 19:16:31.050039 139977052022528 deprecation_wrapper.py:119] From ../src/run_classifier.py:659: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W1008 19:16:31.050537 139977052022528 deprecation_wrapper.py:119] From /data2/m-taketani/practice/text_classification/sentencepiece/bert-japanese/src/modeling.py:92:

I1008 19:16:42.631214 139977052022528 run_classifier.py:744] ***** Running training *****
I1008 19:16:42.631454 139977052022528 run_classifier.py:745]   Num examples = 5893
I1008 19:16:42.631561 139977052022528 run_classifier.py:746]   Batch size = 4
I1008 19:16:42.631637 139977052022528 run_classifier.py:747]   Num steps = 14732
W1008 19:16:42.631808 139977052022528 deprecation_wrapper.py:119] From ../src/run_classifier.py:389: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W1008 19:16:42.639947 139977052022528 deprecation.py:323] From /home/m-taketani/.pyenv/versions/anaconda3-5.2.0/envs/py37/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1008 19:1

W1008 19:16:46.501877 139977052022528 deprecation.py:323] From /home/m-taketani/.pyenv/versions/anaconda3-5.2.0/envs/py37/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py:1205: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
I1008 19:16:54.575535 139977052022528 estimator.py:1147] Done calling model_fn.
I1008 19:16:54.577021 139977052022528 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I1008 19:16:57.888806 139977052022528 monitored_session.py:240] Graph was finalized.
2019-10-08 19:16:57.889260: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropria

^C
CPU times: user 15.3 s, sys: 1.83 s, total: 17.1 s
Wall time: 12min 2s


## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [None]:
import sys
sys.path.append("../src")

import tokenization_sentencepiece as tokenization
from run_classifier import LivedoorProcessor
from run_classifier import model_fn_builder
from run_classifier import file_based_input_fn_builder
from run_classifier import file_based_convert_examples_to_features
from utils import str_to_value

In [None]:
sys.path.append("../bert")

import modeling
import optimization
import tensorflow as tf

In [None]:
import configparser
import json
import glob
import os
import pandas as pd
import tempfile

bert_config_file = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8', suffix='.json')
bert_config_file.write(json.dumps({k:str_to_value(v) for k,v in config['BERT-CONFIG'].items()}))
bert_config_file.seek(0)
bert_config = modeling.BertConfig.from_json_file(bert_config_file.name)

In [None]:
output_ckpts = glob.glob("{}/model.ckpt*data*".format(FINETUNE_OUTPUT_DIR))
latest_ckpt = sorted(output_ckpts)[-1]
FINETUNED_MODEL_PATH = latest_ckpt.split('.data-00000-of-00001')[0]

In [None]:
class FLAGS(object):
    '''Parameters.'''
    def __init__(self):
        self.model_file = "../model/wiki-ja.model"
        self.vocab_file = "../model/wiki-ja.vocab"
        self.do_lower_case = True
        self.use_tpu = False
        self.output_dir = "/dummy"
        self.data_dir = "../data/livedoor"
        self.max_seq_length = 512
        self.init_checkpoint = FINETUNED_MODEL_PATH
        self.predict_batch_size = 4
        
        # The following parameters are not used in predictions.
        # Just use to create RunConfig.
        self.master = None
        self.save_checkpoints_steps = 1
        self.iterations_per_loop = 1
        self.num_tpu_cores = 1
        self.learning_rate = 0
        self.num_warmup_steps = 0
        self.num_train_steps = 0
        self.train_batch_size = 0
        self.eval_batch_size = 0

In [None]:
FLAGS = FLAGS()

In [None]:
processor = LivedoorProcessor()
label_list = processor.get_labels()

In [None]:
tokenizer = tokenization.FullTokenizer(
    model_file=FLAGS.model_file, vocab_file=FLAGS.vocab_file,
    do_lower_case=FLAGS.do_lower_case)

tpu_cluster_resolver = None

is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    master=FLAGS.master,
    model_dir=FLAGS.output_dir,
    save_checkpoints_steps=FLAGS.save_checkpoints_steps,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=FLAGS.iterations_per_loop,
        num_shards=FLAGS.num_tpu_cores,
        per_host_input_for_training=is_per_host))

In [None]:
model_fn = model_fn_builder(
    bert_config=bert_config,
    num_labels=len(label_list),
    init_checkpoint=FLAGS.init_checkpoint,
    learning_rate=FLAGS.learning_rate,
    num_train_steps=FLAGS.num_train_steps,
    num_warmup_steps=FLAGS.num_warmup_steps,
    use_tpu=FLAGS.use_tpu,
    use_one_hot_embeddings=FLAGS.use_tpu)


estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=FLAGS.use_tpu,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=FLAGS.train_batch_size,
    eval_batch_size=FLAGS.eval_batch_size,
    predict_batch_size=FLAGS.predict_batch_size)

In [None]:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
predict_file = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8', suffix='.tf_record')

file_based_convert_examples_to_features(predict_examples, label_list,
                                        FLAGS.max_seq_length, tokenizer,
                                        predict_file.name)

predict_drop_remainder = True if FLAGS.use_tpu else False

predict_input_fn = file_based_input_fn_builder(
    input_file=predict_file.name,
    seq_length=FLAGS.max_seq_length,
    is_training=False,
    drop_remainder=predict_drop_remainder)

In [None]:
result = estimator.predict(input_fn=predict_input_fn)

In [None]:
%%time
# It will take a few hours on CPU environment.

result = list(result)

In [None]:
result[:2]

Read test data set and add prediction results.

In [None]:
import pandas as pd

In [None]:
test_df = pd.read_csv("../data/livedoor/test.tsv", sep='\t')

In [None]:
test_df['predict'] = [ label_list[elem['probabilities'].argmax()] for elem in result ]

In [None]:
test_df.head()

In [None]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [None]:
!pip install scikit-learn

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
print(classification_report(test_df['label'], test_df['predict']))

In [None]:
print(confusion_matrix(test_df['label'], test_df['predict']))

### Simple baseline model.

In [None]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
train_df = pd.read_csv("../data/livedoor/train.tsv", sep='\t')
dev_df = pd.read_csv("../data/livedoor/dev.tsv", sep='\t')
test_df = pd.read_csv("../data/livedoor/test.tsv", sep='\t')

In [None]:
!apt-get install -q -y mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

In [None]:
!pip install mecab-python3==0.7

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import MeCab

In [None]:
m = MeCab.Tagger("-Owakati")

In [None]:
train_dev_df = pd.concat([train_df, dev_df])

In [None]:
train_dev_xs = train_dev_df['text'].apply(lambda x: m.parse(x))
train_dev_ys = train_dev_df['label']

test_xs = test_df['text'].apply(lambda x: m.parse(x))
test_ys = test_df['label']

In [None]:
vectorizer = TfidfVectorizer(max_features=750)
train_dev_xs_ = vectorizer.fit_transform(train_dev_xs)
test_xs_ = vectorizer.transform(test_xs)

The following set up is not exactly identical to that of BERT because inside Classifier it uses `train_test_split` with shuffle.  
In addition, parameters are not well tuned, however, we think it's enough to check the power of BERT.

In [None]:
%%time

model = GradientBoostingClassifier(n_estimators=200,
                                   validation_fraction=len(dev_df)/len(train_df),
                                   n_iter_no_change=5,
                                   tol=0.01,
                                   random_state=23)

### 1/5 of full training data.
# model = GradientBoostingClassifier(n_estimators=200,
#                                    validation_fraction=len(dev_df)/len(train_df),
#                                    n_iter_no_change=5,
#                                    tol=0.01,
#                                    random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

In [None]:
print(classification_report(test_ys, model.predict(test_xs_)))

In [None]:
print(confusion_matrix(test_ys, model.predict(test_xs_)))

In [1]:
!cp ../../preprocess_data_for_transformer_with_sp.ipynb ./