<a href="https://colab.research.google.com/github/HisakaKoji/bert-japanese/blob/master/finetune_to_livedoor_corpus_20191221.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning of the pretrained Japanese BERT model

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese BERT model

Dataset is livedoor ニュースコーパス in https://www.rondhuit.com/download.html.  
We make test:dev:train = 2:2:6 datasets.

**This colab notebook assumes the above models are stored on some GSC bucket you can acess its objects.**

In [18]:
!git clone --recursive https://github.com/HisakaKoji/bert-japanese.git

Cloning into 'bert-japanese'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects:   6% (1/15)[Kremote: Counting objects:  13% (2/15)[Kremote: Counting objects:  20% (3/15)[Kremote: Counting objects:  26% (4/15)[Kremote: Counting objects:  33% (5/15)[Kremote: Counting objects:  40% (6/15)[Kremote: Counting objects:  46% (7/15)[Kremote: Counting objects:  53% (8/15)[Kremote: Counting objects:  60% (9/15)[Kremote: Counting objects:  66% (10/15)[Kremote: Counting objects:  73% (11/15)[Kremote: Counting objects:  80% (12/15)[Kremote: Counting objects:  86% (13/15)[Kremote: Counting objects:  93% (14/15)[Kremote: Counting objects: 100% (15/15)[Kremote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects:   8% (1/12)[Kremote: Compressing objects:  16% (2/12)[Kremote: Compressing objects:  25% (3/12)[Kremote: Compressing objects:  33% (4/12)[Kremote: Compressing objects:  41% (5/12)[Kremote: Compressing objects:  50% (6

In [0]:
!pip install -q -r bert-japanese/requirements.txt

In [0]:
from google.colab import auth
auth.authenticate_user()

In [61]:
!gsutil cp gs://hisaka/model/wiki-ja.model ../model/
!gsutil cp gs://hisaka/model/wiki-ja.vocab ../model/

Copying gs://hisaka/model/wiki-ja.model...
/ [1 files][786.8 KiB/786.8 KiB]                                                
Operation completed over 1 objects/786.8 KiB.                                    
Copying gs://hisaka/model/wiki-ja.vocab...
/ [1 files][581.7 KiB/581.7 KiB]                                                
Operation completed over 1 objects/581.7 KiB.                                    


In [22]:
%cd bert-japanese/notebook

/content/bert-japanese/notebook/bert-japanese/notebook


In [23]:
%ls

check-extract-features.ipynb       finetune-to-livedoor-corpus.ipynb
check-trained-tokenizer.ipynb      pretraining.ipynb
finetune_to_livedoor_corpus.ipynb


Check TPU.

In [24]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.71.64.130:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 14088410830267968636),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 12423116974813516050),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 4624085173356679015),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 13274030027517492596),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 12061088206839378290),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 10392662510017880913),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 12878912449468424336),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 5515305794850066910),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 1358021792

## Data preparation

You need to put preprocessed data on your GCS bucket.  
To create preprocessed data, follow https://github.com/yoheikikuta/bert-japanese/blob/master/notebook/finetune-to-livedoor-corpus.ipynb.

## Finetune pre-trained model

In [0]:
PRETRAINED_MODEL_PATH = 'gs://hisaka/model/model.ckpt-1400000'  # GCS bucket
PRETRAINED_MODEL_PATH = 'gs://hisaka/rurubu/output/model.ckpt-419'  # GCS bucket
INPUT_DATA_GCS = 'gs://hisaka/rurubu'  # GCS bucket
FINETUNE_OUTPUT_DIR = 'gs://hisaka/rurubu/output1219' # GCS bucket

In [82]:
%%time
!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --use_tpu=True \
  --tpu_name={TPU_ADDRESS} \
  --do_train=true \
  --do_eval=true \
  --data_dir={INPUT_DATA_GCS} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=64 \
  --learning_rate=2e-5 \
  --num_train_epochs=10.0 \
  --output_dir={FINETUNE_OUTPUT_DIR}




W1221 05:29:39.378244 140413960005504 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1221 05:29:39.378485 140413960005504 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1221 05:29:39.378938 140413960005504 module_wrapper.py:139] From /content/bert-japanese/notebook/bert-japanese/src/../bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1221 05:29:39.379760 140413960005504 module_wrapper.py:139] From ../src/run_classifier.py:682: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

Loaded a trained SentencePiece model.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-

## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [83]:
%%time
!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --use_tpu=True \
  --tpu_name={TPU_ADDRESS} \
  --do_train=false \
  --do_eval=false \
  --do_predict=true \
  --data_dir={INPUT_DATA_GCS} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=64 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir={FINETUNE_OUTPUT_DIR}




W1221 05:36:15.619418 140202889467776 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1221 05:36:15.619671 140202889467776 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1221 05:36:15.620127 140202889467776 module_wrapper.py:139] From /content/bert-japanese/notebook/bert-japanese/src/../bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1221 05:36:15.620966 140202889467776 module_wrapper.py:139] From ../src/run_classifier.py:682: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

Loaded a trained SentencePiece model.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-

## Evaluation

Download result and original data.

In [84]:
!gsutil cp {FINETUNE_OUTPUT_DIR}/test_results.tsv .
!gsutil cp {INPUT_DATA_GCS}/train.tsv .
!gsutil cp {INPUT_DATA_GCS}/dev.tsv .
!gsutil cp {INPUT_DATA_GCS}/test.tsv .

Copying gs://hisaka/rurubu/output1219/test_results.tsv...
- [1 files][395.1 KiB/395.1 KiB]                                                
Operation completed over 1 objects/395.1 KiB.                                    
Copying gs://hisaka/rurubu/train.tsv...
/ [1 files][  1.2 MiB/  1.2 MiB]                                                
Operation completed over 1 objects/1.2 MiB.                                      
Copying gs://hisaka/rurubu/dev.tsv...
/ [1 files][408.5 KiB/408.5 KiB]                                                
Operation completed over 1 objects/408.5 KiB.                                    
Copying gs://hisaka/rurubu/test.tsv...
/ [1 files][398.3 KiB/398.3 KiB]                                                
Operation completed over 1 objects/398.3 KiB.                                    


### Trained model

Check accuracy.

In [0]:
import numpy as np
import pandas as pd

In [0]:
import sys
sys.path.append("../src")

from run_classifier import LivedoorProcessor

processor = LivedoorProcessor()
label_list = processor.get_labels()

In [0]:
label_list = ['Traditional-Festivalsand-annual-events' , 'Traditional-Festivalsand-annual-events'  , 'Traditional-performing-arts-and-dance' ,'festival' ,'food'   ,'festival'   , \
          'flower-nature'   ,'festival'   ,'fireworks'   ,'snow'  ,'illumination'  ,'music'  ,'sports'  ,'museum'  ,'museum'  ,'festival'  ,'festival'  ,'experience'  ,  \
           'school'  ,'talk'  ,'stage'  ,'animal-fish-park'  ,'animal-fish-park'  ,'anniversary'  ,'fair'  ,'other'  ,'Industry'  ,'festival'  ,'festival' ,'other']

In [0]:
result = pd.read_csv("./test_results.tsv", sep='\t', header=None)

In [89]:
result.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
0,6.9e-05,4.8e-05,1.9e-05,4e-05,6.1e-05,4.3e-05,0.000111,7.1e-05,8.2e-05,3.5e-05,0.998055,4.7e-05,7.8e-05,3.5e-05,2.7e-05,2.2e-05,4.8e-05,4.3e-05,3.9e-05,1.8e-05,2e-05,3.1e-05,3.3e-05,7.9e-05,3e-05,0.000648,3.8e-05,4.4e-05,4.6e-05,3.9e-05
1,0.000158,0.000185,0.000323,0.000182,0.000277,0.000248,0.000154,0.000371,0.000296,0.989966,0.00014,0.000255,0.002022,0.000127,0.000119,0.00081,0.000337,0.000423,0.000139,0.000112,0.000301,0.000141,0.000195,0.000135,0.000124,0.0018,0.000154,0.000204,0.000147,0.000153


In [0]:
test_df = pd.read_csv("./test.tsv", sep='\t')

In [91]:
label_list

['Traditional-Festivalsand-annual-events',
 'Traditional-Festivalsand-annual-events',
 'Traditional-performing-arts-and-dance',
 'festival',
 'food',
 'festival',
 'flower-nature',
 'festival',
 'fireworks',
 'snow',
 'illumination',
 'music',
 'sports',
 'museum',
 'museum',
 'festival',
 'festival',
 'experience',
 'school',
 'talk',
 'stage',
 'animal-fish-park',
 'animal-fish-park',
 'anniversary',
 'fair',
 'other',
 'Industry',
 'festival',
 'festival',
 'other']

In [0]:
test_df['predict'] = [ label_list[np.array(elem[1]).argmax()] for elem in result.iterrows() ]

In [93]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

0.8303393213572854

In [0]:
### 1/5 of full training data.
# sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [40]:
print(classification_report(test_df['label'], test_df['predict']))

                                        precision    recall  f1-score   support

                              Industry       0.73      0.59      0.65        27
Traditional-Festivalsand-annual-events       0.86      0.86      0.86       320
 Traditional-performing-arts-and-dance       0.76      0.71      0.74        70
                           anniversary       0.00      0.00      0.00         1
                            experience       0.76      0.76      0.76        17
                                  fair       0.00      0.00      0.00         1
                              festival       0.61      0.58      0.59       104
                             fireworks       0.91      0.96      0.93       100
                         flower-nature       0.91      0.95      0.93       152
                                  food       0.75      0.88      0.81        24
                          illumination       0.91      0.96      0.93        92
                                museum 

  'precision', 'predicted', average, warn_for)


In [0]:
### 1/5 of full training data.
# print(classification_report(test_df['label'], test_df['predict']))

In [42]:
print(confusion_matrix(test_df['label'], test_df['predict']))

[[ 16   0   1   0   0   0   4   0   1   2   0   0   0   3   0   0   0   0]
 [  0 274   9   0   1   0  19   4   7   0   1   0   0   4   0   1   0   0]
 [  1  13  50   0   0   0   3   1   1   0   0   0   0   1   0   0   0   0]
 [  0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0  13   0   0   0   0   1   0   0   2   0   0   0   0   1]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1]
 [  1  21   4   0   0   0  60   4   4   3   1   0   2   2   0   1   1   0]
 [  0   1   0   0   0   0   1  96   0   0   1   0   0   1   0   0   0   0]
 [  0   2   1   0   0   0   1   0 144   0   1   0   0   1   0   2   0   0]
 [  1   0   0   0   1   0   1   0   0  21   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0  88   0   0   3   1   0   0   0]
 [  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   1   0   0   0   1   0   0   0   0   0  18   0   0   0   0   0]
 [  2   6   0   0   0   0

In [0]:
### 1/5 of full training data.
# print(confusion_matrix(test_df['label'], test_df['predict']))

In [44]:
!ls

check-extract-features.ipynb	   pretraining.ipynb
check-trained-tokenizer.ipynb	   test_results.tsv
dev.tsv				   test.tsv
finetune_to_livedoor_corpus.ipynb  train.tsv
finetune-to-livedoor-corpus.ipynb


In [0]:
test_df.to_csv('../test20191220.csv')

In [46]:
!gsutil cp -r  test20191220.csv  gs://hisaka/20191220 

CommandException: No URLs matched: test20191220.csv


### Simple baseline model.

In [0]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [0]:
train_df = pd.read_csv("./train.tsv", sep='\t')
dev_df = pd.read_csv("./dev.tsv", sep='\t')
test_df = pd.read_csv("./test.tsv", sep='\t')

In [49]:
!apt-get -q install -y mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

Reading package lists...
Building dependency tree...
Reading state information...
The following package was automatically installed and is no longer required:
  libnvidia-common-430
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libmecab2 mecab-utils
The following NEW packages will be installed:
  libmecab-dev libmecab2 mecab mecab-ipadic mecab-ipadic-utf8 mecab-utils
0 upgraded, 6 newly installed, 0 to remove and 7 not upgraded.
Need to get 12.8 MB of archives.
After this operation, 60.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libmecab2 amd64 0.996-5 [257 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libmecab-dev amd64 0.996-5 [308 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-utils amd64 0.996-5 [4,856 B]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-ipadic all 2.7.0-20070801+main-1 [12.1 MB]
Get:5 http://archive.ub

In [50]:
!pip install -q mecab-python3==0.7

[?25l[K     |███████▉                        | 10kB 18.7MB/s eta 0:00:01[K     |███████████████▊                | 20kB 2.2MB/s eta 0:00:01[K     |███████████████████████▋        | 30kB 3.2MB/s eta 0:00:01[K     |███████████████████████████████▌| 40kB 2.1MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 1.9MB/s 
[?25h  Building wheel for mecab-python3 (setup.py) ... [?25l[?25hdone


In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import MeCab

In [0]:
m = MeCab.Tagger("-Owakati")

In [0]:
train_dev_df = pd.concat([train_df, dev_df])

In [0]:
train_dev_xs = train_dev_df['text'].apply(lambda x: m.parse(x))
train_dev_ys = train_dev_df['label']

test_xs = test_df['text'].apply(lambda x: m.parse(x))
test_ys = test_df['label']

In [0]:
vectorizer = TfidfVectorizer(max_features=750)
train_dev_xs_ = vectorizer.fit_transform(train_dev_xs)
test_xs_ = vectorizer.transform(test_xs)

The following set up is not exactly identical to that of BERT because inside Classifier it uses `train_test_split` with shuffle.  
In addition, parameters are not well tuned, however, we think it's enough to check the power of BERT.

In [56]:
%%time

model = GradientBoostingClassifier(n_estimators=200,
                                   validation_fraction=len(dev_df)/len(train_df),
                                   n_iter_no_change=5,
                                   tol=0.01,
                                   random_state=23)

### 1/5 of full training data.
# model = GradientBoostingClassifier(n_estimators=200,
#                                    validation_fraction=len(train_df)/len(dev_df),
#                                    n_iter_no_change=5,
#                                    tol=0.01,
#                                    random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

CPU times: user 19.3 s, sys: 76.9 ms, total: 19.4 s
Wall time: 19.4 s


In [57]:
print(classification_report(test_ys, model.predict(test_xs_)))

                                        precision    recall  f1-score   support

                              Industry       0.43      0.33      0.38        27
Traditional-Festivalsand-annual-events       0.63      0.85      0.72       320
 Traditional-performing-arts-and-dance       0.65      0.43      0.52        70
                           anniversary       0.00      0.00      0.00         1
                            experience       0.50      0.29      0.37        17
                                  fair       0.00      0.00      0.00         1
                              festival       0.56      0.36      0.44       104
                             fireworks       0.86      0.93      0.89       100
                         flower-nature       0.76      0.80      0.78       152
                                  food       0.53      0.33      0.41        24
                          illumination       0.85      0.87      0.86        92
                                museum 

  'precision', 'predicted', average, warn_for)


In [0]:
### 1/5 of full training data.
# print(classification_report(test_ys, model.predict(test_xs_)))

In [59]:
print(confusion_matrix(test_ys, model.predict(test_xs_)))

[[  9   5   0   0   0   0   5   0   2   1   0   2   1   1   1   0   0   0]
 [  2 271  11   0   1   1  13   6  10   1   0   0   1   2   0   1   0   0]
 [  0  25  30   0   0   0   2   3   5   2   0   0   2   0   0   0   0   1]
 [  0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   4   0   0   5   1   0   0   2   0   0   0   2   1   0   1   1   0]
 [  0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  1  40   4   0   0   0  37   5   6   2   5   0   1   1   0   2   0   0]
 [  0   5   0   0   0   0   0  93   0   0   1   1   0   0   0   0   0   0]
 [  4  18   1   0   2   0   2   0 122   0   1   0   0   1   0   1   0   0]
 [  4   5   0   0   0   0   1   0   5   8   0   0   0   1   0   0   0   0]
 [  0   9   0   0   0   0   0   0   1   0  80   0   1   0   1   0   0   0]
 [  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   7   0   0   1   0   1   0   0   0   0   0   9   1   0   1   0   0]
 [  0  17   0   0   0   1

In [0]:
### 1/5 of full training data.
# print(confusion_matrix(test_ys, model.predict(test_xs_)))