<a href="https://colab.research.google.com/github/HisakaKoji/bert-japanese/blob/master/finetune_to_livedoor_corpus_20191220.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning of the pretrained Japanese BERT model

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese BERT model

Dataset is livedoor ニュースコーパス in https://www.rondhuit.com/download.html.  
We make test:dev:train = 2:2:6 datasets.

**This colab notebook assumes the above models are stored on some GSC bucket you can acess its objects.**

In [15]:
!git clone --recursive https://github.com/HisakaKoji/bert-japanese.git

Cloning into 'bert-japanese'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects:   8% (1/12)[Kremote: Counting objects:  16% (2/12)[Kremote: Counting objects:  25% (3/12)[Kremote: Counting objects:  33% (4/12)[Kremote: Counting objects:  41% (5/12)[Kremote: Counting objects:  50% (6/12)[Kremote: Counting objects:  58% (7/12)[Kremote: Counting objects:  66% (8/12)[Kremote: Counting objects:  75% (9/12)[Kremote: Counting objects:  83% (10/12)[Kremote: Counting objects:  91% (11/12)[Kremote: Counting objects: 100% (12/12)[Kremote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects:  11% (1/9)[Kremote: Compressing objects:  22% (2/9)[Kremote: Compressing objects:  33% (3/9)[Kremote: Compressing objects:  44% (4/9)[Kremote: Compressing objects:  55% (5/9)[Kremote: Compressing objects:  66% (6/9)[Kremote: Compressing objects:  77% (7/9)[Kremote: Compressing objects:  88% (8/9)[Kremote: Compressing objects: 100% (9/9

In [0]:
!pip install -q -r bert-japanese/requirements.txt

In [0]:
from google.colab import auth
auth.authenticate_user()

In [24]:
!gsutil cp gs://hisaka/model/wiki-ja.model ../model/
!gsutil cp gs://hisaka/model/wiki-ja.vocab ../model/

Copying gs://hisaka/model/wiki-ja.model...
/ [1 files][786.8 KiB/786.8 KiB]                                                
Operation completed over 1 objects/786.8 KiB.                                    
Copying gs://hisaka/model/wiki-ja.vocab...
/ [1 files][581.7 KiB/581.7 KiB]                                                
Operation completed over 1 objects/581.7 KiB.                                    


In [19]:
%cd bert-japanese/notebook

/content/bert-japanese/notebook/bert-japanese/notebook


In [20]:
%ls

check-extract-features.ipynb       finetune-to-livedoor-corpus.ipynb
check-trained-tokenizer.ipynb      pretraining.ipynb
finetune_to_livedoor_corpus.ipynb


Check TPU.

In [21]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.52.188.250:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 318455824359339261),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 3629279527469628366),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 3562479057597909262),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 7386736292922555039),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 275932638923328478),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 5872655942254759084),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 548088030640216769),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 5646976345987013185),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 957618970633257419

## Data preparation

You need to put preprocessed data on your GCS bucket.  
To create preprocessed data, follow https://github.com/yoheikikuta/bert-japanese/blob/master/notebook/finetune-to-livedoor-corpus.ipynb.

## Finetune pre-trained model

In [0]:
PRETRAINED_MODEL_PATH = 'gs://hisaka/model/model.ckpt-1400000'  # GCS bucket
INPUT_DATA_GCS = 'gs://hisaka/rurubu'  # GCS bucket
FINETUNE_OUTPUT_DIR = 'gs://hisaka/rurubu/output1219' # GCS bucket

In [0]:
PRETRAINED_MODEL_PATH = 'gs://'  # GCS bucket
INPUT_DATA_GCS = 'gs://'  # GCS bucket
FINETUNE_OUTPUT_DIR = 'gs://' # GCS bucket

In [25]:
%%time
!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --use_tpu=True \
  --tpu_name={TPU_ADDRESS} \
  --do_train=true \
  --do_eval=true \
  --data_dir={INPUT_DATA_GCS} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=64 \
  --learning_rate=2e-5 \
  --num_train_epochs=10.0 \
  --output_dir={FINETUNE_OUTPUT_DIR}




W1220 02:42:09.310641 140646735132544 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1220 02:42:09.310820 140646735132544 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1220 02:42:09.311222 140646735132544 module_wrapper.py:139] From /content/bert-japanese/notebook/bert-japanese/src/../bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1220 02:42:09.311962 140646735132544 module_wrapper.py:139] From ../src/run_classifier.py:682: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

Loaded a trained SentencePiece model.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-

## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [26]:
%%time
!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --use_tpu=True \
  --tpu_name={TPU_ADDRESS} \
  --do_train=false \
  --do_eval=false \
  --do_predict=true \
  --data_dir={INPUT_DATA_GCS} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=64 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir={FINETUNE_OUTPUT_DIR}




W1220 02:49:19.770990 139922199115648 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1220 02:49:19.771212 139922199115648 module_wrapper.py:139] From ../src/run_classifier.py:661: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1220 02:49:19.771607 139922199115648 module_wrapper.py:139] From /content/bert-japanese/notebook/bert-japanese/src/../bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1220 02:49:19.772351 139922199115648 module_wrapper.py:139] From ../src/run_classifier.py:682: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

Loaded a trained SentencePiece model.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-

## Evaluation

Download result and original data.

In [27]:
!gsutil cp {FINETUNE_OUTPUT_DIR}/test_results.tsv .
!gsutil cp {INPUT_DATA_GCS}/train.tsv .
!gsutil cp {INPUT_DATA_GCS}/dev.tsv .
!gsutil cp {INPUT_DATA_GCS}/test.tsv .

Copying gs://hisaka/rurubu/output1219/test_results.tsv...
/ [1 files][394.1 KiB/394.1 KiB]                                                
Operation completed over 1 objects/394.1 KiB.                                    
Copying gs://hisaka/rurubu/train.tsv...
/ [1 files][  1.2 MiB/  1.2 MiB]                                                
Operation completed over 1 objects/1.2 MiB.                                      
Copying gs://hisaka/rurubu/dev.tsv...
/ [1 files][408.5 KiB/408.5 KiB]                                                
Operation completed over 1 objects/408.5 KiB.                                    
Copying gs://hisaka/rurubu/test.tsv...
/ [1 files][398.3 KiB/398.3 KiB]                                                
Operation completed over 1 objects/398.3 KiB.                                    


### Trained model

Check accuracy.

In [0]:
import numpy as np
import pandas as pd

In [0]:
import sys
sys.path.append("../src")

from run_classifier import LivedoorProcessor

processor = LivedoorProcessor()
label_list = processor.get_labels()

In [0]:
label_list = ['Traditional-Festivalsand-annual-events' , 'Traditional-Festivalsand-annual-events'  , 'Traditional-performing-arts-and-dance' ,'festival' ,'food'   ,'festival'   , \
          'flower-nature'   ,'festival'   ,'fireworks'   ,'snow'  ,'illumination'  ,'music'  ,'sports'  ,'museum'  ,'museum'  ,'festival'  ,'festival'  ,'experience'  ,  \
           'school'  ,'talk'  ,'stage'  ,'animal-fish-park'  ,'animal-fish-park'  ,'anniversary'  ,'fair'  ,'other'  ,'Industry'  ,'festival'  ,'festival' ,'other']

In [0]:
result = pd.read_csv("./test_results.tsv", sep='\t', header=None)

In [31]:
result.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
0,0.00016,0.000107,0.000159,0.000147,0.000119,8e-05,0.000118,0.000105,0.00013,0.000306,0.993441,0.000179,0.000117,0.000185,0.000254,0.000114,0.000127,0.000161,0.000122,0.000104,0.00038,0.00018,8.8e-05,0.000189,8e-05,0.002308,0.000161,0.000117,0.00015,0.000115
1,0.000333,0.000847,0.000244,0.000587,0.000972,0.00101,0.000565,0.001145,0.000481,0.969476,0.000576,0.000687,0.008519,0.000687,0.000576,0.00061,0.000713,0.001146,0.000985,0.000495,0.001073,0.000539,0.000472,0.000946,0.000782,0.002886,0.000331,0.000698,0.001066,0.000554


In [0]:
test_df = pd.read_csv("./test.tsv", sep='\t')

In [37]:
label_list

['Traditional-Festivalsand-annual-events',
 'Traditional-Festivalsand-annual-events',
 'Traditional-performing-arts-and-dance',
 'festival',
 'food',
 'festival',
 'flower-nature',
 'festival',
 'fireworks',
 'snow',
 'illumination',
 'music',
 'sports',
 'museum',
 'museum',
 'festival',
 'festival',
 'experience',
 'school',
 'talk',
 'stage',
 'animal-fish-park',
 'animal-fish-park',
 'anniversary',
 'fair',
 'other',
 'Industry',
 'festival',
 'festival',
 'other']

In [0]:
test_df['predict'] = [ label_list[np.array(elem[1]).argmax()] for elem in result.iterrows() ]

In [39]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

0.812375249500998

In [0]:
### 1/5 of full training data.
# sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [41]:
print(classification_report(test_df['label'], test_df['predict']))

                                        precision    recall  f1-score   support

                              Industry       0.73      0.59      0.65        27
Traditional-Festivalsand-annual-events       0.86      0.86      0.86       320
 Traditional-performing-arts-and-dance       0.76      0.71      0.74        70
                           anniversary       0.00      0.00      0.00         1
                            experience       0.76      0.76      0.76        17
                                  fair       0.00      0.00      0.00         1
                              festival       0.61      0.58      0.59       104
                             fireworks       0.91      0.96      0.93       100
                         flower-nature       0.91      0.95      0.93       152
                                  food       0.75      0.88      0.81        24
                          illumination       0.91      0.96      0.93        92
                                museum 

  'precision', 'predicted', average, warn_for)


In [0]:
### 1/5 of full training data.
# print(classification_report(test_df['label'], test_df['predict']))

In [42]:
print(confusion_matrix(test_df['label'], test_df['predict']))

[[ 16   0   1   0   0   0   4   0   1   2   0   0   0   3   0   0   0   0]
 [  0 274   9   0   1   0  19   4   7   0   1   0   0   4   0   1   0   0]
 [  1  13  50   0   0   0   3   1   1   0   0   0   0   1   0   0   0   0]
 [  0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0  13   0   0   0   0   1   0   0   2   0   0   0   0   1]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1]
 [  1  21   4   0   0   0  60   4   4   3   1   0   2   2   0   1   1   0]
 [  0   1   0   0   0   0   1  96   0   0   1   0   0   1   0   0   0   0]
 [  0   2   1   0   0   0   1   0 144   0   1   0   0   1   0   2   0   0]
 [  1   0   0   0   1   0   1   0   0  21   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0  88   0   0   3   1   0   0   0]
 [  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   1   0   0   0   1   0   0   0   0   0  18   0   0   0   0   0]
 [  2   6   0   0   0   0

In [0]:
### 1/5 of full training data.
# print(confusion_matrix(test_df['label'], test_df['predict']))

In [48]:
!ls

check-extract-features.ipynb	   finetune-to-livedoor-corpus.ipynb  test.tsv
check-trained-tokenizer.ipynb	   pretraining.ipynb		      train.tsv
dev.tsv				   test20191220.csv
finetune_to_livedoor_corpus.ipynb  test_results.tsv


In [0]:
test_df.to_csv('../test20191220.csv')

In [49]:
!gsutil cp -r  test20191220.csv  gs://hisaka/20191220 

Copying file://test20191220.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/423.1 KiB.                                    


### Simple baseline model.

In [0]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [0]:
train_df = pd.read_csv("./train.tsv", sep='\t')
dev_df = pd.read_csv("./dev.tsv", sep='\t')
test_df = pd.read_csv("./test.tsv", sep='\t')

In [0]:
!apt-get -q install -y mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

In [0]:
!pip install -q mecab-python3==0.7

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import MeCab

In [0]:
m = MeCab.Tagger("-Owakati")

In [0]:
train_dev_df = pd.concat([train_df, dev_df])

In [0]:
train_dev_xs = train_dev_df['text'].apply(lambda x: m.parse(x))
train_dev_ys = train_dev_df['label']

test_xs = test_df['text'].apply(lambda x: m.parse(x))
test_ys = test_df['label']

In [0]:
vectorizer = TfidfVectorizer(max_features=750)
train_dev_xs_ = vectorizer.fit_transform(train_dev_xs)
test_xs_ = vectorizer.transform(test_xs)

The following set up is not exactly identical to that of BERT because inside Classifier it uses `train_test_split` with shuffle.  
In addition, parameters are not well tuned, however, we think it's enough to check the power of BERT.

In [0]:
%%time

model = GradientBoostingClassifier(n_estimators=200,
                                   validation_fraction=len(dev_df)/len(train_df),
                                   n_iter_no_change=5,
                                   tol=0.01,
                                   random_state=23)

### 1/5 of full training data.
# model = GradientBoostingClassifier(n_estimators=200,
#                                    validation_fraction=len(train_df)/len(dev_df),
#                                    n_iter_no_change=5,
#                                    tol=0.01,
#                                    random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

In [0]:
print(classification_report(test_ys, model.predict(test_xs_)))

In [0]:
### 1/5 of full training data.
# print(classification_report(test_ys, model.predict(test_xs_)))

In [0]:
print(confusion_matrix(test_ys, model.predict(test_xs_)))

In [0]:
### 1/5 of full training data.
# print(confusion_matrix(test_ys, model.predict(test_xs_)))