<a href="https://colab.research.google.com/github/HisakaKoji/bert-japanese/blob/master/20200930_bert_sentence_twitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning of the pretrained Japanese BERT model

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese BERT model

Dataset is livedoor ニュースコーパス in https://www.rondhuit.com/download.html.  
We make test:dev:train = 2:2:6 datasets.

**This colab notebook assumes the above models are stored on some GSC bucket you can acess its objects.**

In [1]:
!git clone --recursive https://github.com/HisakaKoji/bert-japanese.git

Cloning into 'bert-japanese'...
remote: Enumerating objects: 314, done.[K
remote: Total 314 (delta 0), reused 0 (delta 0), pack-reused 314[K
Receiving objects: 100% (314/314), 637.48 KiB | 4.87 MiB/s, done.
Resolving deltas: 100% (195/195), done.
Submodule 'bert' (https://github.com/google-research/bert.git) registered for path 'bert'
Cloning into '/content/bert-japanese/bert'...
remote: Enumerating objects: 340, done.        
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340        
Receiving objects: 100% (340/340), 317.85 KiB | 4.13 MiB/s, done.
Resolving deltas: 100% (185/185), done.
Submodule path 'bert': checked out '88a817c37f788702a363ff935fd173b6dc6ac0d6'


In [2]:
!pip install -q -r bert-japanese/requirements.txt

[K     |████████████████████████████████| 1.1MB 3.5MB/s 
[K     |████████████████████████████████| 110.5MB 72kB/s 
[K     |████████████████████████████████| 512kB 32.4MB/s 
[K     |████████████████████████████████| 51kB 4.1MB/s 
[K     |████████████████████████████████| 3.8MB 25.6MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow-probability 0.11.0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.[0m


In [3]:
from google.colab import auth
auth.authenticate_user()

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [4]:
%cd bert-japanese/notebook

/content/bert-japanese/notebook


In [5]:
!gsutil cp gs://hisaka/model/wiki-ja.model ../model/
!gsutil cp gs://hisaka/model/wiki-ja.vocab ../model/

Copying gs://hisaka/model/wiki-ja.model...
- [1 files][786.8 KiB/786.8 KiB]                                                
Operation completed over 1 objects/786.8 KiB.                                    
Copying gs://hisaka/model/wiki-ja.vocab...
- [1 files][581.7 KiB/581.7 KiB]                                                
Operation completed over 1 objects/581.7 KiB.                                    


In [None]:
%ls

check-extract-features.ipynb       finetune-to-livedoor-corpus.ipynb
check-trained-tokenizer.ipynb      pretraining.ipynb
finetune_to_livedoor_corpus.ipynb


Check TPU.

In [49]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.79.9.122:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 14679878147523842301),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 10836455019740514999),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 8013382682266160208),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 18416605877308808535),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 9895975787705464943),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 3634706843308867205),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 15620743593460322940),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 1960286348963760415),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 3631170411885

## Data preparation

You need to put preprocessed data on your GCS bucket.  
To create preprocessed data, follow https://github.com/yoheikikuta/bert-japanese/blob/master/notebook/finetune-to-livedoor-corpus.ipynb.

## Finetune pre-trained model

In [47]:
PRETRAINED_MODEL_PATH = 'gs://hisaka/model/model.ckpt-1400000'  # GCS bucket
#PRETRAINED_MODEL_PATH = 'gs://hisaka/rurubu/output/model.ckpt-419'  # GCS bucket
#INPUT_DATA_GCS = 'gs://hisaka/rurubu'  # GCS bucket
#FINETUNE_OUTPUT_DIR = 'gs://hisaka/rurubu/output1219' # GCS bucket

INPUT_DATA_GCS = 'gs://hisaka/input'  # GCS bucket
FINETUNE_OUTPUT_DIR = 'gs://hisaka/input/output' # GCS bucket

In [85]:
%%time
!python3 ../src/run_classifier.py \
  --task_name=titanic \
  --use_tpu=True \
  --tpu_name={TPU_ADDRESS} \
  --do_train=true \
  --do_eval=true \
  --data_dir={INPUT_DATA_GCS} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=64 \
  --learning_rate=2e-5 \
  --num_train_epochs=10.0 \
  --output_dir={FINETUNE_OUTPUT_DIR}




W0930 09:50:40.890435 140431686952832 module_wrapper.py:139] From ../src/run_classifier.py:701: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0930 09:50:40.890684 140431686952832 module_wrapper.py:139] From ../src/run_classifier.py:701: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0930 09:50:40.891158 140431686952832 module_wrapper.py:139] From /content/bert-japanese/src/../bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0930 09:50:40.892464 140431686952832 module_wrapper.py:139] From ../src/run_classifier.py:723: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

Loaded a trained SentencePiece model.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * h

## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [86]:
%%time
!python3 ../src/run_classifier.py \
  --task_name=titanic \
  --use_tpu=True \
  --tpu_name={TPU_ADDRESS} \
  --do_train=false \
  --do_eval=false \
  --do_predict=true \
  --data_dir={INPUT_DATA_GCS} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=64 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir={FINETUNE_OUTPUT_DIR}




W0930 09:52:56.750323 140261715154816 module_wrapper.py:139] From ../src/run_classifier.py:701: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0930 09:52:56.750587 140261715154816 module_wrapper.py:139] From ../src/run_classifier.py:701: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0930 09:52:56.751042 140261715154816 module_wrapper.py:139] From /content/bert-japanese/src/../bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0930 09:52:56.751878 140261715154816 module_wrapper.py:139] From ../src/run_classifier.py:723: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

Loaded a trained SentencePiece model.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * h

## Evaluation

Download result and original data.

In [87]:
!gsutil cp {FINETUNE_OUTPUT_DIR}/test_results.tsv .
!gsutil cp {INPUT_DATA_GCS}/train.tsv .
!gsutil cp {INPUT_DATA_GCS}/dev.tsv .
!gsutil cp {INPUT_DATA_GCS}/test.tsv .

Copying gs://hisaka/input/output/test_results.tsv...
/ [1 files][ 11.3 KiB/ 11.3 KiB]                                                
Operation completed over 1 objects/11.3 KiB.                                     
Copying gs://hisaka/input/train.tsv...
- [1 files][  3.0 MiB/  3.0 MiB]                                                
Operation completed over 1 objects/3.0 MiB.                                      
Copying gs://hisaka/input/dev.tsv...
/ [1 files][752.0 KiB/752.0 KiB]                                                
Operation completed over 1 objects/752.0 KiB.                                    
Copying gs://hisaka/input/test.tsv...
/ [1 files][ 89.2 KiB/ 89.2 KiB]                                                
Operation completed over 1 objects/89.2 KiB.                                     


### Trained model

Check accuracy.

In [88]:
import numpy as np
import pandas as pd

In [89]:
import sys
sys.path.append("../src")

from run_classifier import TitanicProcessor

processor = TitanicProcessor()
label_list = processor.get_labels()

In [None]:
label_list = ['Traditional-Festivalsand-annual-events' , 'Traditional-Festivalsand-annual-events'  , 'Traditional-performing-arts-and-dance' ,'festival' ,'food'   ,'festival'   , \
          'flower-nature'   ,'festival'   ,'fireworks'   ,'snow'  ,'illumination'  ,'music'  ,'sports'  ,'museum'  ,'museum'  ,'festival'  ,'festival'  ,'experience'  ,  \
           'school'  ,'talk'  ,'stage'  ,'animal-fish-park'  ,'animal-fish-park'  ,'anniversary'  ,'fair'  ,'other'  ,'Industry'  ,'festival'  ,'festival' ,'other']

In [90]:
result = pd.read_csv("./test_results.tsv", sep='\t', header=None)

In [91]:
result.head(2)

Unnamed: 0,0,1
0,0.003751,0.996249
1,0.002986,0.997014


In [103]:
test_df['nega_conf'] = result[1]

In [104]:
test_df['posi_conf'] = result[0]

In [92]:
test_df = pd.read_csv("./test.tsv", sep='\t')

In [93]:
label_list

['0', '1']

In [94]:
test_df['predict'] = [ label_list[np.array(elem[1]).argmax()] for elem in result.iterrows() ]

In [96]:
test_df['predict'] = test_df['predict'].astype(int)

In [97]:
test_df[ test_df['predict'] == 1 ]

Unnamed: 0,text,label,predict
0,SHARPのSHV42を使ってるんだけど、勝手にカメラアプリの写真の保存先がSDカードから本...,1,1
1,@SHARP_JP SoftBankのAQUOS R2を使用してますが、Android 10...,1,1
4,AQUOS R5G君が「低速充電中」のまま、まともに充電されておらず気づいたら電池切れ寸前だ...,1,1
9,結局、AQUOSsense3は、水には強かったけど、音はイマイチだしすぐ割れてしまいました。...,1,1
11,携帯のヒンジがまた壊れた。AQUOSケータイ2。 閉じても勝手に開く。ボタンも利かない。 下...,1,1
55,AQUOS R5Gに変えてから正常にゲームできないこと多発(｡•́︿•̀｡) ゲーミング設定...,0,1
121,shv45っていうスマホ使ってるんだけどまーーーじで文字変換がカスなのだ。音質もゴミなのだ。...,0,1
215,#AQUOSsense3basic 使ってみました② 少し斜めに撮ってしまった風景写真です...,0,1
285,@mura_neko zero2、明るさ自動にしてると照度センサーが効きすぎて違和感がありま...,0,1


In [63]:
test_df['label'] 

0      1
1      1
2      1
3      1
4      1
      ..
477    0
478    0
479    0
480    0
481    0
Name: label, Length: 482, dtype: int64

In [82]:
test_df['predict'] = test_df['predict'].astype(int)

In [67]:
test_df[test_df['predict'] == 1 ]

Unnamed: 0,text,label,predict
13,AQUOS RとGalaxy A41のスマホとしてのサイズはほぼ同じなのだが、Galaxy ...,0,1
15,@RED_prideofeden スマホ変えてから一向にこんな感じです。対応してくれないとイ...,0,1
46,AQUOS R5G simフリー スマホ 本体 新品 スマートフォン 本体 楽天モバイル 端...,0,1
54,AQUOS R5Gの方が良い説あるな,0,1
114,【スマートフォン】ベスト10 シャープ AQUOS sense2 SH-M08 アーバンブル...,0,1
211,AQUOS sense3 lite、悪くないねえ。気に入ったよ。大きさも丁度ええし。,0,1
216,モイ！Android AQUOS sense3 liteからキャス配信中。コメント来たら喋り...,0,1
283,@senecal11717 Zero2の背面も…😱😱😱😱,0,1
363,ドコモ、10月1日に端末割引を変更　「らくらくホン」「AQUOS ケータイ」などが対象に -...,0,1


In [98]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

0.975103734439834

In [74]:
test_df[ test_df['label'] == test_df['predict'] ]

Unnamed: 0,text,label,predict
14,という訳でAQUOS RからGalaxy A41使いにジョブチェンジと相成りました。 AQ...,0,0
16,@canned_may_03 はぜる 出やんかった(605sh) これ古いしなぁ,0,0
17,@masachappy Android version はどちらも9 端末名はHW-02Lと...,0,0
18,ＡＱＵＯＳR 5G買ったぜ https://t.co/Tih27XFDTL,0,0
19,やばいAQUOS R 5Gの画質良すぎるwwwww,0,0
...,...,...,...
477,ちびっ子のいとこ二人に キッズケータイを持たせました… お姉ちゃんは 君たちのスマホの維持費...,0,0
478,家にパソコンもタブレットもあるのにそっちに執着しないってことはスマホ依存なんだろうね いっそ...,0,0
479,もしかして全てがうまくいって明後日から働くことになるかもしれないから、今日はやりたいこと全部...,0,0
480,@sachiho_0912 @__k5__ikuji__ | 'ω'){そこはむしろキッズケ...,0,0


In [130]:
test_df = pd.read_excel('20200930result3_人の目チェック.xlsx')

In [132]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [133]:
y_test = test_df['label'].astype(float).to_list()
y_pred = test_df['predict'].astype(float).to_list()

In [134]:
print("Accuracy: %.5f" % accuracy_score(y_test, y_pred))
print("Precision: %.5f" % precision_score(y_test, y_pred))
print("recall: %.5f" % recall_score(y_test, y_pred))
print("f1: %.5f" % f1_score(y_test, y_pred))

Accuracy: 0.98133
Precision: 0.88889
recall: 0.50000
f1: 0.64000


In [105]:
test_df.to_excel('20200930result3.xlsx')

In [None]:
### 1/5 of full training data.
# sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [25]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [26]:
print(classification_report(test_df['label'], test_df['predict']))

              precision    recall  f1-score   support

           0       1.00      0.97      0.99      8732
           1       0.00      0.00      0.00         0

    accuracy                           0.97      8732
   macro avg       0.50      0.49      0.49      8732
weighted avg       1.00      0.97      0.99      8732



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
### 1/5 of full training data.
# print(classification_report(test_df['label'], test_df['predict']))

In [None]:
print(confusion_matrix(test_df['label'], test_df['predict']))

[[ 16   1   0   0   0   0   6   0   0   2   0   1   0   1   0   0   0   0]
 [  0 283  10   0   1   0  18   3   3   1   1   0   0   0   0   0   0   0]
 [  1  11  51   0   0   0   5   1   0   0   0   0   0   1   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0]
 [  0   0   0   0  14   0   0   0   0   0   0   0   2   0   0   1   0   0]
 [  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  2  15   4   1   0   0  64   4   5   2   2   0   3   1   0   1   0   0]
 [  0   1   0   0   0   0   2  96   0   0   1   0   0   0   0   0   0   0]
 [  0   3   1   0   0   0   2   0 143   0   1   0   0   0   0   2   0   0]
 [  2   0   0   0   0   0   2   0   0  20   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   1   0   0   0  87   0   0   3   1   0   0   0]
 [  1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   2   0   0   0   0   0  18   0   0   0   0   0]
 [  0   4   1   0   0   0

In [None]:
### 1/5 of full training data.
# print(confusion_matrix(test_df['label'], test_df['predict']))

In [None]:
!ls

check-extract-features.ipynb	   pretraining.ipynb
check-trained-tokenizer.ipynb	   test_results.tsv
dev.tsv				   test.tsv
finetune_to_livedoor_corpus.ipynb  train.tsv
finetune-to-livedoor-corpus.ipynb


In [None]:
test_df.to_csv('../test20191221.csv')

In [None]:
!gsutil cp -r  ../test20191221.csv  gs://hisaka/20191220 

Copying file://../test20191221.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/423.1 KiB.                                    


In [125]:
!pip install asari

Collecting asari
[?25l  Downloading https://files.pythonhosted.org/packages/b1/b9/e328f6ef94596517417491d4d77548a6413252545a62c2a4668ca6df0561/asari-0.0.4-py3-none-any.whl (10.4MB)
[K     |████████████████████████████████| 10.4MB 3.3MB/s 
Collecting Janome>=0.3.7
[?25l  Downloading https://files.pythonhosted.org/packages/a8/63/98858cbead27df7536c7e300c169da0999e9704d02220dc6700b804eeff0/Janome-0.4.1-py2.py3-none-any.whl (19.7MB)
[K     |████████████████████████████████| 19.7MB 1.3MB/s 
Installing collected packages: Janome, asari
Successfully installed Janome-0.4.1 asari-0.0.4


In [126]:
from asari.api import Sonar

In [127]:
sonar = Sonar()



In [129]:
sonar.ping(text="広告多すぎる♡")

TypeError: ignored

### Simple baseline model.

In [106]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [107]:
train_df = pd.read_csv("./train.tsv", sep='\t')
dev_df = pd.read_csv("./dev.tsv", sep='\t')
test_df = pd.read_csv("./test.tsv", sep='\t')

In [108]:
!apt-get -q install -y mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libmecab2 mecab-utils
The following NEW packages will be installed:
  libmecab-dev libmecab2 mecab mecab-ipadic mecab-ipadic-utf8 mecab-utils
0 upgraded, 6 newly installed, 0 to remove and 21 not upgraded.
Need to get 12.8 MB of archives.
After this operation, 60.4 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libmecab2 amd64 0.996-5 [257 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libmecab-dev amd64 0.996-5 [308 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-utils amd64 0.996-5 [4,856 B]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-ipadic all 2.7.0-20070801+main-1 [12.1 MB]
Get:5 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab amd64 0.996-5 [132 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic/universe amd64 mecab-

In [109]:
!pip install -q mecab-python3==0.7

[?25l[K     |███████▉                        | 10kB 21.6MB/s eta 0:00:01[K     |███████████████▊                | 20kB 2.1MB/s eta 0:00:01[K     |███████████████████████▋        | 30kB 2.8MB/s eta 0:00:01[K     |███████████████████████████████▌| 40kB 3.0MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.0MB/s 
[?25h  Building wheel for mecab-python3 (setup.py) ... [?25l[?25hdone


In [110]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import MeCab

In [111]:
m = MeCab.Tagger("-Owakati")

In [112]:
train_dev_df = pd.concat([train_df, dev_df])

In [113]:
train_dev_xs = train_dev_df['text'].apply(lambda x: m.parse(x))
train_dev_ys = train_dev_df['label']

test_xs = test_df['text'].apply(lambda x: m.parse(x))
test_ys = test_df['label']

In [114]:
vectorizer = TfidfVectorizer(max_features=750)
train_dev_xs_ = vectorizer.fit_transform(train_dev_xs)
test_xs_ = vectorizer.transform(test_xs)

The following set up is not exactly identical to that of BERT because inside Classifier it uses `train_test_split` with shuffle.  
In addition, parameters are not well tuned, however, we think it's enough to check the power of BERT.

In [115]:
%%time

model = GradientBoostingClassifier(n_estimators=200,
                                   validation_fraction=len(dev_df)/len(train_df),
                                   n_iter_no_change=5,
                                   tol=0.01,
                                   random_state=23)

### 1/5 of full training data.
# model = GradientBoostingClassifier(n_estimators=200,
#                                    validation_fraction=len(train_df)/len(dev_df),
#                                    n_iter_no_change=5,
#                                    tol=0.01,
#                                    random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

CPU times: user 509 ms, sys: 955 µs, total: 510 ms
Wall time: 514 ms


In [123]:
model.score(test_xs_, test_ys)

0.970954356846473

In [124]:
model.predict(test_xs_)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [116]:
print(classification_report(test_ys, model.predict(test_xs_)))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99       469
           1       0.00      0.00      0.00        13

    accuracy                           0.97       482
   macro avg       0.49      0.50      0.49       482
weighted avg       0.95      0.97      0.96       482



<1x750 sparse matrix of type '<class 'numpy.float64'>'
	with 23 stored elements in Compressed Sparse Row format>

In [117]:
test_xs_

<482x750 sparse matrix of type '<class 'numpy.float64'>'
	with 6302 stored elements in Compressed Sparse Row format>

In [None]:
### 1/5 of full training data.
# print(classification_report(test_ys, model.predict(test_xs_)))

In [118]:
print(confusion_matrix(test_ys, model.predict(test_xs_)))

[[468   1]
 [ 13   0]]


In [None]:
### 1/5 of full training data.
# print(confusion_matrix(test_ys, model.predict(test_xs_)))