# BERT VS TFIDF
目的：　伝統のTFIDF×機械学習モデルと最新のBERTはどれぐらい違うのかを検証する。  
動機：　１、今までBERTの理論知識は詳しく、実際動かしたことはない。  
　　　　２、BERTは言語理解などに優れた性能があることは証明されたが、普通の分類問題などはどうなるのか、伝統的なやり方と比べてどれぐらい違うのかを確認

比較タスク：IMDBの感情分析  
https://www.kaggle.com/c/imdb-review/data


# データ確認

In [1]:
!pip uninstall -y tensorflow tensorflow-cpu tensorflow-gpu keras
!pip install -U tensorflow==1.15 tensorflow_datasets

Uninstalling tensorflow-2.2.0rc4:
  Successfully uninstalled tensorflow-2.2.0rc4
Uninstalling Keras-2.3.1:
  Successfully uninstalled Keras-2.3.1
Collecting tensorflow==1.15
[?25l  Downloading https://files.pythonhosted.org/packages/3f/98/5a99af92fb911d7a88a0005ad55005f35b4c1ba8d75fba02df726cd936e6/tensorflow-1.15.0-cp36-cp36m-manylinux2010_x86_64.whl (412.3MB)
[K     |████████████████████████████████| 412.3MB 31kB/s 
[?25hCollecting tensorflow_datasets
[?25l  Downloading https://files.pythonhosted.org/packages/bd/99/996b15ff5d11166c3516012838f569f78d57b71d4aac051caea826f6c7e0/tensorflow_datasets-3.1.0-py3-none-any.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 39.5MB/s 
[?25hCollecting tensorboard<1.16.0,>=1.15.0
[?25l  Downloading https://files.pythonhosted.org/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 49.2MB/s 
Collecting gast==0.2.2
  Do

In [2]:
! unzip imdb-review.zip

Archive:  imdb-review.zip
  inflating: test_data.csv           
  inflating: train_data.csv          


In [0]:
import pandas as pd 
import numpy as np
import nltk

In [0]:
train_df = pd.read_csv('train_data.csv')

In [5]:
train_df.head()

Unnamed: 0,ID,SentimentText,Sentiment
0,0,first think another disney movie might good it...,1
1,1,put aside dr house repeat missed desperate hou...,0
2,2,big fan stephen king s work film made even gre...,1
3,3,watched horrid thing tv needless say one movie...,0
4,4,truly enjoyed film acting terrific plot jeff c...,1


In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
from nltk.tokenize import word_tokenize
train_df['tokenized'] = train_df['SentimentText'].apply(word_tokenize)

In [8]:
train_df.head()

Unnamed: 0,ID,SentimentText,Sentiment,tokenized
0,0,first think another disney movie might good it...,1,"[first, think, another, disney, movie, might, ..."
1,1,put aside dr house repeat missed desperate hou...,0,"[put, aside, dr, house, repeat, missed, desper..."
2,2,big fan stephen king s work film made even gre...,1,"[big, fan, stephen, king, s, work, film, made,..."
3,3,watched horrid thing tv needless say one movie...,0,"[watched, horrid, thing, tv, needless, say, on..."
4,4,truly enjoyed film acting terrific plot jeff c...,1,"[truly, enjoyed, film, acting, terrific, plot,..."


ワード数

In [0]:
import itertools
tokenized_list = list(itertools.chain.from_iterable(train_df['tokenized'].to_list()))
vocab = set(tokenized_list)


In [10]:
len(tokenized_list),len(vocab)

(3217535, 72987)

正解・不正解データの割合

In [11]:
train_df['Sentiment'].mean()

0.499

データを0.3の割合で、TrainとValidを分ける

In [0]:
valid_size = int(train_df.shape[0]*0.3)

In [0]:
df_valid = train_df.sample(valid_size,random_state = 0)

In [0]:
df_train = train_df.loc[~(train_df.index.isin(df_valid.index)),:]

In [15]:
df_train.head()

Unnamed: 0,ID,SentimentText,Sentiment,tokenized
0,0,first think another disney movie might good it...,1,"[first, think, another, disney, movie, might, ..."
1,1,put aside dr house repeat missed desperate hou...,0,"[put, aside, dr, house, repeat, missed, desper..."
2,2,big fan stephen king s work film made even gre...,1,"[big, fan, stephen, king, s, work, film, made,..."
3,3,watched horrid thing tv needless say one movie...,0,"[watched, horrid, thing, tv, needless, say, on..."
4,4,truly enjoyed film acting terrific plot jeff c...,1,"[truly, enjoyed, film, acting, terrific, plot,..."


In [16]:
df_valid.head()

Unnamed: 0,ID,SentimentText,Sentiment,tokenized
5118,5118,i m surprised even cowgirls get blues movie an...,0,"[i, m, surprised, even, cowgirls, get, blues, ..."
10284,10284,pretty standard b movie stuff seriously anyone...,0,"[pretty, standard, b, movie, stuff, seriously,..."
6208,6208,i ve watch films pang brothers eye one take on...,1,"[i, ve, watch, films, pang, brothers, eye, one..."
3361,3361,vampires vs humansmilitary reject roughneck sq...,0,"[vampires, vs, humansmilitary, reject, roughne..."
7068,7068,n b spoilers within assigning artistic directo...,0,"[n, b, spoilers, within, assigning, artistic, ..."


In [17]:
df_valid['Sentiment'].mean()

0.49930555555555556

# BERT編
ライブラリ：keras_bert

In [18]:
!pip install -q keras-bert keras-rectified-adam
!wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip -o uncased_L-12_H-768_A-12.zip

[?25l[K     |▉                               | 10kB 28.2MB/s eta 0:00:01[K     |█▊                              | 20kB 7.2MB/s eta 0:00:01[K     |██▋                             | 30kB 10.2MB/s eta 0:00:01[K     |███▌                            | 40kB 4.3MB/s eta 0:00:01[K     |████▍                           | 51kB 4.3MB/s eta 0:00:01[K     |█████▏                          | 61kB 5.1MB/s eta 0:00:01[K     |██████                          | 71kB 5.4MB/s eta 0:00:01[K     |███████                         | 81kB 5.4MB/s eta 0:00:01[K     |███████▉                        | 92kB 6.0MB/s eta 0:00:01[K     |████████▊                       | 102kB 5.3MB/s eta 0:00:01[K     |█████████▌                      | 112kB 5.3MB/s eta 0:00:01[K     |██████████▍                     | 122kB 5.3MB/s eta 0:00:01[K     |███████████▎                    | 133kB 5.3MB/s eta 0:00:01[K     |████████████▏                   | 143kB 5.3MB/s eta 0:00:01[K     |█████████████            

In [0]:
# Constants

SEQ_LEN = 128
BATCH_SIZE = 128
EPOCHS = 5
LR = 1e-4

In [0]:
# パス設定
import os
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
# TF_KERAS must be added to environment variables in order to use TPU
os.environ['TF_KERAS'] = '1'

In [21]:
# TPU環境設定

import tensorflow as tf
from keras_bert import get_custom_objects

TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Initializing the TPU system: 10.117.67.122:8470
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Querying Tensorflow master (grpc://10.117.67.122:8470) for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 15770819735683517156)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 11738655311101867708)

In [22]:
# BERT読み込み
import codecs
from keras_bert import load_trained_model_from_checkpoint

token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)
print('dict_readed')
# ファインチューニングするため、training=True,trainable=True,
with strategy.scope():
    model = load_trained_model_from_checkpoint(
        config_path,
        checkpoint_path,
        training=True,
        trainable=True,
        seq_len=SEQ_LEN,
    )

dict_readed
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [23]:
#一回もとのBERTを確認
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      [(None, 128)]        0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 128, 768), ( 23440896    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 128, 768)     1536        Input-Segment[0][0]              
______________________________________________________________________________________________

### IMDBデータセットのテキストデータをベクトルにする

In [24]:
from tqdm import tqdm
from keras_bert import Tokenizer
tokenizer = Tokenizer(token_dict)

def pddf_to_tfdataset(df):
  indices, sentiments = [], []
  for text,sentiment in tqdm(zip(df['SentimentText'],df['Sentiment'])):
    ids, segments = tokenizer.encode(text, max_len=SEQ_LEN)
    indices.append(ids)
    sentiments.append(sentiment)

  items = list(zip(indices, sentiments))
  np.random.shuffle(items)
  indices, sentiments = zip(*items)
  indices = np.array(indices)
  mod = indices.shape[0] % BATCH_SIZE
  if mod > 0:
    indices, sentiments = indices[:-mod], sentiments[:-mod]
  return [indices, np.zeros_like(indices)], np.array(sentiments)
  
  
train_x, train_y = pddf_to_tfdataset(df_train)
test_x, test_y = pddf_to_tfdataset(df_valid)

16800it [00:27, 600.94it/s]
7200it [00:12, 597.53it/s]


In [25]:
model.inputs

[<tf.Tensor 'Input-Token:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Segment:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Masked:0' shape=(?, 128) dtype=float32>]

In [26]:
model.get_layer('NSP-Dense').output

<tf.Tensor 'NSP-Dense/Tanh:0' shape=(?, 768) dtype=float32>

In [0]:
#  分類器にする
from tensorflow.python import keras
from keras_radam import RAdam

with strategy.scope():
    inputs = model.inputs[:2]
    dense = model.get_layer('NSP-Dense').output
    outputs = keras.layers.Dense(units=2, activation='softmax')(dense)
    
    model = keras.models.Model(inputs, outputs)
    model.compile(
        RAdam(lr=LR),
        loss='sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy'],
    )

In [28]:
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      [(None, 128)]        0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 128, 768), ( 23440896    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 128, 768)     1536        Input-Segment[0][0]              
____________________________________________________________________________________________

In [29]:
#  初期化
import tensorflow as tf
import tensorflow.keras.backend as K

sess = K.get_session()
uninitialized_variables = set([i.decode('ascii') for i in sess.run(tf.report_uninitialized_variables())])
init_op = tf.variables_initializer(
    [v for v in tf.global_variables() if v.name.split(':')[0] in uninitialized_variables]
)
sess.run(init_op)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [30]:
# 学習
model.fit(
    train_x,
    train_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f00066e43c8>

In [35]:
y_pred = model.predict(test_x, verbose=True)




In [0]:
y_pred_cat = y_pred.argmax(axis=1)


In [41]:
len(test_y)

7168

### 評価

In [43]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred_cat,test_y)


0.8505859375

# tfidf×Lightgbm


文をTFIDFのベクトルにする（次元数を５００）

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 500)
X = vectorizer.fit_transform(train_df['SentimentText'])
X = X.todense()
y = train_df['Sentiment']


In [0]:
X_train_tfidf = X[df_train.index,:]
y_train_tfidf = y[df_train.index]
X_valid_tfidf = X[df_valid.index,:]
y_valid_tfidf = y[df_valid.index]

### 学習させて予測

In [0]:
import lightgbm as lgb

In [47]:
lgb_model = lgb.LGBMClassifier(n_estimators=150)
lgb_model.fit(X_train_tfidf, y_train_tfidf)


LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=150, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [0]:
y_pred = lgb_model.predict(X_valid_tfidf)

ACCURACYを確認

In [51]:
from sklearn.metrics import accuracy_score
accuracy_score(y_valid_tfidf, y_pred)


0.825

# 結論
BERTは感情分析みたいなタスクでも（せめてIMDBデータセット）で、TFIDFより精度高いですが。  
ベーシックなやり方で、それぞれ比較をしてみたら、  
TFIDF：BERT＝０．８２５：０．８５ぐらいで、圧倒的に高いとは言えない。    

課題：  
・今回データは比較的にきれいで、クレンジングしていない。本当はストップワード除去などをやったほうがいい。  
・RNN系のLSTMなどのモデルでも、BERTの精度と比較してみるべき。  
・日本語はどうなるかはまだ未知  


参考URL
https://github.com/CyberZHG/keras-bert/blob/master/demo/tune/keras_bert_classification_tpu.ipynb