# Improved LSTM baseline

### https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout

## LSTM とは

https://www.hellocybernetics.tech/entry/2017/05/06/182757

LSTMとはLong Short-Term Memoryの略です。
short-term memoryとは短期記憶のことであり、短期記憶を長期に渡って活用することを可能にしたのが、LSTMの重大な成果です。
LSTMが１つの中間層に相当すると思って構いません。層の中で複雑な処理を行い、普通に中間層のような役割を担ってくれます。

LSTMはRNNを実現するために考案され、前の情報を上手く扱うことに特化した層を提供してくれると考えればいいでしょう。
LSTMもいろいろな改良がなされて、中身は変わっていっていますが、LSTMの目指す姿とはいつでも、系列データを上手く扱うことです。

<img src="LSTM.png">

## Bi-directional RNN とは

https://deepage.net/deep_learning/2017/05/23/recurrent-neural-networks.html

Bi-directional RNNは、過去の情報だけでなく、未来の情報を加味することで精度を向上させるためのモデルです。 一般的なRNNでは、過去から未来のみの情報で学習しますが、Bidirectional RNNは未来から過去の方向でも同時に学習します。

## GloVe （Global Vectors for Word Representation）とは

https://nlp.stanford.edu/projects/glove/

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

http://nonbiri-tereka.hatenablog.com/entry/2015/10/25/223430

GloVeとは、Global Vectors for Word Representationの略です。その名の通り、ワードを表現する大域的な特徴ベクトルを計算します。単語をD次元ベクトルに変換することができ、言葉と言葉の距離の計算を可能とします。

In [1]:
# Libraryの読み込み

import sys, os, gc, re, csv, codecs, numpy as np, pandas as pd
import keras

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras import initializers, regularizers, constraints, optimizers, layers

Using TensorFlow backend.


We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [2]:
# GloVe辞書 とファイルのパス設定

path = '../data/'
EMBEDDING_FILE=f'{path}glove6b50d/glove.6B.50d.txt'
# Glove の Word Vector (https://www.kaggle.com/watts2/glove6b50dtxt)

TRAIN_DATA_FILE=f'{path}train.csv'
TEST_DATA_FILE=f'{path}test.csv'

Set some basic config parameters:

In [3]:
# 定数

embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

Read in our data and replace missing values:

In [4]:
# ファイルの読み込み

train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)

In [5]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [6]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [7]:
# 欠損値を埋める
# https://stats.stackexchange.com/questions/381110/text-preprocessing-using-keras/381111

list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [8]:
# テキストをトークン化(ベクトル化) --> Keras のスタンダードな機能
# https://keras.io/ja/preprocessing/text/
# https://qiita.com/tomiyou/items/da0b4cc85b89eb0b6d1d

tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

# シーケンスを同じ長さになるように詰める
# https://keras.io/ja/preprocessing/sequence/
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [9]:
# GloVe 辞書の読み込み
# https://pycarnival.com/dict/

def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
# 単一の星*は配列/コレクションを位置引数に展開
# 「* x」引数を取る関数を定義することで、宣言することなく多数のオプションパラメータを指定することができる。
# https://codeday.me/jp/qa/20181122/11424.html

import json
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE, encoding="utf-8_sig"))
# split() カンマ区切り文字列を分割、空白を削除しリスト化

# UnicodeDecodeError: 'cp932' codec can't decode byte 0x93 in position 3136: illegal multibyte sequence を回避するため、
# open(EMBEDDING_FILE) --> open(EMBEDDING_FILE, encoding="utf-8_sig")

Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [10]:
# numpy.stack() 新たな軸（次元）に沿ってNumPy配列を結合
all_embs = np.stack(embeddings_index.values())

# 平均と標準偏差
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

  


(0.020940498, 0.6441043)

In [11]:
# Embedding() の引数前処理

word_index = tokenizer.word_index # トークン化
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

Reference: Using pre-trained word embeddings in a Keras model

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

Reference: pythonでBidirectional LSTMを使った分類問題

https://paper.hatenadiary.jp/entry/2016/10/19/231911

In [12]:
# LSTMを含むモデルの構築

inp = Input(shape=(maxlen,))

# Embedding() 正の整数（インデックス）を固定次元の密ベクトルに変換
# https://keras.io/ja/layers/embeddings/
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)

# RNNのBidirectionalなラッパー
# https://keras.io/ja/layers/wrappers/
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)

x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)  # 6つの multi classification なので、最終層の output shape を6に設定
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# 学習の前に、モデル構造を確認
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 50)           1000000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 100)          40400     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 306       
Total para

In [13]:
# Kerasモデル構築の別の記述方法（私はこちらの方が慣れています）

model = Sequential()
model.add(Embedding(max_features, embed_size, input_length= maxlen, weights=[embedding_matrix]))
model.add(Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(6, activation="sigmoid")) # 6つの multi classification なので、最終層の output shape を6に設定

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# 学習の前に、モデル構造を確認
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 50)           1000000   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 100, 100)          40400     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 306       
Total params: 1,045,756
Trainable params: 1,045,756
Non-trainable params: 0
_________________________________________________________________


Now we're ready to fit out model! Use validation_split when not submitting.

In [14]:
# メモリーに問題がなければこれを実行
model.fit(X_t, y, batch_size=32, epochs=2, validation_split=0.1)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x23868412eb8>

In [13]:
# メモリー不足で落ちるときは、epoch 毎にモデルを保存して、Kernel再起動 --> 学習再開
gc.collect()

model.fit(X_t, y, batch_size=32, initial_epoch=0, epochs=1, validation_split=0.1)

# モデルの保存（再学習用）
model.save('model_tmp1.h5', include_optimizer=False)

gc.collect()

Train on 143613 samples, validate on 15958 samples
Epoch 1/1


0

In [15]:
# モデルを保存
model.save_weights('param_imp_LSTM.hdf5')

In [37]:
# モデル学習再開
# Kernel Restart --> 最初からモデル構築まで実行 --> １回目の学習を飛ばしてここから実行

gc.collect()

# 保存したモデルの読み出し
from keras.models import load_model
model = keras.models.load_model('model_tmp1.h5', compile=False)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_t, y, batch_size=32, initial_epoch=1, epochs=2, validation_split=0.1)

gc.collect()

Train on 143613 samples, validate on 15958 samples
Epoch 2/2


4

In [15]:
# モデルを保存
model.save_weights('param_imp_LSTM.hdf5')

In [16]:
# 学習済みのモデルを取得
#model.load_weights('param_imp_LSTM.hdf5') 

And finally, get predictions for the test set and prepare a submission CSV:

In [17]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)



In [18]:
y_test

array([[9.96388197e-01, 2.19282776e-01, 9.43920493e-01, 6.54944330e-02,
        8.16871285e-01, 1.26922444e-01],
       [1.33802605e-04, 5.00541887e-07, 2.67888681e-05, 1.15297830e-07,
        2.21633236e-05, 1.94943277e-06],
       [8.59084714e-04, 5.15509464e-06, 1.89128608e-04, 4.49071649e-06,
        1.25537394e-04, 1.20692139e-05],
       ...,
       [2.84381531e-04, 8.21534059e-07, 5.74402657e-05, 3.97292297e-07,
        3.92255133e-05, 3.09141615e-06],
       [5.74679638e-04, 6.29436090e-06, 1.26549858e-04, 2.17145134e-06,
        1.12253794e-04, 9.45359716e-05],
       [9.74714994e-01, 1.73104275e-02, 8.42632592e-01, 5.47187962e-03,
        5.37742078e-01, 2.44914903e-03]], dtype=float32)

In [19]:
sample_submission = pd.read_csv(f'{path}sample_submission.csv')
sample_submission[list_classes] = y_test

In [20]:
sample_submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.996388,0.2192828,0.94392,0.06549443,0.816871,0.126922
1,0000247867823ef7,0.000134,5.005419e-07,2.7e-05,1.152978e-07,2.2e-05,2e-06
2,00013b17ad220c46,0.000859,5.155095e-06,0.000189,4.490716e-06,0.000126,1.2e-05
3,00017563c3f7919a,0.000457,1.79263e-06,0.000114,7.521456e-07,0.000103,3e-06
4,00017695ad8997eb,0.002536,1.117108e-05,0.000456,1.262597e-05,0.000398,2.7e-05


In [21]:
sample_submission.to_csv(f'{path}submission.csv', index=False)