## 透過keras框架end-to-end建構文本分類模型
- 資料處理
    - 透過keras embedding layer讓模型學習text data的語意隨著訓練
- 模型
    - CNN-based
    - RNN-based
    - Transformer-based
- 資料集
    - 利用IMDB資料

In [1]:
import tensorflow as tf
import numpy as np

### 取得資料

In [2]:
# 透過linux指令發request
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  13.3M      0  0:00:06  0:00:06 --:--:-- 18.9M


In [3]:
!ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [4]:
!ls aclImdb/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [5]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [6]:
# cat 指令是將每個檔案依照順序讀取並把內容送到標準輸出（螢幕）。 例如，鍵入cat filename 可將檔案filename 的內容在螢幕上顯示。
!cat aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

In [7]:
# rm 指令 刪除檔案
!rm -r aclImdb/train/unsup

### 透過tf.keras.preprocessing.text_dataset_from_directory從dir取得資料建立資料集
- Generates a tf.data.Dataset from text files in a directory.
    - 專門用於txt資料

In [8]:
batch_size = 64

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    directory="aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=42
)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    directory="aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=42
)
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


In [9]:
# 看一個epoch有多少個batch

print(
    "Number of batches in raw_train_ds: %d"
    % tf.data.experimental.cardinality(raw_train_ds)
)
print(
    "Number of batches in raw_val_ds: %d" % tf.data.experimental.cardinality(raw_val_ds)
)
print(
    "Number of batches in raw_test_ds: %d"
    % tf.data.experimental.cardinality(raw_test_ds)
)

Number of batches in raw_train_ds: 313
Number of batches in raw_val_ds: 79
Number of batches in raw_test_ds: 391


In [10]:
# 看一個batch

for text, label in raw_train_ds.take(count=1):
    for i in range(5):
        print(text.numpy()[i])         # tensor to numpy
        print(label.numpy()[i])

b"First of all, I liked very much the central idea of locating the '' intruders'', Others in the fragile Self, on various levels - mainly subconscious but sometimes more allegorical. In fact the intruders are omnipresent throughout the film : in the Swiss-French border where the pretagonist leads secluded life; in the his recurring daydream and nightmare; inside his ailing body after heart transplantation.... In the last half of the film, he becomes intruder himself, returning in ancient french colony in the hope of atoning for the past. <br /><br />The overall tone is bitter rather than pathetic, full of regrets and guilts, sense of failure being more or less dominant. This is a quite grim picture of an old age, ostensibly self-dependent but hopelessly void and lonely inside. The directer composes the images more to convey passing sensations of anxiety and desire than any explicit meanings. Some of them are mesmerizing, not devoid of humor though, kind of absurdist play only somnambul

> 可以發現有 br tag

### 資料處理、準備

In [12]:
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re

def text_preprocessing(input_data):
    input_data = tf.strings.lower(input=input_data)     # lower case
    processed_input_data = tf.strings.regex_replace(    # 處理此tag
        input=input_data,
        pattern="<br />",
        rewrite=" "
    )
    return tf.strings.regex_replace(                    # 處理標點符號
        input=processed_input_data,
        pattern="[%s]" % re.escape(string.punctuation),
        rewrite=''
    )

# 文字處理設定, 用於TextVectorization
max_tokens = 20000
embedding_dim = 128
sequence_length = 500


# keras重點處理text的class
vectorizer_layer = TextVectorization(
    max_tokens=max_tokens,
    standardize=text_preprocessing,     # default是小寫+去除標點符號
    output_mode='int',                  # 將token轉換成index表示, index 0留給masked token
    output_sequence_length=sequence_length      # tokenzie最常長度
)


# 向量化
text_ds = raw_train_ds.map(lambda x, y: x)  # 先取出text only
vectorizer_layer.adapt(text_ds)             # 相當於train的意思

### 兩個方法去向量化文字資料
- *成為模型的一部份*
    - 實際使用上一條龍的處理會比較方便
- 在資料集方面處理
    - 此方法可以更好的使用CPU

In [14]:
# 法二

def vectorizer_text(text, label):
    text = tf.expand_dims(input=text, axis=-1)
    return vectorizer_layer(text), label


# 向量化
train_ds = raw_train_ds.map(vectorizer_text)
val_ds = raw_val_ds.map(vectorizer_text)
test_ds = raw_test_ds.map(vectorizer_text)

# 效能方面的trick, 透過快取
train_ds = train_ds.cache().prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
val_ds = train_ds.cache().prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

### 模型建立

In [15]:
from tensorflow.keras import layers

# 1. cnn-based, 透過functional API

inputs = tf.keras.Input(shape=(sequence_length, ), dtype='int64')
x = layers.Embedding(input_dim=max_tokens, output_dim=embedding_dim, input_length=sequence_length)(inputs)
x = layers.Dropout(0.5)(x)
x = layers.Conv1D(filters=128, kernel_size=7, padding='valid', activation='relu', strides=3)(x)
x = layers.Conv1D(filters=128, kernel_size=7, padding='valid', activation='relu', strides=3)(x)
x = layers.GlobalAveragePooling1D()(x)      # 功能同flattn
x = layers.Dense(128, 'relu')(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation='sigmoid', name='predictions')(x)

cnn_model = tf.keras.Model(inputs, outputs)

# compile
cnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

In [16]:
cnn_model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 500)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 500, 128)          2560000   
_________________________________________________________________
dropout (Dropout)            (None, 500, 128)          0         
_________________________________________________________________
conv1d (Conv1D)              (None, 165, 128)          114816    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 53, 128)           114816    
_________________________________________________________________
global_average_pooling1d (Gl (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               16512 

In [None]:
# 訓練

epochs = 5
cnn_model.fit(
    x=train_ds,
    validation_data=val_ds,
    epochs=epochs
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f94064b49d0>

In [None]:
cnn_model.evaluate(test_ds)



[0.41873037815093994, 0.8611999750137329]

In [None]:
# predict出來是機率
# p > 0.5 ---> class:1, otherwise: 0
cnn_model.predict(test_ds)

array([[0.57410717],
       [0.39040452],
       [0.98512137],
       ...,
       [0.10803606],
       [0.9998933 ],
       [0.01814001]], dtype=float32)

In [None]:
for x, y in test_ds.take(count=1):
    print(x, y)

tf.Tensor(
[[  29  517  875 ... 1186    6    4]
 [   2  198 2443 ...    0    0    0]
 [  11   28  201 ...    0    0    0]
 ...
 [  45   22   25 ...    0    0    0]
 [4083 2890    7 ...    0    0    0]
 [  11   19    7 ...    0    0    0]], shape=(64, 500), dtype=int64) tf.Tensor(
[1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0
 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1], shape=(64,), dtype=int32)


### RNN-basd model
- 因為序列關係無法快速訓練

In [None]:
def create_model(model_type='rnn'):
    """
        透過model_type去管控, 之後可以透過design pattern去優化。
    """
    if model_type == 'cnn':
        pass
    elif model_type == 'rnn':
        inputs = layers.Input(shape=(sequence_length, ), dtype='int64')
        x = layers.Embedding(input_dim=max_tokens, output_dim=embedding_dim, input_length=sequence_length)(inputs)
        # x = layers.GRU(units=64, dropout=0.5, return_sequences=True, recurrent_dropout=0.2)(x)
        x = layers.GRU(units=32, dropout=0.5, recurrent_dropout=0.2, activation='relu')(x)
        outputs = layers.Dense(1, activation='sigmoid')(x)
    else:
        # transformer-based
        pass
    
    model = tf.keras.Model(inputs, outputs)
    return model

In [None]:
epochs = 5

rnn_model = create_model()
rnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
rnn_model.summary()

Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        [(None, 500)]             0         
_________________________________________________________________
embedding_12 (Embedding)     (None, 500, 128)          2560000   
_________________________________________________________________
gru_7 (GRU)                  (None, 32)                15552     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 33        
Total params: 2,575,585
Trainable params: 2,575,585
Non-trainable params: 0
_________________________________________________________________


In [None]:
# 太慢
# rnn_model.fit(
#     x=train_ds,
#     validation_data=val_ds,
#     epochs=epochs
# )

> to be continued...

In [21]:
from transformer_block import TransformerBlock, TokenAndPositionEmbedding


def create_model(num_transformers=6):
    inputs = layers.Input(shape=(sequence_length,), dtype='int64')
    x = TokenAndPositionEmbedding(maxlen=sequence_length, vocab_size=max_tokens, embed_dim=32)(inputs)
    # 論文是6個
    for _ in range(num_transformers):
        x = TransformerBlock(embed_dim=32, num_heads=8, ff_dim=32)(x)
    x = layers.GlobalAveragePooling1D()(x)      # dimension reduction or flatten
    x = layers.Dropout(0.5)(x)
    output = layers.Dense(units=1, activation='sigmoid')(x)

    model = tf.keras.Model(inputs, output)
    return model

transformer_model = create_model()
transformer_model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(None, 500)]             0         
_________________________________________________________________
token_and_position_embedding (None, 500, 32)           656000    
_________________________________________________________________
transformer_block_8 (Transfo (None, 500, 32)           6464      
_________________________________________________________________
transformer_block_9 (Transfo (None, 500, 32)           6464      
_________________________________________________________________
transformer_block_10 (Transf (None, 500, 32)           6464      
_________________________________________________________________
transformer_block_11 (Transf (None, 500, 32)           6464      
_________________________________________________________________
transformer_block_12 (Transf (None, 500, 32)           6464

In [23]:
transformer_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
transformer_model.fit(
    x=train_ds,
    validation_data=val_ds,
    epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f5301ff97d0>

In [24]:
transformer_model.evaluate(test_ds)



[0.5669289231300354, 0.8374000191688538]