<a href="https://colab.research.google.com/github/TA-aiacademy/course_3.0/blob/v2-5_nlp/09_v2-5_NLP/Part3/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN text classification

In [None]:
import numpy as np
import os
import pandas as pd
import time

from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

import tensorflow as tf
import tensorflow_datasets as tfds
print('tensorflow version: ', tf.__version__)

# 指定使用第一張GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
# 上傳資料
!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/v2.5_nlp/NLP_part3.zip
!unzip -q NLP_part3.zip

In [None]:
output_dir = "Data"
zh_vocab_file = os.path.join(output_dir, "zh_vocab")
checkpoint_path = os.path.join(output_dir, "checkpoints.h5")

## Load Data

In [None]:
ptt_gossip = pd.read_csv('Data/ptt_gossip.csv')
ptt_gossip.drop(columns='idx', inplace=True)
print(ptt_gossip.shape)
ptt_gossip.head()

## Filter sentence length

依照句子長度過濾

In [None]:
max_length = 256

ptt_gossip = ptt_gossip[ptt_gossip.sentence.str.len() < max_length]
ptt_gossip.reset_index(drop=True, inplace=True)
print(ptt_gossip.shape)
ptt_gossip.head()

## Train validation split

In [None]:
valid_size = 0.2
X_train, X_valid, y_train, y_valid = train_test_split(ptt_gossip['sentence'],
                                                      ptt_gossip['label'],
                                                      test_size=valid_size,
                                                      shuffle=True)

## Pre-processing

1. 將資料轉換成`tf.tensor`格式。
2. 使用`tfds.features.text.SubwordTextEncoder`進行斷詞，斷詞方式為`character-level`方式。

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
valid_dataset = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))

In [None]:
%%time
try:
    tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.load_from_file(zh_vocab_file)
    print('Load Chinese vocabulary: %s' % zh_vocab_file)
except:
    print('Build Chinese vocabulary: %s' % zh_vocab_file)
    tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus((x.numpy() for x, y in train_dataset),
                                                                             max_subword_length=1,
                                                                             target_vocab_size=2**13)
    tokenizer_zh.save_to_file(zh_vocab_file)

In [None]:
print('Vocabulary size: ', tokenizer_zh.vocab_size)

In [None]:
tokenizer_zh

### Example

In [None]:
sentence = '文瑋助教真壯'
token_id = tokenizer_zh.encode(sentence)

print('Sentence token_id: ', token_id)
print('Tokenization: ', [tokenizer_zh.decode([t]) for t in token_id])

## Convert to token_id

因為訓練時需要將每個字轉換成，這邊使用`.map`方式將`train_dataset`轉換成`token_id`。

In [None]:
def encode(sentence, label):
    zh_id = tokenizer_zh.encode(sentence.numpy())
    return (tf.cast(zh_id, tf.int32), tf.cast(label, tf.int32))

In [None]:
def tf_encode(sentence, label):
    """
    從encode輸出的zh_id不是Eager Tensor
    需要透過 tf.py_function 轉為Eager Tensor
    """
    return tf.py_function(encode, [sentence, label], [tf.int32, tf.int32])

In [None]:
train_dataset = train_dataset.map(tf_encode)
valid_dataset = valid_dataset.map(tf_encode)

In [None]:
tmp_valid = next(iter(valid_dataset))

In [None]:
pprint(tmp_valid)

In [None]:
pprint(tokenizer_zh.decode(tmp_valid[0].numpy()))

## Input pipeline

這邊使用`tf.data.Data.from_tensor_slices`建立一個`generator`，每次訓練時讀取`batch_size`張圖片，通常會建立`generator`都是因為圖片量過大無法一次讀入記憶體，這邊使用`generator`是為了示範。

1. `.shuffle()`:進行`buffer_size`的打亂，每次從資料中取`buffer_size`個`batch`作為`buffer`，然後再從`buffer`中隨機抽一個`batch`出來做訓練，所以適當的`buffer_size`很重要，如果`buffer_size`過小會導致放在`buffer`裡的都是同一類別的圖片，最好的做法是直接把`buffer_size`設為訓練圖片數量(`len(X_train)`)，這樣能夠確保隨機性。

2. `.padded_batch()`:將每個`batch`進行`padding`，符合訓練的輸入格式。

3. `.repeat()`: 複製資料集為`epochs`份，訓練時需要`epochs`份

In [None]:
buffer_size = len(X_train)

embedding_size = 256
rnn_units = 512

batch_size = 64
epochs = 10

In [None]:
train_dataset = train_dataset.shuffle(buffer_size).padded_batch(batch_size, padded_shapes=([-1], []), drop_remainder=True).repeat(epochs)
valid_dataset = valid_dataset.padded_batch(batch_size, padded_shapes=([-1], []))

### Example

這邊使用`iter`呼叫`generator`來觀看其中一個`batch`。

In [None]:
tmp_generator = iter(train_dataset)
tmp_x, tmp_y = next(tmp_generator)

print('Sentence.shape: ', tmp_x.shape)
print(tmp_x)
print('-'*20)
print('Label.shape: ', tmp_y.shape)
print(tmp_y)

## Define LSTM model

`tensorflow2.0.0`預設是`eager model`，有助於在撰寫模型時`debug`以及觀看數值運算結果。

這裡使用`tf.keras`為基底進行建模，在`lstm`中需要注意輸入型態為`(timesteps, feature_size)`，另外常見有三個參數需要注意：

1. `embedding_size`: 每個字的詞向量大小。
2. `rnn_units`: `lstm`模型的神經元數量。
3. `return_sequences`: 是否輸出每個`timestep`的結果(`hidden_state`)，輸出型態為`(batch_size, )`。
4. `return_state`: 是否輸出最後一個`timestep`的結果(`hidden_state`和`cell_state`)。

其實`3.`和`4.`的功能有點重複了，通常我們只會拿最後一個`timestep`作為輸出，這邊我們將`return_sequences`設為`True`，並使用`slice`方式將最後一個`hidden_sate`拿出來。

最後使用`tf.keras.layers.Dense`輸出`2`個類別的概率。

In [None]:
def rnn_model(batch_size, rnn_units):
    input_layer = tf.keras.Input(shape=[None],batch_size=batch_size)
    embedding_layer = tf.keras.layers.Embedding(tokenizer_zh.vocab_size, embedding_size)(input_layer)

    lstm = tf.keras.layers.LSTM(units=rnn_units,
                                activation='tanh',
                                recurrent_activation='sigmoid',
                                use_bias=True,
                                return_sequences=True,
                                return_state=False,
                                recurrent_initializer='glorot_uniform')

    lstm_hidden_states = lstm(embedding_layer)

    lstm_last_state = lstm_hidden_states[:,-1,:]

    output = tf.keras.layers.Dense(2, activation='softmax', name='output')(lstm_last_state)

    return input_layer, output

In [None]:
input_layer, output = rnn_model(batch_size,rnn_units)
model = tf.keras.Model(inputs=input_layer, outputs=output)

In [None]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
history = model.fit(train_dataset,
                    epochs=epochs,
                    steps_per_epoch=len(X_train) // batch_size,
                    validation_data=valid_dataset,
                    validation_steps=len(X_valid) // batch_size)

In [None]:
model.save(checkpoint_path)

## Testing prediction

觀察`testing`的`precision, recall, f1-score`以及`confusion matrix`。

In [None]:
valid_pred = model.predict(valid_dataset)
valid_pred_id = np.argmax(valid_pred, axis=-1)
valid_true_id = np.array(y_valid)

In [None]:
print(classification_report(y_pred = valid_pred_id, y_true = valid_true_id))

In [None]:
confm = confusion_matrix(y_pred = valid_pred_id, y_true = valid_true_id)
pd.DataFrame(confm, index=['Actual_0', 'Actual_1'], columns=['Pred_0', 'Pred_1'])