<a href="https://colab.research.google.com/github/TA-aiacademy/course_3.0/blob/v2-5_nlp/09_v2-5_NLP/Part5/02_Bert_finetune_ptt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PTT gossip classification

這章節我們使用中文預訓練模型`bert-base-chinese`來進行`finetune`。

In [None]:
!pip install transformers

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.model_selection import train_test_split
from transformers import *

In [None]:
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese')

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

### Data overview

我們使用從ptt八卦版進行爬蟲整理，$0$表示該留言的推數小於噓數，$1$表示該留言的推數大於噓數，所以這個任務是屬於`Text classification`任務(二元分類)。

In [None]:
# 上傳資料
!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/v2.5_nlp/NLP_part5.zip
!unzip -q NLP_part5.zip

In [None]:
ptt = pd.read_csv('Data/ptt_gossip.csv')

bert_max_length = 512
ptt['sentence'] = [t[:bert_max_length] for t in ptt.sentence]

In [None]:
ptt.head()

In [None]:
"""
訓練集80%，測試集20%
"""
train_size = 0.8

mask = np.random.rand(len(ptt)) < train_size
train_dataset = ptt[mask]
valid_dataset = ptt[~mask]

In [None]:
train_size = len(train_dataset)
valid_size = len(valid_dataset)

In [None]:
print('Train size: ', train_size)
print('Valid size: ', valid_size)

### Convert to tensor

各種`Transformer`預訓練都支持`tf.tensor`輸入格式，需要將資料集轉為`tf.tensor`格式。

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices(dict(train_dataset))
valid_dataset = tf.data.Dataset.from_tensor_slices(dict(valid_dataset))

### Traing data format

使用`glue_convert_examples_to_features`將資料集轉為模型可讀取格式，因為是二元分類，所以我們使用的任務為`cola`，`cola`是`bert`在`finetune`時的任務之一，一樣是二元分類任務，我們可以套用他的輸入格式來進行轉換，而在中文部分目前的預訓練模型都是用`chararcter-level`進行斷詞，所以我們將`max_length`提高至$256$，下表為在`Titan X 12G`上`finetune`的參數限制，表示模型以及多少句子長度對應其最大的`batch_size`，需要注意其硬體限制，而`1080ti`為`11G`，可以使用句子長度`256`搭配`batch_size`為16。

<img src="https://hackmd.io/_uploads/Hybk-351p.png" alt="Drawing" style="width: 250px;"/>

In [None]:
max_length = 512
task = 'cola'

train_dataset = glue_convert_examples_to_features(train_dataset,
                                                  tokenizer,
                                                  max_length,
                                                  task)
valid_dataset = glue_convert_examples_to_features(valid_dataset,
                                                  tokenizer,
                                                  max_length,
                                                  task)

In [None]:
train_temp = next(iter(train_dataset))

In [None]:
train_temp

In [None]:
buffer_size = 100
train_bz = 6
epochs = 3
valid_bz = 6

train_gen = train_dataset.shuffle(buffer_size).batch(train_bz).repeat(epochs)
valid_gen = valid_dataset.batch(valid_bz)

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5,
                                     epsilon=1e-8,
                                     clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                     reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

In [None]:
history = model.fit(train_gen,
                    epochs=epochs,
                    steps_per_epoch=train_size//train_bz,
                    validation_data=valid_gen,
                    validation_steps=valid_size//valid_bz)

## Save model

In [None]:
save_path = 'save_ptt'
if not os.path.exists(save_path):
    os.mkdir(save_path)

In [None]:
model.save_pretrained('./save_ptt/')

## Evaluation

畫出`precision`, `recall`, `f1-score`以及`confusion matrix`評估模型表現。

In [None]:
valid_pred = model.predict(valid_gen)
valid_pred_ids = np.argmax(valid_pred.logits, axis=-1)

In [None]:
valid_label = list()
for x in valid_dataset:
    valid_label += [x[1].numpy()]

In [None]:
print(classification_report(y_pred=valid_pred_ids, y_true=valid_label))

In [None]:
confm = confusion_matrix(y_pred=valid_pred_ids, y_true=valid_label)

index = ['Actual_0', 'Actual_1']
columns = ['Pred_0', 'Pred_1']
pd.DataFrame(confm, index=index, columns=columns)

## Load model and predict

In [None]:
new_model = TFBertForSequenceClassification.from_pretrained('save_chinese/')
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

In [None]:
sentence = ["文瑋助教好壯"]

test_dataset = pd.DataFrame(dict(idx=list(range(len(sentence))),
                                 label=[0]*len(sentence),
                                 sentence=sentence))

In [None]:
test_dataset

In [None]:
test_gen = tf.data.Dataset.from_tensor_slices(dict(test_dataset))

In [None]:
max_length = 512
task = 'cola'
test_gen = glue_convert_examples_to_features(test_gen, tokenizer, max_length, task)

In [None]:
test_gen = test_gen.batch(1)

In [None]:
next(iter(test_gen))

In [None]:
pred = new_model.predict(test_gen)

In [None]:
pred_ids = np.argmax(pred.logits, axis=-1)

In [None]:
print(pred_ids[0])