## 新聞分類專案
- 傳統機器學習方法
    - svm: 0.77
- 深度學習方法(樣本有點小，上面兩個方法可能不太好!)
    - cnn: 0.77左右
    - lstm: 0.7左右
    - bert: 0.90-0.92

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['Microsoft JhengHei']  # 處理中文顯示問題

## 讀取資料


In [None]:
# 讀取
train_data = pd.read_csv('./datasets/news_clustering_train.tsv', sep='\t').drop(columns=['index'])
train_data.head()

## 觀察資料

In [None]:
# 有無缺失值

train_data.info()

In [None]:
print('筆數: ', train_data.shape[0])
print('分類數: ', len(train_data['class'].unique()))

In [None]:
# title長度
train_data['title_length'] = train_data['title'].apply(func=lambda x: len(x))
train_data.head()

In [None]:
# title 長度範圍

print(max(train_data['title_length']), min(train_data['title_length']))

In [None]:
# 標題長度分布
train_data['title_length'].hist()

In [None]:
# 有哪幾個分類
classes = train_data['class'].unique()
classes

In [None]:
# 確認分類比例

train_data['class'].value_counts().plot(kind='bar', facecolor='b')

In [None]:
classes = dict(zip(classes, range(len(classes))))
num_to_class = dict(zip(range(len(classes)), classes))
print(classes, num_to_class, sep='\n')

In [None]:
train_data['class'] = train_data['class'].apply(func=lambda x: classes[x])
train_data.head()

##  斷詞
- jieba
- ckip

In [None]:
import jieba

train_data['jieba'] = train_data['title'].apply(func=lambda x: list(jieba.cut(x)))
train_data.head()

In [None]:
from ckiptagger import data_utils, WS

ws = WS(r'C:\Users\aband\OneDrive\桌面\NLP_marathon\NLP_practice\data')

# 這樣有點久
train_data['ckiptagger'] = train_data['title'].apply(func=lambda x: ws([x], sentence_segmentation=True, segment_delimiter_set={',','。', '，', '?', ':', '!'})[0])
train_data.head()

## Zipf定理
- for fun

In [None]:
"""
    Zipf: NLP的語料庫會有的現象，出現次數隨著排名呈反比。大概會是2倍。
"""
corpus_jieba = []
for e in train_data['jieba']:
    # 得到全部語料
    corpus_jieba += e

dict_word_to_count = {}
for e in corpus_jieba:
    dict_word_to_count[e] = dict_word_to_count.get(e, 0) + 1

dict_word_to_count = sorted(dict_word_to_count.items(), key=lambda x: x[1], reverse=True)
dict_word_to_count

In [None]:
# 顯然不是!

count_all = 0
for e in dict_word_to_count:
    count_all += e[1]

for e in dict_word_to_count:
    print(e[0], ':', round(e[1]/count_all, 3), '%', '次數: ', e[1])

In [None]:
count_all

In [None]:
import wordcloud

def draw_wordcloud(corpus):
    """
        functionality:
            draw the wordcloud
        Args:
            corpus: str
    """
    words = list(jieba.cut(corpus, cut_all = False))   # 分詞
    words = [e for e in words if e.isalpha()]        # 
    words = [e for e in words if len(e) != 1]        # 只要長度是2以上的word
    words = " ".join(words)
    
    cloud = wordcloud.WordCloud(background_color = "black",
                                font_path = "C:\Windows\Fonts\kaiu",   # 字型路徑
                                scale = 2,                             # 調整解析度
                                width = 1000,                          # 圖片長度
                                height = 600,                          # 圖片寬度
                                min_font_size = 20,
                                max_words = 200)
    cloud = cloud.generate(words)                    # 產生文字雲
    plt.figure(figsize = (15, 15))
    plt.axis("off")
    plt.imshow(cloud)
    plt.show()
    
corpus = []
for i in range(len(train_data)):
    corpus.append(train_data.iloc[i, 1])
    
corpus = ''.join(corpus)
draw_wordcloud(corpus)

## 根據每一個類別去做wordcloud

In [None]:
def get_class_wordcloud(df):
    for i in range(len(df['class'].unique())):
        print(num_to_class[i])
        corpus = []
        temp_df = df[(df['class'] == i)]
        for j in range(len(temp_df)):
            corpus.append(temp_df.iloc[j, 1])
        corpus = ''.join(corpus)
        draw_wordcloud(corpus)
get_class_wordcloud(train_data)

## TF-IDF
1. 透過CountVectorizer將每一個句子轉成TF
2. 透過TfidfTransformer將每一個句子計算TFIDF, 基於上面的scipy.sparse.csr_matrix.
- 參考資源
    - [TFIDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)
        - [TF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
    - [中文TF-IDF](https://www.cnblogs.com/cheesezh/p/8644893.html)

### 將语料轉換為詞袋向量

In [None]:
corpus = []
for i in range(len(train_data)):
    corpus.append(' '.join(train_data.iloc[i, 3]))
corpus

In [None]:
from sklearn.feature_extraction.text import CountVectorizer    # 這就是計算tf

# 1. 初始化向量化工具
vectorizer = CountVectorizer(min_df=1, max_df=1.0)
# 2. 根據語料集統計詞袋模型
vectorizer.fit(corpus)
# 3. 看語料詞袋信息
bag_of_words = vectorizer.get_feature_names()
print('Bag of words:')
print(bag_of_words)
print(len(bag_of_words))
# 4. 將語料集轉化成為詞袋向量
X = vectorizer.transform(corpus)
print('向量化語料')
print(X.toarray())
# 5. 查看每個word在詞袋中的索引
print('index of `詹姆斯` is {}'.format(vectorizer.vocabulary_.get('詹姆斯')))

In [None]:
# 1800筆資料, 每筆的各個詞
X.shape

In [None]:
corpus[0]

In [None]:
# X此時不是ndarray
X[0].toarray()

In [None]:
# 字典 ---> word: index
print(len(vectorizer.vocabulary_))
vectorizer.vocabulary_

### 根據詞袋向量統計TF-IDF

1. 初始化一個tf-idf轉換器
2. 根據語料集的詞袋向量計算TF-idf
3. 打印TF-IDF信息: 比如結合詞袋，可以查看每個詞的tf-idf
4. 將詞料集的詞袋向量轉換為TF-idf向量表示

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

# 1.
tfidf_transfomer = TfidfTransformer()
# 2.
tfidf_transfomer.fit(X.toarray())
# 3.
for idx, word in enumerate(vectorizer.get_feature_names()):
    print('{}\t{}'.format(word, tfidf_transfomer.idf_[idx]))

# 4
tfidf = tfidf_transfomer.transform(X)
print(tfidf.toarray())

In [None]:
print(tfidf.toarray().shape)

In [None]:
x_train = tfidf.toarray()
y_train = train_data['class']

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


# 訓練效果, 但這樣看其實沒有意義, 但至少要確認訓練資料可以fit得不錯

models = [RandomForestClassifier(), KNeighborsClassifier(), LogisticRegression(), MultinomialNB(), SVC()]
for model in models:
    model.fit(x_train, y_train)
    print('score: {}'.format(accuracy_score(model.predict(x_train), y_train)))

score: 1.0
score: 0.8161111111111111
score: 0.9855555555555555
score: 0.9833333333333333
score: 0.9994444444444445


## [交叉驗證](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)
- 結果
    - MultinomialNB是最平均的, 表現也比較好!

In [29]:
# 再加入cross_validate
from sklearn.model_selection import cross_validate
import numpy as np

def cv():
    models = [RandomForestClassifier(), KNeighborsClassifier(), LogisticRegression(), MultinomialNB(), SVC()]
    for model in models:
        cv_results = cross_validate(model, x_train, y_train, cv=5)
        print(np.mean(cv_results['test_score']), cv_results['test_score'])
cv()

0.6316666666666666 [0.63888889 0.62777778 0.59166667 0.66666667 0.63333333]
0.6466666666666667 [0.65555556 0.64166667 0.66111111 0.64166667 0.63333333]
0.71 [0.73055556 0.70555556 0.69722222 0.72222222 0.69444444]
0.721111111111111 [0.725      0.73333333 0.73333333 0.71111111 0.70277778]
0.6899999999999998 [0.725      0.68055556 0.65277778 0.7        0.69166667]


## 利用測試資料看結果

In [None]:
# 讀入data
test_data = pd.read_csv('./datasets/news_clustering_test.tsv', sep='\t').drop(columns=['index'])
test_data['class'] = test_data['class'].apply(func=lambda x: classes[x])

# 斷詞
test_data['jieba'] = test_data['title'].apply(func=lambda x: list(jieba.cut(x)))
test_data.head()

In [None]:
# 算TF-idf

# corpus
test_corpus = []
for i in range(len(test_data)):
    test_corpus.append(' '.join(test_data.iloc[i, 2]))

# tf
x_test = tfidf_transfomer.transform(vectorizer.transform(test_corpus)).toarray()
y_test = test_data['class']

In [None]:
print(x_test.shape)
print(y_test.shape)

In [34]:
for model in models:
    print('score: {}'.format(accuracy_score(model.predict(x_test), y_test)))

score: 0.6766666666666666
score: 0.685
score: 0.765
score: 0.765
score: 0.755


## 透過Confusion_matrix去找看結果的分布

In [34]:
# 看分類結果的分布
from sklearn.metrics import confusion_matrix
from time import time

model_name = ['隨機森林', 'KNN', '羅吉斯回歸', '簡單貝氏', 'SVM']
for name, model in zip(model_name, models):
    t = time()
    
    print(confusion_matrix(model.predict(x_test), y_test))
    print(name, f'耗時: {round(time() - t, 3)} s')

[[73  3  3  4  1  3]
 [ 3 65 13  5 12  3]
 [ 3 12 61  4  6  3]
 [ 3  6  3 64  9  2]
 [17 14 17 23 72 14]
 [ 1  0  3  0  0 75]]
隨機森林 耗時: 0.101 s
[[85  7  9 12  7  5]
 [ 4 66 10  7 10  0]
 [ 4 17 68 13 12  7]
 [ 3  2  3 59 13  6]
 [ 0  3  3  2 56  5]
 [ 4  5  7  7  2 77]]
KNN 耗時: 11.737 s
[[87  1  1  1  1  7]
 [ 2 77 12  9 10  3]
 [ 2 16 72  6  7  2]
 [ 1  2  3 72 13  1]
 [ 7  3  7 12 69  5]
 [ 1  1  5  0  0 82]]
羅吉斯回歸 耗時: 0.01 s
[[91  2  3  6  6  5]
 [ 3 71  8  5 10  2]
 [ 3 21 76  8 10  3]
 [ 1  0  3 68  9  0]
 [ 0  4  4  6 64  1]
 [ 2  2  6  7  1 89]]
簡單貝氏 耗時: 0.01 s
[[84  1  1  0  1  6]
 [ 5 81 17 10 11  4]
 [ 2  9 66  6  4  3]
 [ 1  3  4 68  9  1]
 [ 8  6 10 14 74  6]
 [ 0  0  2  2  1 80]]
SVM 耗時: 8.887 s


## 超參數調整
- 使用模型
    - xgboost
        - 
    - knn
        - n_neighbors(5)
    - svc
        - kernel, random_state
    - nb
        - None
    - rf
        - n_estimator(100), criterion, max_depth(None), random_state
    

In [35]:
from sklearn.model_selection import GridSearchCV

hyper_params = {
    'knn': {
        'n_neighbors': [3, 5, 7]
    },
    'svc': {
        'kernel': ['linear', 'rbf']
    },
    'rf': {
        'n_estimators': [100, 150, 200], 'criterion': ['gini', 'entropy'], 'max_depth': [None, 3, 5, 7], 'random_state': [0]
    }
}

rf = RandomForestClassifier()
knn = KNeighborsClassifier()
svc = SVC()

model_dict = {
    'knn': knn,
    'svc': svc,
    'rf': rf,
}

for k in model_dict:
    gs = GridSearchCV(model_dict[k], hyper_params[k])
    gs.fit(x_train, y_train)
    model_dict[k] = gs
    print(k, gs.score(x_test, y_test))

knn 0.6866666666666666
svc 0.77
rf 0.6883333333333334


## Baseline完成後整理
- 前處理部分
    - 只用了簡單斷詞, 數字、停用詞等等都尚未去除。
    - 剔除文字雲中共同出現的詞。
- 特徵工程
    - TF-IDF
        - 維度有6773, 有點太高了。
    - 其他方法
- 模型訓練以及選擇
    - 傳統ML model(因為資料量小)
    - [BERT fine tune](https://zhuanlan.zhihu.com/p/72448986)

## 對斷詞做更細緻的處理
- 去除stopwords
- 去除無意義詞
- 去除標點符號
- 結果
    - 只去除下面那些有變得比較好，但是我test忘了去除...論pipeline的重要性QQ

In [None]:
def remove_stopwords(review, stopwords):
    corpus = []

    for sentence in review:
        s = jieba.cut(sentence)
        s_remove = []
        for word in s:
            if word not in stopwords:
                s_remove.append(word)
        corpus.append(''.join(s_remove))
    return corpus

In [None]:
## 去除stopwords

with open('./datasets/停用詞-繁體中文.txt', 'r', encoding='utf-8') as f:
    stopwords = f.read().split()

review = list(train_data['title'])
corpus = remove_stopwords(review, stopwords)
print(corpus[:10])
train_data['title'] = corpus

In [38]:
# 上面的stopwords理論上已經全部去除了

# train_data['title'] = train_data['title'].apply(func=lambda x: re.sub('[，、、。？！,.\'：?:]', '', x))
# train_data['title'] = train_data['title'].apply(func=lambda x: re.sub('[0-9]', '', x))    # 數字不重要
# train_data.head()

In [None]:
train_data['jieba'] = train_data['title'].apply(func=lambda x: list(jieba.cut(x)))
train_data.head()

## [BERT - tf2.0](https://blog.csdn.net/xiaoniu0991/article/details/108243733)
- 參數訓練
    - 全部fine-tune
- 超參數
    - epochs: 8
    - batch_size: 64
    - lr: 2e-05
- 結果
    - acc: 0.9111, 
    

In [None]:
from transformers import BertTokenizer

############################# 這邊是讓我們看input的實際操作

tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')      # 分詞

max_length = 50
test_sentence = '亞洲杯奪冠賠率日本伊朗領銜，中國竟與泰國並列'

# 加入特殊token, 這是結構問題, 最後才會輸出分類結果
test_sentence_with_special_tokens = '[CLS]' + test_sentence + '[SEP]'  # CLS: 輸出分類的token, SEP: 結尾(separate)
tokenized = tokenizer.tokenize(test_sentence_with_special_tokens)
print('tokenized: ', tokenized)


# 轉換tokens to ids in wordpiece?
input_ids = tokenizer.convert_tokens_to_ids(tokenized)

# padding
padding_length = max_length - len(input_ids)
input_ids += padding_length * [0]

# attention, 只有原先沒有padding的部分是1
attention_mask = [1] * (max_length - padding_length) + [0] * padding_length

# token types, needed for example for question answering, for our purpose we will just set 0 as we have just one sequence
token_type_ids = [0] * max_length
bert_input = {
    "token_ids": input_ids,
    "token_type_ids": token_type_ids,
    "attention_mask": attention_mask
} 
print(bert_input)

In [None]:
################### 實際上只需要, 可上下對照, 完全一樣
bert_input = tokenizer.encode_plus(
                        test_sentence,                      
                        add_special_tokens = True, # add [CLS], [SEP]
                        max_length = max_length, # max length of the text that can go to BERT
                        pad_to_max_length = True, # add [PAD] tokens
                        return_attention_mask = True, # add attention mask to not focus on pad tokens
              )
print('encoded', bert_input)

In [None]:
train_data

In [None]:
# 資料集處理

from sklearn.model_selection import train_test_split
import pandas as pd

def split_dataset(df):
    train_set, x = train_test_split(df, 
        stratify=df['label'],
        test_size=0.2,
        random_state=2)
    val_set, test_set = train_test_split(x, 
        stratify=x['label'],
        test_size=0.5, 
        random_state=2)

    return train_set,val_set, test_set

train_data = pd.read_csv('./datasets/news_clustering_train.tsv', sep='\t').drop(columns=['index'])
# 打資料集打散
from sklearn.utils import shuffle
train_data = shuffle(train_data)

df_raw = train_data[['class', 'title']]
df_raw.columns = ['label', 'text']
# label
# {'體育': 0, '財經': 1, '科技': 2, '旅遊': 3, '農業': 4, '遊戲': 5}
df_label = pd.DataFrame({"label":['體育', '財經', '科技', '旅遊', '農業', '遊戲'],"y":list(range(6))})
df_raw = pd.merge(df_raw,df_label,on="label",how="left")

train_data,val_data, test_data = split_dataset(df_raw)

In [None]:
df_raw

In [None]:
train_data

In [None]:
## token ---> word embedding
## 现在，我们需要在所有样本中应用 BERT tokenizer 。我们将token映射到词嵌入。这可以通过encode_plus完成。

def convert_example_to_feature(review):
    return tokenizer.encode_plus(review, 
                                 add_special_tokens = True, # add [CLS], [SEP]
                                 max_length = max_length, # max length of the text that can go to BERT
                                 pad_to_max_length = True, # add [PAD] tokens
                                 return_attention_mask = True, # add attention mask to not focus on pad tokens
                                )

# map to the expected input to TFBertForSequenceClassification, see here 
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

def encode_examples(ds, limit=-1):
    # prepare list, so that we can build up final TensorFlow dataset from slices.
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []
    if (limit > 0):
        ds = ds.take(limit)    # DataFrame.take(indices, axis=0, is_copy=None, **kwargs): Return the elements in the given positional indices along an axis.
    
    for index, row in ds.iterrows():
        review = row["text"]
        label = row["y"]
        bert_input = convert_example_to_feature(review)
  
        input_ids_list.append(bert_input['input_ids'])
        token_type_ids_list.append(bert_input['token_type_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append([label])
    return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

In [None]:
import tensorflow as tf
batch_size = 64
# 把資料裝好成輸入bert model的形式

# train dataset
ds_train_encoded = encode_examples(train_data).shuffle(10000).batch(batch_size)
# val dataset
ds_val_encoded = encode_examples(val_data).batch(batch_size)
# test dataset
ds_test_encoded = encode_examples(test_data).batch(batch_size)

In [None]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf

model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=6)

In [64]:
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch for illustration, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 8

# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=6)

# optimizer Adam recommended
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate,epsilon=1e-08, clipnorm=1)

# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
# fit model
bert_history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_val_encoded)
# evaluate test set
model.evaluate(ds_test_encoded)

I0124 12:49:20.741760 10980 configuration_utils.py:265] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json from cache at C:\Users\aband/.cache\torch\transformers\8a3b1cfe5da58286e12a0f5d7d182b8d6eca88c08e26c332ee3817548cf7e60a.f12a4f986e43d8b328f5b067a641064d67b91597567a06c7b122d1ca7dfd9741
I0124 12:49:20.755438 10980 configuration_utils.py:301] Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddin

Epoch 1/8
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and at

[0.45325931906700134, 0.8777777552604675]

In [65]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  102267648 
_________________________________________________________________
dropout_75 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  4614      
Total params: 102,272,262
Trainable params: 102,272,262
Non-trainable params: 0
_________________________________________________________________


In [67]:
model.save_weights('./models/bert_fine_tuning.h5')

In [68]:
# bert model驗證test data
# 文字應該要處理一下...

# 1. prepare data
final_test_data = pd.read_csv('./datasets/news_clustering_test.tsv', sep='\t').drop(columns=['index'])
df_raw_2 = final_test_data[['class', 'title']]
df_raw_2.columns = ['label', 'text']
# label
# {'體育': 0, '財經': 1, '科技': 2, '旅遊': 3, '農業': 4, '遊戲': 5}
df_label_2 = pd.DataFrame({"label":['體育', '財經', '科技', '旅遊', '農業', '遊戲'],"y":list(range(6))})
df_raw_2 = pd.merge(df_raw_2,df_label_2,on="label",how="left")
test_data_2 = df_raw_2

# 2. batch data
ds_test_encoded_2 = encode_examples(test_data_2).batch(batch_size)


# 3. predict

model.evaluate(ds_test_encoded_2)



[0.3585672974586487, 0.9049999713897705]

## 結果
- 前處理沒做
    - 10/10 [==============================] - 87s 9s/step - loss: 0.3335 - accuracy: 0.9133
    - [0.333496630191803, 0.9133333563804626]
- 有做
    - 
    - 

## LSTM & CNN 深度學習模型運用tensorflow
- CNN model
- LSTM model
- 注意
    - vocab_size + 1採了一個坑...
        - Integer. Size of the vocabulary, i.e. maximum integer index + 1.

#### [前處理, 利用keras](https://blog.csdn.net/DBC_121/article/details/108038858)

#### CNN model
- 運用conv1d去收集特徵
    - 重點改變, kernel_size長度

In [35]:
import tensorflow as tf
from tensorflow.keras.layers import Bidirectional

In [36]:
corpus[:10]

['亞洲杯 奪冠 賠率 ： 日本 、 伊朗 領銜   中國 竟 與 泰國 並列',
 '9 輪 4 球 本土 射手 僅次 武磊   黃 紫昌要 搶 最強 U23 頭銜',
 '如果 今年 勇士 奪冠 ， 下賽季 詹姆斯 何去 何 從 ？',
 '超級 替 補 ！ 科斯塔 本賽 季替 補 出場 貢獻 7 次 助攻',
 '騎士 6 天里 發生 了 啥 ？ 從 首輪 搶七到 次 輪 3 - 0 猛龍',
 '如果 朗多 進入 轉會 市場 ， 哪些 球隊 適合 他 ？',
 '詹姆斯 G3 決殺 ， 你 怎麼 看 ？',
 '大 魔王 帶頭 唱歌 ！ 火箭 這 像是 打季後賽 ？ 爵士 神帥 這話 已 提前 投降 了',
 '馬 夏爾要 去 切爾西 ？ 可以 商量 ， 不過 穆里尼 奧 的 要 價是 4000 萬加 威廉',
 '利希施 泰納宣 佈 賽季 結束 後 離隊 ： 我 需要 新 的 挑戰']

In [37]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

# 建立出辭典
tokenizer.fit_on_texts(corpus)

# 看辭典
tokenizer.word_counts

OrderedDict([('亞洲杯', 8),
             ('奪冠', 14),
             ('賠率', 1),
             ('：', 242),
             ('日本', 19),
             ('、', 80),
             ('伊朗', 1),
             ('領銜', 1),
             ('中國', 68),
             ('竟', 3),
             ('與', 30),
             ('泰國', 5),
             ('並列', 1),
             ('9', 10),
             ('輪', 12),
             ('4', 22),
             ('球', 5),
             ('本土', 2),
             ('射手', 4),
             ('僅次', 3),
             ('武磊', 1),
             ('黃', 25),
             ('紫昌要', 1),
             ('搶', 3),
             ('最強', 9),
             ('u23', 2),
             ('頭銜', 1),
             ('如果', 41),
             ('今年', 15),
             ('勇士', 14),
             ('，', 1269),
             ('下賽季', 1),
             ('詹姆斯', 39),
             ('何去', 1),
             ('何', 3),
             ('從', 22),
             ('？', 975),
             ('超級', 2),
             ('替', 1),
             ('補', 3),
             ('！', 299),
     

In [38]:
print(tokenizer.word_index)
print(len(tokenizer.word_index))

{'，': 1, '？': 2, '的': 3, '！': 4, '：': 5, '了': 6, '有': 7, '是': 8, '你': 9, '什麼': 10, '」': 11, '在': 12, '「': 13, '嗎': 14, '為': 15, '怎麼': 16, '都': 17, '如何': 18, '不': 19, '人': 20, '、': 21, '和': 22, '後': 23, '去': 24, '被': 25, '《': 26, '》': 27, '中國': 28, '王者': 29, '看': 30, '5': 31, '哪些': 32, '上': 33, '錢': 34, '農村': 35, '年': 36, '好': 37, '世界': 38, '最': 39, '說': 40, '手機': 41, '榮耀': 42, '3': 43, '到': 44, '可以': 45, '2018': 46, '能': 47, '就': 48, '誰': 49, '月': 50, '英雄': 51, '怎樣': 52, '個': 53, '—': 54, '來': 55, '馬': 56, '會': 57, '如果': 58, '一個': 59, '做': 60, '吃': 61, '詹姆斯': 62, '要': 63, '中': 64, '2': 65, '遊戲': 66, '多': 67, '讓': 68, '我': 69, '這個': 70, '沒': 71, '他': 72, '1': 73, '第一': 74, '卻': 75, '哪個': 76, '又': 77, '想': 78, '萬': 79, '支付': 80, '對': 81, '也': 82, '與': 83, '大': 84, '用': 85, '呢': 86, '6': 87, '旅遊': 88, '玩': 89, '現在': 90, '銀行': 91, '更': 92, '買': 93, '0': 94, 'dnf': 95, '黃': 96, '需要': 97, '新': 98, '哪裡': 99, '該': 100, '小米': 101, '猛龍': 102, '還是': 103, '有什麼': 104, '這': 105, '決賽': 106, '自己': 107

In [39]:
help(tokenizer)

Help on Tokenizer in module keras_preprocessing.text object:

class Tokenizer(builtins.object)
 |  Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0, **kwargs)
 |  
 |  Text tokenization utility class.
 |  
 |  This class allows to vectorize a text corpus, by turning each
 |  text into either a sequence of integers (each integer being the index
 |  of a token in a dictionary) or into a vector where the coefficient
 |  for each token could be binary, based on word count, based on tf-idf...
 |  
 |  # Arguments
 |      num_words: the maximum number of words to keep, based
 |          on word frequency. Only the most common `num_words-1` words will
 |          be kept.
 |      filters: a string where each element is a character that will be
 |          filtered from the texts. The default is all punctuation, plus
 |          tabs and line breaks, minus the `'` character.
 |      lower: boolea

In [40]:
# 轉換成整數序列
tokenized_text = tokenizer.texts_to_sequences(corpus)
tokenized_text[:5]

[[335, 173, 2412, 5, 128, 21, 2413, 2414, 28, 862, 83, 521, 2415],
 [272, 219, 111, 522, 1316, 640, 863, 2416, 96, 2417, 864, 296, 1317, 2418],
 [58, 163, 174, 173, 1, 2419, 62, 2420, 865, 112, 2],
 [1318, 2421, 866, 4, 2422, 2423, 2424, 866, 867, 641, 164, 523, 1319],
 [120, 87, 2425, 642, 6, 369, 2, 112, 1320, 2426, 523, 219, 43, 94, 102]]

In [41]:
# 找出最常長度的序列: 發現最常為36

dict_seq_length_to_count = {}
for seq in tokenized_text:
    length = len(seq)
    dict_seq_length_to_count[length] = dict_seq_length_to_count.get(length, 0) + 1

dict_seq_length_to_count

{13: 117,
 14: 137,
 11: 113,
 15: 162,
 8: 100,
 17: 122,
 22: 19,
 16: 140,
 7: 79,
 9: 124,
 12: 134,
 18: 98,
 4: 13,
 19: 74,
 5: 34,
 20: 53,
 21: 42,
 10: 116,
 23: 16,
 24: 8,
 6: 73,
 3: 7,
 25: 8,
 28: 2,
 26: 3,
 27: 3,
 2: 1,
 36: 1,
 29: 1}

In [42]:
# 轉換成一樣長度
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 36
x_train_dl = pad_sequences(tokenized_text, maxlen=36, padding='post')  # padding='pre' 是default
print(x_train_dl.shape)
print(x_train_dl[:5])

(1800, 36)
[[ 335  173 2412    5  128   21 2413 2414   28  862   83  521 2415    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0]
 [ 272  219  111  522 1316  640  863 2416   96 2417  864  296 1317 2418
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0]
 [  58  163  174  173    1 2419   62 2420  865  112    2    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0]
 [1318 2421  866    4 2422 2423 2424  866  867  641  164  523 1319    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0]
 [ 120   87 2425  642    6  369    2  112 1320 2426  523  219   43   94
   102    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0]]


In [43]:
# 資料準備好
from tensorflow.keras.utils import to_categorical

y_train_dl = to_categorical(y_train, 6)
y_train_dl[:10]

array([[1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.]], dtype=float32)

In [44]:
print(x_train_dl.shape)
print(y_train_dl.shape)

(1800, 36)
(1800, 6)


In [45]:
# 測試資料準備好

test_data.head()

Unnamed: 0,class,title,jieba
0,0,如果騎士火箭進入總決賽，誰的勝算大？,"[如果, 騎士, 火箭, 進入, 總, 決賽, ，, 誰, 的, 勝算大, ？]"
1,0,從個人競技狀態來看，三個階段的詹姆斯，哪個最強？,"[從個, 人, 競技, 狀態, 來, 看, ，, 三個, 階段, 的, 詹姆斯, ，, 哪個..."
2,0,騎士總冠軍！地球人誰能阻擋詹姆斯？史上最佳就是他！打服所有人,"[騎士, 總冠, 軍, ！, 地球, 人, 誰, 能, 阻擋, 詹姆斯, ？, 史上, 最佳..."
3,0,詹姆斯絕殺，騎士3比0，猛龍懷疑人生,"[詹姆斯, 絕殺, ，, 騎士, 3, 比, 0, ，, 猛龍, 懷疑, 人生]"
4,0,騎士和步行者戰成搶七險勝，而猛龍即將被橫掃，步行者跟猛龍的區別在哪裡？,"[騎士, 和, 步行者, 戰成, 搶七險勝, ，, 而, 猛龍, 即將, 被, 橫掃, ，,..."


In [46]:
x_test_dl =  pad_sequences(tokenizer.texts_to_sequences(test_corpus), maxlen=36, padding='post')
y_test_dl = to_categorical(test_data['class'].to_numpy())

print(x_test_dl.shape)
print(y_test_dl.shape)

(600, 36)
(600, 6)


In [47]:
# 只是嘗試其實應該有許多可以調整的地方!
vocab_size = len(tokenizer.word_index)
embedding_dim = 256
input_length = max_length

nlp_cnn_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(6, activation='softmax')
])

In [48]:
# 模型其他設定

nlp_cnn_model.compile(optimizer='adam', loss=tf.keras.losses.CategoricalCrossentropy(), metrics = ['acc'])
nlp_cnn_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 36, 256)           1965312   
_________________________________________________________________
conv1d (Conv1D)              (None, 32, 128)           163968    
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 32)                4128      
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 198       
Total params: 2,133,606
Trainable params: 2,133,606
Non-trainable params: 0
______________________________________________

In [49]:
nlp_cnn_model.fit(x_train_dl, y_train_dl, batch_size=64, epochs=50, validation_data=(x_test_dl, y_test_dl))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x16ca1f1e550>

#### LSTM model
- python3.7.3 64 bit的kernel會出錯，tf版本2.3.0
    - 出錯在LSTM那邊!
- 我用colab先嘗試沒問題, 改用2.4.1

In [54]:
nlp_lstm_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax')
])

In [55]:
nlp_lstm_model.compile(optimizer='adam', loss=tf.keras.losses.CategoricalCrossentropy(), metrics=['acc'])
nlp_lstm_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 36, 256)           1965312   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 32, 128)           163968    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense_5 (Dense)              (None, 512)               66048     
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 512)               262656    
_________________________________________________________________
dense_7 (Dense)              (None, 6)                

In [56]:
nlp_lstm_model.fit(x_train_dl, y_train_dl, batch_size=64, epochs=50, validation_data=(x_test_dl, y_test_dl))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x16ca37530f0>

In [57]:
import tensorflow as tf
tf.__version__

'2.4.1'

In [58]:
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
cpus = tf.config.experimental.list_physical_devices(device_type='CPU')
print("可用的GPU：",gpus,"\n可用的CPU：", cpus)

可用的GPU： [] 
可用的CPU： [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
