### 分類任務
使用bert-base-chinese模型對新聞資料集做embeddings，接著訓練分類器。（參考week7程式碼）

In [None]:
import os

from google.colab import drive
drive.mount('/content/drive')

os.chdir('你的雲端資料夾路徑') #切換該目錄
os.listdir() #確認目錄內容

In [1]:
import pandas as pd
import re

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
# !pip install -U sentence-transformers

In [4]:
from sentence_transformers import SentenceTransformer, models, util

  from .autonotebook import tqdm as notebook_tqdm


<font color=#ffa>載入BERT-base</font>

In [5]:
# 中文 bert-base-chinese
bert_ch = SentenceTransformer('google-bert/bert-base-chinese')

bert_ch.tokenizer.add_special_tokens({'pad_token': '[PAD]'})

No sentence-transformers model found with name google-bert/bert-base-chinese. Creating a new one with MEAN pooling.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


0

## <font color=#ffa>資料集
蘋果日報：娛樂時尚, 3C車市, 國際, 生活, 社會, 政治, 體育  
檢索時段：2024年4月  
資料筆數：4492筆
</font>

In [12]:
udn = pd.read_csv("./raw_data\sna2024s_5_db95ac6af0_29.csv")
udn.head(3)

Unnamed: 0,system_id,artTitle,artDate,artCatagory,artUrl,artContent,dataSource
0,1,CITIZEN抗磁飛行錶端酷帥鋼鐵灰、曜石黑　韋禮安搶先戴上手,2024-04-01 18:41:00,娛樂時尚,https://tw.nextapple.com/entertainment/2024040...,【記者劉旻君／台北報導】若是經常出遊的outdoor派，也許對日本鐘錶品牌CITIZEN的P...,appleDaily
1,2,劉香慈高齡懷第3胎　不敢驗「唐氏症」！曝心中有堅定答案,2024-04-01 07:52:00,娛樂時尚,https://tw.nextapple.com/entertainment/2024040...,【王怡人／綜合報導】40歲「最美士官長」劉香慈和永信藥品總經理鍾威凱育有2子，兩人婚姻剛跨越...,appleDaily
2,3,南拳媽媽張傑與「他」穩交6年　最新戀情成果曝光,2024-04-01 18:10:00,娛樂時尚,https://tw.nextapple.com/entertainment/2024040...,【記者林秭渝／台北報導】曾奪下亞洲、東亞、世界運動會的健美比賽拿過銀牌、金牌和銅牌的許家豪，...,appleDaily


In [13]:
# 過濾 nan 的資料
udn = udn.dropna(subset=['artTitle'])
udn = udn.dropna(subset=['artContent'])
# 移除網址格式
udn["artContent"] = udn.artContent.apply(
    lambda x: re.sub("(http|https)://.*", "", x)
)
udn["artTitle"] = udn["artTitle"].apply(
    lambda x: re.sub("(http|https)://.*", "", x)
)
# 只留下中文字
udn["artContent"] = udn.artContent.apply(
    lambda x: re.sub("[^\u4e00-\u9fa5]+", "", x)
)
udn["artTitle"] = udn["artTitle"].apply(
    lambda x: re.sub("[^\u4e00-\u9fa5]+", "", x)
)

# 留下 content
udn["content"] = udn["artTitle"] + udn["artContent"]
udn = udn[["content", "artUrl", "artCatagory"]]  # 文章內容 文章連結
udn.head()

Unnamed: 0,content,artUrl,artCatagory
0,抗磁飛行錶端酷帥鋼鐵灰曜石黑韋禮安搶先戴上手記者劉旻君台北報導若是經常出遊的派也許對日本鐘錶...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚
1,劉香慈高齡懷第胎不敢驗唐氏症曝心中有堅定答案王怡人綜合報導歲最美士官長劉香慈和永信藥品總經理...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚
2,南拳媽媽張傑與他穩交年最新戀情成果曝光記者林秭渝台北報導曾奪下亞洲東亞世界運動會的健美比賽拿...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚
3,顫慄秒日本電視台突插播北韓閱兵網怒愚人節不好笑吳惠菁綜合報導愚人節搞笑日本電視台生活資訊節目...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚
4,劇時赤坂見假戲真愛裸身出擊露骨邀上床記者陳薇安綜合報導打造出經典劇如果歲還是處男似乎就能成為...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚


## <font color=#ffa>將整理好的dataframe丟入BERT做每筆文章的Encoding</font>

In [14]:
udn["embeddings"] = udn.content.apply(lambda x: bert_ch.encode(x))
udn.head(3)

Unnamed: 0,content,artUrl,artCatagory,embeddings
0,抗磁飛行錶端酷帥鋼鐵灰曜石黑韋禮安搶先戴上手記者劉旻君台北報導若是經常出遊的派也許對日本鐘錶...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚,"[0.5629351, -0.084517315, -0.6042539, -0.10897..."
1,劉香慈高齡懷第胎不敢驗唐氏症曝心中有堅定答案王怡人綜合報導歲最美士官長劉香慈和永信藥品總經理...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚,"[0.6024722, -0.14360781, -0.36352628, 0.174739..."
2,南拳媽媽張傑與他穩交年最新戀情成果曝光記者林秭渝台北報導曾奪下亞洲東亞世界運動會的健美比賽拿...,https://tw.nextapple.com/entertainment/2024040...,娛樂時尚,"[0.5377075, -0.08188572, -0.18106245, 0.341444..."


In [15]:
import numpy as np
from ast import literal_eval

## <font color=#ffa>七分做Training，三分做Testing</font>

In [32]:
data = udn.copy()

X = data["embeddings"].apply(pd.Series)
y = data["artCatagory"]

# 把整個資料集七三切
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=777
)

print(X_train.head())
print(y_train.head())

           0         1         2         3         4         5         6    \
317   0.513953 -0.178706 -0.574484  0.244219  0.259221 -0.085194 -0.363229   
121   0.322818 -0.262457 -0.377609  0.386087 -0.003611 -0.049434  0.021079   
568   0.537200 -0.135376 -0.321614  0.316921  0.086563 -0.175800  0.017494   
134   0.422602 -0.045941 -0.486148  0.110890 -0.060581  0.020933 -0.003441   
3898  0.463599 -0.115300 -0.560857  0.082238  0.264982  0.030440 -0.078990   

           7         8         9    ...       758       759       760  \
317   0.284722 -0.139646 -0.077878  ...  0.025199 -0.123573  0.338039   
121   0.057958 -0.141957 -0.236466  ...  0.338110 -0.329229  0.153125   
568   0.247430 -0.112659 -0.083443  ... -0.053425 -0.106651  0.192610   
134   0.088530 -0.232469 -0.138608  ... -0.119394 -0.239330  0.104855   
3898  0.076101 -0.272206 -0.357696  ... -0.023360  0.056205  0.240736   

           761       762       763       764       765       766       767  
317   0.099929 

## <font color=#ffa>查看訓練&測試筆數</font>

In [51]:
print('訓練集: 共{}筆\n測試集: 共{}筆'.format(len(X_train),len(X_test)))

訓練集: 共3144筆
測試集: 共1348筆


In [17]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## <font color=#ffa>預測看板並顯示前10筆</font>

In [41]:
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)
print(y_pred[:10])

['國際' '娛樂時尚' '娛樂時尚' '社會' '國際' '政治' '生活' '娛樂時尚' '政治' '社會']
1348


In [19]:
from sklearn.metrics import classification_report

In [20]:
## Accuracy, Precision, Recall, F1-score
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

        3C車市       1.00      0.62      0.76        13
          國際       0.92      0.94      0.93       187
        娛樂時尚       0.94      0.94      0.94       387
          政治       0.89      0.90      0.90       177
          生活       0.86      0.82      0.84       319
          社會       0.76      0.82      0.79       187
          體育       0.89      0.87      0.88        78

    accuracy                           0.88      1348
   macro avg       0.90      0.84      0.86      1348
weighted avg       0.88      0.88      0.88      1348



## <font color=#ffa>分類器的accuracy落在0.88</font>