### 分類任務
使用bert-base-chinese模型對新聞資料集做embeddings，接著訓練分類器。

# **資料集說明**
sna2024s_2_eb4bb8bde2_9.csv 為我們要分析的**聯合新聞網資料**.

**版別:** 股市、產經、要聞

**時間範圍：** 24/01~24/03

In [None]:
import pandas as pd
import re
import torch

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer, models, util

In [None]:
bert_ch = SentenceTransformer('google-bert/bert-base-chinese')

bert_ch.tokenizer.add_special_tokens({'pad_token': '[PAD]'})

No sentence-transformers model found with name google-bert/bert-base-chinese. Creating a new one with MEAN pooling.


0

In [None]:
print(torch.backends.mps.is_available())
print(torch.backends.mps.is_built())

True
True


In [None]:
raw_news = pd.read_csv("raw_data/sna2024s_2_eb4bb8bde2_9.csv")  # 匯資料
raw_news.head(3)

Unnamed: 0,system_id,artTitle,artDate,artCatagory,artSecondCatagory,artUrl,artContent,dataSource
0,1,中國拖船闖台水域 海巡署：伴航監控和廣播驅離,2024-01-02 11:09:00,要聞,https://udn.com/news/story/10930/7680054,軍事粉專記錄兩艘中國籍拖船於跨年期間闖台灣水域，其中「寧海拖5001」今天凌晨位置在鵝鑾鼻東...,,UDN
1,2,確保幻象機戰力 國軍斥資102億元採購發動機零附件,2024-01-04 16:24:00,要聞,https://udn.com/news/story/10930/7686167,為確保幻象戰機零附件及戰力無虞，政府電子採購網今天公布決標資訊，國防部國防採購室駐歐採購組與...,,UDN
2,3,國防部：2枚中共空飄氣球昨穿越台灣本島中南部上空,2024-01-09 10:06:00,要聞,https://udn.com/news/story/10930/7695181,國防部上午發布中共解放軍台海周邊海、空域動態，情資顯示昨（8）日有4枚中共空飄氣球逾越海峽中...,,UDN


 **查看各版文章筆數**

 對欄位重新命名，因為原本爬下來的欄位名稱與其對應的資料有誤

In [None]:
print(f"number of posts: {raw_news.shape[0]}")
print(f"date range: {(raw_news['artDate'].min(), raw_news['artDate'].max())}")
print(f"category: \n{raw_news['artCatagory'].value_counts()}")
raw_news.rename(columns={'artSecondCatagory': 'artURl','artUrl': 'artcontent','artContent': 'content'},inplace=True)
raw_news.drop('content',axis='columns')
raw_news.head()

number of posts: 6150
date range: ('2024-01-01 00:11:00', '2024-03-31 23:59:00')
category: 
artCatagory
產經    3067
股市    2578
要聞     505
Name: count, dtype: int64


Unnamed: 0,system_id,artTitle,artDate,artCatagory,artURl,artcontent,content,dataSource
0,1,中國拖船闖台水域 海巡署：伴航監控和廣播驅離,2024-01-02 11:09:00,要聞,https://udn.com/news/story/10930/7680054,軍事粉專記錄兩艘中國籍拖船於跨年期間闖台灣水域，其中「寧海拖5001」今天凌晨位置在鵝鑾鼻東...,,UDN
1,2,確保幻象機戰力 國軍斥資102億元採購發動機零附件,2024-01-04 16:24:00,要聞,https://udn.com/news/story/10930/7686167,為確保幻象戰機零附件及戰力無虞，政府電子採購網今天公布決標資訊，國防部國防採購室駐歐採購組與...,,UDN
2,3,國防部：2枚中共空飄氣球昨穿越台灣本島中南部上空,2024-01-09 10:06:00,要聞,https://udn.com/news/story/10930/7695181,國防部上午發布中共解放軍台海周邊海、空域動態，情資顯示昨（8）日有4枚中共空飄氣球逾越海峽中...,,UDN
3,4,因應中共空飄氣球 國防部執行反偵蒐隱掩蔽部署,2024-01-09 11:43:00,要聞,https://udn.com/news/story/10930/7695498,針對中共空飄氣球近期密集飄越海峽中線，部分飛越本島上空，國防部情次室情報次長室情研中心情報官...,,UDN
4,5,國防部：不會擊毀中共空飄氣球,2024-01-09 12:51:00,要聞,https://udn.com/news/story/10930/7695730,國防部作計室聯合作戰計畫處副處長王家駿上校今天表示，中共對我施放空飄氣球，國軍會保持全程監控...,,UDN


# **資料清理**
**利用標點符號斷句**

In [None]:
# 過濾 nan 的資料
raw_news = raw_news.dropna(subset=['artTitle'])
raw_news = raw_news.dropna(subset=['artcontent'])
# 移除網址格式
raw_news["artcontent"] = raw_news.artcontent.apply(
    lambda x: re.sub("(http|https)://.*", "", x)
)
raw_news["artTitle"] = raw_news["artTitle"].apply(
    lambda x: re.sub("(http|https)://.*", "", x)
)
# 只留下中文字
raw_news["artcontent"] = raw_news.artcontent.apply(
    lambda x: re.sub("[^\u4e00-\u9fa5]+", "", x)
)
raw_news["artTitle"] = raw_news["artTitle"].apply(
    lambda x: re.sub("[^\u4e00-\u9fa5]+", "", x)
)
raw_news.head(3)

Unnamed: 0,system_id,artTitle,artDate,artCatagory,artURl,artcontent,content,dataSource
0,1,中國拖船闖台水域海巡署伴航監控和廣播驅離,2024-01-02 11:09:00,要聞,https://udn.com/news/story/10930/7680054,軍事粉專記錄兩艘中國籍拖船於跨年期間闖台灣水域其中寧海拖今天凌晨位置在鵝鑾鼻東面不到浬海巡署...,,UDN
1,2,確保幻象機戰力國軍斥資億元採購發動機零附件,2024-01-04 16:24:00,要聞,https://udn.com/news/story/10930/7686167,為確保幻象戰機零附件及戰力無虞政府電子採購網今天公布決標資訊國防部國防採購室駐歐採購組與法國...,,UDN
2,3,國防部枚中共空飄氣球昨穿越台灣本島中南部上空,2024-01-09 10:06:00,要聞,https://udn.com/news/story/10930/7695181,國防部上午發布中共解放軍台海周邊海空域動態情資顯示昨日有枚中共空飄氣球逾越海峽中線其中兩枚穿...,,UDN


**文集的標題和內容納入分析的內容，成為content欄位**

In [None]:
raw_news["content"] = raw_news["artTitle"] + raw_news["artcontent"]
raw_news = raw_news[["artcontent", "artURl", "artCatagory"]]  # 文章內容 文章連結
raw_news.head()

Unnamed: 0,artcontent,artURl,artCatagory
0,軍事粉專記錄兩艘中國籍拖船於跨年期間闖台灣水域其中寧海拖今天凌晨位置在鵝鑾鼻東面不到浬海巡署...,https://udn.com/news/story/10930/7680054,要聞
1,為確保幻象戰機零附件及戰力無虞政府電子採購網今天公布決標資訊國防部國防採購室駐歐採購組與法國...,https://udn.com/news/story/10930/7686167,要聞
2,國防部上午發布中共解放軍台海周邊海空域動態情資顯示昨日有枚中共空飄氣球逾越海峽中線其中兩枚穿...,https://udn.com/news/story/10930/7695181,要聞
3,針對中共空飄氣球近期密集飄越海峽中線部分飛越本島上空國防部情次室情報次長室情研中心情報官黃明...,https://udn.com/news/story/10930/7695498,要聞
4,國防部作計室聯合作戰計畫處副處長王家駿上校今天表示中共對我施放空飄氣球國軍會保持全程監控視其...,https://udn.com/news/story/10930/7695730,要聞


# **使用Bert做encoding**

In [None]:
raw_news["embeddings"] = raw_news.artcontent.apply(lambda x: bert_ch.encode(x))
raw_news.head(3)

Unnamed: 0,artcontent,artURl,artCatagory,embeddings
0,軍事粉專記錄兩艘中國籍拖船於跨年期間闖台灣水域其中寧海拖今天凌晨位置在鵝鑾鼻東面不到浬海巡署...,https://udn.com/news/story/10930/7680054,要聞,"[0.56159174, -0.0055728243, -0.6307354, 0.0209..."
1,為確保幻象戰機零附件及戰力無虞政府電子採購網今天公布決標資訊國防部國防採購室駐歐採購組與法國...,https://udn.com/news/story/10930/7686167,要聞,"[0.42235366, 0.018447105, -0.4759608, 0.065503..."
2,國防部上午發布中共解放軍台海周邊海空域動態情資顯示昨日有枚中共空飄氣球逾越海峽中線其中兩枚穿...,https://udn.com/news/story/10930/7695181,要聞,"[0.41948006, 0.007725761, -0.53348315, 0.07277..."


In [None]:
import numpy as np
from ast import literal_eval

# **分類模型的訓練流程**
利用 sklearn 中的 train_test_split 函數將 `raw_data` 隨機切成 7:3，設置 random_state 讓每次切分的結果一致。`y_train`和`y_test`分別為訓練資料和測試資料的預測目標。

In [None]:
data = raw_news.copy()

X = data["embeddings"].apply(pd.Series)
y = data["artCatagory"]

# 把整個資料集七三切
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=777
)

print(X_train.head())
print(y_train.head())

           0         1         2         3         4         5         6    \
5521  0.486500  0.249898 -0.453145 -0.187781  0.055304 -0.273429  0.095390   
235   0.313146  0.027624 -0.493724  0.065583  0.040519  0.117715  0.024672   
2391  0.417224 -0.151663 -0.412026  0.306600  0.061421 -0.065624 -0.070339   
4343  0.313999  0.035254 -0.465185  0.171159 -0.106247 -0.167248  0.005695   
3706  0.469487  0.045671 -0.437358  0.276495 -0.114427 -0.126056 -0.092725   

           7         8         9    ...       758       759       760  \
5521 -0.247338 -0.212979 -0.468973  ...  0.112426  0.153317  0.141328   
235   0.254579 -0.310649 -0.231608  ...  0.066327 -0.278821  0.171456   
2391  0.186531 -0.120282 -0.261008  ...  0.211049 -0.013568  0.384136   
4343 -0.135370 -0.230250 -0.409921  ...  0.303683 -0.054738  0.191702   
3706 -0.122082 -0.135393 -0.221340  ...  0.233908 -0.098839  0.361630   

           761       762       763       764       765       766       767  
5521 -0.050128 

**建立分類器模型**

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)
print(y_pred[:10])

['產經' '產經' '產經' '產經' '產經' '產經' '股市' '股市' '產經' '股市']


In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          產經       0.87      0.88      0.88       897
          股市       0.88      0.88      0.88       795
          要聞       0.93      0.87      0.90       153

    accuracy                           0.88      1845
   macro avg       0.89      0.88      0.88      1845
weighted avg       0.88      0.88      0.88      1845



## 與第七周結果比較
產經、股市的precision提升0.01<br>
要聞的precision降低0.03<br>
整體而言跟我們之前第7周跑的<font color="red">結果相差不大</font>