# News Preprocessing Using Ckiplab NLP Package

多篇新聞熱門斷詞關鍵字統計新聞摘要-整理並存檔

# All-in-one

一鍵完工

In [2]:
%%time
import pandas as pd
import numpy
from collections import Counter
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

df = pd.read_csv('cna_category_news.csv', sep='|')

# ckiplab word segment (中研院斷詞)
# Initialize drivers
# It takes time to download ckiplab models

# default參數是model="bert-base"
# ws = CkipWordSegmenter() 
# pos = CkipPosTagger()
# ner = CkipNerChunker()

# model="albert-tiny" 模型小，斷詞速度比較快，犧牲一些精確度
ws = CkipWordSegmenter(model="albert-tiny") 
pos = CkipPosTagger(model="albert-tiny")
ner = CkipNerChunker(model="albert-tiny")


## Word Segmentation
tokens = ws(df.content)

## POS
tokens_pos = pos(tokens)

## word pos pair 詞性關鍵字
word_pos_pair = [list(zip(w, p)) for w, p in zip(tokens, tokens_pos)]

## NER命名實體辨識
entity_list = ner(df.content)

# Remove stop words and filter using POS tag (tokens_v2)
#with open('stops_chinese_traditional.txt', 'r', encoding='utf8') as f:
#    stops = f.read().split('\n')

# 過濾條件:兩個字以上 特定的詞性
# allowPOS 過濾條件: 特定的詞性
allowPOS = ['Na', 'Nb', 'Nc', 'VC']

tokens_v2 = []
for wp in word_pos_pair:
    tokens_v2.append([w for w, p in wp if (len(w) >= 2) and p in allowPOS])

# Insert tokens into dataframe (新增斷詞資料欄位)
df['tokens'] = tokens
df['tokens_v2'] = tokens_v2
df['entities'] = entity_list
df['token_pos'] = word_pos_pair

# Calculate word count (frequency) 計算字頻(次數)


def word_frequency(wp_pair):
    filtered_words = []
    for word, pos in wp_pair:
        if (pos in allowPOS) & (len(word) >= 2):
            filtered_words.append(word)
        #print('%s %s' % (word, pos))
    counter = Counter(filtered_words)
    return counter.most_common(200)


keyfreqs = []
for wp in word_pos_pair:
    topwords = word_frequency(wp)
    keyfreqs.append(topwords)

df['top_key_freq'] = keyfreqs

# Abstract (summary) and sentimental score(摘要與情緒分數)
summary = []
sentiment = []
for text in df.content:
    summary.append("暫無")
    sentiment.append("暫無")

df['summary'] = summary
df['sentiment'] = sentiment

# Rearrange the colmun order for readability
df = df[[
    'item_id', 'date','category', 'title', 'content', 'sentiment', 'summary',
    'top_key_freq', 'tokens', 'tokens_v2', 'entities', 'token_pos', 'link',
    'photo_link'
]]

# Save data to disk
df.to_csv('cna_news_preprocessed.csv', sep='|', index=False)

## Read it out 讀出看看
#df = pd.read_csv('cna_dataset_preprocessed.csv', sep='|')
#df.head(1)

print("Tokenize OK!")

  from .autonotebook import tqdm as notebook_tqdm
Tokenization: 100%|██████████| 12/12 [00:00<00:00, 1126.79it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00,  1.62it/s]
Tokenization: 100%|██████████| 12/12 [00:00<00:00, 1331.95it/s]
Inference: 100%|██████████| 3/3 [00:12<00:00,  4.02s/it]
Tokenization: 100%|██████████| 12/12 [00:00<00:00, 1140.32it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00,  1.76it/s]

Tokenize OK!
CPU times: total: 1min 15s
Wall time: 26.6 s





In [3]:
## Read it out 讀出看看
df = pd.read_csv('cna_news_preprocessed.csv', sep='|')
df.head(1)


Unnamed: 0,item_id,date,category,title,content,sentiment,summary,top_key_freq,tokens,tokens_v2,entities,token_pos,link,photo_link
0,game-news_2025-03-06_1,2025-03-06,遊戲新聞,《明日方舟》聯名「吉豚屋」推出合作套餐，隨餐贈周邊禮品及虛寶卡,['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...,暫無,暫無,"[('活動', 20), ('尋訪', 7), ('燈塔', 5), ('時間', 5), ...","['[', ""'"", '由', '龍成', '網路', '代理', '營運', '的', '...","['龍成', '網路', '代理', '型態', '戰術', '攻防', '明日方舟', '...","[NerToken(word='明日方舟', ner='WORK_OF_ART', idx=...","[('[', 'PARENTHESISCATEGORY'), (""'"", 'FW'), ('...",https://tw.news.yahoo.com/%E3%80%8A%E6%98%8E%E...,https://s.yimg.com/os/creatr-uploaded-images/2...


# Demonstration Step by Step 

# Read data from file

In [4]:
import pandas as pd
import numpy

In [5]:
df = pd.read_csv('cna_category_news.csv', sep='|')

In [6]:
df.shape

(12, 7)

In [7]:
df.head()

Unnamed: 0,item_id,date,category,title,content,link,photo_link
0,game-news_2025-03-06_1,2025-03-06,遊戲新聞,《明日方舟》聯名「吉豚屋」推出合作套餐，隨餐贈周邊禮品及虛寶卡,['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...,https://tw.news.yahoo.com/%E3%80%8A%E6%98%8E%E...,https://s.yimg.com/os/creatr-uploaded-images/2...
1,game-news_2025-03-06_2,2025-03-06,遊戲新聞,《GTAV》PC強化版正式推出再度引起風潮！與原版遊戲一起成Steam玩家同上前10名遊戲寫成就！,['《俠盜獵車手5》（GTAV）對於許多玩家來說可說是這個世代的經典之作了，在推出 12 年...,https://tw.news.yahoo.com/%E3%80%8Agtav%E3%80%...,https://s.yimg.com/os/creatr-uploaded-images/2...
2,game-news_2025-03-06_3,2025-03-06,遊戲新聞,《魔物獵人 荒野》真的太簡單？玩家投票僅3%認為遊戲有難度,['卡普空（Capcom）的新作《魔物獵人 荒野》（Monster Hunter Wilds...,https://tw.news.yahoo.com/%E3%80%8A%E9%AD%94%E...,https://s.yimg.com/os/creatr-uploaded-images/2...
3,game-news_2025-03-06_4,2025-03-06,遊戲新聞,《殺戮空間3》變英雄射擊遊戲？玩家分享封測心得：玩具槍手感、角色綁職業、最佳化不好,['Tripwire Interactive 睽違 7 年的《殺戮空間》系列第三部續作《殺戮...,https://tw.news.yahoo.com/%E3%80%8A%E6%AE%BA%E...,https://s.yimg.com/os/creatr-uploaded-images/2...
4,game-tips_2025-03-03_1,2025-03-03,遊戲攻略,《魔物獵人 荒野》超實用技巧！1塊生肉烤出12個全熟肉的方法,['卡普空（Capcom）遊戲大作《魔物獵人 荒野》在 2 月 28 日上市後創造熱潮，光是...,https://tw.news.yahoo.com/%E3%80%8A%E9%AD%94%E...,https://s.yimg.com/os/creatr-uploaded-images/2...


In [8]:
df.content[0:5]

0    ['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...
1    ['《俠盜獵車手5》（GTAV）對於許多玩家來說可說是這個世代的經典之作了，在推出 12 年...
2    ['卡普空（Capcom）的新作《魔物獵人 荒野》（Monster Hunter Wilds...
3    ['Tripwire Interactive 睽違 7 年的《殺戮空間》系列第三部續作《殺戮...
4    ['卡普空（Capcom）遊戲大作《魔物獵人 荒野》在 2 月 28 日上市後創造熱潮，光是...
Name: content, dtype: object

# Ckiplab word segment (中研院斷詞)

# Load tokenization models

    We provide three levels (1–3) of drivers. Level 1 is the fastest, and level 3 (default) is the most accurate.
    我們的工具分為三個等級（1—3）。等級一最快，等級三（預設值）最精準。




In [9]:
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker

In [10]:
# model="albert-tiny" 模型小，斷詞速度比較快，犧牲一些精確度
ws = CkipWordSegmenter(model="albert-tiny") 
pos = CkipPosTagger(model="albert-tiny")
ner = CkipNerChunker(model="albert-tiny")

## Word Segmentation

In [11]:
%%time
# It takes time.
tokens=ws(df.content)

Tokenization: 100%|██████████| 12/12 [00:00<00:00, 1115.26it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00,  1.67it/s]

CPU times: total: 3.38 s
Wall time: 621 ms





In [12]:
len(tokens)

12

In [13]:
len(tokens[0])

729

In [14]:
tokens[0]

['[',
 "'",
 '由',
 '龍成',
 '網路',
 '代理',
 '營運',
 '的',
 '新',
 '型態',
 '戰術',
 '攻防',
 ' RPG',
 '《',
 '明日方舟',
 '》',
 '今',
 '（6）',
 '日',
 '宣布',
 '與',
 '知名',
 '連鎖',
 '豬排店',
 '「',
 '吉豚屋',
 '」',
 '聯名',
 '推出',
 '合作',
 '套餐',
 '，',
 '自',
 '3月6日',
 '至',
 '3月',
 '26日',
 '止限',
 '時',
 '販售',
 '。',
 '此外',
 '，',
 '主題曲',
 ' EP14',
 '「',
 '慈悲',
 '燈塔',
 '」',
 '活動',
 '限時',
 '登場',
 '，',
 '「',
 '何以為',
 '我',
 '」',
 '限定',
 '尋訪',
 '開啟',
 '，',
 '六星',
 '幹員',
 '「',
 '維什戴爾',
 '(',
 '限定',
 ')',
 '」',
 '、',
 '「',
 '邏各斯',
 '」',
 '同步',
 '登場',
 '。',
 "', '",
 '活動',
 '時間',
 '：',
 '3月',
 '6日',
 '~',
 '3月',
 '26日',
 "', '",
 '本',
 '次',
 '聯名',
 '活動',
 '中',
 '，',
 '以',
 '人氣',
 '招牌',
 '『',
 'Katsud',
 'on』',
 '、',
 '『',
 '鹹',
 '甜',
 '交',
 '織',
 '日式',
 '壽喜',
 '牛',
 '』',
 '、',
 '『',
 '職人',
 '南蠻',
 '炸雞',
 '』',
 '，',
 '三重',
 '組合',
 '讓',
 '你',
 '一',
 '次',
 '享有',
 '三',
 '種',
 '不同',
 '的',
 '滋味',
 '的',
 '「',
 '魂靈之影',
 '」',
 '套餐',
 '，',
 '以及',
 '由',
 '8',
 '0g',
 '酥',
 '脆里肌',
 '豬排',
 '，',
 '淋上',
 '香濃',
 '的',
 '日本',
 '進口',
 '咖哩醬',
 '，',
 '

In [15]:
tokens[0][0]

'['

## POS

In [16]:
%%time
tokens_pos = pos(tokens)

Tokenization: 100%|██████████| 12/12 [00:00<00:00, 1328.43it/s]
Inference: 100%|██████████| 3/3 [00:12<00:00,  4.00s/it]

CPU times: total: 1min 5s
Wall time: 12.1 s





In [17]:
len(tokens_pos)

12

In [18]:
len(tokens_pos[0])

729

In [19]:
tokens_pos[0]

['PARENTHESISCATEGORY',
 'FW',
 'P',
 'Nb',
 'Na',
 'VC',
 'VA',
 'DE',
 'VH',
 'Na',
 'Na',
 'VC',
 'FW',
 'PARENTHESISCATEGORY',
 'Nb',
 'PARENTHESISCATEGORY',
 'Nd',
 'PARENTHESISCATEGORY',
 'Nd',
 'VE',
 'Caa',
 'VH',
 'A',
 'Nc',
 'PARENTHESISCATEGORY',
 'Nc',
 'PARENTHESISCATEGORY',
 'D',
 'VC',
 'VH',
 'Na',
 'COMMACATEGORY',
 'P',
 'Nd',
 'P',
 'Nd',
 'Nd',
 'Ng',
 'Ng',
 'VD',
 'PERIODCATEGORY',
 'Cbb',
 'COMMACATEGORY',
 'Na',
 'DASHCATEGORY',
 'PARENTHESISCATEGORY',
 'VH',
 'Na',
 'PARENTHESISCATEGORY',
 'Na',
 'D',
 'VA',
 'COMMACATEGORY',
 'PARENTHESISCATEGORY',
 'D',
 'Nh',
 'PARENTHESISCATEGORY',
 'VK',
 'VC',
 'VC',
 'COMMACATEGORY',
 'Nb',
 'Na',
 'PARENTHESISCATEGORY',
 'Nb',
 'PARENTHESISCATEGORY',
 'VK',
 'PARENTHESISCATEGORY',
 'PARENTHESISCATEGORY',
 'PAUSECATEGORY',
 'PARENTHESISCATEGORY',
 'Na',
 'PARENTHESISCATEGORY',
 'VH',
 'VA',
 'PERIODCATEGORY',
 'FW',
 'Na',
 'Na',
 'COLONCATEGORY',
 'Nd',
 'Nd',
 'FW',
 'Nd',
 'Nd',
 'FW',
 'Nes',
 'Nf',
 'A',
 'Na',
 'N

## word pos pair 詞性關鍵字

In [20]:
word_pos_pair = [list(zip(w,p)) for w, p in zip(tokens, tokens_pos)]

In [21]:
len(word_pos_pair)

12

In [22]:
word_pos_pair[0][0]

('[', 'PARENTHESISCATEGORY')

In [23]:
word_pos_pair

[[('[', 'PARENTHESISCATEGORY'),
  ("'", 'FW'),
  ('由', 'P'),
  ('龍成', 'Nb'),
  ('網路', 'Na'),
  ('代理', 'VC'),
  ('營運', 'VA'),
  ('的', 'DE'),
  ('新', 'VH'),
  ('型態', 'Na'),
  ('戰術', 'Na'),
  ('攻防', 'VC'),
  (' RPG', 'FW'),
  ('《', 'PARENTHESISCATEGORY'),
  ('明日方舟', 'Nb'),
  ('》', 'PARENTHESISCATEGORY'),
  ('今', 'Nd'),
  ('（6）', 'PARENTHESISCATEGORY'),
  ('日', 'Nd'),
  ('宣布', 'VE'),
  ('與', 'Caa'),
  ('知名', 'VH'),
  ('連鎖', 'A'),
  ('豬排店', 'Nc'),
  ('「', 'PARENTHESISCATEGORY'),
  ('吉豚屋', 'Nc'),
  ('」', 'PARENTHESISCATEGORY'),
  ('聯名', 'D'),
  ('推出', 'VC'),
  ('合作', 'VH'),
  ('套餐', 'Na'),
  ('，', 'COMMACATEGORY'),
  ('自', 'P'),
  ('3月6日', 'Nd'),
  ('至', 'P'),
  ('3月', 'Nd'),
  ('26日', 'Nd'),
  ('止限', 'Ng'),
  ('時', 'Ng'),
  ('販售', 'VD'),
  ('。', 'PERIODCATEGORY'),
  ('此外', 'Cbb'),
  ('，', 'COMMACATEGORY'),
  ('主題曲', 'Na'),
  (' EP14', 'DASHCATEGORY'),
  ('「', 'PARENTHESISCATEGORY'),
  ('慈悲', 'VH'),
  ('燈塔', 'Na'),
  ('」', 'PARENTHESISCATEGORY'),
  ('活動', 'Na'),
  ('限時', 'D'),
  ('登場', 'VA')

## NER命名實體辨識

In [24]:
%%time
entity_list = ner(df.content)

Tokenization: 100%|██████████| 12/12 [00:00<00:00, 1070.20it/s]
Inference: 100%|██████████| 1/1 [00:00<00:00,  1.47it/s]

CPU times: total: 3.5 s
Wall time: 766 ms





In [25]:
entity_list[0]

[NerToken(word='明日方舟', ner='WORK_OF_ART', idx=(24, 28)),
 NerToken(word='80g', ner='CARDINAL', idx=(235, 238)),
 NerToken(word='日本', ner='GPE', idx=(250, 252)),
 NerToken(word='明日方舟', ner='PERSON', idx=(324, 328)),
 NerToken(word='台北', ner='GPE', idx=(396, 398)),
 NerToken(word='兩', ner='CARDINAL', idx=(400, 401)),
 NerToken(word='吉豚屋館前店', ner='FAC', idx=(405, 411)),
 NerToken(word='北館', ner='FAC', idx=(438, 440)),
 NerToken(word='台北市', ner='GPE', idx=(519, 522)),
 NerToken(word='中正區信陽街', ner='LOC', idx=(522, 528)),
 NerToken(word="27號'", ner='QUANTITY', idx=(528, 532)),
 NerToken(word='第一', ner='ORDINAL', idx=(557, 559)),
 NerToken(word='40', ner='CARDINAL', idx=(581, 583)),
 NerToken(word='第二', ner='ORDINAL', idx=(588, 590)),
 NerToken(word='3', ner='CARDINAL', idx=(594, 595)),
 NerToken(word='40', ner='CARDINAL', idx=(612, 614)),
 NerToken(word='一日', ner='DATE', idx=(644, 646)),
 NerToken(word='3月20日', ner='DATE', idx=(772, 777)),
 NerToken(word='慈悲燈塔', ner='FAC', idx=(906, 910)),
 

# Remove stop words and filter using POS (tokens_v2)

去除停用詞並依據詞性過濾

In [None]:
# with open('stops_chinese_traditional.txt', 'r', encoding='utf8') as f:
#     stops = f.read().split('\n') 

In [26]:
# 過濾條件:兩個字以上與特定的詞性
allowPOS=['Na','Nb','Nc']

tokens_v2 =[]
for wp in word_pos_pair:
    tokens_v2.append([w for w,p in wp if (len(w) >= 2) and p in allowPOS])

In [27]:
tokens_v2[0]

['龍成',
 '網路',
 '型態',
 '戰術',
 '明日方舟',
 '豬排店',
 '吉豚屋',
 '套餐',
 '主題曲',
 '燈塔',
 '活動',
 '六星',
 '幹員',
 '維什戴爾',
 '邏各斯',
 '活動',
 '時間',
 '活動',
 '人氣',
 '招牌',
 '壽喜',
 '職人',
 '三重',
 '滋味',
 '魂靈之影',
 '套餐',
 '豬排',
 '日本',
 '咖哩醬',
 '麗菜絲',
 '肉味',
 '組合',
 '願景',
 '方舟精美',
 '周邊',
 '禮品',
 '虛寶卡',
 '數量',
 '玩家',
 '吉豚',
 '台北',
 '門市',
 '吉豚屋館前店',
 '吉豚屋',
 '信義',
 '旗艦店',
 '主題',
 '形象店',
 '台北',
 '店舉',
 '活動',
 '活動',
 '周邊',
 '活動',
 '活動',
 '地點',
 '吉豚屋',
 '台北',
 '主題',
 '店址',
 '台北市',
 '中正區',
 '信陽街',
 '活動',
 '時間',
 '場次',
 '場次',
 '場次',
 '場次',
 '人氣',
 '穎兒',
 '店長',
 '玩家',
 '時光',
 '店長',
 '活動',
 '名單',
 '官方',
 '訊息請',
 '官方',
 '公告',
 '內容',
 '主題曲',
 '活動',
 '時間',
 '活動',
 '期間',
 '題曲',
 '燈塔',
 '門檻',
 '玩家',
 '主線',
 '燈塔',
 '抑制劑',
 '戰略',
 '倉庫',
 '物品',
 '塵封密室',
 '活動',
 '活動',
 '玩家',
 '燈塔',
 '關卡',
 '玩法',
 '任務',
 '塵封密室',
 '留言',
 '星幹員',
 '魔王',
 '魔王',
 '活動',
 '獎勵',
 '活動',
 '時間',
 '月20日',
 '活動',
 '期間',
 '慶典',
 '幹員',
 '維什戴爾',
 '邏各斯',
 '繆爾賽思',
 '緘默',
 '德克薩斯',
 '幽靈鯊',
 '幹員',
 '機率',
 '機會',
 '幹員',
 '羅德島',
 '戰力',
 '慶典',
 '每日',
 '活動',
 '時間',
 '活動',
 '期

# Insert tokens into dataframe (新增斷詞資料欄位)

In [28]:
df['tokens'] = tokens

In [29]:
df['tokens_v2'] = tokens_v2

In [30]:
df['entities'] = entity_list

In [31]:
df['token_pos'] = word_pos_pair

In [32]:
df.head(1)

Unnamed: 0,item_id,date,category,title,content,link,photo_link,tokens,tokens_v2,entities,token_pos
0,game-news_2025-03-06_1,2025-03-06,遊戲新聞,《明日方舟》聯名「吉豚屋」推出合作套餐，隨餐贈周邊禮品及虛寶卡,['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...,https://tw.news.yahoo.com/%E3%80%8A%E6%98%8E%E...,https://s.yimg.com/os/creatr-uploaded-images/2...,"[[, ', 由, 龍成, 網路, 代理, 營運, 的, 新, 型態, 戰術, 攻防, R...","[龍成, 網路, 型態, 戰術, 明日方舟, 豬排店, 吉豚屋, 套餐, 主題曲, 燈塔, ...","[(明日方舟, WORK_OF_ART, (24, 28)), (80g, CARDINAL...","[([, PARENTHESISCATEGORY), (', FW), (由, P), (龍..."


# Calculate word count (frequency) 計算字頻(次數)

In [33]:
# allowPOS過濾條件:兩個字以上與特定的詞性
allowPOS=['Na','Nb','Nc','VA','VAC','VB','VC']

from collections import Counter
def word_frequency( wp_pair ):
    filtered_words =[]
    for word, pos in wp_pair:
        if (pos in allowPOS) & (len(word) >= 2):
            filtered_words.append(word)
        #print('%s %s' % (word, pos))
    counter = Counter( filtered_words)
    return counter.most_common(20)

In [34]:
word_pos_pair[0]

[('[', 'PARENTHESISCATEGORY'),
 ("'", 'FW'),
 ('由', 'P'),
 ('龍成', 'Nb'),
 ('網路', 'Na'),
 ('代理', 'VC'),
 ('營運', 'VA'),
 ('的', 'DE'),
 ('新', 'VH'),
 ('型態', 'Na'),
 ('戰術', 'Na'),
 ('攻防', 'VC'),
 (' RPG', 'FW'),
 ('《', 'PARENTHESISCATEGORY'),
 ('明日方舟', 'Nb'),
 ('》', 'PARENTHESISCATEGORY'),
 ('今', 'Nd'),
 ('（6）', 'PARENTHESISCATEGORY'),
 ('日', 'Nd'),
 ('宣布', 'VE'),
 ('與', 'Caa'),
 ('知名', 'VH'),
 ('連鎖', 'A'),
 ('豬排店', 'Nc'),
 ('「', 'PARENTHESISCATEGORY'),
 ('吉豚屋', 'Nc'),
 ('」', 'PARENTHESISCATEGORY'),
 ('聯名', 'D'),
 ('推出', 'VC'),
 ('合作', 'VH'),
 ('套餐', 'Na'),
 ('，', 'COMMACATEGORY'),
 ('自', 'P'),
 ('3月6日', 'Nd'),
 ('至', 'P'),
 ('3月', 'Nd'),
 ('26日', 'Nd'),
 ('止限', 'Ng'),
 ('時', 'Ng'),
 ('販售', 'VD'),
 ('。', 'PERIODCATEGORY'),
 ('此外', 'Cbb'),
 ('，', 'COMMACATEGORY'),
 ('主題曲', 'Na'),
 (' EP14', 'DASHCATEGORY'),
 ('「', 'PARENTHESISCATEGORY'),
 ('慈悲', 'VH'),
 ('燈塔', 'Na'),
 ('」', 'PARENTHESISCATEGORY'),
 ('活動', 'Na'),
 ('限時', 'D'),
 ('登場', 'VA'),
 ('，', 'COMMACATEGORY'),
 ('「', 'PARENTHESISCATEGO

In [35]:
word_frequency(word_pos_pair[3])

[('殺戮', 7),
 ('空間', 7),
 ('測試', 7),
 ('遊戲', 6),
 ('玩家', 5),
 ('推出', 4),
 ('開發商', 2),
 ('角色', 2),
 ('綁定', 2),
 ('英雄', 2),
 ('修改', 2),
 ('系列', 1),
 ('評價', 1),
 ('批評', 1),
 ('體驗', 1),
 ('取消', 1),
 ('預購', 1),
 ('修正', 1),
 ('問題', 1),
 ('系統', 1)]

In [None]:
## 逐篇計算詞頻率

In [36]:
%%time
keyfreqs =[]
for wp in word_pos_pair:
    topwords = word_frequency(wp)
    keyfreqs.append(topwords)  

CPU times: total: 0 ns
Wall time: 998 μs


In [37]:
keyfreqs[0:1]

[[('活動', 20),
  ('尋訪', 7),
  ('燈塔', 5),
  ('時間', 5),
  ('玩家', 5),
  ('開啟', 4),
  ('幹員', 4),
  ('場次', 4),
  ('吉豚屋', 3),
  ('推出', 3),
  ('登場', 3),
  ('台北', 3),
  ('官方', 3),
  ('內容', 3),
  ('期間', 3),
  ('提升', 3),
  ('套餐', 2),
  ('主題曲', 2),
  ('維什戴爾', 2),
  ('邏各斯', 2)]]

In [38]:
df['top_key_freq'] = keyfreqs

In [39]:
df.head(1)

Unnamed: 0,item_id,date,category,title,content,link,photo_link,tokens,tokens_v2,entities,token_pos,top_key_freq
0,game-news_2025-03-06_1,2025-03-06,遊戲新聞,《明日方舟》聯名「吉豚屋」推出合作套餐，隨餐贈周邊禮品及虛寶卡,['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...,https://tw.news.yahoo.com/%E3%80%8A%E6%98%8E%E...,https://s.yimg.com/os/creatr-uploaded-images/2...,"[[, ', 由, 龍成, 網路, 代理, 營運, 的, 新, 型態, 戰術, 攻防, R...","[龍成, 網路, 型態, 戰術, 明日方舟, 豬排店, 吉豚屋, 套餐, 主題曲, 燈塔, ...","[(明日方舟, WORK_OF_ART, (24, 28)), (80g, CARDINAL...","[([, PARENTHESISCATEGORY), (', FW), (由, P), (龍...","[(活動, 20), (尋訪, 7), (燈塔, 5), (時間, 5), (玩家, 5),..."


In [40]:
df.iloc[0].top_key_freq

[('活動', 20),
 ('尋訪', 7),
 ('燈塔', 5),
 ('時間', 5),
 ('玩家', 5),
 ('開啟', 4),
 ('幹員', 4),
 ('場次', 4),
 ('吉豚屋', 3),
 ('推出', 3),
 ('登場', 3),
 ('台北', 3),
 ('官方', 3),
 ('內容', 3),
 ('期間', 3),
 ('提升', 3),
 ('套餐', 2),
 ('主題曲', 2),
 ('維什戴爾', 2),
 ('邏各斯', 2)]

# Abstract (summary) and sentimental score(摘要與情緒分數)

Currently we don't use abstract in our later application. However, we prepare news summary for future usage.

We use snowNLP to get summary and sentimental score.

In [None]:
# pip install snownlp

In [41]:
%%time
summary=[]
sentiment=[]
for text in df.content: # process piece by piece
    summary.append("暫無")  
    sentiment.append("暫無")

CPU times: total: 0 ns
Wall time: 0 ns


In [42]:
df['summary'] = summary

In [43]:
df['sentiment'] = sentiment

In [44]:
df.head(1)

Unnamed: 0,item_id,date,category,title,content,link,photo_link,tokens,tokens_v2,entities,token_pos,top_key_freq,summary,sentiment
0,game-news_2025-03-06_1,2025-03-06,遊戲新聞,《明日方舟》聯名「吉豚屋」推出合作套餐，隨餐贈周邊禮品及虛寶卡,['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...,https://tw.news.yahoo.com/%E3%80%8A%E6%98%8E%E...,https://s.yimg.com/os/creatr-uploaded-images/2...,"[[, ', 由, 龍成, 網路, 代理, 營運, 的, 新, 型態, 戰術, 攻防, R...","[龍成, 網路, 型態, 戰術, 明日方舟, 豬排店, 吉豚屋, 套餐, 主題曲, 燈塔, ...","[(明日方舟, WORK_OF_ART, (24, 28)), (80g, CARDINAL...","[([, PARENTHESISCATEGORY), (', FW), (由, P), (龍...","[(活動, 20), (尋訪, 7), (燈塔, 5), (時間, 5), (玩家, 5),...",暫無,暫無


# Rearrange the colmun order for readability

In [45]:
df.columns

Index(['item_id', 'date', 'category', 'title', 'content', 'link', 'photo_link',
       'tokens', 'tokens_v2', 'entities', 'token_pos', 'top_key_freq',
       'summary', 'sentiment'],
      dtype='object')

In [46]:
# This operation is the same as slicing
df=df[['item_id', 'date', 'category', 'title', 'content','sentiment', 'summary', 'top_key_freq', 'tokens',
       'tokens_v2', 'entities', 'token_pos', 'link', 'photo_link'
       ]]

In [47]:
df.head(1)

Unnamed: 0,item_id,date,category,title,content,sentiment,summary,top_key_freq,tokens,tokens_v2,entities,token_pos,link,photo_link
0,game-news_2025-03-06_1,2025-03-06,遊戲新聞,《明日方舟》聯名「吉豚屋」推出合作套餐，隨餐贈周邊禮品及虛寶卡,['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...,暫無,暫無,"[(活動, 20), (尋訪, 7), (燈塔, 5), (時間, 5), (玩家, 5),...","[[, ', 由, 龍成, 網路, 代理, 營運, 的, 新, 型態, 戰術, 攻防, R...","[龍成, 網路, 型態, 戰術, 明日方舟, 豬排店, 吉豚屋, 套餐, 主題曲, 燈塔, ...","[(明日方舟, WORK_OF_ART, (24, 28)), (80g, CARDINAL...","[([, PARENTHESISCATEGORY), (', FW), (由, P), (龍...",https://tw.news.yahoo.com/%E3%80%8A%E6%98%8E%E...,https://s.yimg.com/os/creatr-uploaded-images/2...


# Save data to disk

存檔

In [48]:
df.to_csv('cna_news_preprocessed.csv', sep='|', index=False)

## Read it out 讀出看看

In [49]:
df = pd.read_csv('cna_news_preprocessed.csv', sep='|')

In [50]:
df.head(1)

Unnamed: 0,item_id,date,category,title,content,sentiment,summary,top_key_freq,tokens,tokens_v2,entities,token_pos,link,photo_link
0,game-news_2025-03-06_1,2025-03-06,遊戲新聞,《明日方舟》聯名「吉豚屋」推出合作套餐，隨餐贈周邊禮品及虛寶卡,['由龍成網路代理營運的新型態戰術攻防 RPG《明日方舟》今（6）日宣布與知名連鎖豬排店「吉...,暫無,暫無,"[('活動', 20), ('尋訪', 7), ('燈塔', 5), ('時間', 5), ...","['[', ""'"", '由', '龍成', '網路', '代理', '營運', '的', '...","['龍成', '網路', '型態', '戰術', '明日方舟', '豬排店', '吉豚屋',...","[NerToken(word='明日方舟', ner='WORK_OF_ART', idx=...","[('[', 'PARENTHESISCATEGORY'), (""'"", 'FW'), ('...",https://tw.news.yahoo.com/%E3%80%8A%E6%98%8E%E...,https://s.yimg.com/os/creatr-uploaded-images/2...
