<a href="https://colab.research.google.com/github/405620294/classwork/blob/main/Naive_Bayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# load data

In [165]:
from urllib.request import urlretrieve
import pandas as pd
url = 'https://github.com/Elwing-Chou/ml0223/raw/main/poem_train.csv'
urlretrieve(url, 'train.csv')
url = 'https://github.com/Elwing-Chou/ml0223/raw/main/poem_test.csv'
urlretrieve(url, 'test.csv')

train_df = pd.read_csv('train.csv', encoding='UTF-8')
test_df = pd.read_csv('test.csv', encoding='UTF-8')

In [166]:
train_df.head()

Unnamed: 0,作者,詩名,內容
0,李白,菩薩蠻·平林漠漠煙如織,平林漠漠煙如織，寒山一帶傷心碧。\r\n暝色入高樓，有人樓上愁。玉階空佇立，宿鳥歸飛急。\r...
1,李白,把酒問月,青天有月來幾時，我今停杯一問之：人攀明月不可得，月行卻與人相隨？皎如飛鏡臨丹闕，綠煙滅儘清輝...
2,李白,春思,燕草如碧絲，秦桑低綠枝。當君懷歸日，是妾斷腸時。春風不相識，何事入羅幃。
3,李白,春夜洛城聞笛,誰家玉笛暗飛聲，散入春風滿洛城。此夜曲中聞折柳，何人不起故園情。
4,李白,古風 其十九,西上蓮花山，迢迢見明星。(西上 一作：西嶽)素手把芙蓉，虛步躡太清。霓裳曳廣帶，飄拂升天行。...


In [167]:
test_df.head()

Unnamed: 0,作者,詩名,內容
0,李白,望廬山瀑布,日照香爐生紫煙，遙看瀑布掛前川。飛流直下三千尺，疑是銀河落九天。
1,李白,早發白帝城,朝辭白帝彩雲間，千裡江陵一日還。兩岸猿聲啼不住，輕舟已過萬重山。
2,李白,贈汪倫,李白乘舟將欲行，忽聞岸上踏歌聲。桃花潭水深千尺，不及汪倫送我情。
3,李白,送孟浩然之廣陵,故人西辭黃鶴樓，煙花三月下揚州。孤帆遠影碧空儘，唯見長江天際流。
4,李白,夜宿山寺,危樓高百尺，手可摘星辰。不敢高聲語，恐驚天上人。


In [168]:
train_df['作者'].value_counts()

杜甫     1157
李白      969
白居易     605
Name: 作者, dtype: int64

# preprocessing

In [169]:
author_list = train_df['作者'].unique()
list(enumerate(author_list))

[(0, '李白'), (1, '杜甫'), (2, '白居易')]

In [170]:
# 作者
author2num = {name:index for index,name in enumerate(author_list)}
y_train = train_df['作者'].replace(author2num)
y_test = test_df['作者'].replace(author2num)
print(train_df['作者'].value_counts())

杜甫     1157
李白      969
白居易     605
Name: 作者, dtype: int64


In [171]:
# 詞：分詞、轉換
import jieba
def word_cut(s):
  return ' '.join(jieba.cut(s))

x_train = train_df['內容'].apply(word_cut)
x_test = test_df['內容'].apply(word_cut)

In [172]:
x_train.head()

0    平林 漠漠 煙如織 ， 寒山 一帶 傷心 碧 。 \r\n 暝 色入 高樓 ， 有人 樓上 ...
1    青天 有 月 來 幾時 ， 我今 停杯 一問 之 ： 人攀 明月 不可 得 ， 月行 卻 與...
2    燕草 如碧絲 ， 秦桑低 綠枝 。 當君 懷歸日 ， 是 妾 斷腸時 。 春風 不 相識 ，...
3    誰 家玉笛 暗飛聲 ， 散 入春 風滿 洛城 。 此 夜曲 中聞折 柳 ， 何人 不起 故園情 。
4    西上 蓮 花山 ， 迢迢 見 明星 。 ( 西 上   一作 ： 西 嶽 ) 素 手把 芙蓉...
Name: 內容, dtype: object

# feature extraction: word2vec

In [173]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
train_transform = vectorizer.fit_transform(x_train)
test_transform = vectorizer.transform(x_test)

In [174]:
# vectorizer.vocabulary_
inv_vaca = {index:word for word, index in vectorizer.vocabulary_.items()}

In [175]:
train_transform  # sparse matrix
# print(train_transform)
sparse_df = pd.DataFrame(train_transform.toarray())
sparse_df.rename(columns=inv_vaca, inplace=True)
sparse_df

Unnamed: 0,198,38,77,一一,一丈,一上,一下,一不中,一不存,一世,一丘樂,一丘藏,一串,一丸,一主,一乘,一事,一二,一人,一人佩,一人來,一人出,一人字,一人常,一代,一以,一以合,一以貫,一任,一伸,一何,一何樂,一何適,一何麗,一作,一作勳,一作千裡,一作結,一使,一來,...,龍顏顧,龍飛,龍飛入,龍馬,龍駒,龍駕具,龍駕空,龍騎,龍驚,龍驤,龍驤塋,龍驤詔,龍髯,龍髯幸,龍鱗,龍鱗識,龍鳳脫,龍鳴,龍鵬,龍鸞炳,龍龜,龐人,龐公,龐公不浪出,龐公任,龐公竟,龐公隱,龐德公,龐眉,龔勝亡,龔子,龔黃,龜山蔽,龜新,龜緣入,龜蒙,龜遊蓮葉,龜靈,龜頭,龜鶴年
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2726,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2727,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2728,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2729,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# training

In [179]:
from sklearn.naive_bayes import MultinomialNB

nb_clf = MultinomialNB()
nb_clf2 = MultinomialNB(alpha=0.1)
nb_clf3 = MultinomialNB(alpha=0.01)

nb_clf.fit(train_transform, y_train)
nb_clf2.fit(train_transform, y_train)
nb_clf3.fit(train_transform, y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

# envaluating

In [181]:
from sklearn.metrics import accuracy_score
pre = nb_clf.predict(test_transform)
pre2 = nb_clf2.predict(test_transform)
pre3 = nb_clf3.predict(test_transform)

alpha1 =  accuracy_score(y_test, pre)
alpha01 =  accuracy_score(y_test, pre2)
alpha001 =  accuracy_score(y_test, pre3)

print('alpha=1', alpha1, '\nalpha=0.1', alpha01, '\nalpha=0.01', alpha001)

alpha=1 0.8 
alpha=0.1 0.8333333333333334 
alpha=0.01 0.8333333333333334


In [190]:
from sklearn.metrics import confusion_matrix
cm_df = pd.DataFrame(confusion_matrix(y_test, pre2),
            columns=[i+'(預測)' for i in author_list],
            index=[i+'(真實)' for i in author_list])
cm_df

Unnamed: 0,李白(預測),杜甫(預測),白居易(預測)
李白(真實),8,1,1
杜甫(真實),0,9,1
白居易(真實),1,1,8
