<h1 align="center">Naive Bayes Classifier for Text Classification</h1>

<hr>
<h3>情感類別（感知形容詞）：</h3>
<pre>
'清新', '憂愁', '真誠', '青春', '成熟',
'有趣', '無聊', '溫和', '剛強', '科技',
'熱情', '冷漠', '正義', '甜美', '苦澀',
'浪漫', '科幻', '現代', '陳腐', '驚悚',
'舒服', '活潑', '悠閒'
</pre>

In [1]:
# 情感類別（感知形容詞）

alst = [
    '清新', '憂愁', '真誠', '青春', '成熟',
    '有趣', '無聊', '溫和', '剛強', '科技',
    '熱情', '冷漠', '正義', '甜美', '苦澀',
    '浪漫', '科幻', '現代', '陳腐', '驚悚',
    '舒服', '活潑', '悠閒'
]

<hr>
<h3>載入 emo23 的 Likelihood 資料</h3>

In [2]:
# 載入 emo23 的 Likelihood 資料

import pickle

with open('likelihood_emo23.pkl', 'rb') as fp:
    likelihood = pickle.load(fp)
fp.close()

print(len(likelihood))


230023


<hr>
<h3>輸入測試文句</h3>

In [3]:
# 輸入測試文句

txt = '今天心情很好，在社團跟大家玩得很開心，晚上打算去吃大餐，一切都安排得很好。'

<hr>
<h3>
刪除非中文的所有文字與符號
</h3>

In [4]:
# 先刪除非中文的所有文字與符號

import re

def remove_punctuation(line):
    rule = re.compile(r'[^\u4e00-\u9fa5|\s]')
    line = rule.sub(' ', line)
    return line

def remove_redundant_space(line):
    line = re.sub(' +', ' ', line)
    return line


<hr>
<h3>詞法分析（字詞修整）</h3>

In [5]:
# 詞法分析（字詞修整）

txt = remove_punctuation(txt)
txt = remove_redundant_space(txt)

print(txt)

今天心情很好 在社團跟大家玩得很開心 晚上打算去吃大餐 一切都安排得很好 


<hr>
<h3>
詞法分析（斷詞）
</h3>

In [6]:
# 斷詞

import jieba

# 有必要的話載入斷詞用專屬字典
jieba.set_dictionary('dict.txt.big')
jieba.load_userdict('user.txt')

res = jieba.cut(txt)

lst = [ x for x in res ]

print(lst)


Building prefix dict from C:\Users\trchou\Dropbox\000 teaching\aisd10a\jupyter\unit_09_text\unit_09_text_naive_bayes_classifier\dict.txt.big ...
Loading model from cache C:\Users\trchou\AppData\Local\Temp\jieba.ua8b55bb30a5d8be72a6e9488d1c37541.cache
Loading model cost 0.903 seconds.
Prefix dict has been built successfully.


['今天', '心情', '很', '好', ' ', '在', '社團', '跟', '大家', '玩得', '很', '開心', ' ', '晚上', '打算', '去', '吃', '大餐', ' ', '一切', '都', '安排', '得', '很', '好', ' ']


<hr>
<h3>
刪除虛字『的』等無用字詞（Stopword），過濾單字詞
</h3>
<pre>
Stopword 代表很常見（高頻率）的詞彙，但是，對語意分辨沒有什麼貢獻。
<p>詞法分析的最後輸出結果是<span style="color:red">『字符串』（tokens）</span>，作為分類的依據
</pre>

In [7]:
# 刪除虛字『的』等無用字詞，Stopword

# stopword 可以是需要而自行增加
stopwords = [ '的', '之', '乎', '者', '也', ' ' ]

tokens = []
for x in lst:
    if (x not in stopwords):
        if (len(x) >= 2):
            tokens.append(x)

print(tokens)


['今天', '心情', '社團', '大家', '玩得', '開心', '晚上', '打算', '大餐', '一切', '安排']


<hr>
<h3>Probability of Likelyhood 模組化函數（prob_likely）的定義</h3>
<pre>
Prob { w | a } = p = likelyhood[(w,a)]
p = 0.???, if w in metaphors of a (note that: a in metaphors of a)
p = 0.001, otherwise
</pre>

In [8]:
# Probability of Likelyhood 模組化函數（prob_likely）的定義

alpha = 0.000001

def prob_likely(w, a):
    try:
        p = likelihood[(w,a)]
    except:
        p = alpha
    return p


<hr>
<h3>likelihood 函數的運用</h3>

In [9]:
# 清新

a0 = '有趣'

# 累計機率應為 0.999（1-alpha）
acc = 0.0
for w, a in likelihood:
    if (a == a0):
        p = prob_likely(w, a)
        acc = acc + p
print('acc = ', acc)

p1 = prob_likely(a0, a0)
p2 = prob_likely('烏龜', a0)

print('p1 = ', p1)
print('p2 = ', p2)


acc =  0.9999989999999979
p1 =  0.0003659794954630224
p2 =  1e-06


<hr>
<h3>特定詞彙的 Likelihood 測試</h3>

In [10]:
# 計算 w 對 emo23 所有詞彙的 likelihood

w = '沒事'

dic = dict()

for a in alst:
    p = prob_likely(w, a)
    dic[a] = p
    
# 排序，輸出前 20 項 ccom

sorted_list = sorted(dic.items(), key=lambda x: x[1], reverse=True)
# print(sorted_list)

for x in sorted_list:
    c, p = x[0], x[1]
    print('p = %12.8f, p = Prob{ %s | %s }' % (p, w, c))


p =   0.00012659, p = Prob{ 沒事 | 無聊 }
p =   0.00012577, p = Prob{ 沒事 | 舒服 }
p =   0.00008879, p = Prob{ 沒事 | 憂愁 }
p =   0.00000100, p = Prob{ 沒事 | 清新 }
p =   0.00000100, p = Prob{ 沒事 | 真誠 }
p =   0.00000100, p = Prob{ 沒事 | 青春 }
p =   0.00000100, p = Prob{ 沒事 | 成熟 }
p =   0.00000100, p = Prob{ 沒事 | 有趣 }
p =   0.00000100, p = Prob{ 沒事 | 溫和 }
p =   0.00000100, p = Prob{ 沒事 | 剛強 }
p =   0.00000100, p = Prob{ 沒事 | 科技 }
p =   0.00000100, p = Prob{ 沒事 | 熱情 }
p =   0.00000100, p = Prob{ 沒事 | 冷漠 }
p =   0.00000100, p = Prob{ 沒事 | 正義 }
p =   0.00000100, p = Prob{ 沒事 | 甜美 }
p =   0.00000100, p = Prob{ 沒事 | 苦澀 }
p =   0.00000100, p = Prob{ 沒事 | 浪漫 }
p =   0.00000100, p = Prob{ 沒事 | 科幻 }
p =   0.00000100, p = Prob{ 沒事 | 現代 }
p =   0.00000100, p = Prob{ 沒事 | 陳腐 }
p =   0.00000100, p = Prob{ 沒事 | 驚悚 }
p =   0.00000100, p = Prob{ 沒事 | 活潑 }
p =   0.00000100, p = Prob{ 沒事 | 悠閒 }


<hr>
<h3>樸素貝式（Naive Bayes）分類</h3>

In [11]:
# 樸素貝式（Naive Bayes）分類

print(tokens)

dic = dict()
for a in alst:
    q = 1
    for t in tokens:
        p = prob_likely(t, a)
        # q = q * p
        q = q * p * 10000
    dic[a] = q

sorted_list = sorted(dic.items(), key=lambda x: x[1], reverse=True)

for x in sorted_list:
    a, p = x[0], x[1]
    print('%s: %20.18f' % (a, p))    


['今天', '心情', '社團', '大家', '玩得', '開心', '晚上', '打算', '大餐', '一切', '安排']
有趣: 0.000000000215915116
無聊: 0.000000000003032493
悠閒: 0.000000000002468391
熱情: 0.000000000000931487
舒服: 0.000000000000013163
浪漫: 0.000000000000000130
真誠: 0.000000000000000101
活潑: 0.000000000000000071
苦澀: 0.000000000000000001
清新: 0.000000000000000001
青春: 0.000000000000000001
甜美: 0.000000000000000001
冷漠: 0.000000000000000001
成熟: 0.000000000000000001
現代: 0.000000000000000000
憂愁: 0.000000000000000000
正義: 0.000000000000000000
陳腐: 0.000000000000000000
溫和: 0.000000000000000000
剛強: 0.000000000000000000
科技: 0.000000000000000000
科幻: 0.000000000000000000
驚悚: 0.000000000000000000


<h3 style="color:orange">Text Classification based on Naive Bayes Theory（整合版）</h3>

In [12]:
# Text Classification based on Naive Bayes Theory（整合版）

import pickle
import jieba
import re

# 分類類別

alst = [
    '清新', '憂愁', '真誠', '青春', '成熟',
    '有趣', '無聊', '溫和', '剛強', '科技',
    '熱情', '冷漠', '正義', '甜美', '苦澀',
    '浪漫', '科幻', '現代', '陳腐', '驚悚',
    '舒服', '活潑', '悠閒'
]

# stopword 可以是需要而自行增加

stopwords = [ '的', '之', '乎', '者', '也', ' ' ]

# 載入語言模型（word2vec, jieba, ...）函數定義

def load_language_model():
    # 載入 emo23 的 Likelihood 資料
    with open('likelihood_emo23.pkl', 'rb') as fp:
        likelihood = pickle.load(fp)
    fp.close()
    jieba.set_dictionary('dict.txt.big')
    jieba.load_userdict('user.txt')
    return likelihood

# 刪除非中文的所有文字與符號函數定義

def remove_non_chinese(line):
    # 消除英文文數字
    rule = re.compile('[a-zA-Z0-9]')
    line = rule.sub(' ', line)
    # 消除特殊符號（含部分全形符號）
    rule = re.compile('[’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~\s]+')
    line = rule.sub(' ', line)
    # 消除不可見字碼
    rule = re.compile('[\001\002\003\004\005\006\007\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+')
    line = rule.sub(' ', line)
    # 消除所有全形符號
    rule = re.compile('[^\u4e00-\u9fa5]')
    line = rule.sub(' ', line)
    return line

def remove_redundant_space(line):
    line = re.sub(' +', ' ', line)
    return line

# 詞法分析器函數定義

def lexical_analyzer(txt):
    # print(txt)
    txt = remove_non_chinese(txt)
    txt = remove_redundant_space(txt)
    res = jieba.cut(txt)
    lst = [ x for x in res ]
    # print(lst)
    tokens = []
    for x in lst:
        if (x not in stopwords):
            if (len(x) >= 2):
                tokens.append(x)
    # print(tokens)
    return tokens

# Probability of Likelyhood 模組化函數（prob_likely）的定義

alpha = 0.000001

def prob_likely(w, a):
    try:
        p = likelihood[(w,a)]
    except:
        p = alpha
    return p

# 文件分類器函數定義

def text_classifier(tokens):
    # 樸素貝式（Naive Bayes）分類
    dic = dict()
    for a in alst:
        q = 1
        for t in tokens:
            p = prob_likely(t, a)
            # q = q * p
            q = q * p * 10000
        dic[a] = q
    # 字典排序（成為串列）
    lst = sorted(dic.items(), key=lambda x: x[1], reverse=True)
    return lst

# 載入語言模型

likelihood =  load_language_model()


Building prefix dict from C:\Users\trchou\Dropbox\000 teaching\aisd10a\jupyter\unit_09_text\unit_09_text_naive_bayes_classifier\dict.txt.big ...
Loading model from cache C:\Users\trchou\AppData\Local\Temp\jieba.ua8b55bb30a5d8be72a6e9488d1c37541.cache
Loading model cost 0.916 seconds.
Prefix dict has been built successfully.


<hr>
<h3>整合測試</h3>

In [13]:
# 整合測試

# 輸入

# 測試例句
# txt = '不在乎天長地久，只在乎曾經擁有'
# txt = '實踐是檢驗真理的唯一方法。'
txt = '在新的一年裡，希望所有朋友都幸福快樂心想事成！'

# 文件分類

tokens = lexical_analyzer(txt)
result = text_classifier(tokens)

# 顯示結果

print(txt)
for a, score in result:
    print('%s: %16.14f' % (a, score))


在新的一年裡，希望所有朋友都幸福快樂心想事成！
熱情: 0.00011783294370
真誠: 0.00009897563048
悠閒: 0.00000150008651
無聊: 0.00000097769708
青春: 0.00000001041810
浪漫: 0.00000000945896
憂愁: 0.00000000806715
甜美: 0.00000000801321
有趣: 0.00000000736224
冷漠: 0.00000000670429
正義: 0.00000000010392
苦澀: 0.00000000009002
清新: 0.00000000008542
活潑: 0.00000000008512
舒服: 0.00000000008360
驚悚: 0.00000000008047
溫和: 0.00000000007678
成熟: 0.00000000000100
剛強: 0.00000000000100
科技: 0.00000000000100
科幻: 0.00000000000100
現代: 0.00000000000100
陳腐: 0.00000000000100
