<h1 align="center">Word Similarity Classifier for Text Classification</h1>
<h3 align="center">based on similarity from word2vec</h3>

<hr>
<h3>情感類別（感知形容詞）：</h3>
<pre>
'清新', '憂愁', '真誠', '青春', '成熟',
'有趣', '無聊', '溫和', '剛強', '科技',
'熱情', '冷漠', '正義', '甜美', '苦澀',
'浪漫', '科幻', '現代', '陳腐', '驚悚',
'舒服', '活潑', '悠閒'
</pre>

In [1]:
alst = [
    '清新', '憂愁', '真誠', '青春', '成熟',
    '有趣', '無聊', '溫和', '剛強', '科技',
    '熱情', '冷漠', '正義', '甜美', '苦澀',
    '浪漫', '科幻', '現代', '陳腐', '驚悚',
    '舒服', '活潑', '悠閒'
]

<hr>
<h3>載入 word2vec model</h3>
<pre>
（請改用自己訓練的模型檔或下載而得的預訓練模型檔）
</pre>

In [2]:
import gensim

print(gensim.__version__)

# 自訓練模型
# model = gensim.models.Word2Vec.load('c:/python/w2v_model/cna_xin_wiki_cis180.model.bin')

# TMU 預訓練模型
model = gensim.models.KeyedVectors.load_word2vec_format('c:/python/w2v_model/y_360W_cbow_2D_300dim_2020v1.bin', unicode_errors='ignore', binary=True)


4.0.1


<hr>
<h3>輸入測試文句</h3>

In [3]:
txt = '不在乎天長地久，只在乎曾經擁有'
# txt = '公平是我們追求的真理'
# txt = '實踐是檢驗真理的唯一方法'
# txt = '航向宇宙探索未知太空世界是每一個少年的兒時夢想'
# txt = '上課好無趣，整天都昏昏欲睡神遊太虛'

<hr>
<h3>
刪除非中文的所有文字與符號
</h3>

In [4]:
# 刪除非中文的所有文字與符號

import re

def remove_non_chinese(line):
    # 消除英文文數字
    rule = re.compile('[a-zA-Z0-9]')
    line = rule.sub(' ', line)
    # 消除特殊符號（含部分全形符號）
    rule = re.compile('[’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~\s]+')
    line = rule.sub(' ', line)
    # 消除不可見字碼
    rule = re.compile('[\001\002\003\004\005\006\007\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+')
    line = rule.sub(' ', line)
    # 消除所有全形符號
    rule = re.compile('[^\u4e00-\u9fa5]')
    line = rule.sub(' ', line)
    return line

def remove_redundant_space(line):
    line = re.sub(' +', ' ', line)
    return line


<hr>
<h3>詞法分析（字詞修整）</h3>

In [5]:
# 詞法分析（字詞修整）

txt = remove_non_chinese(txt)
txt = remove_redundant_space(txt)

print(txt)


不在乎天長地久 只在乎曾經擁有


<hr>
<h3>
詞法分析（斷詞、分詞）
</h3>

In [6]:
# 詞法分析（斷詞、分詞） 

import jieba

# 有必要的話載入常用辭典
jieba.set_dictionary('dict.txt.big')
# 有必要的話載入專屬字典
jieba.load_userdict('user.txt')

res = jieba.cut(txt)

lst = [ x for x in res ]

print(lst)


Building prefix dict from C:\Users\trchou\Dropbox\000 teaching\aisd10a\jupyter\unit_09_text\unit_09_text_word_similarity_classifier\dict.txt.big ...
Loading model from cache C:\Users\trchou\AppData\Local\Temp\jieba.u451fe25cd6e3a6f27d5c8b0d57922321.cache
Loading model cost 0.906 seconds.
Prefix dict has been built successfully.


['不在乎', '天長地久', ' ', '只在乎', '曾經', '擁有']


<hr>
<h3>
刪除虛字『的』等無用字詞（Stopword），過濾單字詞
</h3>
<pre>
Stopword 代表很常見（高頻率）的詞彙，但是，對語意分辨沒有什麼貢獻。
<p>詞法分析的最後輸出結果是<span style="color:red">『字符串』（tokens）</span>，作為分類的依據
</pre>

In [7]:
# 刪除虛字『的』等無用字詞（Stopword），過濾單字詞 

# stopword 可以是需要而自行增加
stopwords = [ '的', '之', '乎', '者', '也', ' ' ]

tokens = []
for x in lst:
    if (x not in stopwords):
        if (len(x) >= 2):
            tokens.append(x)

print(tokens)


['不在乎', '天長地久', '只在乎', '曾經', '擁有']


<hr>
<h3>直接詞彙相似度比對</h3>
<p>based on word2vec (word vector)</p>

<p>（平均值）</p>

In [8]:
# 直接詞彙相似度（平均值）比對

n = len(tokens)

d = dict()
for a in alst:
    score = 0
    for t in tokens:
        try:
            # sim = model.wv.similarity(t, a)
            sim = model.similarity(t, a)
        except:
            sim = 0
        if (sim < 0):
            sim = 0
        score = score + sim
    if (n == 0):
        score = 0
    else:
        score = score / n
    d[a] = score

# 字典排序（成為串列）
lst = sorted(d.items(), key=lambda x: x[1], reverse=True)

for a, score in lst:
    print('%s: %10.8f' % (a, score))


剛強: 0.05947875
真誠: 0.05890341
苦澀: 0.05329786
浪漫: 0.05108019
青春: 0.04941345
憂愁: 0.04810110
科幻: 0.04617314
成熟: 0.04439275
現代: 0.04321548
甜美: 0.04060926
清新: 0.04026775
冷漠: 0.03941980
熱情: 0.03852817
科技: 0.03419882
悠閒: 0.03249943
有趣: 0.02969757
活潑: 0.02849381
陳腐: 0.02677233
無聊: 0.02552254
溫和: 0.01957277
驚悚: 0.01376841
正義: 0.01255617
舒服: 0.00000000


（最大值）

In [9]:
# 直接詞彙相似度（最大值）比對

n = len(tokens)

d = dict()
for a in alst:
    score_max = 0
    for t in tokens:
        try:
            # sim = model.wv.similarity(t, a)
            sim = model.similarity(t, a)
        except:
            sim = 0
        if (sim < 0):
            sim = 0
        if (sim > score_max):
            score_max = sim
    d[a] = score_max

# 字典排序（成為串列）
lst = sorted(d.items(), key=lambda x: x[1], reverse=True)

for a, score in lst:
    print('%s: %10.8f' % (a, score))


真誠: 0.20308638
浪漫: 0.19319998
憂愁: 0.18272440
成熟: 0.17329067
剛強: 0.16913190
苦澀: 0.16646215
青春: 0.14741611
科技: 0.14216813
甜美: 0.12066689
科幻: 0.11840704
冷漠: 0.11783428
有趣: 0.10830958
清新: 0.10613213
現代: 0.10499865
活潑: 0.09939733
溫和: 0.09786385
熱情: 0.09003606
無聊: 0.08784638
陳腐: 0.08621894
悠閒: 0.06781122
正義: 0.06278086
驚悚: 0.04591414
舒服: 0.00000000


<hr>
<h3>擴展詞彙相似度比對</h3>
<pre>
based on word2vec (word vector)
<span style="color:red">topn=5</span>
</pre>

<p>（平均值）</p>

In [10]:
# 擴展詞彙相似度（平均值）比對

ext = []
for t0 in tokens:
    try:
        # lst = model.wv.most_similar(t0, topn=5)
        lst = model.most_similar(t0, topn=5)
    except:
        lst = []
    for t, _ in lst:
        if (t not in ext):
            ext.append(t)
# print(ext)

n = len(ext)

d = dict()
for a in alst:
    score = 0
    for t in ext:
        try:
            # sim = model.wv.similarity(t, a)
            sim = model.similarity(t, a)
        except:
            sim = 0
        if (sim < 0):
            sim = 0
        score = score + sim
    if (n == 0):
        score = 0
    else:
        score = score / n
    d[a] = score

# 字典排序（成為串列）
lst = sorted(d.items(), key=lambda x: x[1], reverse=True)

for a, score in lst:
    print('%s: %10.8f' % (a, score))


苦澀: 0.07565614
真誠: 0.06792980
浪漫: 0.06718439
憂愁: 0.06481532
冷漠: 0.06002934
悠閒: 0.05795810
成熟: 0.05166683
剛強: 0.04832403
青春: 0.04400143
熱情: 0.04246446
甜美: 0.04052958
溫和: 0.03740396
陳腐: 0.03620513
清新: 0.03527851
正義: 0.03379348
有趣: 0.03313669
科技: 0.03078090
活潑: 0.02878336
無聊: 0.02807757
現代: 0.02434960
舒服: 0.01744028
科幻: 0.01729825
驚悚: 0.01413314


<p>（最大值）</p>

In [11]:
# 擴展詞彙相似度（最大值）比對

ext = []
for t0 in tokens:
    try:
        # lst = model.wv.most_similar(t0, topn=5)
        lst = model.most_similar(t0, topn=5)
    except:
        lst = []
    for t, _ in lst:
        if (t not in ext):
            ext.append(t)
# print(ext)

n = len(ext)

d = dict()
for a in alst:
    score_max = 0
    for t in tokens:
        try:
            # sim = model.wv.similarity(t, a)
            sim = model.similarity(t, a)
        except:
            sim = 0
        if (sim < 0):
            sim = 0
        if (sim > score_max):
            score_max = sim
    d[a] = score_max

# 字典排序（成為串列）
lst = sorted(d.items(), key=lambda x: x[1], reverse=True)

for a, score in lst:
    print('%s: %10.8f' % (a, score))


真誠: 0.20308638
浪漫: 0.19319998
憂愁: 0.18272440
成熟: 0.17329067
剛強: 0.16913190
苦澀: 0.16646215
青春: 0.14741611
科技: 0.14216813
甜美: 0.12066689
科幻: 0.11840704
冷漠: 0.11783428
有趣: 0.10830958
清新: 0.10613213
現代: 0.10499865
活潑: 0.09939733
溫和: 0.09786385
熱情: 0.09003606
無聊: 0.08784638
陳腐: 0.08621894
悠閒: 0.06781122
正義: 0.06278086
驚悚: 0.04591414
舒服: 0.00000000


<h3 style="color:orange">Text Classification based on similarity from word2vec（整合版）</h3>
<p>擴展詞彙相似度（平均值）比對</p>

In [12]:
# Text Classification based on similarity from word2vec（整合版）

import gensim
import jieba
import re

# 分類類別

alst = [
    '清新', '憂愁', '真誠', '青春', '成熟',
    '有趣', '無聊', '溫和', '剛強', '科技',
    '熱情', '冷漠', '正義', '甜美', '苦澀',
    '浪漫', '科幻', '現代', '陳腐', '驚悚',
    '舒服', '活潑', '悠閒'
]

# stopword 可以是需要而自行增加

stopwords = [ '的', '之', '乎', '者', '也', ' ' ]

# 載入語言模型（word2vec, jieba, ...）函數定義

def load_language_model():
    # model = gensim.models.Word2Vec.load('c:/python/w2v_model/cna_xin_wiki_cis180.model.bin')
    model = gensim.models.KeyedVectors.load_word2vec_format('c:/python/w2v_model/y_360W_cbow_2D_300dim_2020v1.bin', unicode_errors='ignore', binary=True)
    jieba.set_dictionary('dict.txt.big')
    jieba.load_userdict('user.txt')
    return model

# 刪除非中文的所有文字與符號函數定義

def remove_non_chinese(line):
    # 消除英文文數字
    rule = re.compile('[a-zA-Z0-9]')
    line = rule.sub(' ', line)
    # 消除特殊符號（含部分全形符號）
    rule = re.compile('[’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~\s]+')
    line = rule.sub(' ', line)
    # 消除不可見字碼
    rule = re.compile('[\001\002\003\004\005\006\007\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+')
    line = rule.sub(' ', line)
    # 消除所有全形符號
    rule = re.compile('[^\u4e00-\u9fa5]')
    line = rule.sub(' ', line)
    return line

def remove_redundant_space(line):
    line = re.sub(' +', ' ', line)
    return line

# 詞法分析器函數定義

def lexical_analyzer(txt):
    # print(txt)
    txt = remove_non_chinese(txt)
    txt = remove_redundant_space(txt)
    res = jieba.cut(txt)
    lst = [ x for x in res ]
    # print(lst)
    tokens = []
    for x in lst:
        if (x not in stopwords):
            if (len(x) >= 2):
                tokens.append(x)
    # print(tokens)
    return tokens

# 文件分類器函數定義

def text_classifier(tokens):
    # 擴展詞彙
    ext = []
    for t0 in tokens:
        try:
            # lst = model.wv.most_similar(t0, topn=5)
            lst = model.most_similar(t0, topn=5)
        except:
            lst = []
        for t, _ in lst:
            if (t not in ext):
                ext.append(t)
    # print(ext)
    # 分類
    n = len(ext)
    d = dict()
    for a in alst:
        score = 0
        for t in ext:
            try:
                # sim = model.wv.similarity(t, a)
                sim = model.similarity(t, a)
            except:
                sim = 0
            if (sim < 0):
                sim = 0
            score = score + sim
        if (n == 0):
            score = 0
        else:
            score = score / n
        d[a] = score
    # 字典排序（成為串列）
    lst = sorted(d.items(), key=lambda x: x[1], reverse=True)
    return lst

# 載入語言模型

model = load_language_model()


Building prefix dict from C:\Users\trchou\Dropbox\000 teaching\aisd10a\jupyter\unit_09_text\unit_09_text_word_similarity_classifier\dict.txt.big ...
Loading model from cache C:\Users\trchou\AppData\Local\Temp\jieba.u451fe25cd6e3a6f27d5c8b0d57922321.cache
Loading model cost 0.929 seconds.
Prefix dict has been built successfully.


<hr>
<h3>整合測試</h3>

In [13]:
# 整合測試

# 輸入

# 測試例句
txt = '不在乎天長地久，只在乎曾經擁有'
# txt = '實踐是檢驗真理的唯一方法。'
# txt = '在新的一年裡，希望所有朋友都幸福快樂心想事成！'

# 文件分類

tokens = lexical_analyzer(txt)
result = text_classifier(tokens)

# 顯示結果

print(txt)
for a, score in result:
    print('%s: %10.8f' % (a, score))


不在乎天長地久，只在乎曾經擁有
苦澀: 0.07565614
真誠: 0.06792980
浪漫: 0.06718439
憂愁: 0.06481532
冷漠: 0.06002934
悠閒: 0.05795810
成熟: 0.05166683
剛強: 0.04832403
青春: 0.04400143
熱情: 0.04246446
甜美: 0.04052958
溫和: 0.03740396
陳腐: 0.03620513
清新: 0.03527851
正義: 0.03379348
有趣: 0.03313669
科技: 0.03078090
活潑: 0.02878336
無聊: 0.02807757
現代: 0.02434960
舒服: 0.01744028
科幻: 0.01729825
驚悚: 0.01413314
