H. Hsu and N. Huang, “Xiao-Shih: A Self-enriched Question Answering Bot With Machine Learning on Chinese-Based MOOCs,” <I>IEEE Trans. Learning Technologies</I>. (Under Review)

# Spreading Question Similarity (SQS)
SQS algorithm was proposed to compute question similarities based on keyword networks. As the name suggests, this algorithm spreads the degree of relationship between the most relevant keywords by iterating the neighbors on keyword networks. Because of this, vectors will not only be generated with existing keywords but also existing keywords will find other relevant keywords and integrate their similarities into vectors.

### xiaoshih.SQS(wv_model_path, keywords)
<b>Parameters:</b>
- wv_model_path: str, <I>path object of a word2vec model</I>
- keywords: set, <I>path object of keywords for tokenizing text</I>

In [1]:
import warnings
warnings.filterwarnings('ignore')

## 1. Preparing keywords and word2vec model
Course "Python for Data Science" (PDS): 
- word2vec model: word2vec_model/pds
- keywords: corpus/keywords_pds.txt

Course "Introduction to Computer Networks" (ICN): 
- word2vec model: word2vec_model/icn
- keywords: corpus/keywords_icn.txt

In [2]:
keywords = set()
with open('corpus/keywords_pds.txt','r') as f:
    for line in f:
        keywords.add(line.strip())

In [3]:
from xiaoshih import SQS
sqs = SQS(wv_model_path='word2vec_model/pds', keywords=keywords)

## 2. Executing SQS and generating VSMs (vector space models) of questions
### SQS.text2ngram(text, n)
<b>Parameters:</b>
- text: str, <I>the text of a question</I>
- n: int, <I>used for ngram algorithm</I>

#### An example shows how SQS works with English text.

In [4]:
question = 'how to plot a heatmap?'

In [5]:
tokens = sqs.text2ngram(text=question, n=5)
print(tokens)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/cr/_xbyk0wn69jdw48ygj_6w_w00000gn/T/jieba.cache
Loading model cost 0.650 seconds.
Prefix dict has been built successfully.


['plot', 'heatmap']


### SQS.ngram2vsm(tokens)
<b>Parameters:</b>
- tokens: list, <I>the list of tokens generated from a question</I>

In [6]:
vsm = sqs.ngram2vsm(tokens)
print(vsm)

{'plot': 1, 'heatmap': 1}


### SQS.spreading_similarity(depth, tokens, topn, vsm)
<b>Parameters:</b>
- depth: int, <I>the depth of spreading question similarity on knowledge networks (KGs)</I>
- tokens: list, <I>the list of tokens generated from a question</I>
- topn: int, <I>extracting top n similar keywords from a word2vec model</I>
- vsm: dict, <I>vector space model of a question</I>

<b>SQS with depth=1</b>

In [7]:
vsm = sqs.spreading_similarity(depth=1, tokens=tokens, topn=10, vsm=vsm)
print(vsm)

{'bitmap': 0.8547123670578003, 'html input': 0.7219830751419067, 'colormap': 0.6446496248245239, 'seaborn': 1.2536998987197876, 'html5': 0.77767014503479, 'graph': 0.569450855255127, 'mplot3d': 0.5833823680877686, 'dml': 0.6419522166252136, 'gml': 0.6470009088516235, 'figsize': 0.6039561033248901, 'ggplot': 0.687764585018158, 'lxml': 0.615425705909729, 'boxplot': 0.7325916290283203, 'html': 0.7161680459976196, 'agents': 0.6629743576049805, 'trix': 0.5658433437347412, 'plot': 2, 'axes': 0.6319538354873657, 'aes': 0.5761990547180176, 'xgboost': 0.6083373427391052, 'heatmap': 2}


<b>SQS with depth=2</b>

In [8]:
vsm = sqs.spreading_similarity(depth=2, tokens=tokens, topn=10, vsm=vsm)
print(vsm)

{'文化部': 0.8027802109718323, 'uml': 0.7251225709915161, 'Seaborn': 1.3778547048568726, '羅吉斯': 0.799808144569397, 'contour': 1.1097870767116547, 'mplot3d': 4.131114661693573, 'web.xml': 3.7222737669944763, 'gas': 0.7223443984985352, 'yaxis': 1.3975439071655273, 'ggplot': 6.6922173500061035, '作圖': 0.7126445174217224, 'tm': 2.1172672510147095, 'ax1': 3.385937988758087, '收錄國內展演': 1.5479412078857422, '3d': 0.6745867729187012, '再加': 0.7167825102806091, '最大值': 0.7477145791053772, 'bitmap': 4.265951991081238, 'confusion matrix': 0.6560948491096497, 'seaborn': 6.709187626838684, 'ml': 0.8418811559677124, 'wt': 0.5851527452468872, '故事': 0.8029372096061707, 'mask': 0.604655385017395, 'fusion': 0.6689865589141846, '國內展演空間': 0.7920407056808472, 'agents': 7.2940661907196045, 'html': 5.269179821014404, 'att': 0.5665087103843689, 'hue': 1.3775079846382141, 'NLP': 0.7477859258651733, 'igraph': 0.6725866198539734, '建物': 0.7144760489463806, 'xml': 2.235849916934967, 'kwds': 0.6766580939292908, 'dml': 7.55

#### Another example shows how SQS works with Chinese text.

In [9]:
question = '如何讀取中文的 CSV 檔?'
tokens = sqs.text2ngram(text=question, n=5)
print(tokens)

['讀取', '中文', 'CSV']


In [10]:
vsm = sqs.ngram2vsm(tokens)
print(vsm)

{'中文': 1, '讀取': 1, 'CSV': 1}


<b>SQS with depth=1</b>

In [11]:
vsm = sqs.spreading_similarity(depth=1, tokens=tokens, topn=10, vsm=vsm)
print(vsm)

{'科學': 0.8575551509857178, 'Q2': 1.7136414051055908, '中文': 2, '讀取': 2.913317620754242, '資料科學實作': 0.8548364639282227, '請參考': 0.853739857673645, 'Excel 檔': 0.868510901927948, '匯出': 1.7170578837394714, '政府': 0.8685304522514343, '編碼': 0.8212184906005859, '解決': 1.7170233726501465, '一段': 0.8692190647125244, '混淆矩陣函式': 0.8130865097045898, 'CSV': 2.913317561149597, '標題': 0.8905748128890991, '斜線': 0.8246365189552307, '錯誤': 0.8125832080841064, '文件': 0.8597566485404968, '顯示中文': 0.9127604961395264, '英文': 0.9175906181335449, '字體': 0.9368928670883179, '資料科學': 0.8830405473709106, '確認': 0.8986414670944214, '呈現': 1.7149540781974792, '電腦': 0.8579882383346558, '案例': 0.8618677854537964, '快捷': 0.8264524936676025}


<b>SQS with depth=2</b>

In [12]:
vsm = sqs.spreading_similarity(depth=2, tokens=tokens, topn=10, vsm=vsm)
print(vsm)

{'科學': 5.4731199741363525, '符號': 3.7402597665786743, '匯出': 12.702610492706299, '小寫': 0.9085090756416321, 'Q2': 6.1409242153167725, '中文': 7.592450499534607, '網頁': 0.8878343105316162, '修改': 0.8456957936286926, 'ascii': 0.8152944445610046, '進階': 1.7766331434249878, '留下': 0.886789858341217, '工程師': 2.841284453868866, '提到': 0.946760892868042, '博客': 0.8902795314788818, '分配': 3.779073417186737, '收錄國內展演': 2.827899932861328, '異常值': 0.9356635212898254, '影片': 0.915725827217102, '麻煩': 0.9059646129608154, '字元': 0.8977420926094055, 'one hot encoding': 0.8175873160362244, '篇': 5.2325122356414795, '斜線': 8.963612794876099, '錯誤': 3.4377496242523193, '文件': 4.383580267429352, '現象': 0.9510537385940552, '位數': 1.8975083827972412, '人數': 0.7670204639434814, '提示': 1.8082762956619263, '答案': 0.9394747018814087, '經典案例': 0.9228866100311279, '資料科學': 5.375200867652893, '社團': 1.8752697706222534, '程式結構': 2.7501797676086426, '詞': 3.6768754720687866, '案例': 4.359405815601349, '空白': 0.951610803604126, '國內展演空間': 2.8462900519

## 3. Computing Question Similarity
### SQS.cosine_similarity(vsm1, vsm2)
<b>Parameters:</b>
- vsm1: dict, <I>vector space model of a question 1</I>
- vsm2: dict, <I>vector space model of a question 2</I>

<b>A example of duplicate questions in different words from two students. </b>

In [13]:
q1 = "!dot -Tpng tree.dot -o tree.png 的問題 老師好:  我在執行決策分類樹時，執行!dot -Tpng tree.dot -o tree.png跑出來的結果是:'dot' 不是內部或外部命令、可執行的程式或批次檔。不知道是什麼原因造成這樣，麻煩老師了。"
tokens1 = sqs.text2ngram(text=q1, n=5)
vsm1 = sqs.ngram2vsm(tokens1)
vsm1 = sqs.spreading_similarity(depth=1, tokens=tokens1, topn=10, vsm=vsm1)

In [14]:
q2 = "dot command not found 在觀看課程影片的時候，dot轉換成png檔時發生問題，執行程式: !dot -Tpng tree.dot -o tree.png錯誤訊息: 'dot' 不是內部或外部命令、可執行的程式或批次檔。後來我去上網下載graphviz後，依然沒辦法解決。想請問有什麼方法可以下載和解決?PS: 電腦是使用windows 10"
tokens2 = sqs.text2ngram(text=q2, n=5)
vsm2 = sqs.ngram2vsm(tokens2)
vsm2 = sqs.spreading_similarity(depth=1, tokens=tokens2, topn=10, vsm=vsm2)

In [15]:
sqs.cosine_similarity(vsm1, vsm2)

0.8578461340775784