# 文本相似度

https://github.com/MuJiang0618/NLP/blob/master/%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97%E4%B8%A4%E4%B8%AA%E6%96%87%E6%A1%A3%E7%9A%84%E7%9B%B8%E4%BC%BC%E5%BA%A6.pdf

In [1]:
from gensim import corpora, models, similarities
import logging



In [2]:
logging.basicConfig(format=' %(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [3]:
documents = ["Shipment of gold damaged in a fire",
            "Delivery of silver arrived in a silver truck",
             "Shipment of gold arrived in a truck"]

全部转化为小写

In [4]:
texts = [[word for word in document.lower().split()] for document in documents]
print(texts)

[['shipment', 'of', 'gold', 'damaged', 'in', 'a', 'fire'], ['delivery', 'of', 'silver', 'arrived', 'in', 'a', 'silver', 'truck'], ['shipment', 'of', 'gold', 'arrived', 'in', 'a', 'truck']]


In [6]:
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

 2018-10-08 21:35:41,060 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
 2018-10-08 21:35:41,060 : INFO : built Dictionary(11 unique tokens: ['a', 'damaged', 'fire', 'gold', 'in']...) from 3 documents (total 22 corpus positions)


{'a': 0, 'damaged': 1, 'fire': 2, 'gold': 3, 'in': 4, 'of': 5, 'shipment': 6, 'arrived': 7, 'delivery': 8, 'silver': 9, 'truck': 10}


In [7]:
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2), (10, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (10, 1)]]


In [9]:
tfidf = models.TfidfModel(corpus)

 2018-10-08 22:08:06,121 : INFO : collecting document frequencies
 2018-10-08 22:08:06,123 : INFO : PROGRESS: processing document #0
 2018-10-08 22:08:06,125 : INFO : calculating IDF weights for 3 documents and 10 features (21 matrix non-zeros)


In [17]:
corpus_tfidf = tfidf[corpus]   # 计算每个文档中的词的tf-idf值, 如果corpus含有多个文档, 则返回多个文档中每个词的tf-idf
for doc in corpus_tfidf:
    print(doc)

[(1, 0.6633689723434505), (2, 0.6633689723434505), (3, 0.2448297500958463), (6, 0.2448297500958463)]
[(7, 0.16073253746956623), (8, 0.4355066251613605), (9, 0.871013250322721), (10, 0.16073253746956623)]
[(3, 0.5), (6, 0.5), (7, 0.5), (10, 0.5)]


In [13]:
print(tfidf.dfs)   # 0:3 表示包含单词0的文档数为3

{0: 3, 1: 1, 2: 1, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 1, 9: 1, 10: 2}


In [14]:
print(tfidf.idfs)   # 每个词的idf值, 注意不是tf-idf值

{0: 0.0, 1: 1.5849625007211563, 2: 1.5849625007211563, 3: 0.5849625007211562, 4: 0.0, 5: 0.0, 6: 0.5849625007211562, 7: 0.5849625007211562, 8: 1.5849625007211563, 9: 1.5849625007211563, 10: 0.5849625007211562}


包含 id 为 0，4，5 这3 个单词的文档数(df)都为3，而文档总数也是3，所以idf被计算为0了, 看来 gensim 没有对分子+1，做一个平滑处理

In [18]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(2)

 2018-10-08 22:24:22,607 : INFO : using serial LSI version on this node
 2018-10-08 22:24:22,609 : INFO : updating model with new documents
 2018-10-08 22:24:22,611 : INFO : preparing a new chunk of documents
 2018-10-08 22:24:22,612 : INFO : using 100 extra samples and 2 power iterations
 2018-10-08 22:24:22,613 : INFO : 1st phase: constructing (11, 102) action matrix
 2018-10-08 22:24:22,619 : INFO : orthonormalizing (11, 102) action matrix
 2018-10-08 22:24:22,641 : INFO : 2nd phase: running dense svd on (11, 3) matrix
 2018-10-08 22:24:22,652 : INFO : computing the final decomposition
 2018-10-08 22:24:22,653 : INFO : keeping 2 factors (discarding 23.571% of energy spectrum)
 2018-10-08 22:24:22,655 : INFO : processed documents up to #3
 2018-10-08 22:24:22,657 : INFO : topic #0(1.137): -0.438*"gold" + -0.438*"shipment" + -0.366*"arrived" + -0.366*"truck" + -0.345*"fire" + -0.345*"damaged" + -0.297*"silver" + -0.149*"delivery" + 0.000*"a" + 0.000*"in"
 2018-10-08 22:24:22,658 : INF

[(0,
  '-0.438*"gold" + -0.438*"shipment" + -0.366*"arrived" + -0.366*"truck" + -0.345*"fire" + -0.345*"damaged" + -0.297*"silver" + -0.149*"delivery" + 0.000*"a" + 0.000*"in"'),
 (1,
  '-0.728*"silver" + -0.364*"delivery" + 0.364*"fire" + 0.364*"damaged" + -0.134*"arrived" + -0.134*"truck" + 0.134*"gold" + 0.134*"shipment" + 0.000*"a" + -0.000*"in"')]

In [25]:
corpus_lsi = lsi[corpus_tfidf]
for doc in corpus_lsi:
    print(doc)

[(0, -0.6721146880987869), (1, 0.5488068211935582)]
[(0, -0.4412482520869763), (1, -0.8359492048033912)]
[(0, -0.804013789637927)]


可以看出, 文档1, 2和topic2相关, 文档3和topic1负相关

下面跑一个LDA看看

In [27]:
lda = models.LdaModel(corpus_tfidf, id2word=dictionary,num_topics=2)
lda.print_topics(2)

 2018-10-08 22:33:14,158 : INFO : using symmetric alpha at 0.5
 2018-10-08 22:33:14,160 : INFO : using symmetric eta at 0.5
 2018-10-08 22:33:14,161 : INFO : using serial LDA version on this node
 2018-10-08 22:33:14,168 : INFO : running online (single-pass) LDA training, 2 topics, 1 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
 2018-10-08 22:33:14,176 : INFO : -4.106 per-word bound, 17.2 perplexity estimate based on a held-out corpus of 3 documents with 5 words
 2018-10-08 22:33:14,177 : INFO : PROGRESS: pass 0, at document #3/3
 2018-10-08 22:33:14,179 : INFO : topic #0 (0.500): 0.117*"damaged" + 0.115*"silver" + 0.112*"fire" + 0.096*"delivery" + 0.094*"truck" + 0.093*"gold" + 0.090*"arrived" + 0.089*"shipment" + 0.065*"a" + 0.065*"in"
 2018-10-08 22:33:14,180 : INFO : topic #1 (0.500): 0.121*"shipment" + 0.118*"gold" + 0.113*"silver" + 0.111*"arri

[(0,
  '0.117*"damaged" + 0.115*"silver" + 0.112*"fire" + 0.096*"delivery" + 0.094*"truck" + 0.093*"gold" + 0.090*"arrived" + 0.089*"shipment" + 0.065*"a" + 0.065*"in"'),
 (1,
  '0.121*"shipment" + 0.118*"gold" + 0.113*"silver" + 0.111*"arrived" + 0.108*"truck" + 0.091*"fire" + 0.087*"damaged" + 0.079*"delivery" + 0.057*"a" + 0.057*"in"')]

每个主题中, 权值越大的词, 越与该主题相关; 两个主题中对应词的权重值差不多, 没有说服力