# 语料库和向量空间

首先是环境配置

In [18]:
%%HTML
<button onclick="$('.output_stderr').toggle();">Toggle Code</button>

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

Folder "/tmp" will be used to save temporary dictionary and corpus.


## 字符串转为词袋模型

In [3]:
from gensim import corpora

2017-11-07 10:38:32,174 : INFO : 'pattern' package not found; tag filters are not available for English


In [4]:
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

可以看做一个小的语料集，每行是一个doc，一共9篇

1、对documents做预处理，包括去掉停用词，删除仅出现一次的词

In [5]:
#停用词
stoplist = set('for a of the and to in'.split())
texts = [ [word for word in document.lower().split() if word not in stoplist ]
         for document in documents ]

#删除仅出现一次的词
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1 ] for text in texts]

from pprint import pprint
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


将其转成词袋模型(bag-of-words）。

In [6]:
dictionary = corpora.Dictionary(texts)
#词典保存到本地
dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))
print(dictionary)

2017-11-07 10:38:32,247 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-11-07 10:38:32,257 : INFO : built Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...) from 9 documents (total 29 corpus positions)
2017-11-07 10:38:32,260 : INFO : saving Dictionary object under /tmp/deerwester.dict, separately None
2017-11-07 10:38:32,263 : INFO : saved /tmp/deerwester.dict


Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)


这里，通过gemsim.corpora.dictionary.Dictionary类，给每个单词一个id，同时也收集词频等相关统计。这里能看到所有文档共出现了12个单词。用dictionary.token2id可以看到每个单词的编号。

In [7]:
print(dictionary.token2id)

{u'minors': 11, u'graph': 10, u'system': 6, u'trees': 9, u'eps': 8, u'computer': 1, u'survey': 5, u'user': 7, u'human': 2, u'time': 4, u'interface': 0, u'response': 3}


再来一篇文档，看一下转成词袋模型的效果

In [8]:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
pprint(new_vec)

[(1, 1), (2, 1)]


其中，interaction没有在语料库中出现，就没有转成稀疏向量。
在scikit-learn中，doc2bow()类似于[CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)类的fit_transform()。

接下来，再用训练好的模型将整个语料库转成稀疏向量。

In [9]:
corpus = [dictionary.doc2bow(text) for text in texts] 
#词袋模型保存到本地
corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'deerwester.mm'), corpus)
for c in corpus:
    print(c)

2017-11-07 10:38:32,320 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
2017-11-07 10:38:32,322 : INFO : saving sparse matrix to /tmp/deerwester.mm
2017-11-07 10:38:32,324 : INFO : PROGRESS: saving document #0
2017-11-07 10:38:32,326 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2017-11-07 10:38:32,327 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index


[(0, 1), (1, 1), (2, 1)]
[(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(0, 1), (6, 1), (7, 1), (8, 1)]
[(2, 1), (6, 2), (8, 1)]
[(3, 1), (4, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(5, 1), (10, 1), (11, 1)]


## Corpus流处理

如果训练的语料库很大，不宜都放到内存。 
假设现在的文档在磁盘上，一行一个文档。

In [10]:
class MyCorpus(object):
    def __iter__(self):
        for line in open('/home/david/code/jupyter/gensim/mycorpus.txt'):
            yield dictionary.doc2bow(line.lower().split())

In [11]:
corpus_memory_friendly = MyCorpus()
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x7fbdfac13210>


In [12]:
for vector in corpus_memory_friendly:
    print(vector)

[(0, 1), (1, 1), (2, 1)]
[(1, 1), (3, 1), (5, 1), (6, 1), (7, 1)]
[(0, 1), (7, 1), (8, 1)]
[(2, 1), (6, 2)]
[(3, 1), (4, 1), (7, 1)]
[]
[(10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(5, 1), (10, 1), (11, 1)]


接下来，用来构建词典，而不用把整个数据加载到内存。然后，我们将生成停用词的token id，为了将其从dictionary中删除；以及从词频词典中只出现一次的词的token id。最后，我们把这些从dictionary中过滤掉。

要注意的是，像dictionary.filter_tokens（或者其他一些方法比如dictionary.add_document）会调用dictionary.compactify()来消除id的gap，所以每个词的token id会有变化。

In [13]:
from six import iteritems

#收集所有token的统计信息
dictionary2 = corpora.Dictionary(line.lower().split() for line in open('/home/david/code/jupyter/gensim/mycorpus.txt'))

#找到停用词和只出现一次的词的id
stop_ids = [ dictionary2.token2id[stopword] for stopword in stoplist if stopword in dictionary2.token2id ]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary2.dfs) if docfreq == 1 ]

#删除上述id
dictionary2.filter_tokens(stop_ids + once_ids)
print(dictionary2)

2017-11-07 10:38:32,422 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-11-07 10:38:32,425 : INFO : built Dictionary(46 unique tokens: [u'and', u'paths', u'minors', u'generation', u'testing']...) from 9 documents (total 69 corpus positions)


Dictionary(10 unique tokens: [u'minors', u'graph', u'trees,', u'system', u'computer']...)


为了将向量空间的corpus保存到本地，有几种文件格式。Gemsim通过之前提到的流处理接口来实现。

**保存模型**

In [14]:
#创建一个corpus
corpus = [[(1,0.5)], []]
corpora.MmCorpus.serialize(os.path.join(TEMP_FOLDER, 'corpus.mm'), corpus)

2017-11-07 10:38:32,451 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
2017-11-07 10:38:32,453 : INFO : saving sparse matrix to /tmp/corpus.mm
2017-11-07 10:38:32,467 : INFO : PROGRESS: saving document #0
2017-11-07 10:38:32,468 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2017-11-07 10:38:32,470 : INFO : saving MmCorpus index to /tmp/corpus.mm.index


**加载模型**

In [15]:
corpus = corpora.MmCorpus(os.path.join(TEMP_FOLDER, 'corpus.mm'))
print(corpus)

2017-11-07 10:38:32,478 : INFO : loaded corpus index from /tmp/corpus.mm.index
2017-11-07 10:38:32,480 : INFO : initializing corpus reader from /tmp/corpus.mm
2017-11-07 10:38:32,482 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries


MmCorpus(2 documents, 2 features, 1 non-zero entries)


In [16]:
print(list(corpus))

[[(1, 0.5)], []]


In [17]:
#也可以遍历输出
for doc in corpus:
    print(doc)

[(1, 0.5)]
[]


## 与Numpy和Scipy互通

略