<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#准备语料" data-toc-modified-id="准备语料-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>准备语料</a></span><ul class="toc-item"><li><span><a href="#去除不重要的单词" data-toc-modified-id="去除不重要的单词-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>去除不重要的单词</a></span></li><li><span><a href="#去除-只出现一次的单词" data-toc-modified-id="去除-只出现一次的单词-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>去除 只出现一次的单词</a></span></li></ul></li><li><span><a href="#文档转化为向量" data-toc-modified-id="文档转化为向量-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>文档转化为向量</a></span><ul class="toc-item"><li><span><a href="#转化原则" data-toc-modified-id="转化原则-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>转化原则</a></span></li><li><span><a href="#创建字典" data-toc-modified-id="创建字典-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>创建字典</a></span></li><li><span><a href="#字典中每个key对应的编号" data-toc-modified-id="字典中每个key对应的编号-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>字典中每个key对应的编号</a></span></li><li><span><a href="#使用doc2brow-转化为向量" data-toc-modified-id="使用doc2brow-转化为向量-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>使用doc2brow 转化为向量</a></span></li></ul></li><li><span><a href="#处理单个文档流-----不是全部文档" data-toc-modified-id="处理单个文档流-----不是全部文档-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>处理单个文档流 --- 不是全部文档</a></span></li><li><span><a href="#不需要将文件全部导入构建语料库" data-toc-modified-id="不需要将文件全部导入构建语料库-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>不需要将文件全部导入构建语料库</a></span></li><li><span><a href="#corpus储存和导入" data-toc-modified-id="corpus储存和导入-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>corpus储存和导入</a></span></li><li><span><a href="#numpy-矩阵之间转化" data-toc-modified-id="numpy-矩阵之间转化-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>numpy 矩阵之间转化</a></span></li></ul></div>

# 准备语料

In [5]:
from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

from pprint import pprint



## 去除不重要的单词

In [18]:
# 去除不重要的单词
stoplist = set('for the a and an if in on of to '.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]
pprint(texts)

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]


## 去除 只出现一次的单词

In [19]:
# 去除 只出现一次的单词
from collections import defaultdict
frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]  # !! 控制出现的次数
pprint(texts)


[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


# 文档转化为向量

## 转化原则
* each document is represented by one vector where each vector element represents a question-answer pair, in the style of:

* “How many times does the word system appear in the document? Once.”


## 创建字典

In [22]:
dictionary = corpora.Dictionary(texts)
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


## 字典中每个key对应的编号

In [25]:
print(dictionary.token2id)

{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}


## 使用doc2brow 转化为向量

In [26]:
corpus = [dictionary.doc2bow(text) for text in texts]
pprint(corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


* 对应与上文的texts 矩阵  （10，1） 最后一行的10，1 表示 10 对应的 graph 在最后一行 即最后一个文档出现了1 次

# 处理单个文档流 --- 不是全部文档

In [29]:
class Mycorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            yield dictionary.doc2bow(line.lower().split())


mycorpus = Mycorpus()
print(mycorpus)

<__main__.Mycorpus object at 0x1a1c7edc88>


* 生成的corpus 是一个object 打印的是他的地址

In [30]:
for vector in mycorpus:
    print(vector)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


* 输出相同，但是每次内存只存储一个文档的内容

# 不需要将文件全部导入构建语料库

In [33]:
from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(
    line.lower().split() for line in open('mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [
    dictionary.token2id[stopword] for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [
    tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1
]
dictionary.filter_tokens(
    stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify(
)  # remove gaps in id sequence after words that were removed
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)


# corpus储存和导入 

In [34]:
corpora.MmCorpus.serialize('corpus.mm', corpus)

In [36]:
corpus = corpora.MmCorpus('corpus.mm')
print(corpus)

MmCorpus(9 documents, 12 features, 28 non-zero entries)


* To view the contents of a corpus:

In [37]:
pprint(list(corpus))

[[(0, 1.0), (1, 1.0), (2, 1.0)],
 [(0, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0)],
 [(2, 1.0), (5, 1.0), (7, 1.0), (8, 1.0)],
 [(1, 1.0), (5, 2.0), (8, 1.0)],
 [(3, 1.0), (6, 1.0), (7, 1.0)],
 [(9, 1.0)],
 [(9, 1.0), (10, 1.0)],
 [(9, 1.0), (10, 1.0), (11, 1.0)],
 [(4, 1.0), (10, 1.0), (11, 1.0)]]


# numpy 矩阵之间转化

In [38]:
import numpy as np
numpy_matrix = np.random.randint(10, size=[5, 2])
numpy_matrix

array([[6, 3],
       [4, 8],
       [2, 6],
       [0, 1],
       [4, 6]])

In [43]:
import gensim
corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
pprint(list(corpus))

[[(0, 6.0), (1, 4.0), (2, 2.0), (4, 4.0)],
 [(0, 3.0), (1, 8.0), (2, 6.0), (3, 1.0), (4, 6.0)]]
