/
dictionary.html
221 lines (191 loc) · 12 KB
/
dictionary.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>corpora.dictionary – Construct word<->id mappings — gensim documentation</title>
<link rel="stylesheet" href="../_static/default.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: '../',
VERSION: '0.7.7',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<link rel="top" title="gensim documentation" href="../index.html" />
<link rel="up" title="API Reference" href="../apiref.html" />
<link rel="next" title="corpora.dmlcorpus – Corpus in DML-CZ format" href="dmlcorpus.html" />
<link rel="prev" title="corpora.bleicorpus – Corpus in Blei’s LDA-C format" href="bleicorpus.html" />
<script type="text/javascript" src="_static/widget.js"></script>
</head>
<body>
<!--
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<a href="../index.html"><img src="../_static/logo.png" border="0" alt="gensim logo"/></a>
</div>
--!>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="../py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="dmlcorpus.html" title="corpora.dmlcorpus – Corpus in DML-CZ format"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="bleicorpus.html" title="corpora.bleicorpus – Corpus in Blei’s LDA-C format"
accesskey="P">previous</a> |</li>
<li><a href="../index.html">Gensim home</a>| </li>
<li><a href="../apiref.html">API reference</a>| </li>
<li><a href="../tutorial.html">Tutorials</a> »</li>
<li><a href="../apiref.html" accesskey="U">API Reference</a> »</li>
</ul>
</div>
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h4>Previous topic</h4>
<p class="topless"><a href="bleicorpus.html"
title="previous chapter"><tt class="docutils literal docutils literal"><span class="pre">corpora.bleicorpus</span></tt> – Corpus in Blei’s LDA-C format</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="dmlcorpus.html"
title="next chapter"><tt class="docutils literal"><span class="pre">corpora.dmlcorpus</span></tt> – Corpus in DML-CZ format</a></p>
<div id="searchbox" style="display: none">
<h3>Quick search</h3>
<form class="search" action="../search.html" method="get">
<input type="text" name="q" size="18" />
<input type="submit" value="Go" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
<p class="searchtip" style="font-size: 90%">
Enter search terms or a module, class or function name.
</p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
<div class="section" id="module-gensim.corpora.dictionary">
<span id="corpora-dictionary-construct-word-id-mappings"></span><h1><tt class="xref py py-mod docutils literal"><span class="pre">corpora.dictionary</span></tt> – Construct word<->id mappings<a class="headerlink" href="#module-gensim.corpora.dictionary" title="Permalink to this headline">¶</a></h1>
<p>This module implements the concept of Dictionary – a mapping between words and
their integer ids.</p>
<p>Dictionaries can be created from a corpus and can later be pruned according to
document frequency (removing (un)common words via the <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.filterExtremes" title="gensim.corpora.dictionary.Dictionary.filterExtremes"><tt class="xref py py-func docutils literal"><span class="pre">Dictionary.filterExtremes()</span></tt></a> method),
save/loaded from disk via <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.save" title="gensim.corpora.dictionary.Dictionary.save"><tt class="xref py py-func docutils literal"><span class="pre">Dictionary.save()</span></tt></a> and <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.load" title="gensim.corpora.dictionary.Dictionary.load"><tt class="xref py py-func docutils literal"><span class="pre">Dictionary.load()</span></tt></a> methods etc.</p>
<dl class="class">
<dt id="gensim.corpora.dictionary.Dictionary">
<em class="property">class </em><tt class="descclassname">gensim.corpora.dictionary.</tt><tt class="descname">Dictionary</tt><big>(</big><em>documents=None</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary" title="Permalink to this definition">¶</a></dt>
<dd><p>Dictionary encapsulates mappings between normalized words and their integer ids.</p>
<p>The main function is <cite>doc2bow</cite>, which converts a collection of words to its
bag-of-words representation, optionally also updating the dictionary mapping
with newly encountered words and their ids.</p>
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.addDocuments">
<tt class="descname">addDocuments</tt><big>(</big><em>documents</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.addDocuments" title="Permalink to this definition">¶</a></dt>
<dd><p>Build dictionary from a collection of documents. Each document is a list
of tokens (<strong>tokenized and normalized</strong> utf-8 encoded strings).</p>
<p>This is only a convenience wrapper for calling <cite>doc2bow</cite> on each document
with <cite>allowUpdate=True</cite>.</p>
<div class="highlight-python"><div class="highlight"><pre><span class="gp">>>> </span><span class="k">print</span> <span class="n">Dictionary</span><span class="o">.</span><span class="n">fromDocuments</span><span class="p">([</span><span class="s">"máma mele maso"</span><span class="o">.</span><span class="n">split</span><span class="p">(),</span> <span class="s">"ema má máma"</span><span class="o">.</span><span class="n">split</span><span class="p">()])</span>
<span class="go">Dictionary(5 unique tokens)</span>
</pre></div>
</div>
</dd></dl>
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.doc2bow">
<tt class="descname">doc2bow</tt><big>(</big><em>document</em>, <em>allowUpdate=False</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.doc2bow" title="Permalink to this definition">¶</a></dt>
<dd><p>Convert <cite>document</cite> (a list of words) into the bag-of-words format = list of
<cite>(tokenId, tokenCount)</cite> 2-tuples. Each word is assumed to be a
<strong>tokenized and normalized</strong> utf-8 encoded string.</p>
<p>If <cite>allowUpdate</cite> is set, then also update of dictionary in the process: create ids
for new words. At the same time, update document frequencies – for
each word appearing in this document, increase its <cite>self.docFreq</cite> by one.</p>
<p>If <cite>allowUpdate</cite> is <strong>not</strong> set, this function is <cite>const</cite>, i.e. read-only.</p>
</dd></dl>
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.filterExtremes">
<tt class="descname">filterExtremes</tt><big>(</big><em>noBelow=5</em>, <em>noAbove=0.5</em>, <em>keepN=None</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.filterExtremes" title="Permalink to this definition">¶</a></dt>
<dd><p>Filter out tokens that appear in</p>
<ol class="arabic simple">
<li>less than <cite>noBelow</cite> documents (absolute number) or</li>
<li>more than <cite>noAbove</cite> documents (fraction of total corpus size, <em>not</em>
absolute number).</li>
<li>after (1) and (2), keep only the first <cite>keepN’ most frequent tokens (or
all if `None</cite>).</li>
</ol>
<p>After the pruning, shrink resulting gaps in word ids.</p>
<p><strong>Note</strong>: Due to the gap shrinking, the same word may have a different
word id before and after the call to this function!</p>
</dd></dl>
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.filterTokens">
<tt class="descname">filterTokens</tt><big>(</big><em>badIds=None</em>, <em>goodIds=None</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.filterTokens" title="Permalink to this definition">¶</a></dt>
<dd><p>Remove the selected <cite>badIds</cite> tokens from all dictionary mappings, or, keep
selected <cite>goodIds</cite> in the mapping and remove the rest.</p>
<p><cite>badIds</cite> and <cite>goodIds</cite> are collections of word ids to be removed.</p>
</dd></dl>
<dl class="classmethod">
<dt id="gensim.corpora.dictionary.Dictionary.load">
<em class="property">classmethod </em><tt class="descname">load</tt><big>(</big><em>fname</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.load" title="Permalink to this definition">¶</a></dt>
<dd><p>Load a previously saved object from file (also see <cite>save</cite>).</p>
</dd></dl>
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.rebuildDictionary">
<tt class="descname">rebuildDictionary</tt><big>(</big><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.rebuildDictionary" title="Permalink to this definition">¶</a></dt>
<dd><p>Assign new word ids to all words.</p>
<p>This is done to make the ids more compact, e.g. after some tokens have
been removed via <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.filterTokens" title="gensim.corpora.dictionary.Dictionary.filterTokens"><tt class="xref py py-func docutils literal"><span class="pre">filterTokens()</span></tt></a> and there are gaps in the id series.
Calling this method will remove the gaps.</p>
</dd></dl>
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.save">
<tt class="descname">save</tt><big>(</big><em>fname</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.save" title="Permalink to this definition">¶</a></dt>
<dd><p>Save the object to file via pickling (also see <cite>load</cite>).</p>
</dd></dl>
</dd></dl>
</div>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="../genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="../py-modindex.html" title="Python Module Index"
>modules</a> |</li>
<li class="right" >
<a href="dmlcorpus.html" title="corpora.dmlcorpus – Corpus in DML-CZ format"
>next</a> |</li>
<li class="right" >
<a href="bleicorpus.html" title="corpora.bleicorpus – Corpus in Blei’s LDA-C format"
>previous</a> |</li>
<li><a href="../index.html">Gensim home</a>| </li>
<li><a href="../apiref.html">API reference</a>| </li>
<li><a href="../tutorial.html">Tutorials</a> »</li>
<li><a href="../apiref.html" >API Reference</a> »</li>
</ul>
</div>
<div class="footer">
© Copyright 2011, Radim Řehůřek <radimrehurek(at)seznam.cz>.
Last updated on Feb 19, 2011.
</div>
</body>
</html>