Permalink
Browse files

Cleanup whitespace

* remove ^M stale characters
* strip trailing whitespace

this commit is generated with:
find . -type f | grep -vF ./.git/ | grep -v \.png$ | xargs sed -i 's/\s\+$//
  • Loading branch information...
1 parent 3a81762 commit a22dca45afccb75475c3778da488eca357574e39 Dieter Plaetinck committed Feb 28, 2011
Showing with 3,289 additions and 3,291 deletions.
  1. +1 −1 CHANGELOG.txt
  2. +7 −7 MANIFEST.in
  3. +6 −6 README.txt
  4. +0 −1 docs/_sources/apiref.txt
  5. +6 −6 docs/_sources/dist_lda.txt
  6. +22 −22 docs/_sources/dist_lsi.txt
  7. +19 −19 docs/_sources/distributed.txt
  8. +5 −5 docs/_sources/index.txt
  9. +16 −16 docs/_sources/install.txt
  10. +43 −43 docs/_sources/intro.txt
  11. +1 −1 docs/_sources/models/models.txt
  12. +36 −36 docs/_sources/tut1.txt
  13. +48 −48 docs/_sources/tut2.txt
  14. +23 −23 docs/_sources/tut3.txt
  15. +24 −24 docs/_sources/tutorial.txt
  16. +1 −1 docs/_sources/utils.txt
  17. +33 −33 docs/_sources/wiki.txt
  18. +10 −10 docs/apiref.html
  19. +10 −10 docs/corpora/bleicorpus.html
  20. +10 −10 docs/corpora/corpora.html
  21. +22 −22 docs/corpora/dictionary.html
  22. +18 −18 docs/corpora/dmlcorpus.html
  23. +17 −17 docs/corpora/lowcorpus.html
  24. +12 −12 docs/corpora/mmcorpus.html
  25. +15 −15 docs/corpora/svmlightcorpus.html
  26. +17 −17 docs/corpora/wikicorpus.html
  27. +10 −10 docs/dist_lda.html
  28. +10 −10 docs/dist_lsi.html
  29. +10 −10 docs/distributed.html
  30. +7 −7 docs/genindex.html
  31. +9 −9 docs/index.html
  32. +10 −10 docs/install.html
  33. +19 −19 docs/interfaces.html
  34. +10 −10 docs/intro.html
  35. +20 −20 docs/matutils.html
  36. +11 −11 docs/models/lda_dispatcher.html
  37. +10 −10 docs/models/lda_worker.html
  38. +41 −41 docs/models/ldamodel.html
  39. +11 −11 docs/models/lsi_dispatcher.html
  40. +10 −10 docs/models/lsi_worker.html
  41. +45 −45 docs/models/lsimodel.html
  42. +11 −11 docs/models/models.html
  43. +12 −12 docs/models/rpmodel.html
  44. +12 −12 docs/models/tfidfmodel.html
  45. +6 −6 docs/modindex.html
  46. +5 −5 docs/py-modindex.html
  47. +7 −7 docs/search.html
  48. +26 −26 docs/similarities/docsim.html
  49. +4 −4 docs/src/_templates/page.html
  50. +0 −1 docs/src/apiref.rst
  51. +6 −6 docs/src/dist_lda.rst
  52. +22 −22 docs/src/dist_lsi.rst
  53. +19 −19 docs/src/distributed.rst
  54. +5 −5 docs/src/index.rst
  55. +16 −16 docs/src/install.rst
  56. +43 −43 docs/src/intro.rst
  57. +1 −1 docs/src/models/models.rst
  58. +36 −36 docs/src/tut1.rst
  59. +48 −48 docs/src/tut2.rst
  60. +23 −23 docs/src/tut3.rst
  61. +24 −24 docs/src/tutorial.rst
  62. +1 −1 docs/src/utils.rst
  63. +33 −33 docs/src/wiki.rst
  64. +10 −10 docs/tut1.html
  65. +10 −10 docs/tut2.html
  66. +10 −10 docs/tut3.html
  67. +10 −10 docs/tutorial.html
  68. +29 −29 docs/utils.html
  69. +10 −10 docs/wiki.html
  70. +1 −1 ez_setup.py
  71. +10 −10 setup.py
  72. +2 −2 src/gensim/__init__.py
  73. +16 −16 src/gensim/corpora/bleicorpus.py
  74. +44 −44 src/gensim/corpora/dictionary.py
  75. +47 −47 src/gensim/corpora/dmlcorpus.py
  76. +16 −16 src/gensim/corpora/indexedcorpus.py
  77. +34 −34 src/gensim/corpora/lowcorpus.py
  78. +3 −3 src/gensim/corpora/mmcorpus.py
  79. +73 −73 src/gensim/corpora/sources.py
  80. +19 −19 src/gensim/corpora/svmlightcorpus.py
  81. +46 −46 src/gensim/corpora/wikicorpus.py
  82. +9 −9 src/gensim/dmlcz/gensim_build.py
  83. +4 −4 src/gensim/dmlcz/gensim_genmodel.py
  84. +9 −9 src/gensim/dmlcz/gensim_xml.py
  85. +19 −19 src/gensim/dmlcz/geteval.py
  86. +3 −3 src/gensim/dmlcz/runall.sh
  87. +37 −37 src/gensim/interfaces.py
  88. +487 −487 src/gensim/matutils.py
  89. +1 −1 src/gensim/models/__init__.py
  90. +25 −25 src/gensim/models/lda_dispatcher.py
  91. +11 −11 src/gensim/models/lda_worker.py
  92. +116 −116 src/gensim/models/ldamodel.py
  93. +22 −22 src/gensim/models/lsi_dispatcher.py
  94. +10 −10 src/gensim/models/lsi_worker.py
  95. +784 −784 src/gensim/models/lsimodel.py
  96. +101 −101 src/gensim/models/rpmodel.py
  97. +96 −96 src/gensim/models/tfidfmodel.py
  98. +48 −48 src/gensim/similarities/docsim.py
  99. +7 −7 src/gensim/test/test_corpora.py
  100. +29 −29 src/gensim/test/test_models.py
  101. +66 −66 src/gensim/utils.py
View
2 CHANGELOG.txt
@@ -52,7 +52,7 @@ Changes
0.6.0
* added option for online LSI training (yay!). the transformation can now be
- used after any amount of training, and training can be continued at any time
+ used after any amount of training, and training can be continued at any time
with more data.
* optimized the tf-idf transformation, so that it is a strictly one-pass algorithm in all cases (thx to Brian Merrell).
* fixed Windows-specific bug in handling binary files (thx to Sutee Sudprasert)
View
14 MANIFEST.in
@@ -1,7 +1,7 @@
-recursive-include docs *
-recursive-include src/gensim/test testcorpus*
-recursive-include src *.sh
-prune docs/src*
-include COPYING
-include COPYING.LESSER
-include ez_setup.py
+recursive-include docs *
+recursive-include src/gensim/test testcorpus*
+recursive-include src *.sh
+prune docs/src*
+include COPYING
+include COPYING.LESSER
+include ez_setup.py
View
12 README.txt
@@ -4,7 +4,7 @@ gensim -- Python Framework for Topic Modelling
-Gensim is a Python library for *Vector Space Modelling* with very large corpora.
+Gensim is a Python library for *Vector Space Modelling* with very large corpora.
Target audience is the *Natural Language Processing* (NLP) community.
@@ -17,14 +17,14 @@ Features
* easy to plug in your own input corpus/datastream (trivial streaming API)
* easy to extend with other Vector Space algorithms (trivial transformation API)
-* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis**,
+* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis**,
**Latent Dirichlet Allocation** or **Random Projections**
* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers.
* Extensive `HTML documentation and tutorials <http://nlp.fi.muni.cz/projekty/gensim/>`_.
-If this feature list left you scratching your head, you can first read more about the `Vector
-Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
+If this feature list left you scratching your head, you can first read more about the `Vector
+Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
document analysis <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_ on Wikipedia.
Installation
@@ -37,14 +37,14 @@ The simple way to install `gensim` is::
sudo easy_install gensim
-Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
+Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
you'll need to run::
python setup.py test
sudo python setup.py install
-For alternative modes of installation (without root priviledges, development
+For alternative modes of installation (without root priviledges, development
installation, optional install features), see the `documentation <http://nlp.fi.muni.cz/projekty/gensim/install.html>`_.
This version has been tested under Python 2.5 and 2.6, but should run on any 2.5 <= Python < 3.0.
View
1 docs/_sources/apiref.txt
@@ -28,4 +28,3 @@ Modules:
models/lda_worker
similarities/docsim
-
View
12 docs/_sources/dist_lda.txt
@@ -12,25 +12,25 @@ Setting up the cluster
_______________________
See the tutorial on :doc:`dist_lsi`; setting up a cluster for LDA is completely
-analogous, except you want to run `lda_worker` and `lda_dispatcher` scripts instead
+analogous, except you want to run `lda_worker` and `lda_dispatcher` scripts instead
of `lsi_worker` and `lsi_dispatcher`.
Running LDA
____________
-Run LDA like you normally would, but turn on the `distributed=True` constructor
+Run LDA like you normally would, but turn on the `distributed=True` constructor
parameter::
>>> # extract 100 LDA topics, using default parameters
>>> lda = LdaModel(corpus=mm, id2word=id2word, numTopics=100, distributed=True)
using distributed version with 4 workers
running online LDA training, 100 topics, 1 passes over the supplied corpus of 3199665 documets, updating model once every 40000 documents
..
-
+
In serial mode (no distribution), creating this online LDA :doc:`model of Wikipedia <wiki>`
-takes 10h56m on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
-In distributed mode with four workers (Linux, Xeons of 2Ghz, 4GB RAM
+takes 10h56m on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
+In distributed mode with four workers (Linux, Xeons of 2Ghz, 4GB RAM
with `ATLAS <http://math-atlas.sourceforge.net/>`_), the wallclock time taken drops to 3h20m.
To run standard batch LDA (no online updates of mini-batches) instead, you would similarly
@@ -73,7 +73,7 @@ and then, some two days later::
topic #17: 0.027*book + 0.021*published + 0.020*books + 0.014*isbn + 0.010*author + 0.010*magazine + 0.009*press + 0.009*novel + 0.009*writers + 0.008*story
topic #18: 0.027*football + 0.024*players + 0.023*cup + 0.019*club + 0.017*fc + 0.017*footballers + 0.017*league + 0.011*season + 0.007*teams + 0.007*goals
topic #19: 0.032*band + 0.024*album + 0.014*albums + 0.013*guitar + 0.013*rock + 0.011*records + 0.011*vocals + 0.009*live + 0.008*bass + 0.008*track
-
+
If you used the distributed LDA implementation in `gensim`, please let me know (my
View
44 docs/_sources/dist_lsi.txt
@@ -11,7 +11,7 @@ Distributed Latent Semantic Analysis
Setting up the cluster
_______________________
-We will show how to run distributed Latent Semantic Analysis by means of an example.
+We will show how to run distributed Latent Semantic Analysis by means of an example.
Let's say we have 5 computers at our disposal, all in the same broadcast domain.
To start with, install `gensim` and `Pyro` on each one of them with::
@@ -21,41 +21,41 @@ and run Pyro's name server on exactly *one* of the machines (doesn't matter whic
$ python -m Pyro.naming &
-Let's say our example cluster consists of dual-core computers with loads of
-memory. We will therefore run **two** worker scripts on four of the physical machines,
+Let's say our example cluster consists of dual-core computers with loads of
+memory. We will therefore run **two** worker scripts on four of the physical machines,
creating **eight** logical worker nodes::
$ python -m gensim.models.lsi_worker &
This will execute `gensim`'s `lsi_worker.py` script (to be run twice on each of the
four computer).
-This lets `gensim` know that it can run two jobs on each of the four computers in
-parallel, so that the computation will be done faster, while also taking up twice
+This lets `gensim` know that it can run two jobs on each of the four computers in
+parallel, so that the computation will be done faster, while also taking up twice
as much memory on each machine.
-Next, pick one computer that will be a job scheduler in charge of worker
-synchronization, and on it, run `LSA dispatcher`. In our example, we will use the
+Next, pick one computer that will be a job scheduler in charge of worker
+synchronization, and on it, run `LSA dispatcher`. In our example, we will use the
fifth computer to act as the dispatcher and from there run::
$ python -m gensim.models.lsi_dispatcher &
-In general, the dispatcher can be run on the same machine as one of the worker nodes, or it
+In general, the dispatcher can be run on the same machine as one of the worker nodes, or it
can be another, distinct computer within the same broadcast domain. The dispatcher
won't be doing much with CPU most of the time, but pick a computer with ample memory.
And that's it! The cluster is set up and running, ready to accept jobs. To remove
a worker later on, simply terminate its `lsi_worker` process. To add another worker, run another
`lsi_worker` (this will not affect a computation that is already running). If you terminate
-`lsi_dispatcher`, you won't be able to run computations until you run it again
+`lsi_dispatcher`, you won't be able to run computations until you run it again
(surviving workers can be re-used though).
Running LSA
____________
-So let's test our setup and run one computation of distributed LSA. Open a Python
+So let's test our setup and run one computation of distributed LSA. Open a Python
shell on one of the five machines (again, this can be done on any computer
-in the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_,
+in the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_,
our choice is incidental) and try::
>>> from gensim import corpora, models, utils
@@ -81,13 +81,13 @@ To check the LSA results, let's print the first two latent topics::
topic #1(2.542): -0.623*"graph" + -0.490*"trees" + -0.451*"minors" + -0.274*"survey" + 0.167*"system"
Success! But a corpus of nine documents is no challenge for our powerful cluster...
-In fact, we had to lower the job size (`chunks` parameter above) to a single document
+In fact, we had to lower the job size (`chunks` parameter above) to a single document
at a time, otherwise all documents would be processed by a single worker all at once.
So let's run LSA on **one million documents** instead::
>>> # inflate the corpus to 1M documents, by repeating its documents over&over
- >>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
+ >>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
>>> # run distributed LSA on 1 million documents
>>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, numTopics=200, chunks=10000, distributed=True)
@@ -115,12 +115,12 @@ Latent Semantic Analysis on the English Wikipedia.
Distributed LSA on Wikipedia
++++++++++++++++++++++++++++++
-First, download and prepare the Wikipedia corpus as per :doc:`wiki`, then load
+First, download and prepare the Wikipedia corpus as per :doc:`wiki`, then load
the corpus iterator with::
-
+
>>> import logging, gensim, bz2
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level logging.INFO)
-
+
>>> # load id->word mapping (the dictionary)
>>> id2word = gensim.corpora.wikicorpus.WikiCorpus.loadDictionary('wiki_en_wordids.txt')
>>> # load corpus iterator
@@ -134,7 +134,7 @@ Now we're ready to run distributed LSA on the English Wikipedia::
>>> # extract 400 LSI topics, using a cluster of nodes
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, numTopics=400, chunks=20000, distributed=True)
-
+
>>> # print the most contributing words (both positively and negatively) for each of the first ten topics
>>> lsi.printTopics(10)
2010-11-03 16:08:27,602 : INFO : topic #0(200.990): -0.475*"delete" + -0.383*"deletion" + -0.275*"debate" + -0.223*"comments" + -0.220*"edits" + -0.213*"modify" + -0.208*"appropriate" + -0.194*"subsequent" + -0.155*"wp" + -0.117*"notability"
@@ -148,10 +148,10 @@ Now we're ready to run distributed LSA on the English Wikipedia::
2010-11-03 16:08:27,807 : INFO : topic #8(78.981): 0.588*"film" + 0.460*"films" + -0.130*"album" + -0.127*"station" + 0.121*"television" + 0.115*"poster" + 0.112*"directed" + 0.110*"actors" + -0.096*"railway" + 0.086*"movie"
2010-11-03 16:08:27,834 : INFO : topic #9(78.620): 0.502*"kategori" + 0.282*"categoria" + 0.248*"kategorija" + 0.234*"kategorie" + 0.172*"категория" + 0.165*"categoría" + 0.161*"kategoria" + 0.148*"categorie" + 0.126*"kategória" + 0.121*"catégorie"
-In serial mode, creating the LSI model of Wikipedia with this **one-pass algorithm**
-takes about 5.25h on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
-In distributed mode with four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM
-with `ATLAS`), the wallclock time taken drops to 1 hour and 41 minutes. You can
-read more about various internal settings and experiments in my `research
+In serial mode, creating the LSI model of Wikipedia with this **one-pass algorithm**
+takes about 5.25h on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
+In distributed mode with four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM
+with `ATLAS`), the wallclock time taken drops to 1 hour and 41 minutes. You can
+read more about various internal settings and experiments in my `research
paper <http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_.
View
38 docs/_sources/distributed.txt
@@ -7,8 +7,8 @@ Why distributed computing?
---------------------------
Need to build semantic representation of a corpus that is millions of documents large and it's
-taking forever? Have several idle machines at your disposal that you could use?
-`Distributed computing <http://en.wikipedia.org/wiki/Distributed_computing>`_ tries
+taking forever? Have several idle machines at your disposal that you could use?
+`Distributed computing <http://en.wikipedia.org/wiki/Distributed_computing>`_ tries
to accelerate computations by splitting a given task into several smaller subtasks,
passing them on to several computing nodes in parallel.
@@ -22,15 +22,15 @@ much communication going on), so the network is allowed to be of relatively high
most of the time consuming stuff is done inside low-level routines for linear algebra, inside
NumPy, independent of any `gensim` code.
**Installing a fast** `BLAS (Basic Linear Algebra) <http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms>`_ **library
- for NumPy can improve performance up to 15 times!** So before you start buying those extra computers,
- consider installing a fast, threaded BLAS that is optimized for your particular machine
+ for NumPy can improve performance up to 15 times!** So before you start buying those extra computers,
+ consider installing a fast, threaded BLAS that is optimized for your particular machine
(as opposed to a generic, binary-distributed library).
- Options include your vendor's BLAS library (Intel's MKL,
+ Options include your vendor's BLAS library (Intel's MKL,
AMD's ACML, OS X's vecLib, Sun's Sunperf, ...) or some open-source alternative (GotoBLAS, ALTAS).
To see what BLAS and LAPACK you are using, type into your shell::
-
- python -c 'import scipy; scipy.show_config()'
+
+ python -c 'import scipy; scipy.show_config()'
Prerequisites
-----------------
@@ -61,33 +61,33 @@ inside `gensim` will automatically try to look for and enslave all available wor
If at least one worker is found, things will run in the distributed mode; if not, in serial node.
.. glossary::
-
+
Node
- A logical working unit. Can correspond to a single physical machine, but you
+ A logical working unit. Can correspond to a single physical machine, but you
can also run multiple workers on one machine, resulting in multiple
logical nodes.
-
+
Cluster
- Several nodes which communicate over TCP/IP. Currently, network broadcasting
- is used to discover and connect all communicating nodes, so the nodes must lie
+ Several nodes which communicate over TCP/IP. Currently, network broadcasting
+ is used to discover and connect all communicating nodes, so the nodes must lie
within the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_.
-
+
Worker
- A process which is created on each node. To remove a node from your cluster,
- simply kill its worker process.
-
+ A process which is created on each node. To remove a node from your cluster,
+ simply kill its worker process.
+
Dispatcher
- The dispatcher will be in charge of negotiating all computations, queueing and
+ The dispatcher will be in charge of negotiating all computations, queueing and
distributing ("dispatching") individual jobs to the workers. Computations never
"talk" to worker nodes directly, only through this dispatcher. Unlike workers,
there can only be one active dispatcher at a time in the cluster.
-
+
Available distributed algorithms
---------------------------------
.. toctree::
:maxdepth: 1
-
+
dist_lsi
dist_lda
View
10 docs/_sources/index.txt
@@ -9,10 +9,10 @@ Gensim -- Python Framework for Vector Space Modelling
.. admonition:: What's new in version |version|?
* faster and leaner **Latent Semantic Indexing (LSI)** and **Latent Dirichlet Allocation (LDA)**:
-
+
* :doc:`Processing the English Wikipedia <wiki>`, 3.2 million documents (`NIPS workshop paper <http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_)
* :doc:`dist_lsi` & :doc:`dist_lda`
-
+
* Input corpus iterators can come from a compressed file (**bzip2**, **gzip**, ...), to save disk space when dealing with
very large corpora.
* `gensim` code now resides on `github <https://github.com/piskvorky/gensim/>`_.
@@ -23,7 +23,7 @@ For **installation** and **troubleshooting**, see the :doc:`installation <instal
For **examples** on how to use it, try the :doc:`tutorials <tutorial>`.
-When **citing** `gensim` in academic papers, please use
+When **citing** `gensim` in academic papers, please use
`this BibTeX entry <http://nlp.fi.muni.cz/projekty/gensim/bibtex_gensim.bib>`_.
@@ -40,7 +40,7 @@ Quick Reference Example
>>>
>>> # convert another corpus to the latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[another_corpus])
->>>
+>>>
>>> # perform similarity query of a query in LSI space against the whole corpus
>>> sims = index[query]
@@ -49,7 +49,7 @@ Quick Reference Example
.. toctree::
:hidden:
:maxdepth: 1
-
+
intro
install
tutorial
View
32 docs/_sources/install.txt
@@ -1,10 +1,10 @@
-.. _install:
+.. _install:
=============
Installation
=============
-Gensim is known to run on Linux, Windows and Mac OS X and should run on any other
+Gensim is known to run on Linux, Windows and Mac OS X and should run on any other
platform that supports Python 2.5 and NumPy. Gensim depends on the following software:
* 3.0 > `Python <http://www.python.org>`_ >= 2.5. Tested with versions 2.5 and 2.6.
@@ -26,7 +26,7 @@ You can download Python 2.5 from http://python.org/download.
Install SciPy & NumPy
----------------------
-These are quite popular Python packages, so chances are there are pre-built binary
+These are quite popular Python packages, so chances are there are pre-built binary
distributions available for your platform. You can try installing from source using easy_install::
sudo easy_install numpy
@@ -46,35 +46,35 @@ That's it! Congratulations, you can proceed to the :doc:`tutorials <tutorial>`.
-----
-If you also want to run the algorithms over a cluster
+If you also want to run the algorithms over a cluster
of computers, in :doc:`distributed`, you should install with::
sudo easy_install gensim[distributed]
-The optional `distributed` feature installs `Pyro (PYthon Remote Objects) <http://pypi.python.org/pypi/Pyro>`_.
+The optional `distributed` feature installs `Pyro (PYthon Remote Objects) <http://pypi.python.org/pypi/Pyro>`_.
If you don't know what distributed computing means, you can ignore it:
`gensim` will work fine for you anyway.
This optional extension can also be installed separately later with::
-
+
sudo easy_install Pyro
-----
There are also alternative routes to install:
-
+
1. If you have downloaded and unzipped the `tar.gz source <http://pypi.python.org/pypi/gensim>`_
- for `gensim` (or you're installing `gensim` from `github <https://github.com/piskvorky/gensim/>`_),
+ for `gensim` (or you're installing `gensim` from `github <https://github.com/piskvorky/gensim/>`_),
you can run::
-
- sudo python setup.py install
-
+
+ sudo python setup.py install
+
to install `gensim` into your ``site-packages`` folder.
-2. If you wish to make local changes to the `gensim` code (`gensim` is, after all, a
- package which targets research prototyping and modifications), a preferred
+2. If you wish to make local changes to the `gensim` code (`gensim` is, after all, a
+ package which targets research prototyping and modifications), a preferred
way may be installing with::
-
+
sudo python setup.py develop
-
+
This will only place a symlink into your ``site-packages`` directory. The actual
files will stay wherever you unpacked them.
3. If you don't have root priviledges (or just don't want to put the package into
@@ -95,5 +95,5 @@ Contact
--------
Use the `gensim discussion group <http://groups.google.com/group/gensim/>`_ for
-any questions and troubleshooting. For private enquiries, you can also send
+any questions and troubleshooting. For private enquiries, you can also send
me an email to the address at the bottom of this page.
View
86 docs/_sources/intro.txt
@@ -1,7 +1,7 @@
.. _intro:
============
-Introduction
+Introduction
============
Gensim is a Python framework designed to automatically extract semantic
@@ -11,40 +11,40 @@ topics from documents, as naturally and painlessly as possible.
Gensim aims at processing raw, unstructured digital texts ("*plain text*").
The unsupervised algorithms in `gensim`, such as **Latent Semantic Analysis**, **Latent Dirichlet Allocation** or **Random Projections**,
discover hidden (*latent*) semantic structure, based on word co-occurrence patterns within a corpus of training documents.
-Once these statistical patterns are found, any plain text documents can be succinctly
-expressed in the new, semantic representation, and queried for topical similarity
+Once these statistical patterns are found, any plain text documents can be succinctly
+expressed in the new, semantic representation, and queried for topical similarity
against other documents and so on.
-If the previous paragraphs left you confused, you can read more about the `Vector
-Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
+If the previous paragraphs left you confused, you can read more about the `Vector
+Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
document analysis <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_ on Wikipedia.
.. _design:
Design objectives
------------------
-
+
`gensim` offers the following features:
-* **Memory independence** -- there is no need for the whole training corpus to
+* **Memory independence** -- there is no need for the whole training corpus to
reside fully in RAM at any one time (can process large, web-scale corpora).
-* Efficient implementations for several popular vector space algorithms,
- including **Tf-Idf**, distributed incremental **Latent Semantic Analysis**,
+* Efficient implementations for several popular vector space algorithms,
+ including **Tf-Idf**, distributed incremental **Latent Semantic Analysis**,
distributed incremental **Latent Dirichlet Allocation (LDA)** or **Random Projection**; adding new ones is easy (really!).
* I/O wrappers and converters around **several popular data formats**.
* **Similarity queries** for documents in their latent, topical representation.
-
-Creation of `gensim` was motivated by a perceived lack of available, scalable software
+
+Creation of `gensim` was motivated by a perceived lack of available, scalable software
frameworks that realize topic modelling, and/or their overwhelming internal complexity (hail java!).
You can read more about the motivation in our `LREC 2010 workshop paper <http://nlp.fi.muni.cz/projekty/gensim/lrec2010_final.pdf>`_.
If you want to cite `gensim` in your own work, please refer to that article (`BibTeX <http://nlp.fi.muni.cz/projekty/gensim/bibtex_gensim.bib>`_).
The **principal design objectives** behind `gensim` are:
1. Straightforward interfaces and low API learning curve for developers. Good for prototyping.
-2. Memory independence with respect to the size of the input corpus; all intermediate
- steps and algorithms operate in a streaming fashion, accessing one document
+2. Memory independence with respect to the size of the input corpus; all intermediate
+ steps and algorithms operate in a streaming fashion, accessing one document
at a time.
@@ -53,76 +53,76 @@ Availability
.. seealso::
- See the :doc:`install <install>` page for more info on `gensim` deployment.
+ See the :doc:`install <install>` page for more info on `gensim` deployment.
-Gensim is licensed under the OSI-approved `GNU LPGL license <http://www.gnu.org/licenses/lgpl.html>`_
+Gensim is licensed under the OSI-approved `GNU LPGL license <http://www.gnu.org/licenses/lgpl.html>`_
and can be downloaded either from its `github repository <https://github.com/piskvorky/gensim/>`_
-or from the `Python Package Index <http://pypi.python.org/pypi/gensim>`_.
+or from the `Python Package Index <http://pypi.python.org/pypi/gensim>`_.
-Core concepts
+Core concepts
-------------
-The whole gensim package revolves around the concepts of :term:`corpus`, :term:`vector` and
+The whole gensim package revolves around the concepts of :term:`corpus`, :term:`vector` and
:term:`model`.
.. glossary::
-
+
Corpus
- A collection of digital documents. This collection is used to automatically
+ A collection of digital documents. This collection is used to automatically
infer structure of the documents, their topics etc. For
- this reason, the collection is also called a *training corpus*. The inferred
- latent structure can be later used to assign topics to new documents, which did
+ this reason, the collection is also called a *training corpus*. The inferred
+ latent structure can be later used to assign topics to new documents, which did
not appear in the training corpus.
- No human intervention (such as tagging the documents by hand, or creating
+ No human intervention (such as tagging the documents by hand, or creating
other metadata) is required.
-
+
Vector
- In the Vector Space Model (VSM), each document is represented by an
- array of features. For example, a single feature may be thought of as a
+ In the Vector Space Model (VSM), each document is represented by an
+ array of features. For example, a single feature may be thought of as a
question-answer pair:
-
+
1. How many times does the word *splonge* appear in the document? Zero.
2. How many paragraphs does the document consist of? Two.
3. How many fonts does the document use? Five.
-
+
The question is usually represented only by its integer id, so that the
representation of a document becomes a series of pairs like ``(1, 0.0), (2, 2.0), (3, 5.0)``.
- If we know all the questions in advance, we may leave them implicit
+ If we know all the questions in advance, we may leave them implicit
and simply write ``(0.0, 2.0, 5.0)``.
This sequence of answers can be thought of as a high-dimensional (in this case 3-dimensional)
*vector*. For practical purposes, only questions to which the answer is (or
- can be converted to) a single real number are allowed.
-
- The questions are the same for each document, so that looking at two
+ can be converted to) a single real number are allowed.
+
+ The questions are the same for each document, so that looking at two
vectors (representing two documents), we will hopefully be able to make
- conclusions such as "The numbers in these two vectors are very similar, and
- therefore the original documents must be similar, too". Of course, whether
+ conclusions such as "The numbers in these two vectors are very similar, and
+ therefore the original documents must be similar, too". Of course, whether
such conclusions correspond to reality depends on how well we picked our questions.
-
+
Sparse vector
Typically, the answer to most questions will be ``0.0``. To save space,
- we omit them from the document's representation, and write only ``(2, 2.0),
+ we omit them from the document's representation, and write only ``(2, 2.0),
(3, 5.0)`` (note the missing ``(1, 0.0)``).
Since the set of all questions is known in advance, all the missing features
in a sparse representation of a document can be unambiguously resolved to zero, ``0.0``.
-
+
Gensim is specific in that it doesn't prescribe any specific corpus format;
a corpus is anything that, when iterated over, successively yields these sparse vectors.
- For example, `set([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0]))` is a trivial
+ For example, `set([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0]))` is a trivial
corpus of two documents, each with two non-zero `feature-answer` pairs.
-
-
-
+
+
+
Model
For our purposes, a model is a transformation from one document representation
- to another (or, in other words, from one vector space to another).
+ to another (or, in other words, from one vector space to another).
Both the initial and target representations are
still vectors -- they only differ in what the questions and answers are.
The transformation is automatically learned from the traning :term:`corpus`, without human
supervision, and in hopes that the final document representation will be more compact
and more useful: with similar documents having similar representations.
-
+
.. seealso::
For some examples on how this works out in code, go to :doc:`tutorials <tutorial>`.
View
2 docs/_sources/models/models.txt
@@ -1,4 +1,4 @@
-:mod:`models` -- Package for transformation models
+:mod:`models` -- Package for transformation models
======================================================
.. automodule:: gensim.models
View
72 docs/_sources/tut1.txt
@@ -33,12 +33,12 @@ This time, let's start from documents represented as strings:
This is a tiny corpus of nine documents, each consisting of only a single sentence.
-First, let's tokenize the documents, remove common words (using a toy stoplist)
+First, let's tokenize the documents, remove common words (using a toy stoplist)
as well as words that only appear once in the corpus:
>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
->>> texts = [[word for word in document.lower().split() if word not in stoplist]
+>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>> for document in documents]
>>>
>>> # remove words that appear only once
@@ -48,36 +48,36 @@ as well as words that only appear once in the corpus:
>>> for text in texts]
>>>
>>> print texts
-[['human', 'interface', 'computer'],
- ['survey', 'user', 'computer', 'system', 'response', 'time'],
- ['eps', 'user', 'interface', 'system'],
- ['system', 'human', 'system', 'eps'],
- ['user', 'response', 'time'],
- ['trees'],
- ['graph', 'trees'],
- ['graph', 'minors', 'trees'],
+[['human', 'interface', 'computer'],
+ ['survey', 'user', 'computer', 'system', 'response', 'time'],
+ ['eps', 'user', 'interface', 'system'],
+ ['system', 'human', 'system', 'eps'],
+ ['user', 'response', 'time'],
+ ['trees'],
+ ['graph', 'trees'],
+ ['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
Your way of processing the documents will likely vary; here, I only split on whitespace
-to tokenize, followed by lowercasing each word. In fact, I use this particular
-(simplistic and inefficient) setup to mimick the experiment done in Deerwester et al.'s
+to tokenize, followed by lowercasing each word. In fact, I use this particular
+(simplistic and inefficient) setup to mimick the experiment done in Deerwester et al.'s
original LSA article [1]_.
The ways to process documents are so varied and application- and language-dependent that I
decided to *not* constrain them by any interface. Instead, a document is represented
by the features extracted from it, not by its "surface" string form: how you get to
-the features is up to you. Below I describe one common, general-purpose approach (called
-:dfn:`bag-of-words`), but keep in mind that different application domains call for
+the features is up to you. Below I describe one common, general-purpose approach (called
+:dfn:`bag-of-words`), but keep in mind that different application domains call for
different features, and, as always, it's `garbage in, garbage out <http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out>`_...
-To convert documents to vectors, we'll use a document representation called
-`bag-of-words <http://en.wikipedia.org/wiki/Bag_of_words>`_. In this representation,
+To convert documents to vectors, we'll use a document representation called
+`bag-of-words <http://en.wikipedia.org/wiki/Bag_of_words>`_. In this representation,
each document is represented by one vector where each vector element represents
a question-answer pair, in the style of:
"How many times does the word `system` appear in the document? Once."
-It is advantageous to represent the questions only by their (integer) ids. The mapping
+It is advantageous to represent the questions only by their (integer) ids. The mapping
between the questions and ids is called a dictionary:
>>> dictionary = corpora.Dictionary(texts)
@@ -86,13 +86,13 @@ between the questions and ids is called a dictionary:
Dictionary(12 unique tokens)
Here we assigned a unique integer id to all words appearing in the corpus with the
-:class:`gensim.corpora.dictionary.Dictionary` class. This sweeps across the texts, collecting word counts
-and relevant statistics. In the end, we see there are twelve distinct words in the
-processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector).
+:class:`gensim.corpora.dictionary.Dictionary` class. This sweeps across the texts, collecting word counts
+and relevant statistics. In the end, we see there are twelve distinct words in the
+processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector).
To see the mapping between words and their ids:
>>> print dictionary.token2id
-{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,
+{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,
'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}
To actually convert tokenized documents to vectors:
@@ -102,7 +102,7 @@ To actually convert tokenized documents to vectors:
>>> print newVec # the word "interaction" does not appear in the dictionary and is ignored
[(0, 1), (1, 1)]
-The function :func:`doc2bow` simply counts the number of occurences of
+The function :func:`doc2bow` simply counts the number of occurences of
each distinct word, converts the word to its integer word id
and returns the result as a sparse vector. The sparse vector ``[(0, 1), (1, 1)]``
therefore reads: in the document `"Human computer interaction"`, the words `computer`
@@ -121,16 +121,16 @@ therefore reads: in the document `"Human computer interaction"`, the words `comp
[(9, 1.0), (10, 1.0), (11, 1.0)],
[(8, 1.0), (10, 1.0), (11, 1.0)]]
-By now it should be clear that the vector feature with ``id=10`` stands for the question "How many
-times does the word `graph` appear in the document?" and that the answer is "zero" for
-the first six documents and "one" for the remaining three. As a matter of fact,
+By now it should be clear that the vector feature with ``id=10`` stands for the question "How many
+times does the word `graph` appear in the document?" and that the answer is "zero" for
+the first six documents and "one" for the remaining three. As a matter of fact,
we have arrived at exactly the same corpus of vectors as in the :ref:`first-example`.
-And that is all there is to it! At least as far as bag-of-words representation is concerned.
+And that is all there is to it! At least as far as bag-of-words representation is concerned.
Of course, what we do with such corpus is another question; it is not at all clear
-how counting the frequency of distinct words could be useful. As it turns out, it isn't, and
+how counting the frequency of distinct words could be useful. As it turns out, it isn't, and
we will need to apply a transformation on this simple representation first, before
-we can use it to compute any meaningful document vs. document similarities.
+we can use it to compute any meaningful document vs. document similarities.
Transformations are covered in the :doc:`next tutorial <tut2>`, but before that, let's
briefly turn our attention to *corpus persistency*.
@@ -142,7 +142,7 @@ Corpus Formats
There exist several file formats for storing a Vector Space corpus (~sequence of vectors) to disk.
`Gensim` implements them via the *streaming corpus interface* mentioned earlier:
-documents are read from (resp. stored to) disk in a lazy fashion, one document at
+documents are read from (resp. stored to) disk in a lazy fashion, one document at
a time, without the whole corpus being read into main memory at once.
One of the more notable file formats is the `Market Matrix format <http://math.nist.gov/MatrixMarket/formats.html>`_.
@@ -154,9 +154,9 @@ To save a corpus in the Matrix Market format:
>>>
>>> corpora.MmCorpus.saveCorpus('/tmp/corpus.mm', corpus)
-Other formats include `Joachim's SVMlight format <http://svmlight.joachims.org/>`_,
-`Blei's LDA-C format <http://www.cs.princeton.edu/~blei/lda-c/>`_ and
-`GibbsLDA++ format <http://gibbslda.sourceforge.net/>`_.
+Other formats include `Joachim's SVMlight format <http://svmlight.joachims.org/>`_,
+`Blei's LDA-C format <http://www.cs.princeton.edu/~blei/lda-c/>`_ and
+`GibbsLDA++ format <http://gibbslda.sourceforge.net/>`_.
>>> corpora.SvmLightCorpus.saveCorpus('/tmp/corpus.svmlight', corpus)
>>> corpora.BleiCorpus.saveCorpus('/tmp/corpus.lda-c', corpus)
@@ -193,18 +193,18 @@ To save the same Matrix Market document stream in Blei's LDA-C format,
>>> corpora.BleiCorpus.saveCorpus('/tmp/corpus.lda-c', corpus)
-In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:
+In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:
just load a document stream using one format and immediately save it in another format.
-Adding new formats is dead easy, check out the `code for the SVMlight corpus
+Adding new formats is dead easy, check out the `code for the SVMlight corpus
<http://my-trac.assembla.com/gensim/browser/trunk/src/gensim/corpora/svmlightcorpus.py>`_ for an example.
-------------
-For a complete reference (Want to prune the dictionary to a smaller size?
+For a complete reference (Want to prune the dictionary to a smaller size?
Convert between corpora and NumPy/SciPy arrays?), see the :doc:`API documentation <apiref>`.
Or continue to the next tutorial on :doc:`tut2`.
-.. [1] This is the same corpus as used in
+.. [1] This is the same corpus as used in
`Deerwester et al. (1990): Indexing by Latent Semantic Analysis <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_, Table 2.
View
96 docs/_sources/tut2.txt
@@ -27,41 +27,41 @@ In this tutorial, I will show how to transform documents from one vector represe
into another. This process serves two goals:
1. To bring out hidden structure in the corpus, discover relationships between
- words and use them to describe the documents in a new and
+ words and use them to describe the documents in a new and
(hopefully) more realistic way.
2. To make the document representation more compact. This both improves efficiency
- (new representation consumes less resources) and efficacy (marginal data
- trends are ignored, noise-reduction).
+ (new representation consumes less resources) and efficacy (marginal data
+ trends are ignored, noise-reduction).
Creating a transformation
++++++++++++++++++++++++++
-The transformations are standard Python objects, typically initialized by means of
+The transformations are standard Python objects, typically initialized by means of
a :dfn:`training corpus`:
>>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
We used our old corpus to initialize (train) the transformation model. Different
-transformations may require different initialization parameters; in case of TfIdf, the
+transformations may require different initialization parameters; in case of TfIdf, the
"training" consists simply of going through the supplied corpus once and computing document frequencies
of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet
Allocation, is much more involved and, consequently, takes much more time.
.. note::
- Transformations always convert between two specific vector
- spaces. The same vector space (= the same set of feature ids) must be used for training
- as well as for subsequent vector transformations. Failure to use the same input
- feature space, such as applying a different string preprocessing, using different
- feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will
- result in feature mismatch during transformation calls and consequently in either
+ Transformations always convert between two specific vector
+ spaces. The same vector space (= the same set of feature ids) must be used for training
+ as well as for subsequent vector transformations. Failure to use the same input
+ feature space, such as applying a different string preprocessing, using different
+ feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will
+ result in feature mismatch during transformation calls and consequently in either
garbage output and/or runtime exceptions.
Transforming vectors
+++++++++++++++++++++
-From now on, ``tfidf`` is treated as a read-only object that can be used to convert
+From now on, ``tfidf`` is treated as a read-only object that can be used to convert
any vector from the old representation (bag-of-words integer counts) to the new representation
(TfIdf real-valued weights):
@@ -75,18 +75,18 @@ Or to apply a transformation to a whole corpus:
>>> for doc in corpus_tfidf:
>>> print doc
-In this particular case, we are transforming the same corpus that we used
+In this particular case, we are transforming the same corpus that we used
for training, but this is only incidental. Once the transformation model has been initialized,
it can be used on any vectors (provided they come from the same vector space, of course),
even if they were not used in the training corpus at all. This is achieved by a process called
folding-in for LSA, by topic inference for LDA etc.
.. note::
Calling ``model[corpus]`` only creates a wrapper around the old ``corpus``
- document stream -- actual conversions are done on-the-fly, during document iteration.
- We cannot convert the entire corpus at the time of calling ``corpus_transformed = model[corpus]``,
+ document stream -- actual conversions are done on-the-fly, during document iteration.
+ We cannot convert the entire corpus at the time of calling ``corpus_transformed = model[corpus]``,
because that would mean storing the result in main memory, and that contradicts gensim's objective of memory-indepedence.
- If you will be iterating over the transformed ``corpus_transformed`` multiple times, and the
+ If you will be iterating over the transformed ``corpus_transformed`` multiple times, and the
transformation is costly, :ref:`serialize the resulting corpus to disk first <corpus-formats>` and continue
using that.
@@ -96,17 +96,17 @@ Transformations can also be serialized, one on top of another, in a sort of chai
>>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
Here we transformed our Tf-Idf corpus via `Latent Semantic Indexing <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
-into a latent 2-D space (2-D because we set ``numTopics=2``). Now you're probably wondering: what do these two latent
+into a latent 2-D space (2-D because we set ``numTopics=2``). Now you're probably wondering: what do these two latent
dimensions stand for? Let's inspect with :func:`models.LsiModel.printTopics`:
>>> lsi.printTopics(2)
topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"
-It appears that according to LSI, "trees", "graph" and "minors" are all related
-words (and contribute the most to the direction of the first topic), while the
-second topic practically concerns itself with all the other words. As expected,
-the first five documents are more strongly related to the second topic while the
+It appears that according to LSI, "trees", "graph" and "minors" are all related
+words (and contribute the most to the direction of the first topic), while the
+second topic practically concerns itself with all the other words. As expected,
+the first five documents are more strongly related to the second topic while the
remaining four documents to the first topic:
>>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
@@ -141,76 +141,76 @@ Available transformations
Gensim implements several popular Vector Space Model algorithms:
* `Term Frequency * Inverse Document Frequency, Tf-Idf <http://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_
- expects a bag-of-words (integer values) training corpus during initialization.
- During transformation, it will take a vector and return another vector of the
- same dimensionality, except that features which were rare in the training corpus
+ expects a bag-of-words (integer values) training corpus during initialization.
+ During transformation, it will take a vector and return another vector of the
+ same dimensionality, except that features which were rare in the training corpus
will have their value increased.
- It therefore converts integer-valued vectors into real-valued ones, while leaving
+ It therefore converts integer-valued vectors into real-valued ones, while leaving
the number of dimensions intact. It can also optionally normalize the resulting
vectors to (Euclidean) unit length.
>>> model = tfidfmodel.TfidfModel(bow_corpus, normalize=True)
* `Latent Semantic Indexing, LSI (or sometimes LSA) <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into
- a latent space of a lower dimensionality. For the toy corpus above we used only
+ a latent space of a lower dimensionality. For the toy corpus above we used only
2 latent dimensions, but on real corpora, target dimensionality of 200--500 is recommended
as a "golden standard" [1]_.
-
+
>>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary.id2word, numTopics=300)
- LSI training is unique in that we can continue "training" at any point, simply
- by providing more training documents. This is done by incremental updates to
+ LSI training is unique in that we can continue "training" at any point, simply
+ by providing more training documents. This is done by incremental updates to
the underlying model, in a process called `online training`. Because of this feature, the
input document stream may even be infinite -- just keep feeding LSI new documents
as they arrive, while using the computed transformation model as read-only in the meanwhile!
-
+
>>> model.addDocuments(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
- >>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
+ >>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
>>> ...
>>> model.addDocuments(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
>>> lsi_vec = model[tfidf_vec]
>>> ...
-
+
See the :mod:`gensim.models.lsimodel` documentation for details on how to make
LSI gradually "forget" old observations in infinite streams and how to tweak parameters
affecting speed vs. memory footprint vs. numerical precision of the algorithm.
-
- `gensim` uses a novel online incremental streamed distributed training algorithm (quite a mouthful!),
- which I published in [5]_. `gensim` also executes a stochastic multi-pass algorithm
- from Halko et al. [4]_ internally, to accelerate in-core part
+
+ `gensim` uses a novel online incremental streamed distributed training algorithm (quite a mouthful!),
+ which I published in [5]_. `gensim` also executes a stochastic multi-pass algorithm
+ from Halko et al. [4]_ internally, to accelerate in-core part
of the computations.
- See also :doc:`wiki` for further speed-ups by distributing the computation across
+ See also :doc:`wiki` for further speed-ups by distributing the computation across
a cluster of computers.
* `Random Projections, RP <http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf>`_ aim to
reduce vector space dimensionality. This is a very efficient (both memory- and
- CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.
+ CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.
Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.
>>> model = rpmodel.RpModel(tfidf_corpus, numTopics=500)
* `Latent Dirichlet Allocation, LDA <http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>`_
- is yet another transformation from bag-of-words counts into a topic space of lower
- dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA),
+ is yet another transformation from bag-of-words counts into a topic space of lower
+ dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA),
so LDA's topics can be interpreted as probability distributions over words. These distributions are,
just like with LSA, inferred automatically from a training corpus. Documents
are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).
-
+
>>> model = ldamodel.LdaModel(bow_corpus, id2word=dictionary.id2word, numTopics=100)
-
- `gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,
+
+ `gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,
modified to run in :doc:`distributed mode <distributed>` on a cluster of computers.
Adding new :abbr:`VSM (Vector Space Model)` transformations (such as different weighting schemes) is rather trivial;
see the :doc:`API reference <apiref>` or directly the Python code for more info and examples.
-It is worth repeating that these are all unique, **incremental** implementations,
+It is worth repeating that these are all unique, **incremental** implementations,
which do not require the whole training corpus to be present in main memory all at once.
-With memory taken care of, I am now improving :doc:`distributed`,
-to improve CPU efficiency, too.
-If you feel you could contribute (by testing, providing use-cases or even, gasp!, code),
-please `let me know <mailto:radimrehurek@seznam.cz>`_.
+With memory taken care of, I am now improving :doc:`distributed`,
+to improve CPU efficiency, too.
+If you feel you could contribute (by testing, providing use-cases or even, gasp!, code),
+please `let me know <mailto:radimrehurek@seznam.cz>`_.
------
View
46 docs/_sources/tut3.txt
@@ -14,15 +14,15 @@ if you want to see logging events.
Similarity interface
--------------------------
-In the previous tutorials on :doc:`tut1` and :doc:`tut2`, we covered what it means
+In the previous tutorials on :doc:`tut1` and :doc:`tut2`, we covered what it means
to create a corpus in the Vector Space Model and how to transform it between different
-vector spaces. A common reason for such a charade is that we want to determine
+vector spaces. A common reason for such a charade is that we want to determine
**similarity between pairs of documents**, or the **similarity between a specific document
and a set of other documents** (such as a user query vs. indexed documents).
-To show how this can be done in gensim, let us consider the same corpus as in the
-previous examples (which really originally comes from Deerwester et al.'s
-`"Indexing by Latent Semantic Analysis" <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_
+To show how this can be done in gensim, let us consider the same corpus as in the
+previous examples (which really originally comes from Deerwester et al.'s
+`"Indexing by Latent Semantic Analysis" <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_
seminal 1990 article):
>>> from gensim import corpora, models, similarities
@@ -31,14 +31,14 @@ seminal 1990 article):
>>> print corpus
MmCorpus(9 documents, 12 features, 28 non-zero entries)
-To follow Deerwester's example, we first use this tiny corpus to define a 2-dimensional
+To follow Deerwester's example, we first use this tiny corpus to define a 2-dimensional
LSI space:
>>> lsi = models.LsiModel(corpus, id2word=dictionary.id2word, numTopics=2)
-
-Now suppose a user typed in the query `"Human computer interaction"`. We would
-like to sort our nine corpus documents in decreasing order of relevance to this query.
-Unlike modern search engines, here we only concentrate on a single aspect of possible
+
+Now suppose a user typed in the query `"Human computer interaction"`. We would
+like to sort our nine corpus documents in decreasing order of relevance to this query.
+Unlike modern search engines, here we only concentrate on a single aspect of possible
similarities---on apparent semantic relatedness of their texts (words). No hyperlinks,
no random-walk static ranks, just a semantic extension over the boolean keyword match:
@@ -49,16 +49,16 @@ no random-walk static ranks, just a semantic extension over the boolean keyword
[(0, -0.461821), (1, 0.070028)]
In addition, we will be considering `cosine similarity <http://en.wikipedia.org/wiki/Cosine_similarity>`_
-to determine the similarity of two vectors. Cosine similarity is a standard measure
-in Vector Space Modeling, but wherever the vectors represent probability distributions,
+to determine the similarity of two vectors. Cosine similarity is a standard measure
+in Vector Space Modeling, but wherever the vectors represent probability distributions,
`different similarity measures <http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence>`_
may be more appropriate.
Initializing query structures
++++++++++++++++++++++++++++++++
To prepare for similarity queries, we need to enter all documents which we want
-to compare against subsequent queries. In our case, they are the same nine documents
+to compare against subsequent queries. In our case, they are the same nine documents
used for training LSI, converted to 2-D LSA space. But that's only incidental, we
might also be indexing a different corpus altogether.
@@ -67,9 +67,9 @@ might also be indexing a different corpus altogether.
.. warning::
The class :class:`similarities.MatrixSimilarity` is only appropriate when the whole
set of vectors fits into memory. For example, a corpus of one million documents
- would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.
+ would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.
Without 2GB of free RAM, you would need to use the :class:`similarities.Similarity` class.
- This class operates in constant memory, in a streaming (and more gensim-like)
+ This class operates in constant memory, in a streaming (and more gensim-like)
fashion, but is also much slower than :class:`similarities.MatrixSimilarity`, which uses
fast level-2 `BLAS routines <http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms>`_
to determine similarities.
@@ -87,13 +87,13 @@ To obtain similarities of our query document against the nine indexed documents:
>>> sims = index[vec_lsi] # perform a similarity query against the corpus
>>> print list(enumerate(sims)) # print (document_number, document_similarity) 2-tuples
-[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),
+[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),
(5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]
Cosine measure returns similarities in the range `<-1, 1>` (the greater, the more similar),
so that the first document has a score of 0.99809301 etc.
-With some standard Python magic we sort these similarities into descending
+With some standard Python magic we sort these similarities into descending
order, and obtain the final answer to the query `"Human computer interaction"`:
>>> sims = sorted(enumerate(sims), key = lambda item: -item[1])
@@ -108,17 +108,17 @@ order, and obtain the final answer to the query `"Human computer interaction"`:
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees
-(I added the original documents in their "string form" to the output comments, to
+(I added the original documents in their "string form" to the output comments, to
improve clarity.)
The thing to note here is that documents no. 2 (``"The EPS user interface management system"``)
and 4 (``"Relation of user perceived response time to error measurement"``) would never be returned by
-a standard boolean fulltext search, because they do not share any common words with ``"Human
-computer interaction"``. However, after applying LSI, we can observe that both of
-them received quite high similarity scores (no. 2 is actually the most similar!),
+a standard boolean fulltext search, because they do not share any common words with ``"Human
+computer interaction"``. However, after applying LSI, we can observe that both of
+them received quite high similarity scores (no. 2 is actually the most similar!),
which corresponds better to our intuition of
them sharing a "computer-human" related topic with the query. In fact, this semantic
-generalization is the reason why we apply transformations and do topic modelling
+generalization is the reason why we apply transformations and do topic modelling
in the first place.
@@ -136,5 +136,5 @@ This means that:
* your **feedback is most welcome** and appreciated, be it in code and idea contributions, bug reports or just user stories.
Gensim has no ambition to become a production-level tool, with robust failure handling
-and error recoveries. Its main goal is to help NLP newcomers try out popular algorithms
+and error recoveries. Its main goal is to help NLP newcomers try out popular algorithms
and to facilitate prototyping of new algorithms for NLP researchers.
View
48 docs/_sources/tutorial.txt
@@ -4,15 +4,15 @@ Tutorial
========
-This tutorial is organized as a series of examples that highlight various features
-of `gensim`. It is assumed that the reader is familiar with the Python language
+This tutorial is organized as a series of examples that highlight various features
+of `gensim`. It is assumed that the reader is familiar with the Python language
and has read the :doc:`intro`.
The examples are divided into parts on:
.. toctree::
:maxdepth: 2
-
+
tut1
tut2
tut3
@@ -22,12 +22,12 @@ The examples are divided into parts on:
Preliminaries
--------------
-All the examples can be directly copied to your Python interpreter shell (assuming
-you have :doc:`gensim installed <install>`, of course).
-`IPython <http://ipython.scipy.org>`_'s ``cpaste`` command is especially handy for copypasting code fragments which include superfluous
+All the examples can be directly copied to your Python interpreter shell (assuming
+you have :doc:`gensim installed <install>`, of course).
+`IPython <http://ipython.scipy.org>`_'s ``cpaste`` command is especially handy for copypasting code fragments which include superfluous
characters, such as the leading ``>>>``.
-Gensim uses Python's standard :mod:`logging` module to log various stuff at various
+Gensim uses Python's standard :mod:`logging` module to log various stuff at various
priority levels; to activate logging (this is optional), run
>>> import logging
@@ -42,7 +42,7 @@ Quick Example
First, let's import gensim and create a small corpus of nine documents [1]_:
>>> from gensim import corpora, models, similarities
->>>
+>>>
>>> corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
>>> [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
>>> [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
@@ -54,36 +54,36 @@ First, let's import gensim and create a small corpus of nine documents [1]_:
>>> [(8, 1.0), (10, 1.0), (11, 1.0)]]
:dfn:`Corpus` is simply an object which, when iterated over, returns its documents represented
-as sparse vectors.
+as sparse vectors.
If you're familiar with the `Vector Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_,
-you'll probably know that the way you parse your documents and convert them to vectors
+you'll probably know that the way you parse your documents and convert them to vectors
has major impact on the quality of any subsequent applications. If you're not familiar
-with :abbr:`VSM (Vector Space Model)`, we'll bridge the gap between **raw strings**
-and **sparse vectors** in the next tutorial
+with :abbr:`VSM (Vector Space Model)`, we'll bridge the gap between **raw strings**
+and **sparse vectors** in the next tutorial
on :doc:`tut1`.
.. note::
- In this example, the whole corpus is stored in memory, as a Python list. However,
- the corpus interface only dictates that a corpus must support iteration over its
- constituent documents. For very large corpora, it is advantageous to keep the
- corpus on disk, and access its documents sequentially, one at a time. All the
- operations and transformations are implemented in such a way that makes
+ In this example, the whole corpus is stored in memory, as a Python list. However,
+ the corpus interface only dictates that a corpus must support iteration over its
+ constituent documents. For very large corpora, it is advantageous to keep the
+ corpus on disk, and access its documents sequentially, one at a time. All the
+ operations and transformations are implemented in such a way that makes
them independent of the size of the corpus, memory-wise.
Next, let's initialize a :dfn:`transformation`:
>>> tfidf = models.TfidfModel(corpus)
-A transformation is used to convert documents from one vector representation into another:
+A transformation is used to convert documents from one vector representation into another:
>>> vec = [(0, 1), (4, 1)]
>>> print tfidf[vec]
[(0, 0.8075244), (4, 0.5898342)]
-Here, we used `Tf-Idf <http://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, a simple
-transformation which takes documents represented as bag-of-words counts and applies
+Here, we used `Tf-Idf <http://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, a simple
+transformation which takes documents represented as bag-of-words counts and applies
a weighting which discounts common terms (or, equivalently, promotes rare terms).
It also scales the resulting vector to unit length (in the `Euclidean norm <http://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm>`_).
@@ -99,17 +99,17 @@ and to query the similarity of our query vector ``vec`` against every document i
>>> print list(enumerate(sims))
[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
-How to read this output? Document number zero (the first document) has a similarity score of 0.466=46.6\%,
+How to read this output? Document number zero (the first document) has a similarity score of 0.466=46.6\%,
the second document has a similarity score of 19.1\% etc.
-Thus, according to TfIdf document representation and cosine similarity measure,
-the most similar to our query document `vec` is document no. 3, with a similarity score of 82.1%.
+Thus, according to TfIdf document representation and cosine similarity measure,
+the most similar to our query document `vec` is document no. 3, with a similarity score of 82.1%.
Note that in the TfIdf representation, any documents which do not share any common features
with ``vec`` at all (documents no. 4--8) get a similarity score of 0.0. See the :doc:`tut3` tutorial for more detail.
------
-.. [1] This is the same corpus as used in
+.. [1] This is the same corpus as used in
`Deerwester et al. (1990): Indexing by Latent Semantic Analysis <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_, Table 2.
View
2 docs/_sources/utils.txt
@@ -1,4 +1,4 @@
-:mod:`utils` -- Various utility functions
+:mod:`utils` -- Various utility functions
==========================================
.. automodule:: gensim.utils
View
66 docs/_sources/wiki.txt
@@ -1,6 +1,6 @@
.. _wiki:
-Experiments on the English Wikipedia
+Experiments on the English Wikipedia
============================================
To test `gensim` performance, we run it against the English version of Wikipedia.
@@ -13,21 +13,21 @@ anyone can reproduce the results. It is assumed you have `gensim` properly :doc:
Preparing the corpus
----------------------
-1. First, download the dump of all Wikipedia articles from http://download.wikimedia.org/enwiki/
+1. First, download the dump of all Wikipedia articles from http://download.wikimedia.org/enwiki/
(you want a file like `enwiki-latest-pages-articles.xml.bz2`). This file is about 6GB in size
and contains (a compressed version of) all articles from the English Wikipedia.
-2. Convert the articles to plain text (process Wiki markup) and store the result as
- sparse TF-IDF vectors. In Python, this is easy to do on-the-fly and we don't
+2. Convert the articles to plain text (process Wiki markup) and store the result as
+ sparse TF-IDF vectors. In Python, this is easy to do on-the-fly and we don't
even need to uncompress the whole archive to disk. There is a script included in
`gensim` that does just that, run::
$ python -m gensim.corpora.wikicorpus
.. note::
- This pre-processing step makes two passes over the 6GB wiki dump (one to extract
- the dictionary, one to create and store the sparse vectors) and takes about
- 15 hours on my laptop, so you may want to go have a coffee or two.
+ This pre-processing step makes two passes over the 6GB wiki dump (one to extract
+ the dictionary, one to create and store the sparse vectors) and takes about
+ 15 hours on my laptop, so you may want to go have a coffee or two.
Also, you will need about 15GB of free disk space to store the sparse output vectors.
Latent Sematic Analysis
@@ -47,15 +47,15 @@ First let's load the corpus iterator and dictionary, created in the second step
>>> print mm
MmCorpus(3199665 documents, 100000 features, 495547400 non-zero entries)
-We see that our corpus contains 3.2M documents, 100K features (distinct
+We see that our corpus contains 3.2M documents, 100K features (distinct
tokens) and 0.5G non-zero entries in the sparse TF-IDF matrix. The corpus contains
about 1.92 billion tokens in total.
Now we're ready to compute LSA of the English Wikipedia::
>>> # extract 400 LSI topics; use the default one-pass algorithm
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, numTopics=400)
-
+
>>> # print the most contributing words (both positively and negatively) for each of the first ten topics
>>> lsi.printTopics(10)
topic #0(200.540): 0.475*"delete" + 0.383*"deletion" + 0.275*"debate" + 0.223*"comments" + 0.221*"edits" + 0.213*"modify" + 0.208*"appropriate" + 0.195*"subsequent" + 0.155*"wp" + 0.116*"notability"
@@ -72,11 +72,11 @@ Now we're ready to compute LSA of the English Wikipedia::
Creating the LSI model of Wikipedia takes about 5 hours and 14 minutes on my laptop [1]_.
If you need your results even faster, see the tutorial on :doc:`distributed`.
-We see that the total processing time is dominated by the preprocessing step of
+We see that the total processing time is dominated by the preprocessing step of
preparing the TF-IDF corpus, which took 15h. [2]_
-The algorithm used in `gensim` only needs to see each input document once, so it
-is suitable for environments where the documents come as a non-repeatable stream,
+The algorithm used in `gensim` only needs to see each input document once, so it
+is suitable for environments where the documents come as a non-repeatable stream,
or where the cost of storing/iterating over the corpus multiple times is too high.
@@ -97,15 +97,15 @@ As with Latent Semantic Analysis above, first load the corpus iterator and dicti
>>> print mm
MmCorpus(3199665 documents, 100000 features, 495547400 non-zero entries)
-We will run online LDA (see Hoffman et al. [3]_), which is an algorithm that takes a chunk of documents,
+We will run online LDA (see Hoffman et al. [3]_), which is an algorithm that takes a chunk of documents,
updates the LDA model, takes another chunk, updates the model etc. Online LDA can be contrasted
-with batch LDA, which processes the whole corpus (one full pass), then updates
-the model, then another pass, another update... The difference is that given a
-reasonably stationary document stream (not much topic drift), the online updates
-over the smaller chunks (subcorpora) are pretty good in themselves, so that the
+with batch LDA, which processes the whole corpus (one full pass), then updates
+the model, then another pass, another update... The difference is that given a
+reasonably stationary document stream (not much topic drift), the online updates
+over the smaller chunks (subcorpora) are pretty good in themselves, so that the
model estimation converges faster. As a result, we will perhaps only need a single full
-pass over the corpus: if the corpus has 3 million articles, and we update once after
-every 10,000 articles, this means we will have done 300 updates in one pass, quite likely
+pass over the corpus: if the corpus has 3 million articles, and we update once after
+every 10,000 articles, this means we will have done 300 updates in one pass, quite likely
enough to have a very accurate topics estimate::
>>> # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents)
@@ -143,18 +143,18 @@ Creating this LDA model of Wikipedia takes about 11 hours on my laptop [1]_.
If you need your results faster, consider running :doc:`dist_lda` on a cluster of
computers.
-Note two differences between the LDA and LSA runs: we asked LSA
-to extract 400 topics, LDA only 100 topics (so the difference in speed is in fact
+Note two differences between the LDA and LSA runs: we asked LSA
+to extract 400 topics, LDA only 100 topics (so the difference in speed is in fact
even greater). Secondly, the LSA implementation in `gensim` is truly online: if the nature of the input
stream changes in time, LSA will re-orient itself to reflect these changes, in a reasonably
small amount of updates. In contrast, LDA is not truly online (the name of the [3]_
-article notwithstanding), as the impact of later updates on the model gradually
-diminishes. If there is topic drift in the input document stream, LDA will get
+article notwithstanding), as the impact of later updates on the model gradually
+diminishes. If there is topic drift in the input document stream, LDA will get
confused and be increasingly slower at adjusting itself to the new state of affairs.
-In short, be careful if using LDA to incrementally add new documents to the model
+In short, be careful if using LDA to incrementally add new documents to the model
over time. **Batch usage of LDA**, where the entire training corpus is either known beforehand or does
-not exihibit topic drift, **is ok and not affected**.
+not exihibit topic drift, **is ok and not affected**.
To run batch LDA (not online), train `LdaModel` with::
@@ -168,27 +168,27 @@ To run batch LDA (not online), train `LdaModel` with::
.. [2]
Here we're mostly interested in performance, but it is interesting to look at the
- retrieved LSA concepts, too. I am no Wikipedia expert and don't see into Wiki's bowels,
+ retrieved LSA concepts, too. I am no Wikipedia expert and don't see into Wiki's bowels,
but Brian Mingus had this to say about the result::
There appears to be a lot of noise in your dataset. The first three topics
in your list appear to be meta topics, concerning the administration and
cleanup of Wikipedia. These show up because you didn't exclude templates
such as these, some of which are included in most articles for quality
control: http://en.wikipedia.org/wiki/Wikipedia:Template_messages/Cleanup
-
+
The fourth and fifth topics clearly shows the influence of bots that import
massive databases of cities, countries, etc. and their statistics such as
population, capita, etc.
-
+
The sixth shows the influence of sports bots, and the seventh of music bots.
-
- So the top ten concepts are apparently dominated by Wikipedia robots and expanded
- templates; this is a good reminder that LSA is a powerful tool for data analysis,
- but no silver bullet. As always, it's `garbage in, garbage out
+
+ So the top ten concepts are apparently dominated by Wikipedia robots and expanded
+ templates; this is a good reminder that LSA is a powerful tool for data analysis,
+ but no silver bullet. As always, it's `garbage in, garbage out
<http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out>`_...
By the way, improvements to the Wiki markup parsing code are welcome :-)
-.. [3] Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation
+.. [3] Hoffman, Blei, Bach. 2010. Online learning for Latent Dirichlet Allocation
[`pdf <http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>`_] [`code <http://www.cs.princeton.edu/~mdhoffma/>`_]
View
20 docs/apiref.html
@@ -6,7 +6,7 @@
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-
+
<title>API Reference &mdash; gensim documentation</title>
<link rel="stylesheet" href="_static/default.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
@@ -53,12 +53,12 @@
<li><a href="index.html">Gensim home</a>|&nbsp;</li>
<li><a href="#">API reference</a>|&nbsp;</li>
<li><a href="tutorial.html">Tutorials</a> &raquo;</li>
-
+
</ul>
</div>
-
+
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h4>Previous topic</h4>
@@ -82,15 +82,15 @@
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
-
-
+
+
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
-
+
<div class="section" id="api-reference">
<span id="apiref"></span><h1>API Reference<a class="headerlink" href="#api-reference" title="Permalink to this headline">¶</a></h1>
<p>Modules:</p>
@@ -125,7 +125,7 @@
</div>
<div class="clearer"></div>
</div>
-
+
<div class="related">
<h3>Navigation</h3>
<ul>
@@ -144,11 +144,11 @@
<li><a href="index.html">Gensim home</a>|&nbsp;</li>
<li><a href="#">API reference</a>|&nbsp;</li>
<li><a href="tutorial.html">Tutorials</a> &raquo;</li>
-
+
</ul>
</div>
-
-
+
+
<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
View
20 docs/corpora/bleicorpus.html
@@ -6,7 +6,7 @@
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-
+
<title>corpora.bleicorpus – Corpus in Blei’s LDA-C format &mdash; gensim documentation</title>
<link rel="stylesheet" href="../_static/default.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
@@ -55,12 +55,12 @@
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
- <li><a href="../apiref.html" accesskey="U">API Reference</a> &raquo;</li>
+ <li><a href="../apiref.html" accesskey="U">API Reference</a> &raquo;</li>
</ul>
</div>
-
+
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h4>Previous topic</h4>
@@ -84,15 +84,15 @@
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
-
-
+
+
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
-
+
<div class="section" id="module-gensim.corpora.bleicorpus">
<span id="corpora-bleicorpus-corpus-in-blei-s-lda-c-format"></span><h1><tt class="xref py py-mod docutils literal"><span class="pre">corpora.bleicorpus</span></tt> &#8211; Corpus in Blei&#8217;s LDA-C format<a class="headerlink" href="#module-gensim.corpora.bleicorpus" title="Permalink to this headline">¶</a></h1>
<p>Blei&#8217;s LDA-C format.</p>
@@ -140,7 +140,7 @@
</div>
<div class="clearer"></div>
</div>
-
+
<div class="related">
<h3>Navigation</h3>
<ul>
@@ -160,11 +160,11 @@
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
- <li><a href="../apiref.html" >API Reference</a> &raquo;</li>
+ <li><a href="../apiref.html" >API Reference</a> &raquo;</li>
</ul>
</div>
-
-
+
+
<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
View
20 docs/corpora/corpora.html
@@ -6,7 +6,7 @@
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-
+
<title>corpora – Package for corpora I/O &mdash; gensim documentation</title>
<link rel="stylesheet" href="../_static/default.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
@@ -45,12 +45,12 @@
<li><a href="../index.html">Gensim home</a>|&nbsp;</li>
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
-
+
</ul>
</div>
-
+
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<div id="searchbox" style="display: none">
@@ -68,15 +68,15 @@
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
-
-
+
+
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
-
+
<div class="section" id="module-gensim.corpora">
<span id="corpora-package-for-corpora-i-o"></span><h1><tt class="xref py py-mod docutils literal"><span class="pre">corpora</span></tt> &#8211; Package for corpora I/O<a class="headerlink" href="#module-gensim.corpora" title="Permalink to this headline">¶</a></h1>
<p>This package contains implementations of various streaming corpus I/O format.</p>
@@ -88,7 +88,7 @@
</div>
<div class="clearer"></div>
</div>
-
+
<div class="related">
<h3>Navigation</h3>
<ul>
@@ -101,11 +101,11 @@
<li><a href="../index.html">Gensim home</a>|&nbsp;</li>
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
-
+
</ul>
</div>
-
-
+
+
<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
View
44 docs/corpora/dictionary.html
@@ -6,7 +6,7 @@
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-
+
<title>corpora.dictionary – Construct word&lt;-&gt;id mappings &mdash; gensim documentation</title>
<link rel="stylesheet" href="../_static/default.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
@@ -55,12 +55,12 @@
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
- <li><a href="../apiref.html" accesskey="U">API Reference</a> &raquo;</li>
+ <li><a href="../apiref.html" accesskey="U">API Reference</a> &raquo;</li>
</ul>
</div>
-
+
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h4>Previous topic</h4>
@@ -84,33 +84,33 @@
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
-
-
+
+
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
-
+
<div class="section" id="module-gensim.corpora.dictionary">
<span id="corpora-dictionary-construct-word-id-mappings"></span><h1><tt class="xref py py-mod docutils literal"><span class="pre">corpora.dictionary</span></tt> &#8211; Construct word&lt;-&gt;id mappings<a class="headerlink" href="#module-gensim.corpora.dictionary" title="Permalink to this headline">¶</a></h1>
-<p>This module implements the concept of Dictionary &#8211; a mapping between words and
+<p>This module implements the concept of Dictionary &#8211; a mapping between words and
their integer ids.</p>
<p>Dictionaries can be created from a corpus and can later be pruned according to
-document frequency (removing (un)common words via the <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.filterExtremes" title="gensim.corpora.dictionary.Dictionary.filterExtremes"><tt class="xref py py-func docutils literal"><span class="pre">Dictionary.filterExtremes()</span></tt></a> method),
+document frequency (removing (un)common words via the <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.filterExtremes" title="gensim.corpora.dictionary.Dictionary.filterExtremes"><tt class="xref py py-func docutils literal"><span class="pre">Dictionary.filterExtremes()</span></tt></a> method),
save/loaded from disk via <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.save" title="gensim.corpora.dictionary.Dictionary.save"><tt class="xref py py-func docutils literal"><span class="pre">Dictionary.save()</span></tt></a> and <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.load" title="gensim.corpora.dictionary.Dictionary.load"><tt class="xref py py-func docutils literal"><span class="pre">Dictionary.load()</span></tt></a> methods etc.</p>
<dl class="class">
<dt id="gensim.corpora.dictionary.Dictionary">
<em class="property">class </em><tt class="descclassname">gensim.corpora.dictionary.</tt><tt class="descname">Dictionary</tt><big>(</big><em>documents=None</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary" title="Permalink to this definition">¶</a></dt>
<dd><p>Dictionary encapsulates mappings between normalized words and their integer ids.</p>
-<p>The main function is <cite>doc2bow</cite>, which converts a collection of words to its
-bag-of-words representation, optionally also updating the dictionary mapping
+<p>The main function is <cite>doc2bow</cite>, which converts a collection of words to its
+bag-of-words representation, optionally also updating the dictionary mapping
with newly encountered words and their ids.</p>
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.addDocuments">
<tt class="descname">addDocuments</tt><big>(</big><em>documents</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.addDocuments" title="Permalink to this definition">¶</a></dt>
-<dd><p>Build dictionary from a collection of documents. Each document is a list
+<dd><p>Build dictionary from a collection of documents. Each document is a list
of tokens (<strong>tokenized and normalized</strong> utf-8 encoded strings).</p>
<p>This is only a convenience wrapper for calling <cite>doc2bow</cite> on each document
with <cite>allowUpdate=True</cite>.</p>
@@ -123,11 +123,11 @@
<dl class="method">
<dt id="gensim.corpora.dictionary.Dictionary.doc2bow">
<tt class="descname">doc2bow</tt><big>(</big><em>document</em>, <em>allowUpdate=False</em><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.doc2bow" title="Permalink to this definition">¶</a></dt>
-<dd><p>Convert <cite>document</cite> (a list of words) into the bag-of-words format = list of
-<cite>(tokenId, tokenCount)</cite> 2-tuples. Each word is assumed to be a
+<dd><p>Convert <cite>document</cite> (a list of words) into the bag-of-words format = list of
+<cite>(tokenId, tokenCount)</cite> 2-tuples. Each word is assumed to be a
<strong>tokenized and normalized</strong> utf-8 encoded string.</p>
-<p>If <cite>allowUpdate</cite> is set, then also update of dictionary in the process: create ids
-for new words. At the same time, update document frequencies &#8211; for
+<p>If <cite>allowUpdate</cite> is set, then also update of dictionary in the process: create ids
+for new words. At the same time, update document frequencies &#8211; for
each word appearing in this document, increase its <cite>self.docFreq</cite> by one.</p>
<p>If <cite>allowUpdate</cite> is <strong>not</strong> set, this function is <cite>const</cite>, i.e. read-only.</p>
</dd></dl>
@@ -138,13 +138,13 @@
<dd><p>Filter out tokens that appear in</p>
<ol class="arabic simple">
<li>less than <cite>noBelow</cite> documents (absolute number) or</li>
-<li>more than <cite>noAbove</cite> documents (fraction of total corpus size, <em>not</em>
+<li>more than <cite>noAbove</cite> documents (fraction of total corpus size, <em>not</em>
absolute number).</li>
<li>after (1) and (2), keep only the first <cite>keepN&#8217; most frequent tokens (or
all if `None</cite>).</li>
</ol>
<p>After the pruning, shrink resulting gaps in word ids.</p>
-<p><strong>Note</strong>: Due to the gap shrinking, the same word may have a different
+<p><strong>Note</strong>: Due to the gap shrinking, the same word may have a different
word id before and after the call to this function!</p>
</dd></dl>
@@ -166,7 +166,7 @@
<dt id="gensim.corpora.dictionary.Dictionary.rebuildDictionary">
<tt class="descname">rebuildDictionary</tt><big>(</big><big>)</big><a class="headerlink" href="#gensim.corpora.dictionary.Dictionary.rebuildDictionary" title="Permalink to this definition">¶</a></dt>
<dd><p>Assign new word ids to all words.</p>
-<p>This is done to make the ids more compact, e.g. after some tokens have
+<p>This is done to make the ids more compact, e.g. after some tokens have
been removed via <a class="reference internal" href="#gensim.corpora.dictionary.Dictionary.filterTokens" title="gensim.corpora.dictionary.Dictionary.filterTokens"><tt class="xref py py-func docutils literal"><span class="pre">filterTokens()</span></tt></a> and there are gaps in the id series.
Calling this method will remove the gaps.</p>
</dd></dl>
@@ -187,7 +187,7 @@
</div>
<div class="clearer"></div>
</div>
-
+
<div class="related">
<h3>Navigation</h3>
<ul>
@@ -207,11 +207,11 @@
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
- <li><a href="../apiref.html" >API Reference</a> &raquo;</li>
+ <li><a href="../apiref.html" >API Reference</a> &raquo;</li>
</ul>
</div>
-
-
+
+
<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
View
36 docs/corpora/dmlcorpus.html
@@ -6,7 +6,7 @@
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-
+
<title>corpora.dmlcorpus – Corpus in DML-CZ format &mdash; gensim documentation</title>
<link rel="stylesheet" href="../_static/default.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
@@ -55,12 +55,12 @@
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
- <li><a href="../apiref.html" accesskey="U">API Reference</a> &raquo;</li>
+ <li><a href="../apiref.html" accesskey="U">API Reference</a> &raquo;</li>
</ul>
</div>
-
+
<div class="sphinxsidebar">
<div class="sphinxsidebarwrapper">
<h4>Previous topic</h4>
@@ -84,39 +84,39 @@
<script type="text/javascript">$('#searchbox').show(0);</script>
</div>
</div>
-
-
+
+
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body">
-
+
<div class="section" id="module-gensim.corpora.dmlcorpus">
<span id="corpora-dmlcorpus-corpus-in-dml-cz-format"></span><h1><tt class="xref py py-mod docutils literal"><span class="pre">corpora.dmlcorpus</span></tt> &#8211; Corpus in DML-CZ format<a class="headerlink" href="#module-gensim.corpora.dmlcorpus" title="Permalink to this headline">¶</a></h1>
<p>Corpus for the DML-CZ project.</p>
<dl class="class">
<dt id="gensim.corpora.dmlcorpus.DmlConfig">
<em class="property">class </em><tt class="descclassname">gensim.corpora.dmlcorpus.</tt><tt class="descname">DmlConfig</tt><big>(</big><em>configId</em>, <em>resultDir</em>, <em>acceptLangs=None</em><big>)</big><a class="headerlink" href="#gensim.corpora.dmlcorpus.DmlConfig" title="Permalink to this definition">¶</a></dt>
-<dd><p>DmlConfig contains parameters necessary for the abstraction of a &#8216;corpus of
+<dd><p>DmlConfig contains parameters necessary for the abstraction of a &#8216;corpus of
articles&#8217; (see the <cite>DmlCorpus</cite> class).</p>
<p>Articles may come from different sources (=different locations on disk/netword,
different file formats etc.), so the main purpose of DmlConfig is to keep all
sources in one place.</p>
<p>Apart from glueing sources together, DmlConfig also decides where to store
-output files and which articles to accept for the corpus (= an additional filter
+output files and which articles to accept for the corpus (= an additional filter
over the sources).</p>
</dd></dl>
<dl class="class">
<dt id="gensim.corpora.dmlcorpus.DmlCorpus">
<em class="property">class </em><tt class="descclassname">gensim.corpora.dmlcorpus.</tt><tt class="descname">DmlCorpus</tt><a class="headerlink" href="#gensim.corpora.dmlcorpus.DmlCorpus" title="Permalink to this definition">¶</a></dt>
<dd><p>DmlCorpus implements a collection of articles. It is initialized via a DmlConfig
-object, which holds information about where to look for the articles and how
+object, which holds information about where to look for the articles and how
to process them.</p>
<p>Apart from being a regular corpus (bag-of-words iterable with a <cite>len()</cite> method),
-DmlCorpus has methods for building a dictionary (mapping between words and
+DmlCorpus has methods for building a dictionary (mapping between words and
their ids).</p>
<dl class="method">
<dt id="gensim.corpora.dmlcorpus.DmlCorpus.articleDir">
@@ -129,7 +129,7 @@
<tt class="descname">buildDictionary</tt><big>(</big><big>)</big><a class="headerlink" href="#gensim.corpora.dmlcorpus.DmlCorpus.buildDictionary" title="Permalink to this definition">¶</a></dt>
<dd><p>Populate dictionary mapping and statistics.</p>
<p>This is done by sequentially retrieving the article fulltexts, splitting
-them into tokens and converting tokens to their ids (creating new ids as
+them into tokens and converting tokens to their ids (creating new ids as
necessary).</p>
</dd></dl>
@@ -149,10 +149,10 @@
<dt id="gensim.corpora.dmlcorpus.DmlCorpus.processConfig">
<tt class="descname">processConfig</tt><big>(</big><em>config</em>, <em>shuffle=False</em><big>)</big><a class="headerlink" href="#gensim.corpora.dmlcorpus.DmlCorpus.processConfig" title="Permalink to this definition">¶</a></dt>
<dd><p>Parse the directories specified in the config, looking for suitable articles.</p>
-<p>This updates the self.documents var, which keeps a list of (source id,
+<p>This updates the self.documents var, which keeps a list of (source id,
article uri) 2-tuples. Each tuple is a unique identifier of one article.</p>
-<p>Note that some articles are ignored based on config settings (for example
-if the article&#8217;s language doesn&#8217;t match any language specified in the
+<p>Note that some articles are ignored based on config settings (for example
+if the article&#8217;s language doesn&#8217;t match any language specified in the
config etc.).</p>
</dd></dl>
@@ -196,7 +196,7 @@
</div>
<div class="clearer"></div>
</div>
-
+
<div class="related">
<h3>Navigation</h3>
<ul>
@@ -216,11 +216,11 @@
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>
- <li><a href="../apiref.html" >API Reference</a> &raquo;</li>
+ <li><a href="../apiref.html" >API Reference</a> &raquo;</li>
</ul>
</div>
-
-
+
+
<div class="footer">
&copy; Copyright 2011, Radim Řehůřek &lt;radimrehurek(at)seznam.cz&gt;.
View
34 docs/corpora/lowcorpus.html
@@ -6,7 +6,7 @@
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
-
+
<title>corpora.lowcorpus – Corpus in List-of-Words format &mdash; gensim documentation</title>
<link rel="stylesheet" href="../_static/default.css" type="text/css" />
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
@@ -55,12 +55,12 @@
<li><a href="../apiref.html">API reference</a>|&nbsp;</li>
<li><a href="../tutorial.html">Tutorials</a> &raquo;</li>