Skip to content
Topic Modelling for Humans
Branch: develop
Clone or download
anotherbugmaster and mpenkov NMF metrics and wikipedia (#2371)
* Fix random seed again

* Optimize E/M step

* Add an eval_every option, use softmax for normalization

* Fixes

* Improve notebook examples a bit

* Fix eval_every

* Return outliers

* Optimizations

* Experimenting with loss

* Fix PEP8

* Return nmf import

* Revert "Return nmf import"

This reverts commit 1c3a064

* Fix

* Fix minimum_probability & info -> debug logs

* Compute metrics

* Count error on-the-fly

* Speed optimizations, changed error functions

* Beat LDA

* Outperform sklearn in speed (WTF)

* Remove redundant arg

* Add Olivietti faces

* Remove redundant code

* Add Topics

* Make it pretty

* Fix wrapper

* Save corpus & dict, minor fixes

* Add RandomCorpus

* Dense -> sparse

* First doc2dense

* Fix csc again

* Fix len

* Experimenting

* Revert "Experimenting"

This reverts commit 7a3ef47.

* Fix evaluation

* Sparse speedup

* Improve performance

* Divide A and B again

* Fix A and B computation bug

* Sparsify W init

* Experimenting

* New norm

* Sparse threshold -> sparse coefficient

* Optimize residuals computation

* Fix residuals bug

* W speedup

* Experiment

* Revert changes a bit

* Fix corpus

* Fix init error|

* Resolve conflict

* Fix corpus iteration issue

* Switch to numpy algos

* Train on wikipedia

* Sparse coef -> density. More stable way to sparsify W matrix

* Return old sparse algo

* Max

* Optimizations

* Fix A and B computation

* Fix A and B normalization

* Add random_state

* Infer id2word

* Fix tests

* Document __init__

* Document whole nmf

* Remove unnecessary comments

* Add tutorial notebook

* Document __init__

* Fix flake version

* Fix flake warning

* Remove comments, reverse parallelization order

* Add NMF's cython extension to setup.py

* Fix imports, add solve_r function

* Remove comments

* Add docstrings

* Common corpus and common dictionary

* Remove redundant test

* Add signature flag

* Add files to manifest

* Fix flake8

* Fix atol value

* Implement top topics

* Add rst files

* Fix appveyor issue

* Fix cython error

* Fix fmax/fmin not being on win-python27

* Add word transformation test

* Improve readability of residuals computation

* Fix tests

* A few fixes

* Blank line at the end of each docstring

* Add blank line

* Add the paper reference

* Fix long line

* Add log_perplexity

* Add NMF and LDA comparison table

* Change the sign of log perplexity

* Add Sklearn NMF comparison

* Merge sklearn and tm tables

* Add F1

* Remove _solve_r

* Merge tutorial and benchmark

* Identation's back

* Optimize optimizers

* Remove unnecessary pic

* Optimize memory consumption

* Add docstring

* Optimize get_topic_words

* Fix tests

* Fix flake8

* Add missing test

* Code review fixes

* n_tokens -> num_tokens

* [skip ci] Add explicit normalize parameter

* [skip ci] Add explicit normalize parameter[2]

* [skip ci] Update tutorial notebook

* [skip ci] [WIP] Update wikipedia notebook

* Add more description and metrics

* [skip ci] Fix log_probabiliy

* Multiple format fixes in notebook, outputs cleared til tomorrow

* Train on full corpus

* [skip ci] Remove disclaimer

* Add RAM usage stats

* Native 20-newsgroups and additional text

* Truncate outputs

* Fix last cell formatting

* [skip ci] Change model hyperparameters back

* [skip ci] Add module docstring

* [skip ci] Massive speedups

Replaced some sparse matrices with dense.

* Checkout nmf_wikipedia from develop

* Fix tests

* Fix corpus description

* Add components permutation to coordinate descent

* Fix tests

* Fix dictionary highlight

* Fix tests again

* Remove r, it's not used for the time

* Deprecate use_r

* [skip ci] Rearrange params

* [skip ci] Add disclaimer about `r`

* Fix `normalize` and `minimum_probability` docstring

* Remove unused params

* Add csc support

* Add examples to the docstring

* Update tutorial notebook

* [skip ci] Update tutorial again

* [skip ci] fix PEP

* cast explicitly permutations to int32

* [skip ci] Fix a typo

* [skip ci] Remove clip and fix error count in update

* [skip ci] Fix error computation

* [skip ci] Fix error counting again

* [skip ci] Remove redundant imports

* Fix grouper for csc matrices

* Fix module docstring

* Fix training corpus description

* Fix pep8

* Fix flake8 for real

* Normalize, sparsity and dictionary fixes

* Updated module docstring in the notebook

* Update wiki

* Changed the loss shown in logs

* Fix wikipedia metrics

* Set chunksize to 1000

* Sklearn topics

* Fix the issues in PR

* Re-order metrics, add more explanations

* Fix the compatibility test

* Fix flake8

* Add comment about sparsity

* Add nmf_model

* Fix indent and corpus type

* Fix the flake8

* Fix smart_open import

* Fix flake8, docstring and comments

* Truncate wikipedia

* [skip ci] Fix indent

* Fix CI

* Fix initialization

* Fix flake8 and sklearn topics in the tutorial

* [skip ci] Update wiki notebook

* Truncate wiki, remove autoreload

* Remove autoreload and line_profiler

* Fix type checks

* [skip ci] Add the comment

* update language in the NMF tutorial

* WIP: NMF tutorial fixes + add Wikipedia section

* more NMF tutorial fixes

* more NMF tutorial fixes

* NMF notebook fixes

* more NMF tutorial fixes

* more NMF fixes

* NMF tutorial fixes

* NMF tutorial fixes

* [skip ci] O(n_topics^2) -> O(n_topics)

It turns out, the complexity was too high, I've optimized that.

* [skip ci] Improve training logs

* [skip ci] Turn on BLAS

* [skip ci] Optimize conversion to csc

* [skip ci] Optimize conversion to csc [2]

* [skip ci] Speed up corpus2csc

* [skip ci] Fix sparsity normalization

* [skip ci] Fix identation

* [skip ci] Remove optimize option (numpy version issues)

* [skip ci] Re-serialize the model

Attribute name is changed.

* [skip ci] Fix sparse matrix length computation

* [skip ci] Print topics only after eval_every batches

* [skip ci] Cosmetic fix

* [skip ci] Fix length method for csc corpus

* Lot of changes to the notebook

- Update metrics
- Refactor metrics functions
- TF-IDF for NMFs

* Generator test

* [skip ci] Flake fix

* Notebook execution is finished

* [skip ci] Fix objections

* Trimmed the notebook and added more info about the metrics
Latest commit 9eb3933 Mar 20, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci Fix gensim build (docs & pyemd issues) (#2318) Jan 7, 2019
continuous_integration/appveyor
docker Fix ipython kernel version in Dockerfile. Fix #1762 (#1764) Dec 5, 2017
docs NMF metrics and wikipedia (#2371) Mar 20, 2019
gensim NMF metrics and wikipedia (#2371) Mar 20, 2019
.gitignore Drop wrong key `-c` from `gensim.downloader` description (#2262) Dec 12, 2018
.travis.yml
CHANGELOG.md
CONTRIBUTING.md Add windows venv activate command to `CONTRIBUTING.md` (#1880) Feb 7, 2018
COPYING
ISSUE_TEMPLATE.md Update ISSUE_TEMPLATE.md (#2386) Feb 15, 2019
MANIFEST.in
README.md
appveyor.yml
gensim Quick Start.ipynb Fix links & spaces in Quick start guide (#1500) Jul 23, 2017
jupyter_execute_cell.png Add installation/notebook information to the quickstart (#1345) May 22, 2017
jupyter_home.png
setup.cfg
setup.py bump version to 3.7.1 Jan 31, 2019
tox.ini Fix gensim build (docs & pyemd issues) (#2318) Jan 7, 2019
tutorials.md

README.md

gensim – Topic Modelling in Python

Build Status GitHub release Conda-forge Build Wheel DOI Mailing List Gitter Follow

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Features

  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
  • Intuitive interfaces
    • easy to plug in your own input corpus/datastream (trivial streaming API)
    • easy to extend with other Vector Space algorithms (trivial transformation API)
  • Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
  • Extensive documentation and Jupyter Notebook tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Support

Please raise potential bugs on github. See Contribution Guide prior to raising an issue.

If you have an open-ended or a research question:

Installation

This software depends on NumPy and Scipy, two Python packages for scientific computing. You must have them installed prior to installing gensim.

It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as ATLAS or OpenBLAS is known to improve performance by as much as an order of magnitude. On OS X, NumPy picks up the BLAS that comes with it automatically, so you don’t need to do anything special.

The simple way to install gensim is:

pip install -U gensim

Or, if you have instead downloaded and unzipped the source tar.gz package, you’d run:

python setup.py test
python setup.py install

For alternative modes of installation (without root privileges, development installation, optional install features), see the documentation.

This version has been tested under Python 2.7, 3.5 and 3.6. Gensim’s github repo is hooked against Travis CI for automated testing on every commit push and pull request. Support for Python 2.6, 3.3 and 3.4 was dropped in gensim 1.0.0. Install gensim 0.13.4 if you must use Python 2.6, 3.3 or 3.4. Support for Python 2.5 was dropped in gensim 0.10.0; install gensim 0.9.1 if you must use Python 2.5).

How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Python’s built-in generators and iterators for streamed data processing. Memory efficiency was one of gensim’s design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation


Adopters

Company Logo Industry Use of Gensim
RARE Technologies rare ML & NLP consulting Creators of Gensim – this is us!
Amazon amazon Retail Document similarity.
National Institutes of Health nih Health Processing grants and publications with word2vec.
Cisco Security cisco Security Large-scale fraud detection.
Mindseye mindseye Legal Similarities in legal documents.
Channel 4 channel4 Media Recommendation engine.
Talentpair talent-pair HR Candidate matching in high-touch recruiting.
Juju juju HR Provide non-obvious related job suggestions.
Tailwind tailwind Media Post interesting and relevant content to Pinterest.
Issuu issuu Media Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about.
Search Metrics search-metrics Content Marketing Gensim word2vec used for entity disambiguation in Search Engine Optimisation.
12K Research 12k Media Document similarity analysis on media articles.
Stillwater Supercomputing stillwater Hardware Document comprehension and association with word2vec.
SiteGround siteground Web hosting An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.
Capital One capitalone Finance Topic modeling for customer complaints exploration.

Citing gensim

When citing gensim in academic papers and theses, please use this BibTeX entry:

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}
You can’t perform that action at this time.