Ideas & Feature proposals

Menshikh Ivan edited this page Feb 15, 2018 · 65 revisions

A list of ideas for new functionality and projects in Gensim, topic modelling for humans, a scientific Python package for efficient, large-scale topic modeling.

This page contains an initial short description of a project.

Gensim's design philosophy builds on data streaming to process very large datasets (larger than RAM; potentially infinite). Data points are processed one at a time, in constant RAM.

This places stringent requirements on the internal algorithms used (online learning, single pass methods) as well as their implementation, to achieve top performance and robustness.

If you'd like to work on any of the topics below, or have your own ideas, get in touch on the gensim mailing list.


Online NNMF

Background:

Non-negative matrix factorization, NNMF [1], is a popular machine learning algorithm, widely used in collaborative filtering and natural language processing. It can be phrased as an online learning algorithm. [2]

While implementations of NNMF in Python exist [3, 4], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications. You will contribute a scalable implementation of NNMF to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals:

  1. Demonstrate understanding of matrix factorization theory and practice, by describing, implementing and evaluating a scalable version of the NNMF algorithm.

  2. Implement streamed NNMF [5] that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally also implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables:

  1. Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings and accuracy of your NNMF implementation on English Wikipedia and the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your NNMF implementation. You can also evaluate the NNMF factorization quality against other factorization methods, such as SVD and LDA [9] in collaborative filtering settings (optional).

Resources:

[1] NNMF on Wikipedia

[2] Online algorithm

[3] Christian Thurau et al. "Python Matrix Factorisation"

[4] Sklearn NMF code

[5] Online NMF on Wikipedia

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010

[10] Wang, Tan, König, Li. "Efficient Document Clustering via Online Nonnegative Matrix Factorizations." 2011

[11] Topics extraction with Non-Negative Matrix Factorization in sklearn

[12] Gensim github issue #132.

Explicit Semantic Analysis

Background: Explicit Semantic Analysis [1, 2] is a method of unsupervised document analysis using Wikipedia as a resource. It has many applications, for example event classification on Twitter [3].

While implementations of ESA exist in Python [4] and other languages [5], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.

You will contribute a scalable implementation of ESA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals:

  1. Demonstrate understanding of semantic interpretation theory and practice, by describing, implementing and evaluating a scalable version of the ESA algorithm.

  2. Implement streamed ESA that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings and accuracy of your ESA implementation on the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your ESA implementation. You can also evaluate the ESA against other methods of semantic analysis, such as Latent Semantic Analysis [9, 10] in an event classification task (optional).

Resources:

[1] Evgeniy Gabrilovich and Shaul Markovitch "Wikipedia-based Semantic Interpretation for Natural Language Processing." Journal of Artificial Intelligence Research, 34:443–498, 2009

[2] Explicit Semantic Analysis.

[3] Musaev, A.; De Wang; Shridhar, S.; Chien-An Lai; Pu, C., "Toward a Real-Time Service for Landslide Detection: Augmented Explicit Semantic Analysis and Clustering Composition Approaches," in Web Services (ICWS), 2015 IEEE International Conference on , vol., no., pp.511-518, June 27 2015-July 2 2015

[4] Python implementation of ESA

[5] Gabrilovich's page on ESA

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] "Latent Semantic Analysis" article on Wikipedia

[10] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188

Distributed computing

Background: Gensim contains distributed implementations of several algorithms. The implementations use Pyro4 for network communication and are fairly low-level.

To do: Investigate + integrate one of the higher level frameworks for distributed computation, so that gensim can plug into them without reinventing the wheel. Implement one of the algorithms in gensim in this framework: for example, adapt the distributed online Latent Semantic Analysis or online LDA that are already in gensim.

The solution must support online computation (data coming as a non-repeatable stream of documents); batch processing is not enough.

Integration with a framework that plays well with Python (i.e. avoids Spark's serialisation ) to disk would be better, so Ibis is a good candidate.

Resources: Ibis Celery, Spark, Disco, Storm, Samza. Get in touch on the mailing list/@radimrehurek/@ogrisel.

Sanity checks

Background: Gensim newbs sometimes mistakenly pass documents where a whole corpus is assumed. Or pass strings where a list of tokens is assumed etc.

This results in runtime errors, which confuses novices. Even worse, where gensim expects a sequence (~list of tokens) and user passes a string, there is no error (string is also an iterable!) but the individual string letters are silently treated as whole tokens, leading to unexpected results...

To do: Collect/survey common newbie errors and then implement sanity checks that catch them early. This could be a method/set of methods that accept an object and check whether it's really a streamed corpus, whether it's empty, contains strange ids, ids in right order, strange feature weights... In case there's anything fishy, log a warning.

Resources: utils.is_corpus() checks whether the input is a corpus, without destroying the data, in case the content is streamed and non-repeatable. Can serve as a starting point template.

Documentation Tooling

Difficulty: Medium; requires excellent UX skills and native English

Background: We already have a large number of models, therefore, we want to pay more attention to the model quality (documentation and model discovery being the main thing here). If we have a great model users don't know how (or when) to use - they won't use it! For this reason, we want to significantly improve our documentation.

To do:

  • [already underway, WIP] Consistent docstrings for all methods and classes in Gensim
  • An updated new "beginner tutorial chain": an API-centric walk through the Gensim terminology, design choices, ways of doing things the Gensim way, best practices, FAQ
  • Use-case-centric User-guides for major models and use-case pipelines (sphinx-gallery), focusing on how to solve concrete popular task X
  • A New slick project website: the current website https://radimrehurek.com/gensim/ is very popular in terms of visitors, but looks embarrassingly dated.
  • Improved UX: analysis of visitor flow, minimizing clicks for common documentation patterns, a logical structure for all documentation, intuitive navigation, improving information discovery for the different types of visitor types (newbies, API docs, use-case docs, power users…)

Resources:

SparseTools package

See https://github.com/scikit-learn/scikit-learn/issues/6186

A package for working with sparse matrices, built on top of scipy, optimized with Cython, memory-efficient and fast. An improvement and replacement on recently deprecated scipy's sparsetools package.

Should also include faster (Cythonized) transformations between the "gensim streamed corpus" and various formats (scipy.sparse, numpy...). Similar to matutils (https://radimrehurek.com/gensim/matutils.html#gensim.matutils.corpus2csc )

Automatic topic labeling

Implement algorithm from the paper Automatic Labeling of Multinomial Topic Models Qiaozhu Mei et al Suggestion from Jason Liu

Pivoted normalization for tfidf model

See https://github.com/RaRe-Technologies/gensim/issues/220


Word2Vec/Doc2Vec: Add 'Adagrad' Gradient-Descent Option

Some Word2Vec/Doc2Vec papers or projects suggest they've used 'Adagrad' to speed gradient-descent. Having it as an option (for comparative evaluation) and then possibly the default (if it's a clear speed win) would be nice for Word2Vec/Doc2Vec.

Test Online Word2vec better

From the mailing list comment

The testing that's occurred with this new feature has really only verified that new tokens are available with at-a-glance somewhat-meaningful vectors. The effect on existing tokens, or relations with tokens that don't appear in later training batches, hasn't been evaluated. (I'm also not sure it's doing the best thing with respect to features like frequent-word downsampling.)

Decompose dense embeddings into sparse interpretable components

Similar to this blog on images of faces

Structural Topic Models

See https://github.com/RaRe-Technologies/gensim/issues/1038

LazySVD

Fast SVD algorithm. Requires knowledge of C and algorithmic optimizations.

https://arxiv.org/abs/1607.03463

Integration with shorttext supervised learning package

This integration with sklearn and keras should be a part of gensim: https://github.com/stephenhky/PyShortTextCategorization

SentencePiece: Unsupervised language-agnostic tokenization

Google has recently released Code for SentencePiece algorithm.

Implement a gensim wrapper if it produces a good benchmark against supervised tokenization

It fits in the same space as existing Gensim module Phrases. This module is useful to a lot of people even though it is simple and general.

Port ldatuning metrics to gensim

Very useful metrics for selecting the number of topics in LDA http://rpubs.com/siri/ldatuning

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge

Implement a gensim version of this algorithm. Main features: topic seeding through "anchor words" and hierarchical TM.

Paper: Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge Implementation: /gregversteeg/corex_topic