A Python library for topic modeling.
Clone or download
Pull request Compare This branch is 1006 commits ahead, 8 commits behind pielstroem:master.

README.md

Topics – Easy Topic Modeling in Python

Topics is a Python library for topic modeling. Furthermore, this repository provides a convenient, modular workflow that can be entirely controlled from within a well documented Jupyter notebook. Users not yet familiar with programming in Python can test basic topic modeling in a standalone GUI demonstrator, which does not require a Python interpreter or any extra installations.

At the moment, this library supports three LDA implementations:

  • lda, which is lightweight and provides basic LDA.
  • MALLET, which is known to be very robust.
  • Gensim, which is attractive because of its multi-core support.

Resources

Installation

To install the latest stable version of the library dariah_topics:

$ pip install dariah_topics

To install the latest development version:

$ pip install --upgrade git+https://github.com/DARIAH-DE/Topics.git@testing

Example code

In only 15 lines of code from plain text files to a visualization of the topic model output.

>>> from cophi_toolbox import preprocessing
>>> from dariah_topics import modeling, postprocessing, visualization
>>> pathlist = ['corpus/dickens_bleak.txt', 'corpus/thackeray_vanity.txt']
>>> labels = ['dickens_bleak', 'thackeray_vanity']
>>> corpus = preprocessing.read_files(pathlist)
>>> tokens = [preprocessing.tokenize(document) for document in corpus]
>>> matrix = preprocessing.create_document_term_matrix(tokens, labels)
>>> stopwords = preprocessing.list_mfw(matrix)
>>> clean_matrix = preprocessing.remove_features(stopwords, matrix)
>>> vocabulary = clean_matrix.columns
>>> model = modeling.lda(topics=10, iterations=1000, implementation='mallet')
>>> topics = postprocessing.show_topics(model, vocabulary)
>>> document_topics = postprocessing.show_document_topics(model, topics, labels)
>>> PlotDocumentTopics = visualization.PlotDocumentTopics(document_topics)
>>> PlotDocumentTopics.static_heatmap().show()

Working with Jupyter Notebooks

If you wish to work through the tutorials, you can clone the repository using Git:

$ git clone https://github.com/DARIAH-DE/Topics.git

or download the ZIP-archive (don't forget to unzip it) and install dariah_topics from its source code:

$ pip install -r requirements.txt

As a server-client application, Jupyter allows you to edit and run Python code interactively from within so-called notebooks via a web browser.

To install Jupyter:

$ pip install jupyter

Python distributions like Anaconda come with Jupyter by default.

You can run Jupyter via:

$ jupyter notebook

Working with MALLET

MALLET is a Java-based package for statistical natural language processing. The MALLET Topic Model package includes an extremely fast and highly scalable implementation of Gibbs sampling and tools for inferring topics for new documents given trained models.

To call MALLET from within the Python environment, dariah_topics provides a convenient wrapper.

You can download MALLET here. For more detailed instructions, have a look at this.

Troubleshooting

If you are confronted with any issues regarding installation or usability, please use GitHub issues.

This library requires Python 3.4 or higher.

Windows-specific Issues

  • You will have to install future‑0.16.0‑py3‑none‑any.whl from this resource. Download the appropriate file and run pip install future‑0.16.0‑py3‑none‑any.whl.
  • In case of the error Microsoft Visual C++ 10.0 is required, check if you are using Python 3.6 or higher with python -V. If you do, you have to install Microsoft Windows SDK from this resource. If you do not, upgrade to Python 3.6 or higher and try installing the library again.

UNIX-specific Issues

  • In case of PermissionError: [Errno 13] Permission denied, try pip install --user or python setup.py install --user, respectively.
  • Due to several visualization dependencies, you might have to install the distribution packages libfreetype6-dev and libpng-dev (e.g. using sudo apt-get install).
  • Make sure to install Python 3.6 correctly and adjust the selection of the Python interpreter in your editor accordingly. See also the Python documentation.

About DARIAH-DE

DARIAH-DE supports research in the humanities and cultural sciences with digital methods and procedures. The research infrastructure of DARIAH-DE consists of four pillars: teaching, research, research data and technical components. As a partner in DARIAH-EU, DARIAH-DE helps to bundle and network state-of-the-art activities of the digital humanities. Scientists use DARIAH, for example, to make research data available across Europe. The exchange of knowledge and expertise is thus promoted across disciplines and the possibility of discovering new scientific discourses is encouraged.

This application has been developed with support from the DARIAH-DE initiative, the German branch of DARIAH-EU, the European Digital Research Infrastructure for the Arts and Humanities consortium. Funding has been provided by the German Federal Ministry for Research and Education (BMBF) under the identifier 01UG1610J.

DARIAH-DE BMBF