Topics – Easy Topic Modeling in Python
Topics is a Python library for topic modeling. Furthermore, this repository provides a convenient, modular workflow that can be entirely controlled from within a well documented Jupyter notebook. Users not yet familiar with programming in Python can test basic topic modeling in a standalone GUI demonstrator, which does not require a Python interpreter or any extra installations.
At the moment, this library supports three LDA implementations:
- lda, which is lightweight and provides basic LDA.
- MALLET, which is known to be very robust.
- Gensim, which is attractive because of its multi-core support.
- Topics website
- Topics API documentation
- Topics paper
- Standalone Demonstrator releases
- An introduction to topic modeling using lda
- An introduction to topic modeling using MALLET
- An introduction to topic modeling using Gensim
- A demonstration of all available visualizations
To install the latest stable version of the library
$ pip install dariah_topics
To install the latest development version:
$ pip install --upgrade git+https://github.com/DARIAH-DE/Topics.git@testing
In only 15 lines of code from plain text files to a visualization of the topic model output.
>>> from cophi_toolbox import preprocessing >>> from dariah_topics import modeling, postprocessing, visualization >>> pathlist = ['corpus/dickens_bleak.txt', 'corpus/thackeray_vanity.txt'] >>> labels = ['dickens_bleak', 'thackeray_vanity'] >>> corpus = preprocessing.read_files(pathlist) >>> tokens = [preprocessing.tokenize(document) for document in corpus] >>> matrix = preprocessing.create_document_term_matrix(tokens, labels) >>> stopwords = preprocessing.list_mfw(matrix) >>> clean_matrix = preprocessing.remove_features(stopwords, matrix) >>> vocabulary = clean_matrix.columns >>> model = modeling.lda(topics=10, iterations=1000, implementation='mallet') >>> topics = postprocessing.show_topics(model, vocabulary) >>> document_topics = postprocessing.show_document_topics(model, topics, labels) >>> PlotDocumentTopics = visualization.PlotDocumentTopics(document_topics) >>> PlotDocumentTopics.static_heatmap().show()
Working with Jupyter Notebooks
If you wish to work through the tutorials, you can clone the repository using Git:
$ git clone https://github.com/DARIAH-DE/Topics.git
or download the ZIP-archive (don't forget to unzip it) and install
dariah_topics from its source code:
$ pip install -r requirements.txt
As a server-client application, Jupyter allows you to edit and run Python code interactively from within so-called notebooks via a web browser.
To install Jupyter:
$ pip install jupyter
Python distributions like Anaconda come with Jupyter by default.
You can run Jupyter via:
$ jupyter notebook
Working with MALLET
MALLET is a Java-based package for statistical natural language processing. The MALLET Topic Model package includes an extremely fast and highly scalable implementation of Gibbs sampling and tools for inferring topics for new documents given trained models.
To call MALLET from within the Python environment,
dariah_topics provides a convenient wrapper.
If you are confronted with any issues regarding installation or usability, please use GitHub issues.
This library requires Python 3.4 or higher.
- You will have to install
future‑0.16.0‑py3‑none‑any.whlfrom this resource. Download the appropriate file and run
pip install future‑0.16.0‑py3‑none‑any.whl.
- In case of the error
Microsoft Visual C++ 10.0 is required, check if you are using Python 3.6 or higher with
python -V. If you do, you have to install Microsoft Windows SDK from this resource. If you do not, upgrade to Python 3.6 or higher and try installing the library again.
- In case of
PermissionError: [Errno 13] Permission denied, try
pip install --useror
python setup.py install --user, respectively.
- Due to several visualization dependencies, you might have to install the distribution packages
sudo apt-get install).
- Make sure to install Python 3.6 correctly and adjust the selection of the Python interpreter in your editor accordingly. See also the Python documentation.
DARIAH-DE supports research in the humanities and cultural sciences with digital methods and procedures. The research infrastructure of DARIAH-DE consists of four pillars: teaching, research, research data and technical components. As a partner in DARIAH-EU, DARIAH-DE helps to bundle and network state-of-the-art activities of the digital humanities. Scientists use DARIAH, for example, to make research data available across Europe. The exchange of knowledge and expertise is thus promoted across disciplines and the possibility of discovering new scientific discourses is encouraged.
This application has been developed with support from the DARIAH-DE initiative, the German branch of DARIAH-EU, the European Digital Research Infrastructure for the Arts and Humanities consortium. Funding has been provided by the German Federal Ministry for Research and Education (BMBF) under the identifier 01UG1610J.