
<a id='index-0'></a>

<a id='topic-model-mallet'></a>

# Topic modeling with MALLET

This section illustrates how to use [MALLET](http://mallet.cs.umass.edu/) to
model a corpus of texts using a topic model and how to analyze the results using
Python.

A topic model is a probabilistic model of the words appearing in a corpus of
documents.  (There are a number of general introductions to topic models
available, such as [[Ble12]](references.ipynb#blei-introduction-2012).) The particular topic model
used in this section is Latent Dirichlet Allocation (LDA), a model introduced in
the context of text analysis in 2003 [[BNJ03]](references.ipynb#blei-latent-2003). LDA is an
instance of a more general class of models called mixed-membership models. While
LDA involves a greater number of distributions and parameters than the Bayesian
model introduced in the section on [group comparison](feature_selection.ipynb#bayesian-group-comparison), both are instances of a Bayesian probabilistic
model. In fact, posterior inference for both models is typically performed in
precisely the same manner, using Gibbs sampling with conjugate priors.

This section assumes prior exposure to topic modeling and proceeds as follows:

1. MALLET is downloaded and used to fit a topic model of six novels, three by
  Brontë and three by Austen. Because these are lengthy texts, the novels are split
  up into smaller sections—a preprocessing step which improves results considerably.  
1. The output of MALLET is loaded into Python as a document-topic matrix (a
  2-dimensional array) of topic shares.  
1. Topics, discrete distributions over the vocabulary, are analyzed.  


Note that [an entire section](topic_model_visualization.ipynb#topic-model-visualization) is devoted to
visualizing topic models. This section focuses on using MALLET and processing
the results.

This section uses six novels by Brontë and Austen. These novels are divided into
parts as follows:

## Überlegen ob der text unterhalb dazu passt

The first two columns of `doc-topics.txt` record the document number
(0-based indexing) and the full path to the filename. The rest of the columns are best
considered as (topic-number, topic-share) pairs. There are as many of these
pairs as there are topics. All columns are separated by tabs (there’s even
a trailing tab at the end of the line). With the exception of the header (the
first line), this file records data using [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values). There are two challenges
in parsing this file into a document-topic matrix. The first is sorting.
The texts do not appear in a consistent order in `doc-topics.txt` and the
topic number and share pairs appear in different columns depending on the
document. We will need to reorder these pairs before assembling them into
a matrix.[#fnmapreduce]_ The second challenge is that the number of columns will
vary with the number of topics specified (`--num-topics`). Fortunately, the
documentation in the Python library [itertools](http://docs.python.org/dev/library/itertools.html) describes a function
called `grouper` using `itertools.izip_longest` that solves our problem.

### Text zur aktuellen mallet lage verfassen

In [14]:
from pathlib import Path

import dariah
import cophi

jupyter_path = Path.cwd()
directory= Path.joinpath(jupyter_path.resolve().parent.parent, 'data', 'austen-brontë-split')

corpus = cophi.corpus(directory,
                      lowercase=True,
                      token_pattern=r"\p{Letter}+\p{Connector_Punctuation}?\p{Letter}+",
                      metadata=False)

In [15]:
mfw = corpus.mfw(50)
features = mfw + corpus.hapax
dtm = corpus.drop(corpus.dtm, features).fillna(0).astype(int)

### mallet becomes here a global variable

In [16]:
import os

mallet_path = os.environ.get("MALLET_HOME")
mallet_path == mallet_path + '\bin\mallet'

False

In [17]:
model = dariah.core.LDA(num_topics=20,
                        num_iterations=1000,
                        mallet=mallet_path)
model

OSError: 'C:\mallet' is not a file. Point to the 'mallet/bin/mallet' file.

In [None]:
model.fit(dtm)

## Inspecting the topic model

The first thing we should appreciate about our topic model is that the twenty
shares do a remarkably good job of summarizing our corpus. For example, they
preserve the distances between novels (see figures below). By this measure, LDA
is good at dimensionality reduction: we have taken a matrix of dimensions 813 by
14862 (occupying almost three megabytes of memory if stored in a spare matrix)
and fashioned a representation that preserves important features in a matrix
that is 813 by 20 (5% the size of the original).

Even though a topic model “discards” the “fine-grained” information recorded in
the matrix of word frequencies, it preserves salient details of the underlying
matrix. That is, the topic shares associated with a document have an
interpretation in terms of word frequencies. This is best illustrated by
examining the present topic model.

First let us identify the most significant topics for each text in the corpus.
This procedure does not differ in essence from the procedure for identifying the
most frequent words in each text.

In [None]:
model.topic_document

In [None]:
!ls /tmp/dariah-topics

In [None]:
model.topic_document.to_csv('doc_topics_austen_brontë_20.csv', index=True)
model.topics.to_csv('topics_austen_brontë_20.csv', index=True)
model.topic_word.to_csv('word_austen_brontë_20.csv', index=True)

## ab hier zu Ende = alter Code

Now we will calculate the average of the topic shares associated with each
novel. Recall that we have been working with small sections of novels. The
following step combines the topic shares for sections associated with the same
novel.

In order to fit into the space available, the table above displays the first 15
of 20 topics.

We need to parse this file into something we can work with. Fortunately this
task is not difficult.

Now we have everything we need to list the words associated with each topic.

There are many ways to inspect and to visualize topic models. Some of the more
common methods are covered in [next section](topic_model_visualization.ipynb#topic-model-visualization).

## ab hier wieder wichtig = neuer Code

### Distinctive topics

Finding distinctive topics is analogous to the task of [finding distinctive
words](feature_selection.ipynb#feature-selection). The topic model does an excellent job of focusing
attention on recurrent patterns (of co-occurrence) in the word frequencies
appearing in a corpus. To the extent that we are interested in these kinds of
patterns (rather than the rare or isolated feature of texts), working with
topics tends to be easier than working with word frequencies.

Consider the task of finding the distinctive topics in Austen’s novels. Here the
simple difference-in-averages provides an easy way of finding topics that tend
to be associated more strongly with Austen’s novels than with Brontë’s.

In [None]:
bronte_cols = ["Bronte" in col for col in model.topic_document.columns]
austen_cols = ["Austen" in col for col in model.topic_document.columns]

In [None]:
bronte_avg = model.topic_document.iloc[:, bronte_cols].mean(axis=1)
austen_avg = model.topic_document.iloc[:, austen_cols].mean(axis=1)

In [None]:
import numpy as np
keyness = np.abs(austen_avg - bronte_avg)

In [None]:
ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order in Python sequences
ranking = np.argsort(keyness)[::-1]  # from highest to lowest; [::-1] reverses order in Python sequences
ranking

<a id='fnmapreduce'></a>
**[1]** Those familiar with [MapReduce](https://en.wikipedia.org/wiki/MapReduce) may recognize the pattern of splitting a dataset into smaller pieces and then (re)ordering them.