
<a id='index-0'></a>

<a id='visualizing-trends'></a>

# Visualizing trends

Texts often have a sequence. Newspapers and periodicals have volumes.
Novels have chapters. Personal diaries have dated entries. Visualizations of
topic models may benefit from incorporating information about where a text falls
in a sequence.

As a motivating example, consider Victor Hugo’s *Les Misérables*. Over 500,000
words long, the book counts as a lengthy text by any
standard.[#fn_les_mis]_ The novel comes in five volumes (“Fantine”, “Cosette”,
“Marius”, “The Idyll in the Rue Plumet and the Epic in the Rue St. Denis”, and
“Jean Valjean”). And within each volume we have a sequence of chapters. (And
within each chapter we have a sequence of paragraphs, …). In this section we
will address how to visualize topic shares in sequence.

To whet your appetite, consider the rise and fall of a topic associated with
revolutionary activity in *Les Misérables*:

![_static/plot_topics_over_time_series_les_misérables.png](static/plot_topics_over_time_series_les_misérables.png)

([Enjolras](https://en.wikipedia.org/wiki/Enjolras) is the leader of the
revolutionary *Les Amis de l’ABC*.)

>**Note**
>
>Probabilistic models such as topic models often benefit from
incorporating information about where an individual text falls in a larger
sequence of texts [[BL06]](references.ipynb#blei-dynamic-2006).

## Plotting trends

As always, we first need to fit a topic model to the corpus. As MALLET has no
built-in French stopword list we need to provide one. We will use the [French
stopword list](http://svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt)
from the Snowball stemmer package. Additionally, because we are dealing with
non-English text we need to use an alternate regular expression for
tokenization. Token-regex '[\p{L}\p{M}]+'

Each row of the matrix holds the topic proportions associated with
a document.

In [5]:
from pathlib import Path

import dariah
import cophi

jupyter_path = Path.cwd()
directory= Path.joinpath(jupyter_path.resolve().parent.parent, 'data', 'hugo-les-misérables-split')

corpus = cophi.corpus(directory,
                      lowercase=True,
                      token_pattern=r"\p{Letter}+\p{Connector_Punctuation}?\p{Letter}+",
                      metadata=False)

In [6]:
mfw = corpus.mfw(50)
features = mfw + corpus.hapax
dtm = corpus.drop(corpus.dtm, features).fillna(0).astype(int)

### mallet becomes here a global variable

In [7]:
import os

mallet_path = os.environ.get("MALLET_HOME")
mallet_path

'C:\\mallet'

### We get to use DARIAH Tools again to simplify the process

In [8]:
model = dariah.core.LDA(num_topics=50,
                        num_iterations=1000,
                        mallet=mallet_path)
model

OSError: 'C:\mallet' is not a file. Point to the 'mallet/bin/mallet' file.

In [None]:
model.fit(dtm)

In [None]:
model.topic_document

In [None]:
model.topic_document.to_csv('doc_topics_hugo-les-misérables_50.csv', index=True)
model.topics.to_csv('topics_hugo-les-misérables_50.csv', index=True)
model.topic_word.to_csv('word_hugo-les-misérables_50.csv', index=True)

Among the fifty topics there is one topic (#35 using 0-based indexing) that
jumps out as characteristic of events towards the close of the novel. The words
most strongly connected with this topic include “barricade”, “fusil”, and
“cartouches” (“barricade”, “rifle”, and “cartridges”).

In [None]:
model.topics.iloc[35]

Because the documents are ordered in a sequence, we can plot the fate, so to
speak, of this topic over time (for topic #35) with the following lines of code:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

series = model.topic_document.T.iloc[:, 35].values
plt.figure(figsize=(8,8))
plt.savefig('plot_topics_over_time_series_simple.png') #width=7in
plt.xticks(rotation=90)
plt.title("Documents over time for topic 35")
plt.plot(series, '.')  # '.' specifies the type of mark to use on the graph

While this visualization communicates the essential information about the
prevalence of a topic in the corpus, it is not perfect. We can improve it. It
would, for instance, be useful to include an indication of where the various
volumes start and end. Another enhancement would add some kind of “smoothing” to
the time series in order to better communicate the underlying trend.

A rolling average of the topic shares turns out be a useful form of smoothing in
this case. We are interested in the prevalence of a topic over time and whether
a topic disappears completely in one 500-word chunk of text (only to reappear in
the next) does not interest us. We want to visualize the underlying trend, that
is, we need some model or heuristic capable of capturing the idea
that the topic (or any similar feature) has an underlying propensity to appear at
varying points of the novel and that while this propensity may change over time it
does not fluctuate wildly. <sup>[2](#fn-lowess)</sup>

Recall that a rolling or moving average of a time series associates with each
point in the series the average of some fixed number of previous
observations (including the current observation). This fixed number of
observations is often
called a “window”. The idea of a rolling mean is effectively communicated visually:

In [None]:
import pandas as pd
import numpy as np

z = np.array([  3.,   2.,   3.,   6.,   2.,   3.,   1.,   3.,   8.,   3.,   5.,
               8.,   7.,   8.,   7.,   6.,   8.,   7.,   7.,   5.,   8.,   6.,
              11.,   6.,   7.,   8.,   8.,   6.,   9.,  15.,  13.,  10.,   9.])

def rolling_mean(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

rolling_mean(z, 3)

In [None]:
plt.plot(z, '.', alpha=0.5)

plt.savefig('plot_topics_over_time_rolling_mean.png') #width=5in
plt.plot(rolling_mean(z, 5), '-', linewidth=2)

After making these two improvements—marking the volume boundaries and adding
a trend line based on a rolling average—the time series for our topic does
a better job of orienting us in the novel and communicating the points in the
novel where the topic appears:

In [None]:
docnames = list(model.topic_document.columns)

In [None]:
# the values on the x-axis (xs) are simply a sequence of integers
# corresponding to the texts (also the rows in the document topic matrix)
xs = np.arange(len(series))

series_smooth = rolling_mean(series, 15)  # 15 seems to work well here

# now we need to calculate at what index each volume starts
# there are many ways to do this, two methods are shown below
# method #1
volume_names = ["tome-1-fantine", "tome-2-cosette", "tome-3-marius", "tome-4", "tome-5-jean-valjean"]
volume_indexes = []
for volname in volume_names:
    for i, docname in enumerate(docnames):
        if volname in docname:
            volume_indexes.append(i)
            break

try:
    volume_indexes_prev = volume_indexes
except:
    print('An exception occured')
    
# method #2, use NumPy functions
volume_indexes = []
for volname in volume_names:
    volume_indexes.append(np.min(np.nonzero([volname in docname for docname in docnames])))

try:
    volume_indexes == volume_indexes_prev
except:
    print('An exception occured')

    
data = model.topic_document.iloc[:, volume_indexes].iloc[35]

ax = data.plot(title="Les Misérables, Topic #35 (barricade enjolras ...)")
ax.set_xlabel("Novel segment")
ax.set_ylabel("Topic share")
plt.xticks(rotation=90)
plt.savefig("plot_topics_over_time_series_les_misérables.png")

There are of many other topics that appear in our fit of the corpus. Looping
over the topics and saving an image for each topic is straightforward:

In [None]:
for i in range(model.num_topics):
    plt.clf()  # clears the current plot
    plt.figure(figsize=(20,20))
    series = model.topic_document.iloc[i].values
    xs = np.arange(len(series))
    series_smooth = rolling_mean(series, 15)  #mean window = 15
    plt.plot(series, '.')
    plt.plot(series_smooth, '-', linewidth=2)
    plt.title("Topic {}: {}".format(i, ','.join(model.topics.iloc[i, :10])))
    savefig_fn = "/tmp/hugo-topic{}.pdf".format(i)
    plt.savefig(savefig_fn, format='pdf')

<a id='fn-les-mis'></a>
**[1]** The text of Les Misérables has been used in a variety of
(interactive) visualization projects, including [Les Misérables
Co-occurrence](http://bost.ocks.org/mike/miserables/) and [Novel Views:
Les Miserables](http://neoformix.com/2013/NovelViews.html).

<a id='fn-lowess'></a>
**[2]** For generic smoothing those accustomed to using R will be
familiar with the function `loess()` which implements the most common form
of scatterplot smoothing. In Python a similar function
(`statsmodels.nonparametric.lowess()`) is available in the `statsmodels`
package. While we might be tempted to use such a function to communicate
visually the basic trend, we will be better served if we think of the
sequence of topic shares as a proper time series rather than (merely)
a sequence of dependant and independent variables suitable for visualization
in a scatter plot.