# Topic Modelling and Interactive Visuals in R

The [*Catalogue of Political and Personal Satires*](https://www.britishmuseum.org/collection/term/BIB294) is the the primary reference work for the study of British satirical prints of the 18th and 19th century hosted by the Department of Prints and Drawings in the British Museum-- it features things like ["Mr Justic Bull's Decision in the case of Genuine Tea versus anti Genuine"](https://www.britishmuseum.org/collection/object/P_1868-0808-8405):

![Mr Justic Bull's Decision in the case of Genuine Tea versus anti Genuine](1274772001.jpg)

In 2018, there was a project by the [Sussex Humanities Lab](https://www.sussex.ac.uk/research/centres/sussex-humanities-lab/) titled [*Curatorial Voice: legacy descriptions of art objects and their contemporary uses*](https://curatorialvoice.github.io/). It looked at the metadata of the BM Satire catalogue and focused on using methods to analyze the 1.5 million words written by the historian [Mary Dorothy George](https://en.wikipedia.org/wiki/M._Dorothy_George) between 1935 and 1954 to describe 12,552 Georgian satirical prints.

For this tutorial, we will be using some of the [data collected](https://github.com/CuratorialVoice/code/tree/v1.0) during this project, and rather than looking specifically at the descriptions of Mary Dorothy George, we'll look at descriptions from the BM Satire catalogue of prints which were published during the early modern period (1500-1800).

[Here's a video explaining topic modelling](https://youtu.be/gN2x_KjJI1o) by Prof. Graham. Essentially, topic modelling is the process of identifying themes or "topics" in a set of documents by deconstructing the text into however many topics you specify, and creating what the topics actually are based on how words relate to each other (if you want to learn more, [see here](http://www.scottbot.net/HIAL/index.html@p=19113.html)). LDA topic modelling is most commonly used for analyzing large text corpuses, but by applying it to the metadata of items in the BM Satire collection, we can look at both the language used to describe the items within the collection by the curators, as well as arrange the topics found in the collection across the early modern period based on their proportion and original publication date.

**How this data has already been cleaned**:
- Converted to CSV
- Removed documents outside of 1500-1800
- Removed formatters
- Removed duplicates (marked by █)
- Made double quotes within text singular

(Make a habit of documenting how you cleaned your data, it allows those using your notebook to evaluate the change you've made and determine how these changes may affect the data.)


## Creating the Topic Models

Since this is a tutorial and we know the packages we'll be using, let's start with declaring those then load in our data!

In [None]:
# cleaning things up
library(tidyverse)
library(tidytext)

# a package for stopwards that goes beyond what tidyverse provides
library(stopwords)

# for creating topic models
library(reshape2)
library(topicmodels)

# palettes 
library(RColorBrewer)
library(pals)

# graphing and making it interactive
library(ggplot2)
library(ggiraph)

# making streamgraphs!
library(streamgraph)

# for saving the graph
library(htmlwidgets)
library(IRdisplay)

# read in data
bmSatire <- read_csv("data/BMSatire-EM.csv")

And now that we have all that, we can place our CSV data into a data frame and break it down into individual words. By placing it all in a `tibble`, we can ensure the data in each column is uniformly "typed" (text is all text, years are all numbers) so there's no unexpected behaviours/errors down the line.

In [None]:
# In our tibble "bmSatire_df":
# the column "id" will correspond to the id in the CSV
# and we are stripping the text column of numbers so it is uniformly "text"

bmSatire_df <- tibble(id = bmSatire$id, 
                text = (str_remove_all(bmSatire$desc, "[0-9]")), 
                date = bmSatire$publicationDate)

# check out what it looks like now!
print("Showing: bmSatire_df")
bmSatire_df

tidy_bmSatire <- bmSatire_df %>%
  unnest_tokens(word, text)

# all of the words from the document descriptions
print("tidy_bmSatire")
tidy_bmSatire

Even just from scrolling through the table of words generated through tokenization, you can get an idea of what these documents might about, perhaps what topics might appear. Of course, all of those "a" and "of" words are obscuring the view! So, it's time to remove the stop words.

**Note**: There is both Spanish and French included in the descriptions of these documents, so we will be combining some multilingual stopword lists from [Quanteda](https://github.com/quanteda/stopwords) with the English set from Tidytext.

In [None]:
# get Spanish and French stopwards
langSet <- c(stopwords(language = "es", source = "nltk"),
                  stopwords(language = "fr", source = "nltk"))

# format them to match our data aka put them in a tibble 
my_stopwords <- tibble(word = langSet, lexicon = "nltk")

# combine with tidytext stop_words
stopwords <- bind_rows(stop_words, my_stopwords)

# delete stopwords from our data
tidy_bmSatire <- tidy_bmSatire %>%
  anti_join(stopwords)

# look at tidy_bmSatire now!
tidy_bmSatire

Now that all the stop words are gone, does anything catch your eye? When looking at your own data, this step is an excellent place to pause and explore as it may reveal unique details that can put the work you're studying into context. For example, in this dataset, you may notice the token "ca"-- I initally thought it was an error neither I nor Tidytext caught, but when investigating further by returning to the original dataset and searching "ca", I discovered this was in reference to the French revolutionary war song, ["Ça Ira"](https://en.wikipedia.org/wiki/%C3%87a_Ira).

Now we can start prep for topic modelling by first counting the word occurences, and then placing this count into a [document term matrix](https://bookdown.org/Maxine/tidy-text-mining/tidying-a-document-term-matrix.html).

In [None]:
# this line will take a couple of seconds to finish!
bmSatire_words <- tidy_bmSatire %>%
  count(id, word, sort = TRUE)

# take a look at the frequencies... lots of "eye" mentions at the top, why?
bmSatire_words

# turn that into a matrix
# dtm = document term matrix
dtm <- bmSatire_words %>%
  cast_dtm(id, word, n)

And now... the moment you've all been waiting for... topic modelling! Let's look at it step-by-step:

In [None]:
# this is how many topics we're going to generate
K <- 15

# topic models need a random number for calculating the topics 
# setting the "seed"/based number for this allows those running this code to generate the same sequence of random numbers every time these calculations are done
# so we'll always have the same results!
set.seed(9161)

# here's a solid, simple explanation of what LDA modelling is: http://bit.do/ELI5-LDA
# compute the LDA model, inference via 1000 iterations of Gibbs sampling
# this will also take a bit of time to run
topicModel <- LDA(dtm, K, method="Gibbs", control=list(iter = 1000, verbose = 25))

In [None]:
# have a look a some of the results (posterior distributions)
tmResult <- posterior(topicModel)

# format of the resulting object
attributes(tmResult)

In [None]:
# the number of columns is the lengthOfVocab
ncol(dtm)

# topics are probability distributions over the entire lengthOfVocab
beta <- tmResult$terms   # get beta from results
dim(beta)                # K distributions over ncol(DTM) terms

# the number of rows is size of collection
nrow(dtm)

# for every item in the dataset we have a probability distribution of its contained topics
theta <- tmResult$topics 
dim(theta)                # number of documents distributed over K topics

In [None]:
# the way this works means that every topic we generated has every word, but in different proportions
# so let's pull the top 7 terms from every topic to see how the proportions were distributed and thus
# figure out what each topic is about
top7termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top7termsPerTopic, 2, paste, collapse=" ")

# look at your topics:
for (t in topicNames) {
    print(t)
}

**Note**: If you find that the model captures too much (or not enough) variability, try changing up the number of topics generated or the number of iterations in the LDA. 

Now that we have these topics generated, let's put the data into some form of context. You can learn from it just by looking at the topics, but how do the topic proportions change over time?

## Making a `ggplot2` Graph Interactive

First things first, we need to get our topics and their proportions associated with the date range from the original dataset:

In [None]:
# this is creating groupings of decades from the original list of dates
# using substr() we grab the first 3 numbers of the listed year, then attach a "0" to the end of that
# ex. from 1743, substr() takes just "174" then pastes a "0", so it becomes 1740
bmSatire$decade <- paste0(substr(bmSatire$publicationDate, 0, 3), "0")

# we get the mean topic proportions per decade by calculating the
# number of documents distributed over K topics (theta) by the list of decade
topic_proportion_per_decade <- aggregate(theta, by = list(decade = bmSatire$decade), mean)

# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames

# reshape the data frame created for visualization
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")

# make sure nothing looks fishy:
vizDataFrame

All your data looks good? Okay! Here's how to place it in a simple stacked barplot:

In [None]:
# make a custom colour palette!
# the brewer.pal() asks for the number of colours in the palette you're using, then the name of the palette from the ColorBrewer palette library
# the argument after, "(15)", is how many colours we need for this graph
mycolors <- colorRampPalette(brewer.pal(11, "Spectral"))(15)

options(repr.plot.width = 17, repr.plot.height = 11)

# a stacked bar plot of topic proportions per deacde
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
    geom_bar(stat = "identity") +   # we want the heights of the bars to represent values in the data, so we set stat to "identity"
    ylab("proportion") + # then label the y axis
    scale_fill_manual(values = mycolors) + # add our palette
    theme(axis.text.x = element_text(angle = 90, hjust = 1), text = element_text(size=17)) # this angles the labels (the dates) that are on the x-axis

Honestly, that's a pretty nice looking graph! It's clearly labelled and has a legend that you can reference to know what colour represents which topic. But wouldn't it be nice if, instead of having to reference a legend to understand what you're looking at, you could just mouse over each bar to see what they mean? This could make looking at the topic proportions a lot easier, so let's do it!

In [None]:
# this is to make a "tooltip", a box that appears will appear when you hover over a bar
# we will have it show the topic name and the specific proportion 
# '<br/>' is the html line break-- it places the topic and proportion on separate lines for easy reading
ttText <- paste(vizDataFrame$variable, "<br/>", vizDataFrame$value)

# put our graph into a variable
gg_bar <- ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
    # we change the "geom_bar" (the part that makes the bars) to geom_bar_interactive 
    # I removed the legend for this example, but you can keep it there if you want 
    geom_bar_interactive(stat = "identity", show.legend = FALSE, tooltip = ttText, data_id = ttText) + 
    ylab("proportion") +
    scale_fill_manual(values = mycolors) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

# optional: make the tooltip a bit more aesthetically pleasing with some CSS styling
tooltip_css <- "background-color:gray;color:white;font-style:italic;padding:10px;border-radius:5px;font-family:Arial;"

stackedGph <- girafe(ggobj = gg_bar) %>%
    # this pipe is optional, but can make exactly what you're looking more clear
    # try removing it and compare the plots generated (don't forget to actually remove the pipe, "%>%", too!)
    girafe_options(
      opts_hover_inv(css = "opacity:0.5;"),
      opts_hover(css = "fill:clear;stroke:black;"),
      opts_tooltip(css = tooltip_css)
    )

# to save an interactive plot we need to export them as an HTML widget
# we tell saveWidget that we want to save 'strgph'
# and it saves in our working directory
saveWidget(stackedGph, 'stackedGph.html')

# to include this visual on a webpage or in your markdown documents, paste what is between the parenthesis where you want the visual in your document!
# aka use iframes: https://www.w3schools.com/html/html_iframe.asp
display_html('<iframe src="stackedGph.html" width="900px" height="770px"></iframe>')

This can be done with ANY plot made using the `ggplot2` package! You can add `_interactive` to almost all forms of geometry like `geom_bar`, and create a plot with tooltips and other forms of interactivity that can make it more engaging for those exploring you visualisation. Check out the [`ggiraph` documentation](https://davidgohel.github.io/ggiraph/index.html) for more ways you can interact with your visuals in R!

Overall, this visual is now very easy to explore and clearly shows how the topic proportions are broken down over each decade; however, some topics are visually broken up because of how the graph is separated into bars, making it more difficult to follow how each topic changes across the Early Modern period. How can we show this same data, but with more flow?

## The `streamgraph` library

The answer is streamgraphs! For a detailed overview of streamgraphs, check out this article by [Andy Kirk](https://www.visualisingdata.com/2010/08/making-sense-of-streamgraphs/), but to summarize, streamgraphs are a form of visualisation meant to show the "flow" of topics over a period of time. One of the features that makes this visualisation unique is that there is **no y-axis**. The verticality of the graph (so, whether the topic is on the top, in the middle, or on the bottom) is NOT significant, only the proportions are, and these proportions "ebb and flow" over time in relation to eachother.

Let's take our data and create a streamgraph so you can better understand what you just read:

In [None]:
strgph <- streamgraph(vizDataFrame, key="variable", value="value", date="decade",
                  width="920px", height="350px") %>%
                  sg_fill_manual(values = mycolors) %>%
                  sg_axis_y(0) %>%   # make sure to hide the y-axis so significance is not misattributed
                  sg_legend(TRUE, "Topic: ")

# check out your streamgraph!
saveWidget(strgph, 'strgph.html')
display_html('<iframe src="strgph.html" width="991" height="450px"></iframe>') 

Nice and simple to plot thanks to the [dedicated streamgraph package](https://github.com/hrbrmstr/streamgraph) for R! As you can see, this graph resembles a timeline, but instead of plotting events, it shows each topic's proportion in relation to one another overtime, and adds the connectivity that the stacked bar graph lacks. If you want to look at one specific topic without having to select it manually by hovering, you can select it from the dropdown and it will highlight by itself!

Time to explore! What insight can you gain on this data from looking at these visualisations? What could be the significance of that large spike around 1570? Do you notice any language in these descriptions that seems to be persistent? You can record any observations you make in your weekly notes.

Now that you've learned all that, why not download this notebook on to your own machine and try running your own data through this code to see what happens? You can even go further than that and download the full R script for this notebook [here](https://github.com/ChantalMB/SaPP-workbook/blob/master/notebooks/topic-models/tm-sg.R) to open in RStudio and experiment with-- play with how its visualized, or WHAT is visualized.

Here's some resources to get you started:
- [R Graph Gallery](https://www.r-graph-gallery.com/index.html) for more ways to visualize data
- [Tidytext](https://www.tidytextmining.com/) for cleaning your data
- [Colour palettes in R](https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/)
