Skip to content

helper functions for assessing the quality and utility of structural topic models

Notifications You must be signed in to change notification settings

ABindoff/stmQuality

Repository files navigation

A collection of helper functions used to assess the "quality" or utility of structural topic models (Roberts, Stewart, & Tingley 2018: "stm: R Package for Structural Topic Models").

Background: to assist in the development of dementia knowledge and dementia literacy assessment tools to inform health and community policy, a large body of discussion forum posts relating to dementia were analysed. Topics were identified along with their prevalence in different community cohorts using structural topic models.

There is no way of estimating a priori how many topics k might exist within a corpus, so for each analysis several fits were attempted over a range of K = k topics using the manyTopics function in the stm R package. The pareto dominant fit for each K = k was selected (methods described in help files for stm::manyTopics), leaving the decision about which K = k model to select. Ideally, the model should identify meaningful and interpretable topics, and so the "best" model becomes a subjective assessment undertaken by an informed reader. The aim of this package is to reduce the candidate set of models that an informed reader would need to assess.

The stm::topicQuality function conveniently plots semantic coherence and exclusivity for each topic, and provides a method of assessing topic quality within a topic model. It does not provide an assessment of model quality within a set of candidate models. My first attempt at reducing this problem was to plot semantic coherence against exclusivity for each model on a common scale by standardising these measures. This was a reasonable approach and interested users can refer to example code.

topic quality plot

It quickly became apparent that judicious decisions made using an assessment of these relationships didn't necessarily favour models with strong exemplars (readily identified using stm::findThoughts). Due to the nature of the motivating project, the marginal thetas were less important than the conditional thetas. In other words, topics that pervade the corpus but that no-one is really discussing in a direct way are probably not that important in assessing community dementia literacy (e.g people might be worried about their parents developing dementia but if we don't know what their specific concerns are - residential care, medication, community support, palliation etc - then we can't develop useful literacy assessment tools).

One compromise which so far seems to favour models that have strong exemplars is to take the mean topic thetas over some number of exemplars (found using the stm::findThoughts or stmQuality::findThoughts0 functions) and perform principal components analysis with the three variables semantic coherence, exclusivity, and mean theta. A visual assessment of each model can then be made by plotting these transformations. A convenience function is provided, pcaPlot.stm.

topic quality PCA plot

About

helper functions for assessing the quality and utility of structural topic models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages