topiclabels

Automated Topic Labeling with Language Models

topiclabels leverages (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios (see also our Vignette).

References

Related work:

Topic models (selection):

Contribution

This R package is licensed under the GPLv3. For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, ...) and feature requests please use the issue tracker. Pull requests are welcome and will be included at the discretion of the author.

Installation

You can install the recent CRAN version using

install.packages("topiclabels")

For installation of the development version use devtools:

devtools::install_github("PetersFritz/topiclabels")

(Quick Start) Example

library("topiclabels")

First of all, you should store your Huggingface token in the variable token. If you do not have a token, create a Huggingface account and generate a token based on this guideline.

token = "" # set your hf token here

We would now like to label two topics, one with the three top terms zidane, figo, kroos and the other with the three top terms gas, power, wind. Here, we show three typical variants of topic representation:

topwords_matrix = matrix(c("zidane", "figo", "kroos", "gas", "power", "wind"), ncol = 2)
topwords_list = list(c("zidane", "figo", "kroos"), c("gas", "power", "wind"))
topwords_vector = c("zidane, figo, kroos", "gas, power, wind")

A common way to represent top terms is a matrix structure.

topwords_matrix
     [,1]     [,2]   
[1,] "zidane" "gas"  
[2,] "figo"   "power"
[3,] "kroos"  "wind"

For our package, it is not necessary that all topics are characterized by the same number of top terms. For this case, the input must be given via the following list format:

topwords_list
[[1]]
[1] "zidane" "figo"   "kroos" 

[[2]]
[1] "gas"   "power" "wind"

If you have stored your top terms as vectors (e.g., in a data table), it may look like this:

topwords_vector
[1] "zidane, figo, kroos" "gas, power, wind"

Using one of the following three calls

label_topics(topwords_matrix, token = token)
label_topics(topwords_list, token = token)
label_topics(as.list(topwords_vector), token = token)

the labels for the two topics can then be generated, which yields

lm_topic_labels object generated using mistralai/Mixtral-8x7B-Instruct-v0.1
 1: Real Madrid Midfielders [zidane, figo, kroos]
 2: Renewable Energy [gas, power, wind]

Beyond this, it is also possible to display the actual generated outputs of the language models, which might be helpful if our default postprocessing function did not generate proper labels for single topics.

obj = label_topics(topwords_matrix, token = token)
names(obj)
# [1] "terms"       
# [2] "prompts"     
# [3] "model"       
# [4] "params"      
# [5] "with_token"  
# [6] "time"        
# [7] "model_output"
# [8] "labels"      
obj$model_output
# [1] "\n\n{\n\"label\": \"Real Madrid Midfielders\"\n}"
# [2] "\n\n{\n\"label\": \"Renewable Energy\"\n}"
obj$labels
# [1] "Real Madrid Midfielders"
# [2] "Renewable Energy"

Feel free to also check the following examples and check our Vignette of the package for further reading.

label_topics(list(c("zidane", "figo", "ronaldo"), c("gas", "power", "wind")), token = token)
label_topics(list("wind", "greta", "hambach"), token = token)
label_topics(list("wind", "fire", "air"), token = token)
label_topics(list("wind", "feuer", "luft"), token = token)
label_topics(list("wind", "feuer", "luft"), context = "Elements of the Earth", token = token)

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
performance		performance
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
topiclabels.Rproj		topiclabels.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

topiclabels

Automated Topic Labeling with Language Models

References

Contribution

Installation

(Quick Start) Example

About

Releases

Packages

Contributors 4

Languages

License

PetersFritz/topiclabels

Folders and files

Latest commit

History

Repository files navigation

topiclabels

Automated Topic Labeling with Language Models

References

Contribution

Installation

(Quick Start) Example

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages