Skip to content

Commit

Permalink
Merge branch 'master' into similarities
Browse files Browse the repository at this point in the history
  • Loading branch information
JonasRieger committed Aug 26, 2020
2 parents e0fb1fb + f645b5d commit 0a7068a
Show file tree
Hide file tree
Showing 30 changed files with 252 additions and 93 deletions.
3 changes: 2 additions & 1 deletion .Rbuildignore
Expand Up @@ -3,4 +3,5 @@
ignore
\.travis\.yml
LICENSE
CODE_OF_CONDUCT.md
CODE_OF_CONDUCT.md
paper
4 changes: 2 additions & 2 deletions DESCRIPTION
@@ -1,8 +1,8 @@
Package: ldaPrototype
Type: Package
Title: Prototype of Multiple Latent Dirichlet Allocation Runs
Version: 0.1.1
Date: 2020-01-08
Version: 0.2.0
Date: 2020-06-23
Authors@R: person("Jonas", "Rieger", email="jonas.rieger@tu-dortmund.de", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-0007-4478"))
Description: Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.
URL: https://github.com/JonasRieger/ldaPrototype, https://doi.org/10.5281/zenodo.3597978
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Expand Up @@ -20,6 +20,7 @@ S3method(getID,LDARep)
S3method(getID,PrototypeLDA)
S3method(getJob,LDABatch)
S3method(getJob,LDARep)
S3method(getJob,PrototypeLDA)
S3method(getK,LDA)
S3method(getLDA,LDABatch)
S3method(getLDA,LDARep)
Expand Down
2 changes: 1 addition & 1 deletion R/LDA.R
Expand Up @@ -94,7 +94,7 @@ is.LDA = function(obj, verbose = FALSE){
return(FALSE)
}

emptyLDA = LDA(param = .getDefaultParameters())
emptyLDA = LDA(param = .getDefaultParameters(1))
if (length(setdiff(names(obj), names(emptyLDA))) != 0 ||
length(intersect(names(obj), names(emptyLDA))) != 6){
if (verbose) message("object does not contain exactly the list elements of an \"LDA\" object")
Expand Down
5 changes: 4 additions & 1 deletion R/LDABatch.R
Expand Up @@ -20,6 +20,7 @@
#' Documents as received from \code{\link[tosca]{LDAprep}}.
#' @param vocab [\code{character}]\cr
#' Vocabularies passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
#' For additional (and necessary) arguments passed, see ellipsis (three-dot argument).
#' @param n [\code{integer(1)}]\cr
#' Number of Replications.
#' @param seeds [\code{integer(n)}]\cr
Expand All @@ -34,6 +35,8 @@
#' Computational resources for the jobs to submit. See \code{\link[batchtools]{submitJobs}}.
#' @param ... additional arguments passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
#' Arguments will be coerced to a vector of length \code{n}.
#' Default parameters are \code{alpha = eta = 1/K} and \code{num.iterations = 200}.
#' There is no default for \code{K}.
#'
#' @return [\code{named list}] with entries \code{id} for the registry's folder name,
#' \code{jobs} for the submitted jobs' ids and its parameter settings and
Expand Down Expand Up @@ -89,7 +92,7 @@ LDABatch = function(docs, vocab, n = 100, seeds, id = "LDABatch", load = FALSE,
moreArgs = data.table(do.call(cbind, .paramList(n = n, ...)))

if (missing(seeds) || length(seeds) != n){
message("No seeds given or length of given seeds differs from number of replications: sample seeds")
message("No seeds given or length of given seeds differs from number of replications: sample seeds. Sampled seeds can be obtained via getJob().")
if (!exists(".Random.seed", envir = globalenv())){
runif(1)
}
Expand Down
3 changes: 2 additions & 1 deletion R/LDAPrototype.R
@@ -1,6 +1,6 @@
#' @title Determine the Prototype LDA
#'
#' @description Performs multiple runs of LDA and returns the Prototype LDA of
#' @description Performs multiple runs of LDA and computes the Prototype LDA of
#' this set of LDAs.
#'
#' @details While \code{LDAPrototype} marks the overall shortcut for performing
Expand All @@ -21,6 +21,7 @@
#' @inheritParams LDARep
#' @param vocabLDA [\code{character}]\cr
#' Vocabularies passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
#' For additional (and necessary) arguments passed, see ellipsis (three-dot argument).
#' @param vocabMerge [\code{character}]\cr
#' Vocabularies taken into consideration for merging topic matrices.
#' @param limit.rel [0,1]\cr
Expand Down
5 changes: 4 additions & 1 deletion R/LDARep.R
Expand Up @@ -18,6 +18,7 @@
#' Documents as received from \code{\link[tosca]{LDAprep}}.
#' @param vocab [\code{character}]\cr
#' Vocabularies passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
#' For additional (and necessary) arguments passed, see ellipsis (three-dot argument).
#' @param n [\code{integer(1)}]\cr
#' Number of Replications.
#' @param seeds [\code{integer(n)}]\cr
Expand All @@ -34,6 +35,8 @@
#' default is determined by \code{\link[future]{availableCores}}.
#' @param ... additional arguments passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
#' Arguments will be coerced to a vector of length \code{n}.
#' Default parameters are \code{alpha = eta = 1/K} and \code{num.iterations = 200}.
#' There is no default for \code{K}.
#' @return [\code{named list}] with entries \code{id} for computation's name,
#' \code{jobs} for the parameter settings and \code{lda} for the results itself.
#'
Expand All @@ -56,7 +59,7 @@ LDARep = function(docs, vocab, n = 100, seeds, id = "LDARep", pm.backend, ncpus,

args = .paramList(n = n, ...)
if (missing(seeds) || length(seeds) != n){
message("No seeds given or length of given seeds differs from number of replications: sample seeds")
message("No seeds given or length of given seeds differs from number of replications: sample seeds. Sampled seeds can be obtained via getJob().")
if (!exists(".Random.seed", envir = globalenv())){
runif(1)
}
Expand Down
2 changes: 1 addition & 1 deletion R/as.LDABatch.R
Expand Up @@ -96,7 +96,7 @@ is.LDABatch = function(obj, verbose = FALSE){
if (verbose) message("jobs: ", appendLF = FALSE)
job = getJob(obj)
if (!is.data.table(job) ||
!all(c(names(.getDefaultParameters()), "job.id", "seed") %in% colnames(job))){
!all(c(names(.getDefaultParameters(1)), "job.id", "seed") %in% colnames(job))){
if (verbose) message("not a data.table with standard parameters")
return(FALSE)
}
Expand Down
6 changes: 3 additions & 3 deletions R/as.LDARep.R
Expand Up @@ -72,13 +72,13 @@ as.LDARep.default = function(lda, job, id, ...){
setcolorder(job, "job.id")
}else{
if (is.vector(job)){
if (all(names(.getDefaultParameters()) %in% names(job))){
if (all(names(.getDefaultParameters(1)) %in% names(job))){
job = data.table(job.id = as.integer(names(lda)), t(job))
}else{
stop("Not all standard parameters are specified.")
}
}else{
if (all(c(names(.getDefaultParameters()), "job.id") %in% colnames(job))){
if (all(c(names(.getDefaultParameters(1)), "job.id") %in% colnames(job))){
job = as.data.table(job)
if (!all(union(job$job.id, names(lda)) %in% intersect(job$job.id, names(lda))) ||
nrow(job) != length(lda)){
Expand Down Expand Up @@ -153,7 +153,7 @@ is.LDARep = function(obj, verbose = FALSE){
if (verbose) message("jobs: ", appendLF = FALSE)
job = getJob(obj)
if (!is.data.table(job) ||
!all(c(names(.getDefaultParameters()), "job.id") %in% colnames(job))){
!all(c(names(.getDefaultParameters(1)), "job.id") %in% colnames(job))){
if (verbose) message("not a data.table with standard parameters")
return(FALSE)
}
Expand Down
2 changes: 1 addition & 1 deletion R/data.R
Expand Up @@ -11,7 +11,7 @@
#' @format
#' \code{reuters_docs} is a list of documents of length 91 prepared by \code{\link[tosca]{LDAprep}}.
#'
#' \code{reuters_vocab} is a \code{character} vector of length 2141.
#' \code{reuters_vocab} is
#'
#' @usage data(reuters_docs)
#'
Expand Down
47 changes: 21 additions & 26 deletions R/getPrototype.R
Expand Up @@ -32,6 +32,12 @@
#' @param id [\code{character(1)}]\cr
#' A name for the computation. If not passed, it is set to "LDARep".
#' Not considered for \code{\link{LDABatch}} or \code{\link{LDARep}} objects.
#' @param job [\code{\link{data.frame}} or \code{named vector}]\cr
#' A data.frame or data.table with named columns (at least)
#' "job.id" (\code{integerish}), "K", "alpha", "eta" and "num.iterations"
#' or a named vector with entries (at least) "K", "alpha", "eta" and "num.iterations".
#' If not passed, it is interpreted from \code{param} of each LDA.
#' Not considered for \code{\link{LDABatch}} or \code{\link{LDARep}} objects.
#' @param limit.rel [0,1]\cr
#' See \code{\link{jaccardTopics}}. Default is \code{1/500}.
#' Not considered for calculation, if \code{sclop} is passed. But should be
Expand Down Expand Up @@ -74,10 +80,11 @@
#'
#' @return [\code{named list}] with entries
#' \describe{
#' \item{\code{id}}{[\code{character(1)}] See above.}
#' \item{\code{protoid}}{[\code{character(1)}] Name (ID) of the determined Prototype LDA.}
#' \item{\code{lda}}{List of \code{\link{LDA}} objects of the determined Prototype LDA
#' and - if \code{keepLDAs} is \code{TRUE} - all considered LDAs.}
#' \item{\code{protoid}}{[\code{character(1)}] Name (ID) of the determined Prototype LDA.}
#' \item{\code{id}}{[\code{character(1)}] See above.}
#' \item{\code{jobs}}{[\code{data.table}] with parameter specifications for the LDAs.}
#' \item{\code{param}}{[\code{named list}] with parameter specifications for
#' \code{limit.rel} [0,1], \code{limit.abs} [\code{integer(1)}] and
#' \code{atLeast} [\code{integer(1)}]. See above for explanation.}
Expand Down Expand Up @@ -122,7 +129,7 @@ getPrototype.PrototypeLDA = function(x, ...){

#' @rdname getPrototype
#' @export
getPrototype.LDABatch = function(x, vocab, limit.rel, limit.abs, atLeast,
getPrototype.LDARep = function(x, vocab, limit.rel, limit.abs, atLeast,
progress = TRUE, pm.backend, ncpus,
keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ...){

Expand All @@ -135,37 +142,20 @@ getPrototype.LDABatch = function(x, vocab, limit.rel, limit.abs, atLeast,
if (missing(sclop)) sclop = NULL
lda = getLDA(x)
id = getID(x)

NextMethod("getPrototype", lda = lda, vocab = vocab, id = id,
job = getJob(x)
NextMethod("getPrototype", lda = lda, vocab = vocab, id = id, job = job,
limit.rel = limit.rel, limit.abs = limit.abs, atLeast = atLeast,
progress = progress, pm.backend = pm.backend, ncpus = ncpus,
keepTopics = keepTopics, keepSims = keepSims, keepLDAs = keepLDAs, sclop = sclop)
}

#' @rdname getPrototype
#' @export
getPrototype.LDARep = function(x, vocab, limit.rel, limit.abs, atLeast,
progress = TRUE, pm.backend, ncpus,
keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ...){

if (missing(limit.rel)) limit.rel = .defaultLimit.rel()
if (missing(limit.abs)) limit.abs = .defaultLimit.abs()
if (missing(atLeast)) atLeast = .defaultAtLeast()
if (missing(vocab)) vocab = .defaultVocab(x)
if (missing(pm.backend)) pm.backend = NULL
if (missing(ncpus)) ncpus = NULL
if (missing(sclop)) sclop = NULL
lda = getLDA(x)
id = getID(x)
NextMethod("getPrototype", lda = lda, vocab = vocab, id = id,
limit.rel = limit.rel, limit.abs = limit.abs, atLeast = atLeast,
progress = progress, pm.backend = pm.backend, ncpus = ncpus,
keepTopics = keepTopics, keepSims = keepSims, keepLDAs = keepLDAs, sclop = sclop)
}
getPrototype.LDABatch = getPrototype.LDARep

#' @rdname getPrototype
#' @export
getPrototype.default = function(lda, vocab, id, limit.rel, limit.abs, atLeast,
getPrototype.default = function(lda, vocab, id, job, limit.rel, limit.abs, atLeast,
progress = TRUE, pm.backend, ncpus,
keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ...){

Expand All @@ -175,7 +165,12 @@ getPrototype.default = function(lda, vocab, id, limit.rel, limit.abs, atLeast,
if (missing(vocab)) vocab = .defaultVocab(lda)
if (missing(pm.backend)) pm.backend = NULL
if (missing(ncpus)) ncpus = NULL
if (missing(id)) id = "LDARep"
x = as.LDARep.default(lda = lda, job = job, id = id)
lda = getLDA(x)
id = getID(x)
job = getJob(x)
if(length(unique(job[,K])) > 1)
warning("Determination of a Protoype based on different number of topics (K) is not recommended!")
if (missing(sclop) || is.null(sclop)){
topics = mergeRepTopics(lda = lda, vocab = vocab, id = id, progress = progress)
sims = jaccardTopics(topics = topics, limit.rel = limit.rel, limit.abs = limit.abs,
Expand All @@ -198,7 +193,7 @@ getPrototype.default = function(lda, vocab, id, limit.rel, limit.abs, atLeast,
}
protoid = as.integer(names(lda)[which.max(colSums(sclop, na.rm = TRUE))])
if (!keepLDAs) lda = lda[which.max(colSums(sclop, na.rm = TRUE))]
res = list(lda = lda, protoid = protoid, id = id,
res = list(id = id, protoid = protoid, lda = lda, jobs = job,
param = list(limit.rel = limit.rel, limit.abs = limit.abs, atLeast = atLeast),
topics = topics, sims = sims, wordslimit = wordslimit,
wordsconsidered = wordsconsidered, sclop = sclop)
Expand Down
6 changes: 6 additions & 0 deletions R/getSCLOP.R
Expand Up @@ -85,3 +85,9 @@ getID.PrototypeLDA = function(x){
getParam.PrototypeLDA = function(x){
x$param
}

#' @rdname getSCLOP
#' @export
getJob.PrototypeLDA = function(x){
x$jobs
}
6 changes: 3 additions & 3 deletions R/getTopics.R
Expand Up @@ -20,9 +20,9 @@
#' number of assigned tokens in text \eqn{m} and \eqn{n_k} the total number of
#' assigned tokens to topic \eqn{k}.
#'
#' @references Griffiths, Thomas L. and Steyvers, Mark (2004). "Finding scientific topics".
#' In: \emph{Proceedings of the National Academy of Sciences} 101 (suppl 1), p.5228--5235,
#' DOI 10.1073/pnas.0307752101, URL \url{http://www.pnas.org/content/101/suppl_1/5228}
#' @references Griffiths, Thomas L. and Mark Steyvers (2004). "Finding scientific topics".
#' In: \emph{Proceedings of the National Academy of Sciences} \bold{101} (suppl 1), pp.5228--5235,
#' DOI 10.1073/pnas.0307752101, URL \url{https://doi.org/10.1073/pnas.0307752101}.
#'
#' @family getter functions
#' @family LDA functions
Expand Down
21 changes: 19 additions & 2 deletions R/ldaPrototype-package.R
Expand Up @@ -8,7 +8,8 @@
#' distribution parameters. Repeated runs lead to different results, which we
#' encounter by choosing the most representative LDA run as prototype.\cr
#' For bug reports and feature requests please use the issue tracker:
#' \url{https://github.com/JonasRieger/ldaPrototype/issues}.
#' \url{https://github.com/JonasRieger/ldaPrototype/issues}. Also have a look at
#' the (detailed) example at \url{https://github.com/JonasRieger/ldaPrototype}.
#'
#' @section Data:
#' \code{\link{reuters}} Example Dataset (91 articles from Reuters) for testing.
Expand Down Expand Up @@ -41,6 +42,21 @@
#' \code{\link{LDAPrototype}} Shortcut which performs multiple LDAs and
#' determines their Prototype.
#'
#' @references
#' Rieger, Jonas (2020). ldaPrototype: A method in R to get a Prototype of multiple Latent
#' Dirichlet Allocations. Journal of Open Source Software, \bold{5}(51), 2181,
#' DOI 10.21105/joss.02181, URL \url{https://doi.org/10.21105/joss.02181}.
#'
#' Rieger, Jonas, Jörg Rahnenführer and Carsten Jentsch (2020).
#' "Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype."
#' In: \emph{Natural Language Processing and Information Systems, NLDB 2020.} LNCS 12089, pp. 118--125,
#' DOI 10.1007/978-3-030-51310-8_11, URL \url{https://doi.org/10.1007/978-3-030-51310-8_11}.
#'
#' Rieger, Jonas, Lars Koppers, Carsten Jentsch and Jörg Rahnenführer (2020).
#' "Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability using Clustering Techniques on Replicated Runs."
#' arXiv 2003.04980, URL \url{https://arxiv.org/abs/2003.04980}.
#'
#'
#' @import data.table
#' @import stats
#' @import checkmate
Expand All @@ -55,7 +71,8 @@

.getDefaultParameters = function(K){
if (missing(K)){
return(list(K = 100, alpha = 0.01, eta = 0.01, num.iterations = 200))
stop("Parameter K (number of modeled topics) must be set, no default!")
#return(list(K = 100, alpha = 0.01, eta = 0.01, num.iterations = 200))
}else{
return(list(K = K, alpha = 1/K, eta = 1/K, num.iterations = 200))
}
Expand Down
34 changes: 20 additions & 14 deletions README.md
Expand Up @@ -8,6 +8,25 @@
## Prototype of Multiple Latent Dirichlet Allocation Runs
Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.

## References
* Rieger, J. (2020). ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations. [Journal of Open Source Software](https://doi.org/10.21105/joss.02181), 5(51), 2181.
* Rieger, J., Rahnenführer, J. & Jentsch, C. (2020). Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype. [Natural Language Processing and Information Systems, NLDB 2020.](https://doi.org/10.1007/978-3-030-51310-8_11) LNCS 12089, pp. 118-125.
* Rieger, J., Koppers, L., Jentsch, C. & Rahnenführer, J.: Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability using Clustering Techniques on Replicated Runs. [working paper](https://arxiv.org/abs/2003.04980)

## Related Software
* [tm](https://CRAN.R-project.org/package=tm) is useful for preprocessing text data.
* [lda](https://CRAN.R-project.org/package=lda) offers a fast implementation of the Latent Dirichlet Allocation and is used by ``ldaPrototype``.
* [quanteda](https://quanteda.io/) is a framework for "Quantitative Analysis of Textual Data".
* [stm](https://www.structuraltopicmodel.com/) is a framework for Structural Topic Models.
* [tosca](https://github.com/Docma-TU/tosca) is a framework for statistical methods in content analysis including visualizations and validation techniques. It is also useful for managing and manipulating text data to a structure requested by ``ldaPrototype``.
* [topicmodels](https://CRAN.R-project.org/package=topicmodels) is another framework for various topic models based on the Latent Dirichlet Allocation and Correlated Topics Models.
* [mallet](https://github.com/mimno/RMallet) provides an interface for the Java based machine learning tool [MALLET](http://mallet.cs.umass.edu/).

## Contribution
This R package is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, ...) and feature requests please use the [issue tracker](https://github.com/JonasRieger/ldaPrototype/issues).
Pull requests are welcome and will be included at the discretion of the author.

## Installation
```{R}
install.packages("ldaPrototype")
Expand Down Expand Up @@ -103,7 +122,7 @@ n2 = getConsideredWords(jacc)
#### Step 3.1: Representation of Topic Similarities as Dendrogram
It is possible to represent the calulcated pairwise topic similarities as dendrogram using ``dendTopics`` and related ``plot`` options.
```{R}
dend = dendTopics(sims)
dend = dendTopics(jacc)
plot(dend)
```
The S-CLOP algorithm results in a pruning state of the dendrogram, which can be retrieved calling ``pruneSCLOP``. By default each of the topics is colorized by its LDA run belonging; but the cluster belongings can also be visualized by the colors or by vertical lines with freely chosen parameters.
Expand All @@ -126,16 +145,3 @@ There are several possibilites for using shortcut functions to summarize steps o
```{R}
res3 = getPrototype(reps, atLeast = 3)
```

## Related Software
* [tm](https://CRAN.R-project.org/package=tm) is useful for preprocessing text data.
* [lda](https://CRAN.R-project.org/package=lda) offers a fast implementation of the Latent Dirichlet Allocation and is used by ``ldaPrototype``.
* [quanteda](https://quanteda.io/) is a framework for "Quantitative Analysis of Textual Data".
* [stm](https://www.structuraltopicmodel.com/) is a framework for Structural Topic Models.
* [tosca](https://CRAN.R-project.org/package=tosca) is a framework for statistical methods in content analysis including visualizations and validation techniques. It is also useful for managing and manipulating text data to a structure requested by ``ldaPrototype``.
* [topicmodels](https://CRAN.R-project.org/package=topicmodels) is another framework for various topic models based on the Latent Dirichlet Allocation and Correlated Topics Models.

## Contribution
This R package is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, ...) and feature requests please use the [issue tracker](https://github.com/JonasRieger/ldaPrototype/issues).
Pull requests are welcome and will be included at the discretion of the author.

0 comments on commit 0a7068a

Please sign in to comment.