Merge branch 'master' into similarities

JonasRieger · Aug 26, 2020 · 0a7068a · 0a7068a
2 parents e0fb1fb + f645b5d
commit 0a7068a
Show file tree

Hide file tree

Showing 30 changed files with 252 additions and 93 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -3,4 +3,5 @@
 ignore
 \.travis\.yml
 LICENSE
-CODE_OF_CONDUCT.md
+CODE_OF_CONDUCT.md
+paper
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,8 +1,8 @@
 Package: ldaPrototype
 Type: Package
 Title: Prototype of Multiple Latent Dirichlet Allocation Runs
-Version: 0.1.1
-Date: 2020-01-08
+Version: 0.2.0
+Date: 2020-06-23
 Authors@R: person("Jonas", "Rieger", email="jonas.rieger@tu-dortmund.de", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-0007-4478"))
 Description: Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.
 URL: https://github.com/JonasRieger/ldaPrototype, https://doi.org/10.5281/zenodo.3597978

diff --git a/NAMESPACE b/NAMESPACE
@@ -20,6 +20,7 @@ S3method(getID,LDARep)
 S3method(getID,PrototypeLDA)
 S3method(getJob,LDABatch)
 S3method(getJob,LDARep)
+S3method(getJob,PrototypeLDA)
 S3method(getK,LDA)
 S3method(getLDA,LDABatch)
 S3method(getLDA,LDARep)

diff --git a/R/LDA.R b/R/LDA.R
@@ -94,7 +94,7 @@ is.LDA = function(obj, verbose = FALSE){
     return(FALSE)
   }
 
-  emptyLDA = LDA(param = .getDefaultParameters())
+  emptyLDA = LDA(param = .getDefaultParameters(1))
   if (length(setdiff(names(obj), names(emptyLDA))) != 0  ||
       length(intersect(names(obj), names(emptyLDA))) != 6){
     if (verbose) message("object does not contain exactly the list elements of an \"LDA\" object")

diff --git a/R/LDABatch.R b/R/LDABatch.R
@@ -20,6 +20,7 @@
 #' Documents as received from \code{\link[tosca]{LDAprep}}.
 #' @param vocab [\code{character}]\cr
 #' Vocabularies passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
+#' For additional (and necessary) arguments passed, see ellipsis (three-dot argument).
 #' @param n [\code{integer(1)}]\cr
 #' Number of Replications.
 #' @param seeds [\code{integer(n)}]\cr
@@ -34,6 +35,8 @@
 #' Computational resources for the jobs to submit. See \code{\link[batchtools]{submitJobs}}.
 #' @param ... additional arguments passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
 #' Arguments will be coerced to a vector of length \code{n}.
+#' Default parameters are \code{alpha = eta = 1/K} and \code{num.iterations = 200}.
+#' There is no default for \code{K}.
 #'
 #' @return [\code{named list}] with entries \code{id} for the registry's folder name,
 #' \code{jobs} for the submitted jobs' ids and its parameter settings and
@@ -89,7 +92,7 @@ LDABatch = function(docs, vocab, n = 100, seeds, id = "LDABatch", load = FALSE,
   moreArgs = data.table(do.call(cbind, .paramList(n = n, ...)))
 
   if (missing(seeds) || length(seeds) != n){
-    message("No seeds given or length of given seeds differs from number of replications: sample seeds")
+    message("No seeds given or length of given seeds differs from number of replications: sample seeds. Sampled seeds can be obtained via getJob().")
     if (!exists(".Random.seed", envir = globalenv())){
       runif(1)
     }

diff --git a/R/LDAPrototype.R b/R/LDAPrototype.R
@@ -1,6 +1,6 @@
 #' @title Determine the Prototype LDA
 #'
-#' @description Performs multiple runs of LDA and returns the Prototype LDA of
+#' @description Performs multiple runs of LDA and computes the Prototype LDA of
 #' this set of LDAs.
 #'
 #' @details While \code{LDAPrototype} marks the overall shortcut for performing
@@ -21,6 +21,7 @@
 #' @inheritParams LDARep
 #' @param vocabLDA [\code{character}]\cr
 #' Vocabularies passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
+#' For additional (and necessary) arguments passed, see ellipsis (three-dot argument).
 #' @param vocabMerge [\code{character}]\cr
 #' Vocabularies taken into consideration for merging topic matrices.
 #' @param limit.rel [0,1]\cr

diff --git a/R/LDARep.R b/R/LDARep.R
@@ -18,6 +18,7 @@
 #' Documents as received from \code{\link[tosca]{LDAprep}}.
 #' @param vocab [\code{character}]\cr
 #' Vocabularies passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
+#' For additional (and necessary) arguments passed, see ellipsis (three-dot argument).
 #' @param n [\code{integer(1)}]\cr
 #' Number of Replications.
 #' @param seeds [\code{integer(n)}]\cr
@@ -34,6 +35,8 @@
 #' default is determined by \code{\link[future]{availableCores}}.
 #' @param ... additional arguments passed to \code{\link[lda]{lda.collapsed.gibbs.sampler}}.
 #' Arguments will be coerced to a vector of length \code{n}.
+#' Default parameters are \code{alpha = eta = 1/K} and \code{num.iterations = 200}.
+#' There is no default for \code{K}.
 #' @return [\code{named list}] with entries \code{id} for computation's name,
 #' \code{jobs} for the parameter settings and \code{lda} for the results itself.
 #'
@@ -56,7 +59,7 @@ LDARep = function(docs, vocab, n = 100, seeds, id = "LDARep", pm.backend, ncpus,
 
   args = .paramList(n = n, ...)
   if (missing(seeds) || length(seeds) != n){
-    message("No seeds given or length of given seeds differs from number of replications: sample seeds")
+    message("No seeds given or length of given seeds differs from number of replications: sample seeds. Sampled seeds can be obtained via getJob().")
     if (!exists(".Random.seed", envir = globalenv())){
       runif(1)
     }

diff --git a/R/as.LDABatch.R b/R/as.LDABatch.R
@@ -96,7 +96,7 @@ is.LDABatch = function(obj, verbose = FALSE){
   if (verbose) message("jobs: ", appendLF = FALSE)
   job = getJob(obj)
   if (!is.data.table(job) ||
-      !all(c(names(.getDefaultParameters()), "job.id", "seed") %in% colnames(job))){
+      !all(c(names(.getDefaultParameters(1)), "job.id", "seed") %in% colnames(job))){
     if (verbose) message("not a data.table with standard parameters")
     return(FALSE)
   }

diff --git a/R/as.LDARep.R b/R/as.LDARep.R
@@ -72,13 +72,13 @@ as.LDARep.default = function(lda, job, id, ...){
     setcolorder(job, "job.id")
   }else{
     if (is.vector(job)){
-      if (all(names(.getDefaultParameters()) %in% names(job))){
+      if (all(names(.getDefaultParameters(1)) %in% names(job))){
         job = data.table(job.id = as.integer(names(lda)), t(job))
       }else{
         stop("Not all standard parameters are specified.")
       }
     }else{
-      if (all(c(names(.getDefaultParameters()), "job.id") %in% colnames(job))){
+      if (all(c(names(.getDefaultParameters(1)), "job.id") %in% colnames(job))){
         job = as.data.table(job)
         if (!all(union(job$job.id, names(lda)) %in% intersect(job$job.id, names(lda))) ||
             nrow(job) != length(lda)){
@@ -153,7 +153,7 @@ is.LDARep = function(obj, verbose = FALSE){
   if (verbose) message("jobs: ", appendLF = FALSE)
   job = getJob(obj)
   if (!is.data.table(job) ||
-      !all(c(names(.getDefaultParameters()), "job.id") %in% colnames(job))){
+      !all(c(names(.getDefaultParameters(1)), "job.id") %in% colnames(job))){
     if (verbose) message("not a data.table with standard parameters")
     return(FALSE)
   }

diff --git a/R/data.R b/R/data.R
@@ -11,7 +11,7 @@
 #' @format
 #' \code{reuters_docs} is a list of documents of length 91 prepared by \code{\link[tosca]{LDAprep}}.
 #'
-#' \code{reuters_vocab} is a \code{character} vector of length 2141.
+#' \code{reuters_vocab} is
 #'
 #' @usage data(reuters_docs)
 #'

diff --git a/R/getPrototype.R b/R/getPrototype.R
@@ -32,6 +32,12 @@
 #' @param id [\code{character(1)}]\cr
 #' A name for the computation. If not passed, it is set to "LDARep".
 #' Not considered for \code{\link{LDABatch}} or \code{\link{LDARep}} objects.
+#' @param job [\code{\link{data.frame}} or \code{named vector}]\cr
+#' A data.frame or data.table with named columns (at least)
+#' "job.id" (\code{integerish}), "K", "alpha", "eta" and "num.iterations"
+#' or a named vector with entries (at least) "K", "alpha", "eta" and "num.iterations".
+#' If not passed, it is interpreted from \code{param} of each LDA.
+#' Not considered for \code{\link{LDABatch}} or \code{\link{LDARep}} objects.
 #' @param limit.rel [0,1]\cr
 #' See \code{\link{jaccardTopics}}. Default is \code{1/500}.
 #' Not considered for calculation, if \code{sclop} is passed. But should be
@@ -74,10 +80,11 @@
 #'
 #' @return [\code{named list}] with entries
 #'  \describe{
+#'   \item{\code{id}}{[\code{character(1)}] See above.}
+#'   \item{\code{protoid}}{[\code{character(1)}] Name (ID) of the determined Prototype LDA.}
 #'   \item{\code{lda}}{List of \code{\link{LDA}} objects of the determined Prototype LDA
 #'   and - if \code{keepLDAs} is \code{TRUE} - all considered LDAs.}
-#'   \item{\code{protoid}}{[\code{character(1)}] Name (ID) of the determined Prototype LDA.}
-#'   \item{\code{id}}{[\code{character(1)}] See above.}
+#'   \item{\code{jobs}}{[\code{data.table}] with parameter specifications for the LDAs.}
 #'   \item{\code{param}}{[\code{named list}] with parameter specifications for
 #'   \code{limit.rel} [0,1], \code{limit.abs} [\code{integer(1)}] and
 #'   \code{atLeast} [\code{integer(1)}]. See above for explanation.}
@@ -122,7 +129,7 @@ getPrototype.PrototypeLDA = function(x, ...){
 
 #' @rdname getPrototype
 #' @export
-getPrototype.LDABatch = function(x, vocab, limit.rel, limit.abs, atLeast,
+getPrototype.LDARep = function(x, vocab, limit.rel, limit.abs, atLeast,
   progress = TRUE, pm.backend, ncpus,
   keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ...){
 
@@ -135,37 +142,20 @@ getPrototype.LDABatch = function(x, vocab, limit.rel, limit.abs, atLeast,
   if (missing(sclop)) sclop = NULL
   lda = getLDA(x)
   id = getID(x)
-
-  NextMethod("getPrototype", lda = lda, vocab = vocab, id = id,
+  job = getJob(x)
+  NextMethod("getPrototype", lda = lda, vocab = vocab, id = id, job = job,
     limit.rel = limit.rel, limit.abs = limit.abs, atLeast = atLeast,
     progress = progress, pm.backend = pm.backend, ncpus = ncpus,
     keepTopics = keepTopics, keepSims = keepSims, keepLDAs = keepLDAs, sclop = sclop)
 }
 
 #' @rdname getPrototype
 #' @export
-getPrototype.LDARep = function(x, vocab, limit.rel, limit.abs, atLeast,
-  progress = TRUE, pm.backend, ncpus,
-  keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ...){
-
-  if (missing(limit.rel)) limit.rel = .defaultLimit.rel()
-  if (missing(limit.abs)) limit.abs = .defaultLimit.abs()
-  if (missing(atLeast)) atLeast = .defaultAtLeast()
-  if (missing(vocab)) vocab = .defaultVocab(x)
-  if (missing(pm.backend)) pm.backend = NULL
-  if (missing(ncpus)) ncpus = NULL
-  if (missing(sclop)) sclop = NULL
-  lda = getLDA(x)
-  id = getID(x)
-  NextMethod("getPrototype", lda = lda, vocab = vocab, id = id,
-    limit.rel = limit.rel, limit.abs = limit.abs, atLeast = atLeast,
-    progress = progress, pm.backend = pm.backend, ncpus = ncpus,
-    keepTopics = keepTopics, keepSims = keepSims, keepLDAs = keepLDAs, sclop = sclop)
-}
+getPrototype.LDABatch = getPrototype.LDARep
 
 #' @rdname getPrototype
 #' @export
-getPrototype.default = function(lda, vocab, id, limit.rel, limit.abs, atLeast,
+getPrototype.default = function(lda, vocab, id, job, limit.rel, limit.abs, atLeast,
   progress = TRUE, pm.backend, ncpus,
   keepTopics = FALSE, keepSims = FALSE, keepLDAs = FALSE, sclop, ...){
 
@@ -175,7 +165,12 @@ getPrototype.default = function(lda, vocab, id, limit.rel, limit.abs, atLeast,
   if (missing(vocab)) vocab = .defaultVocab(lda)
   if (missing(pm.backend)) pm.backend = NULL
   if (missing(ncpus)) ncpus = NULL
-  if (missing(id)) id = "LDARep"
+  x = as.LDARep.default(lda = lda, job = job, id = id)
+  lda = getLDA(x)
+  id = getID(x)
+  job = getJob(x)
+  if(length(unique(job[,K])) > 1)
+    warning("Determination of a Protoype based on different number of topics (K) is not recommended!")
   if (missing(sclop) || is.null(sclop)){
     topics = mergeRepTopics(lda = lda, vocab = vocab, id = id, progress = progress)
     sims = jaccardTopics(topics = topics, limit.rel = limit.rel, limit.abs = limit.abs,
@@ -198,7 +193,7 @@ getPrototype.default = function(lda, vocab, id, limit.rel, limit.abs, atLeast,
   }
   protoid = as.integer(names(lda)[which.max(colSums(sclop, na.rm = TRUE))])
   if (!keepLDAs) lda = lda[which.max(colSums(sclop, na.rm = TRUE))]
-  res = list(lda = lda, protoid = protoid, id = id,
+  res = list(id = id, protoid = protoid, lda = lda, jobs = job,
     param = list(limit.rel = limit.rel, limit.abs = limit.abs, atLeast = atLeast),
     topics = topics, sims = sims, wordslimit = wordslimit,
     wordsconsidered = wordsconsidered, sclop = sclop)

diff --git a/R/getSCLOP.R b/R/getSCLOP.R
@@ -85,3 +85,9 @@ getID.PrototypeLDA = function(x){
 getParam.PrototypeLDA = function(x){
   x$param
 }
+
+#' @rdname getSCLOP
+#' @export
+getJob.PrototypeLDA = function(x){
+  x$jobs
+}
diff --git a/R/getTopics.R b/R/getTopics.R
@@ -20,9 +20,9 @@
 #' number of assigned tokens in text \eqn{m} and \eqn{n_k} the total number of
 #' assigned tokens to topic \eqn{k}.
 #'
-#' @references Griffiths, Thomas L. and Steyvers, Mark (2004). "Finding scientific topics".
-#' In: \emph{Proceedings of the National Academy of Sciences} 101 (suppl 1), p.5228--5235,
-#' DOI 10.1073/pnas.0307752101, URL \url{http://www.pnas.org/content/101/suppl_1/5228}
+#' @references Griffiths, Thomas L. and Mark Steyvers (2004). "Finding scientific topics".
+#' In: \emph{Proceedings of the National Academy of Sciences} \bold{101} (suppl 1), pp.5228--5235,
+#' DOI 10.1073/pnas.0307752101, URL \url{https://doi.org/10.1073/pnas.0307752101}.
 #'
 #' @family getter functions
 #' @family LDA functions

diff --git a/R/ldaPrototype-package.R b/R/ldaPrototype-package.R
@@ -8,7 +8,8 @@
 #' distribution parameters. Repeated runs lead to different results, which we
 #' encounter by choosing the most representative LDA run as prototype.\cr
 #' For bug reports and feature requests please use the issue tracker:
-#' \url{https://github.com/JonasRieger/ldaPrototype/issues}.
+#' \url{https://github.com/JonasRieger/ldaPrototype/issues}. Also have a look at
+#' the (detailed) example at \url{https://github.com/JonasRieger/ldaPrototype}.
 #'
 #' @section Data:
 #' \code{\link{reuters}} Example Dataset (91 articles from Reuters) for testing.
@@ -41,6 +42,21 @@
 #' \code{\link{LDAPrototype}} Shortcut which performs multiple LDAs and
 #' determines their Prototype.
 #'
+#' @references
+#' Rieger, Jonas (2020). ldaPrototype: A method in R to get a Prototype of multiple Latent
+#' Dirichlet Allocations. Journal of Open Source Software, \bold{5}(51), 2181,
+#' DOI 10.21105/joss.02181, URL \url{https://doi.org/10.21105/joss.02181}.
+#'
+#' Rieger, Jonas, Jörg Rahnenführer and Carsten Jentsch (2020).
+#' "Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype."
+#' In: \emph{Natural Language Processing and Information Systems, NLDB 2020.} LNCS 12089, pp. 118--125,
+#' DOI 10.1007/978-3-030-51310-8_11, URL \url{https://doi.org/10.1007/978-3-030-51310-8_11}.
+#'
+#' Rieger, Jonas, Lars Koppers, Carsten Jentsch and Jörg Rahnenführer (2020).
+#' "Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability using Clustering Techniques on Replicated Runs."
+#' arXiv 2003.04980, URL \url{https://arxiv.org/abs/2003.04980}.
+#'
+#'
 #' @import data.table
 #' @import stats
 #' @import checkmate
@@ -55,7 +71,8 @@
 
 .getDefaultParameters = function(K){
   if (missing(K)){
-    return(list(K = 100, alpha = 0.01, eta = 0.01, num.iterations = 200))
+    stop("Parameter K (number of modeled topics) must be set, no default!")
+    #return(list(K = 100, alpha = 0.01, eta = 0.01, num.iterations = 200))
   }else{
     return(list(K = K, alpha = 1/K, eta = 1/K, num.iterations = 200))
   }

diff --git a/README.md b/README.md
@@ -8,6 +8,25 @@
 ## Prototype of Multiple Latent Dirichlet Allocation Runs
 Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.
 
+## References
+* Rieger, J. (2020). ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations. [Journal of Open Source Software](https://doi.org/10.21105/joss.02181), 5(51), 2181.
+* Rieger, J., Rahnenführer, J. & Jentsch, C. (2020). Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype. [Natural Language Processing and Information Systems, NLDB 2020.](https://doi.org/10.1007/978-3-030-51310-8_11) LNCS 12089, pp. 118-125.
+* Rieger, J., Koppers, L., Jentsch, C. & Rahnenführer, J.: Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability using Clustering Techniques on Replicated Runs. [working paper](https://arxiv.org/abs/2003.04980)
+
+## Related Software
+* [tm](https://CRAN.R-project.org/package=tm) is useful for preprocessing text data.
+* [lda](https://CRAN.R-project.org/package=lda) offers a fast implementation of the Latent Dirichlet Allocation and is used by ``ldaPrototype``.
+* [quanteda](https://quanteda.io/) is a framework for "Quantitative Analysis of Textual Data".
+* [stm](https://www.structuraltopicmodel.com/) is a framework for Structural Topic Models.
+* [tosca](https://github.com/Docma-TU/tosca) is a framework for statistical methods in content analysis including visualizations and validation techniques. It is also useful for managing and manipulating text data to a structure requested by ``ldaPrototype``.
+* [topicmodels](https://CRAN.R-project.org/package=topicmodels) is another framework for various topic models based on the Latent Dirichlet Allocation and Correlated Topics Models.
+* [mallet](https://github.com/mimno/RMallet) provides an interface for the Java based machine learning tool [MALLET](http://mallet.cs.umass.edu/).
+
+## Contribution
+This R package is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
+For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, ...) and feature requests please use the [issue tracker](https://github.com/JonasRieger/ldaPrototype/issues).
+Pull requests are welcome and will be included at the discretion of the author.
+
 ## Installation
 ```{R}
 install.packages("ldaPrototype")
@@ -103,7 +122,7 @@ n2 = getConsideredWords(jacc)
 #### Step 3.1: Representation of Topic Similarities as Dendrogram
 It is possible to represent the calulcated pairwise topic similarities as dendrogram using ``dendTopics`` and related ``plot`` options.
 ```{R}
-dend = dendTopics(sims)
+dend = dendTopics(jacc)
 plot(dend)
 ```
 The S-CLOP algorithm results in a pruning state of the dendrogram, which can be retrieved calling ``pruneSCLOP``. By default each of the topics is colorized by its LDA run belonging; but the cluster belongings can also be visualized by the colors or by vertical lines with freely chosen parameters.
@@ -126,16 +145,3 @@ There are several possibilites for using shortcut functions to summarize steps o
 ```{R}
 res3 = getPrototype(reps, atLeast = 3)
 ```
-
-## Related Software
-* [tm](https://CRAN.R-project.org/package=tm) is useful for preprocessing text data.
-* [lda](https://CRAN.R-project.org/package=lda) offers a fast implementation of the Latent Dirichlet Allocation and is used by ``ldaPrototype``.
-* [quanteda](https://quanteda.io/) is a framework for "Quantitative Analysis of Textual Data".
-* [stm](https://www.structuraltopicmodel.com/) is a framework for Structural Topic Models.
-* [tosca](https://CRAN.R-project.org/package=tosca) is a framework for statistical methods in content analysis including visualizations and validation techniques. It is also useful for managing and manipulating text data to a structure requested by ``ldaPrototype``.
-* [topicmodels](https://CRAN.R-project.org/package=topicmodels) is another framework for various topic models based on the Latent Dirichlet Allocation and Correlated Topics Models.
-
-## Contribution
-This R package is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
-For bug reports (lack of documentation, misleading or wrong documentation, unexpected behaviour, ...) and feature requests please use the [issue tracker](https://github.com/JonasRieger/ldaPrototype/issues).
-Pull requests are welcome and will be included at the discretion of the author.