Skip to content

Commit

Permalink
dirichlet -> Dirichlet
Browse files Browse the repository at this point in the history
  • Loading branch information
lkoppers committed Aug 31, 2018
1 parent 4098e4a commit 846f442
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ Package: tosca
Type: Package
Title: Tools for Statistical Content Analysis
Version: 0.1-0
Date: 2018-08-30
Date: 2018-08-31
Authors@R: c(person("Lars", "Koppers", email="koppers@statistik.tu-dortmund.de", role=c("aut", "cre"), comment = c(ORCID = "0000-0002-1642-9616")),
person("Jonas", "Rieger", email="riegerjonas@gmx.de", role=c("aut")),
person("Karin", "Boczek", email="karin.boczek@tu-dortmund.de", role=c("ctb"), comment = c(ORCID = "0000-0003-1516-4094")),
person("Gerret", "von Nordheim", email="gerret.vonnordheim@tu-dortmund.de", role=c("ctb"), comment = c(ORCID = "0000-0001-7553-3838")))
Description: A framework for statistical analysis in content analysis. In addition to a pipeline for preprocessing text corpora and linking to the latent dirichlet allocation from the 'lda' package, plots are offered for the descriptive analysis of text corpora and topic models. In addition, an implementation of Chang's intruder words and intruder topics is provided.
Description: A framework for statistical analysis in content analysis. In addition to a pipeline for preprocessing text corpora and linking to the latent Dirichlet allocation from the 'lda' package, plots are offered for the descriptive analysis of text corpora and topic models. In addition, an implementation of Chang's intruder words and intruder topics is provided.
URL: https://github.com/Docma-TU/tosca
License: GPL (>= 2)
Encoding: UTF-8
Expand Down
12 changes: 6 additions & 6 deletions vignettes/Vignette.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ vignette: >
\newpage
# Introduction

This package provides different functions to explore text corpora with topic models. The package focuses on the visualisation and validation of content analysis. Therefore it provides some filters for preprocessing and a wrapper for the latent dirichlet allocation (lda) from the lda-package to include a topic model.
This package provides different functions to explore text corpora with topic models. The package focuses on the visualisation and validation of content analysis. Therefore it provides some filters for preprocessing and a wrapper for the latent Dirichlet allocation (lda) from the lda-package to include a topic model.
Most visualisations aim at the presentation of measures for corpora, subcorpora or topics from lda over time. To use this functionality every document needs a date specification as metadata. To harmonize different text sources we provide the S3 object \texttt{textmeta}.

The following table shows an overview over the functions in the package.
Expand Down Expand Up @@ -148,7 +148,7 @@ dups <- duplist(corpus)
```
There is a possibility to visualize duplicates over time by the function \texttt{plotScot} which is explained in section 3.2.

For further analysis, especially for performing the Latent Dirichlet Allocation, it is important that for each duplicate only one page is considered. Therefore it is the aim to reduce the corpus, so that it contains all pages which appear only once and a represantative page for all pages which appear twice or more frequent. In our example we have only duplicated texts containing the empty string \texttt{""} or short relicts like \texttt{"\_\_NOTOC\_\_"} or \texttt{"* * *"}
For further analysis, especially for performing the latent Dirichlet allocation, it is important that for each duplicate only one page is considered. Therefore it is the aim to reduce the corpus, so that it contains all pages which appear only once and a represantative page for all pages which appear twice or more frequent. In our example we have only duplicated texts containing the empty string \texttt{""} or short relicts like \texttt{"\_\_NOTOC\_\_"} or \texttt{"* * *"}

## Clean Corpus - \texttt{cleanTexts}
For further preprocessing of text corpora tosca offers the function \texttt{cleanTexts}. It removes punctuation, numbers and stopwords. By default it removes english stopwords. It uses the stopword list of the function \texttt{stopwords} from the \texttt{tm} package. For the german stopword list some additional word (different spelling) are implemented (e.g. "dass" and "fuer"). You can control which stopwords should be removed with the argument \texttt{sw}. In addition the function changes all words to lowercase and tokenizes the documents. The result is a \texttt{list} of \texttt{character} vectors, or if \texttt{paragraph} is set \texttt{TRUE} (default) a \texttt{list} of \texttt{lists} of \texttt{character} vectors. The sublists represent additional text structure like paragraphs of a document. If you commit a \texttt{textmeta} object instead of a \texttt{list} of texts you will also receive a textmeta object back. In this case you have to commit it to the parameter \texttt{object} instead of \texttt{text}.
Expand Down Expand Up @@ -334,10 +334,10 @@ print(corpusFiltered)
The date and word filtered corpus consists of 451 documents compared to 3909 documents in the original \texttt{corpusDate} corpus.

# Latent Dirichlet Allocation
The central analytical functionality in this package is to perform and analyse a Latent Dirichlet Allocation. The package provides the function \texttt{LDAgen} for performing the LDA, functions for validating the LDA results and various functions for visualizing the results in different ways, especially over time. It is possible to analyse individual articles as well as their topic allocations. In tosca in addition a function for preparing your corpus for performing a Latent Dirichlet Allocation is given. This function creates a object which can be committed to the function you could use for a LDA.
The central analytical functionality in this package is to perform and analyse a latent Dirichlet allocation. The package provides the function \texttt{LDAgen} for performing the LDA, functions for validating the LDA results and various functions for visualizing the results in different ways, especially over time. It is possible to analyse individual articles as well as their topic allocations. In tosca in addition a function for preparing your corpus for performing a latent Dirichlet allocation is given. This function creates a object which can be committed to the function you could use for a LDA.

## Transform Corpus - \texttt{LDAprep}
The last step before performing a Latent Dirichlet Allocation is to create corpus data, which can be committed to the function \texttt{lda.collapsed.gibbs.sampler} from the \texttt{lda} package or the function \texttt{LDAgen} from this package, respectively. This is done by using the function \texttt{LDAprep} with its arguments \texttt{text} (\texttt{text} component of a \texttt{textmeta} object) and \texttt{vocab} (\texttt{character} vector of vocabularies). These vocabularies are the words which are taken into account for LDA.
The last step before performing a latent Dirichlet allocation is to create corpus data, which can be committed to the function \texttt{lda.collapsed.gibbs.sampler} from the \texttt{lda} package or the function \texttt{LDAgen} from this package, respectively. This is done by using the function \texttt{LDAprep} with its arguments \texttt{text} (\texttt{text} component of a \texttt{textmeta} object) and \texttt{vocab} (\texttt{character} vector of vocabularies). These vocabularies are the words which are taken into account for LDA.

You can have a look at the documentation of the \texttt{lda.collapsed.gibbs.sampler} for further information about lda. The function \texttt{LDAprep} offers the option \texttt{reduce} all set to \texttt{TRUE} by default. The returned value is a \texttt{list} in which every entry represents an article and contains a matrix with two rows. In the first row there is the index of each word in \texttt{vocab} minus one (The index starts at 0), in the second row is always one and the number of the appearances of the word is indicated by the number of columns belonging to this word. This structure is needed by \texttt{lda.collapsed.gibbs.sampler}.

Expand All @@ -353,7 +353,7 @@ head(sort(wordtableFiltered$wordtable, decreasing = TRUE))
words5 <- wordtableFiltered$words[wordtableFiltered$wordtable > 5]
pagesLDA <- LDAprep(text = corpusFiltered$text, vocab = words5)
```
After receiving the words which appear at least six times in the whole filtered corpus, the function \texttt{LDAprep} is applied to the example corpus with \texttt{vocab = words5}. The object \texttt{pagesLDA} will be committed to the function which performs a Latent Dirichlet Allocation.
After receiving the words which appear at least six times in the whole filtered corpus, the function \texttt{LDAprep} is applied to the example corpus with \texttt{vocab = words5}. The object \texttt{pagesLDA} will be committed to the function which performs a latent Dirichlet allocation.

## Performing LDA - \texttt{LDAgen}
The function that has to be applied first to the corpus prepared by \texttt{LDAprep} is \texttt{LDAgen}. The function offers the options \texttt{K} (\texttt{integer}, default: \texttt{K = 100L}) to set the number of topics, \texttt{vocab} (\texttt{character} vector) for specifying the words which are considered in the preparation of the corpus and several more e.g. number of iterations for the burnin (default: \texttt{burnin = 70}) and the number of iterations for the Gibbs sampler (default: \texttt{num.iterations = 200}). The result is saved in a \texttt{R} workspace, the first part of the results name can be specified by setting the option \texttt{folder} (default: \texttt{folder = paste0(tempdir(),"/lda-result")}). If you want to save your data permanent, you have to change the path in an non temporary one.
Expand Down Expand Up @@ -570,7 +570,7 @@ LDAresult <- LDAgen(documents = pagesLDA, K = 10L, vocab = words5)
After generating the lda model further analysis depends on the specific aims of the project.

# Conclusion
Uor package tosca is a addition to the existing textmining packages on CRAN. It contains functions for a typical pipeline used for content analysis and uses the implementation of standard preprocessing of existing packages. Additional tosca provides functionality for visual exploration of corpora and topics resulting from the latent dirichlet allocation. tosca focusses on analysis over time, so it needs texts with a date as meta data. The actual version of the package offers an implementation of Chang's intruder topics and intruder words. For future versions a framework for effective sampling in (sub-) corpora is under preparation. There are plans for a better connection to the frameworks of the tm and the quanteda package.
Uor package tosca is a addition to the existing textmining packages on CRAN. It contains functions for a typical pipeline used for content analysis and uses the implementation of standard preprocessing of existing packages. Additional tosca provides functionality for visual exploration of corpora and topics resulting from the latent Dirichlet allocation. tosca focusses on analysis over time, so it needs texts with a date as meta data. The actual version of the package offers an implementation of Chang's intruder topics and intruder words. For future versions a framework for effective sampling in (sub-) corpora is under preparation. There are plans for a better connection to the frameworks of the tm and the quanteda package.

```{r, include = FALSE, eval = FALSE}
library(knitr)
Expand Down

0 comments on commit 846f442

Please sign in to comment.