Size of Corpus != size of summed up speeches #86

ChristophLeonhardt · 2019-05-28T12:09:09Z

I am not sure if that is an actual issue but I observed the following odd behavior in some corpora:

I get the size of a corpus as follows:

size_of_corpus <- size("SL")

If I would split the corpus into speeches and merge the resulting partition bundle again, the size is the same:

`
sl_speeches <- as.speeches("SL", s_attribute_name = "speaker")
size_of_remerged_speeches <- merge(sl_speeches) %>% size()

size_of_corpus == size_of_remerged_speeches
`

The odd part is: If I iterate over the sl_speeches partition bundle and sum up the size of each speech, the resulting sum of sizes is slightly smaller than the actual size of the merged partition bundle.

size_of_summed_up_speeches <- lapply(sl_speeches@objects, function(x) x@size) %>% Reduce("+", .)

The entire sample script:

library(polmineR)
use("PopParl")
size_of_corpus <- size("SL")
sl_speeches <- as.speeches("SL", s_attribute_name = "speaker") 
size_of_remerged_speeches <- merge(sl_speeches) %>% size()

size_of_corpus == size_of_remerged_speeches # TRUE

size_of_summed_up_speeches <- lapply(sl_speeches@objects, function(x) x@size) %>% Reduce("+", .)

size_of_corpus == size_of_summed_up_speeches # FALSE

version: polmineR 0.7.11.9023

The text was updated successfully, but these errors were encountered:

PolMine · 2019-05-31T09:19:54Z

This is certainly an issue! Thanks for raising it. A new polmineR version on the development branch (v0.7.11.9024) addresses this and removes the bug.

devtools::install_github("PolMine/polmineR", ref = "dev")

To explain: Two or three updates ago, I substantially reworked the as.speeches()-method, resulting in very substantial improvement of performance. However, the new procedure failed to address adequately a situation when one speaker had only contributed one single speech to the corpus that is not interrupted by anybody. Then you have only one region of corpus positions. It does not make sense to split this up, but polmineR failed to check for this, tried, produced a warning, and did not create a subcorpus for this specific speaker. As a result, a bit was lacking in the output speeches_bundle.

In the "SL"-corpus, this concerns the speakers "Beck", "LawaIl" (typo!) and "Weisweiler", see:

library(polmineR)
use("PopParl")
corpus("SL") %>% subset(speaker == "Beck") %>% slot("cpos")
corpus("SL") %>% subset(speaker == "LawaIl") %>% slot("cpos")
corpus("SL") %>% subset(speaker == "Weisweiler") %>% slot("cpos")

Not handling this situation adequately is definitely I bug. Using the new polmineR version, the bug is gone. See the following example that is a somewhat simplified version of the instructive example you provided:

library(polmineR)
use("PopParl")

corpus("SL") %>% size()
sl_speeches <- corpus("SL") %>% as.speeches(s_attribute_name = "speaker")
merge(sl_speeches) %>% size()
summary(sl_speeches)[["size"]] %>% sum()

ChristophLeonhardt · 2019-07-24T13:17:44Z

This indeed seems to solve the issue. Thank you for the in-depth explanation which makes a lot of sense.

ablaette mentioned this issue Jul 19, 2019

as.speeches: verbose has no effect #64

Closed

ablaette closed this as completed Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size of Corpus != size of summed up speeches #86

Size of Corpus != size of summed up speeches #86

ChristophLeonhardt commented May 28, 2019 •

edited by PolMine

PolMine commented May 31, 2019

ChristophLeonhardt commented Jul 24, 2019

Size of Corpus != size of summed up speeches #86

Size of Corpus != size of summed up speeches #86

Comments

ChristophLeonhardt commented May 28, 2019 • edited by PolMine

PolMine commented May 31, 2019

ChristophLeonhardt commented Jul 24, 2019

ChristophLeonhardt commented May 28, 2019 •

edited by PolMine