Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size of Corpus != size of summed up speeches #86

Closed
ChristophLeonhardt opened this issue May 28, 2019 · 2 comments
Closed

Size of Corpus != size of summed up speeches #86

ChristophLeonhardt opened this issue May 28, 2019 · 2 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

ChristophLeonhardt commented May 28, 2019

I am not sure if that is an actual issue but I observed the following odd behavior in some corpora:

I get the size of a corpus as follows:

size_of_corpus <- size("SL")

If I would split the corpus into speeches and merge the resulting partition bundle again, the size is the same:

`
sl_speeches <- as.speeches("SL", s_attribute_name = "speaker")
size_of_remerged_speeches <- merge(sl_speeches) %>% size()

size_of_corpus == size_of_remerged_speeches
`

The odd part is: If I iterate over the sl_speeches partition bundle and sum up the size of each speech, the resulting sum of sizes is slightly smaller than the actual size of the merged partition bundle.

size_of_summed_up_speeches <- lapply(sl_speeches@objects, function(x) x@size) %>% Reduce("+", .)

The entire sample script:

library(polmineR)
use("PopParl")
size_of_corpus <- size("SL")
sl_speeches <- as.speeches("SL", s_attribute_name = "speaker") 
size_of_remerged_speeches <- merge(sl_speeches) %>% size()

size_of_corpus == size_of_remerged_speeches # TRUE

size_of_summed_up_speeches <- lapply(sl_speeches@objects, function(x) x@size) %>% Reduce("+", .)

size_of_corpus == size_of_summed_up_speeches # FALSE

version: polmineR 0.7.11.9023

@PolMine
Copy link
Collaborator

PolMine commented May 31, 2019

This is certainly an issue! Thanks for raising it. A new polmineR version on the development branch (v0.7.11.9024) addresses this and removes the bug.

devtools::install_github("PolMine/polmineR", ref = "dev")

To explain: Two or three updates ago, I substantially reworked the as.speeches()-method, resulting in very substantial improvement of performance. However, the new procedure failed to address adequately a situation when one speaker had only contributed one single speech to the corpus that is not interrupted by anybody. Then you have only one region of corpus positions. It does not make sense to split this up, but polmineR failed to check for this, tried, produced a warning, and did not create a subcorpus for this specific speaker. As a result, a bit was lacking in the output speeches_bundle.

In the "SL"-corpus, this concerns the speakers "Beck", "LawaIl" (typo!) and "Weisweiler", see:

library(polmineR)
use("PopParl")
corpus("SL") %>% subset(speaker == "Beck") %>% slot("cpos")
corpus("SL") %>% subset(speaker == "LawaIl") %>% slot("cpos")
corpus("SL") %>% subset(speaker == "Weisweiler") %>% slot("cpos")

Not handling this situation adequately is definitely I bug. Using the new polmineR version, the bug is gone. See the following example that is a somewhat simplified version of the instructive example you provided:

library(polmineR)
use("PopParl")

corpus("SL") %>% size()
sl_speeches <- corpus("SL") %>% as.speeches(s_attribute_name = "speaker")
merge(sl_speeches) %>% size()
summary(sl_speeches)[["size"]] %>% sum()

@ChristophLeonhardt
Copy link
Contributor Author

This indeed seems to solve the issue. Thank you for the in-depth explanation which makes a lot of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants