New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a subcorpus bundle while ignoring the values of the structural attribute #263
Comments
Argument I did this now. This is a very basic example that I use in the unit tests. library(polmineR)
use("GermaParl2")
corpus("GERMAPARL2MINI") %>%
split(s_attribute = "p", values = FALSE, verbose) This is a modification of your scenario: paragraphs <- corpus("GERMAPARL2") |>
subset(protocol_year == 2000) |>
subset(p_type == "speech") |>
split(s_attribute = "p") Here, I will have 107691 paragraphs. A lot, but plausible. |
Thank you very much for the quick response. I think the final chunk is supposed to be:
This seems like a handy solution to the problem I described. As far as I am concerned, this issue can be closed. |
Splitting a subcorpus using a structural attribute without values - at least in principle - seems to be possible now (but see issue #262).
While having a look at this, I encountered a different scenario in which splitting a corpus by the value of a structural attribute might not lead to the desired output.
Use case: There might be scenarios in which I want to ignore the distinct values of a structural attribute and split the corpus every time a structure changes regardless of the value of the structural attribute. An example would be to split GermaParl2 into paragraphs:
The desired output would be a subcorpus bundle containing all paragraphs of type "speech" separately.
However, the structural attribute "p" has values (containing the type of the paragraph). So,
split()
behaves as expected, splitting the subcorpus by these values. In this case, because of the previoussubset()
, there is only one value left and thus the result is a subcorpus bundle containing a single paragraph with all "speech" paragraphs.I am not entirely sure whether this use case is too specific to warrant an intervention for
polmineR
. Maybe it should be addressed by improving the data instead. But maybe, there could be a solution to use the mechanism introduced for structural attributes without values (such as "s", i.e. sentences in GermaParl2) to create a "paragraph" bundle regardless of the actual values of the structural attribute.The text was updated successfully, but these errors were encountered: