Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split a corpus using a structural attribute without values #262

Closed
ChristophLeonhardt opened this issue Sep 6, 2023 · 1 comment
Closed

Comments

@ChristophLeonhardt
Copy link
Contributor

If I am not mistaken, recent changes of polmineR should make it possible - or at least more robust - to split a corpus or subcorpus based on structural attributes without values. This might be relevant when the goal is to split a corpus into individual sentences: In some corpora, such as GermaParl2, sentences are annotated as structural attributes but they do not have values.

polmineR indicates if there are no values when querying an attribute like that:

s_attributes("GERMAPARL2", "s")
# ! s-attribute `s` does not have values, returning NA

I noticed something interesting. With the most recent development version of polmineR, it is possible to split a corpus into sentences:

sentences <- corpus("GERMAPARL2") |>
  subset(protocol_year == 2000) |>
  subset(p_type == "speech") |> 
  split(s_attribute = "s")

This does not seem to work for corpora:

corpus("GERMAPARL2") |>
    split(s_attribute = "s")

This returns a subcorpus bundle with the length of 1 and the size of the returned object is identical to the size of the entire corpus.

This is at least somewhat unexpected.

@ablaette
Copy link
Collaborator

This is something I have implemented in the meantime. Check the following example.

library(polmineR)
use("GermaParl2")

corpus("GERMAPARL2MINI") |>
  split(s_attribute = "s")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants