New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
highlighting tokens in text sample by using workflow from UCSSR #99
Comments
This is a perfectly valid feature request. A while ago, the argument A new version of polmineR I just pushed to the dev branch includes this feature. Get it via I somewhat edited your code to make to more efficient. This should work now: library(polmineR)
library(pbapply)
library(data.table)
use("GermaParl")
coi_l <- corpus("GERMAPARL") %>%
subset(party %in% c("SPD", "GRUENE", "FDP", "LINKE", "PDS")) %>%
subset(interjection == FALSE) %>%
subset(role %in% c("mp", "government"))
# Note that I switched to the new corpus()/split()-workflow. The initial
# initialization is not significantly faster, but getting the subcorpora is
# much, muchfaster
scb_l <- split(coi_l, s_attribute = "date", progress = TRUE)
nested_l <- pblapply(scb_l@objects, split, s_attribute = "agenda_item", verbose = TRUE)
# the flatten() auxiliary function does not yet work for subcorpus_bundle objects
# so I resort to this snippet
debates_l <- new("subcorpus_bundle", objects = unlist(lapply(nested_l, function(x) x@objects)))
names(debates_l) <- paste(
blapply(debates_l, function(x) s_attributes(x, "date")),
blapply(debates_l, function(x) name(x)),
sep = "_"
)
q1 <- c(
'"[Dd]oppelstaat.*"',
'"[Mm]ehrstaat.*"',
'".*[Ss]taatsbürger.*"',
'".*[Ss]taats(an|zu)gehörig.*"',
'"[Ss]taatenlos.*"',
'"[Aa]us.*bürger.*"',
'"[Ee]in.*bürger.*"',
'"Doppelpa(ss|ß).*"',
'"Pa(ss|ß)"',
'"[Oo]ptionspflicht.*"',
'"[Oo]ptionszwang.*"',
'"Blutsrecht.*"',
'"Geburts(recht|prinzip)"',
'"[Ii]us"',
'"Abstammungs(recht|prinzip).*"'
)
dt_l <- count(debates_l, query = q1, regex = TRUE, cqp = TRUE) %>%
setorderv(cols = "TOTAL", order = -1L)
debates_citizen_l <- debates_l[[ subset(dt_l, TOTAL >= 25)[["partition"]] ]]
# This is important! The CQP syntax requires tokens to be wrapped into
# additional quotation marks, but usual regular expressions don't. So we
# need to get rid of the quotation marks again.
q1_regex <- gsub('^\\"(.*?)\\"$', '\\1', q1)
# ... and this is what the whole exercise was about
debates_citizen_l[[20]] %>%
as("plpr_subcorpus") %>%
html() %>%
highlight(lightgreen = q1_regex, regex = TRUE, perl = TRUE) As you will also see, the corpus()/split() workflow is much faster, and should be used. Anyway, I should like to mention: The partition()-call in your example included a statement of two different encodings, which is odd. Using the argument |
Thank you for your answer, The offered workflow works really good and fast like you said but the result in the viewer is not highlighted and the encoding is latin1 so that regex is not matched with umlauts. When I use my workflow with UTF-8 combined with your update it works. The result in viewer though doesn´t seem to be reliable because in some cases Asyl.* is matched and highlighted however in some cases not. Kind regards |
Returning to the issue of incomplete highlighting frightened me that highlighting might be buggy when the encoding of the terminal is "latin-1", i.e. that something may go systematically wrong on Windows machines. Checking the code on my old Windows machine actually yields a different result: Everything is fine and reliable. Indeed, it would have been very surprising that "Asyl.*" had been matched and highlighted, because it was not part of the list of queries. Everyhing fine, you just have to add it. I think I want to close this issue. |
I´ve got an error in highlighting a returning text sample. I´d manipulate the Code from UCSSR for my approach:
Everything is working until now. When I want to highlight tokens within the text, it is not possible to highlight regexed nodes (but it would be very nice to do so).
So I decided to use this following test dict for highlighting:
When I inspected the text with the viewer I saw that not every "Doppelstaatler" is highlighted, only the first one. Than I used the example code from UCSSR. In this case only "Asyl" and "Flucht" are highlighted because of "latin1" encoding. Additionally, when I use
debates_citizen_l[[20]] %>% read(interjection = F) %>% highlight(lightgreen = dict)
some parts of sentences are excluded, not only the interjections.Especially highlighting the regexed terms would be very helpful to get an fast and reliable matching.
The text was updated successfully, but these errors were encountered: