highlighting tokens in text sample by using workflow from UCSSR #99

KevinGlock · 2019-09-10T12:45:21Z

I´ve got an error in highlighting a returning text sample. I´d manipulate the Code from UCSSR for my approach:

## getting the dictionary samples

# The following workflow creates two partitions from the GermaParl corpus,
# subseted by parties ideological position (left/right or progressive/conservative)
# regarding issuses of national and transnational citizenship.

## for progressive parties

coi_l <- partition("GERMAPARL",
                   party = c(
                     "SPD", "GRUENE", "FDP", "LINKE", "PDS"),
                   interjection= F,
                   encoding = c("UTF-8", "latin1"),
                   p_attribute = c("word", "lemma"),
                   role = c("mp", "government")
)


pb_l <- partition_bundle(coi_l, s_attribute = "date")

nested_l <- lapply(
  pb_l@objects,
  function(x) partition_bundle(x, s_attribute = "agenda_item", verbose = F)
)

debates_l <- flatten(nested_l)

names(debates_l) <- paste(
  blapply(debates_l, function(x) s_attributes(x, "date")),
  blapply(debates_l, function(x) name(x)), 
  sep = "_"
)

q1 <- c('"[Dd]oppelstaat.*"', '"[Mm]ehrstaat.*"', '".*[Ss]taatsbürger.*"',
        '".*[Ss]taats(an|zu)gehörig.*"', '"[Ss]taatenlos.*"', '"[Aa]us.*bürger.*"',
        '"[Ee]in.*bürger.*"', '"Doppelpa(ss|ß).*"', '"Pa(ss|ß)"', '"[Oo]ptionspflicht.*"',
        '"[Oo]ptionszwang.*"', '"Blutsrecht.*"', '"Geburts(recht|prinzip)"',
        '"[Ii]us"', '"Abstammungs(recht|prinzip).*"')


dt_l <- count(debates_l, query = q1, regex = T, cqp = T) %>% setorderv(cols = "TOTAL", order = -1L)

debates_citizen_l <- debates_l[[ subset(dt_l, TOTAL >= 25)[["partition"]] ]]

Everything is working until now. When I want to highlight tokens within the text, it is not possible to highlight regexed nodes (but it would be very nice to do so).
So I decided to use this following test dict for highlighting:

dict <- c("Doppelstaatler", "Abstammungsprinzip")

debates_citizen_l[[20]] %>% read() %>% highlight(lightgreen = dict)

warnings()

When I inspected the text with the viewer I saw that not every "Doppelstaatler" is highlighted, only the first one. Than I used the example code from UCSSR. In this case only "Asyl" and "Flucht" are highlighted because of "latin1" encoding. Additionally, when I use
debates_citizen_l[[20]] %>% read(interjection = F) %>% highlight(lightgreen = dict) some parts of sentences are excluded, not only the interjections.
Especially highlighting the regexed terms would be very helpful to get an fast and reliable matching.

The text was updated successfully, but these errors were encountered:

ablaette · 2019-10-02T09:40:50Z

This is a perfectly valid feature request. A while ago, the argument regex had been introduced so that regular expressions could be used when highlighting terms in a keyword-in-context display. But it was not there for fulltext output.

A new version of polmineR I just pushed to the dev branch includes this feature. Get it via devtools::install_github("PolMine/polmineR", ref = "dev").

I somewhat edited your code to make to more efficient. This should work now:

library(polmineR)
library(pbapply)
library(data.table)
use("GermaParl")

coi_l <- corpus("GERMAPARL") %>%
  subset(party %in% c("SPD", "GRUENE", "FDP", "LINKE", "PDS")) %>%
  subset(interjection == FALSE) %>%
  subset(role %in% c("mp", "government"))

# Note that I switched to the new corpus()/split()-workflow. The initial 
# initialization is not significantly faster, but getting the subcorpora is
# much, muchfaster 

scb_l <- split(coi_l, s_attribute = "date", progress = TRUE)

nested_l <- pblapply(scb_l@objects, split, s_attribute = "agenda_item", verbose = TRUE)

# the flatten() auxiliary function does not yet work for subcorpus_bundle objects
# so I resort to this snippet
debates_l <- new("subcorpus_bundle", objects = unlist(lapply(nested_l, function(x) x@objects)))

names(debates_l) <- paste(
  blapply(debates_l, function(x) s_attributes(x, "date")),
  blapply(debates_l, function(x) name(x)), 
  sep = "_"
)

q1 <- c(
  '"[Dd]oppelstaat.*"',
  '"[Mm]ehrstaat.*"',
  '".*[Ss]taatsbürger.*"',
  '".*[Ss]taats(an|zu)gehörig.*"',
  '"[Ss]taatenlos.*"',
  '"[Aa]us.*bürger.*"',
  '"[Ee]in.*bürger.*"',
  '"Doppelpa(ss|ß).*"',
  '"Pa(ss|ß)"',
  '"[Oo]ptionspflicht.*"',
  '"[Oo]ptionszwang.*"',
  '"Blutsrecht.*"',
  '"Geburts(recht|prinzip)"',
  '"[Ii]us"',
  '"Abstammungs(recht|prinzip).*"'
)


dt_l <- count(debates_l, query = q1, regex = TRUE, cqp = TRUE) %>%
  setorderv(cols = "TOTAL", order = -1L)

debates_citizen_l <- debates_l[[ subset(dt_l, TOTAL >= 25)[["partition"]] ]]

# This is important! The CQP syntax requires tokens to be wrapped into 
# additional quotation marks, but usual regular expressions don't. So we
# need to get rid of the quotation marks again.
q1_regex <- gsub('^\\"(.*?)\\"$', '\\1', q1)


# ... and this is what the whole exercise was about
debates_citizen_l[[20]] %>%
  as("plpr_subcorpus") %>%
  html() %>%
  highlight(lightgreen = q1_regex, regex = TRUE, perl = TRUE)

As you will also see, the corpus()/split() workflow is much faster, and should be used. Anyway, I should like to mention: The partition()-call in your example included a statement of two different encodings, which is odd. Using the argument p_attribute is not necessary and triggers a count that is not used, so it should be omitted.

KevinGlock · 2019-10-02T16:54:22Z

Thank you for your answer,

The offered workflow works really good and fast like you said but the result in the viewer is not highlighted and the encoding is latin1 so that regex is not matched with umlauts. When I use my workflow with UTF-8 combined with your update it works. The result in viewer though doesn´t seem to be reliable because in some cases Asyl.* is matched and highlighted however in some cases not.

Kind regards

PolMine · 2020-03-04T09:14:53Z

Returning to the issue of incomplete highlighting frightened me that highlighting might be buggy when the encoding of the terminal is "latin-1", i.e. that something may go systematically wrong on Windows machines.

Checking the code on my old Windows machine actually yields a different result: Everything is fine and reliable.

Indeed, it would have been very surprising that "Asyl.*" had been matched and highlighted, because it was not part of the list of queries. Everyhing fine, you just have to add it.

I think I want to close this issue.

PolMine closed this as completed Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

highlighting tokens in text sample by using workflow from UCSSR #99

highlighting tokens in text sample by using workflow from UCSSR #99

KevinGlock commented Sep 10, 2019

ablaette commented Oct 2, 2019

KevinGlock commented Oct 2, 2019

PolMine commented Mar 4, 2020

highlighting tokens in text sample by using workflow from UCSSR #99

highlighting tokens in text sample by using workflow from UCSSR #99

Comments

KevinGlock commented Sep 10, 2019

ablaette commented Oct 2, 2019

KevinGlock commented Oct 2, 2019

PolMine commented Mar 4, 2020