Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

highlighting tokens in text sample by using workflow from UCSSR #99

Closed
KevinGlock opened this issue Sep 10, 2019 · 3 comments
Closed

highlighting tokens in text sample by using workflow from UCSSR #99

KevinGlock opened this issue Sep 10, 2019 · 3 comments

Comments

@KevinGlock
Copy link

I´ve got an error in highlighting a returning text sample. I´d manipulate the Code from UCSSR for my approach:

## getting the dictionary samples

# The following workflow creates two partitions from the GermaParl corpus,
# subseted by parties ideological position (left/right or progressive/conservative)
# regarding issuses of national and transnational citizenship.

## for progressive parties

coi_l <- partition("GERMAPARL",
                   party = c(
                     "SPD", "GRUENE", "FDP", "LINKE", "PDS"),
                   interjection= F,
                   encoding = c("UTF-8", "latin1"),
                   p_attribute = c("word", "lemma"),
                   role = c("mp", "government")
)


pb_l <- partition_bundle(coi_l, s_attribute = "date")

nested_l <- lapply(
  pb_l@objects,
  function(x) partition_bundle(x, s_attribute = "agenda_item", verbose = F)
)

debates_l <- flatten(nested_l)

names(debates_l) <- paste(
  blapply(debates_l, function(x) s_attributes(x, "date")),
  blapply(debates_l, function(x) name(x)), 
  sep = "_"
)

q1 <- c('"[Dd]oppelstaat.*"', '"[Mm]ehrstaat.*"', '".*[Ss]taatsbürger.*"',
        '".*[Ss]taats(an|zu)gehörig.*"', '"[Ss]taatenlos.*"', '"[Aa]us.*bürger.*"',
        '"[Ee]in.*bürger.*"', '"Doppelpa(ss|ß).*"', '"Pa(ss|ß)"', '"[Oo]ptionspflicht.*"',
        '"[Oo]ptionszwang.*"', '"Blutsrecht.*"', '"Geburts(recht|prinzip)"',
        '"[Ii]us"', '"Abstammungs(recht|prinzip).*"')


dt_l <- count(debates_l, query = q1, regex = T, cqp = T) %>% setorderv(cols = "TOTAL", order = -1L)

debates_citizen_l <- debates_l[[ subset(dt_l, TOTAL >= 25)[["partition"]] ]]

Everything is working until now. When I want to highlight tokens within the text, it is not possible to highlight regexed nodes (but it would be very nice to do so).
So I decided to use this following test dict for highlighting:

dict <- c("Doppelstaatler", "Abstammungsprinzip")

debates_citizen_l[[20]] %>% read() %>% highlight(lightgreen = dict)

warnings()

When I inspected the text with the viewer I saw that not every "Doppelstaatler" is highlighted, only the first one. Than I used the example code from UCSSR. In this case only "Asyl" and "Flucht" are highlighted because of "latin1" encoding. Additionally, when I use
debates_citizen_l[[20]] %>% read(interjection = F) %>% highlight(lightgreen = dict) some parts of sentences are excluded, not only the interjections.
Especially highlighting the regexed terms would be very helpful to get an fast and reliable matching.

@ablaette
Copy link
Collaborator

ablaette commented Oct 2, 2019

This is a perfectly valid feature request. A while ago, the argument regex had been introduced so that regular expressions could be used when highlighting terms in a keyword-in-context display. But it was not there for fulltext output.

A new version of polmineR I just pushed to the dev branch includes this feature. Get it via devtools::install_github("PolMine/polmineR", ref = "dev").

I somewhat edited your code to make to more efficient. This should work now:

library(polmineR)
library(pbapply)
library(data.table)
use("GermaParl")

coi_l <- corpus("GERMAPARL") %>%
  subset(party %in% c("SPD", "GRUENE", "FDP", "LINKE", "PDS")) %>%
  subset(interjection == FALSE) %>%
  subset(role %in% c("mp", "government"))

# Note that I switched to the new corpus()/split()-workflow. The initial 
# initialization is not significantly faster, but getting the subcorpora is
# much, muchfaster 

scb_l <- split(coi_l, s_attribute = "date", progress = TRUE)

nested_l <- pblapply(scb_l@objects, split, s_attribute = "agenda_item", verbose = TRUE)

# the flatten() auxiliary function does not yet work for subcorpus_bundle objects
# so I resort to this snippet
debates_l <- new("subcorpus_bundle", objects = unlist(lapply(nested_l, function(x) x@objects)))

names(debates_l) <- paste(
  blapply(debates_l, function(x) s_attributes(x, "date")),
  blapply(debates_l, function(x) name(x)), 
  sep = "_"
)

q1 <- c(
  '"[Dd]oppelstaat.*"',
  '"[Mm]ehrstaat.*"',
  '".*[Ss]taatsbürger.*"',
  '".*[Ss]taats(an|zu)gehörig.*"',
  '"[Ss]taatenlos.*"',
  '"[Aa]us.*bürger.*"',
  '"[Ee]in.*bürger.*"',
  '"Doppelpa(ss|ß).*"',
  '"Pa(ss|ß)"',
  '"[Oo]ptionspflicht.*"',
  '"[Oo]ptionszwang.*"',
  '"Blutsrecht.*"',
  '"Geburts(recht|prinzip)"',
  '"[Ii]us"',
  '"Abstammungs(recht|prinzip).*"'
)


dt_l <- count(debates_l, query = q1, regex = TRUE, cqp = TRUE) %>%
  setorderv(cols = "TOTAL", order = -1L)

debates_citizen_l <- debates_l[[ subset(dt_l, TOTAL >= 25)[["partition"]] ]]

# This is important! The CQP syntax requires tokens to be wrapped into 
# additional quotation marks, but usual regular expressions don't. So we
# need to get rid of the quotation marks again.
q1_regex <- gsub('^\\"(.*?)\\"$', '\\1', q1)


# ... and this is what the whole exercise was about
debates_citizen_l[[20]] %>%
  as("plpr_subcorpus") %>%
  html() %>%
  highlight(lightgreen = q1_regex, regex = TRUE, perl = TRUE)

As you will also see, the corpus()/split() workflow is much faster, and should be used. Anyway, I should like to mention: The partition()-call in your example included a statement of two different encodings, which is odd. Using the argument p_attribute is not necessary and triggers a count that is not used, so it should be omitted.

@KevinGlock
Copy link
Author

Thank you for your answer,

The offered workflow works really good and fast like you said but the result in the viewer is not highlighted and the encoding is latin1 so that regex is not matched with umlauts. When I use my workflow with UTF-8 combined with your update it works. The result in viewer though doesn´t seem to be reliable because in some cases Asyl.* is matched and highlighted however in some cases not.

Kind regards

@PolMine
Copy link
Collaborator

PolMine commented Mar 4, 2020

Returning to the issue of incomplete highlighting frightened me that highlighting might be buggy when the encoding of the terminal is "latin-1", i.e. that something may go systematically wrong on Windows machines.

Checking the code on my old Windows machine actually yields a different result: Everything is fine and reliable.

Indeed, it would have been very surprising that "Asyl.*" had been matched and highlighted, because it was not part of the list of queries. Everyhing fine, you just have to add it.

I think I want to close this issue.

@PolMine PolMine closed this as completed Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants