diff --git a/DESCRIPTION b/DESCRIPTION index 2da1472..4b6ba67 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,14 +1,15 @@ Package: LinkTools Type: Package Title: LinkTools -Version: 0.0.1.9003 -Date: 2023-02-07 +Version: 0.0.1.9005 +Date: 2023-04-22 Author: Christoph Leonhardt -Maintainer: Christoph Leonhardt -Description: This package facilitates the linkage of datasets via shared unique identifiers. Four steps are integrated into this package: a) the preparation of datasets which should be linked, i.e. the transformation into a comparable format and the assignment of shared unique identifiers, b) the merge of datasets based on these identifiers, c) the encoding or enrichment of the data with three output formats (data.table, XML or CWB). In addition, d), the package includes a wrapper for the Named Entity Linking of textual data based on DBPedia Spotlight. +Maintainer: Christoph Leonhardt +Description: This package facilitates the linkage of datasets. Once finished, four steps should be integrated into this package: a) the preparation of datasets which should be linked, i.e. the transformation into a comparable format and the assignment of new information, in particular shared unique identifiers, b) the merge of datasets based on these identifiers, c) the encoding or enrichment of the data with different output formats (data.table, XML or CWB). In addition, d), the package includes a wrapper for the Named Entity Linking of textual data based on DBpedia Spotlight. Depends: R (>= 3.5.0) Imports: + cli, data.table, cwbtools, polmineR, @@ -22,12 +23,16 @@ Imports: rhandsontable, stringr Suggests: + btmp, knitr, devtools, - DT + DT, + testthat (>= 3.0.0), + withr VignetteBuilder: knitr LazyData: yes License: GPL (>= 3) Encoding: UTF-8 Roxygen: list(markdown = TRUE) -RoxygenNote: 7.2.2 +RoxygenNote: 7.2.3 +Config/testthat/edition: 3 diff --git a/LinkTools_interactive_matching_gui_README.png b/LinkTools_interactive_matching_gui_README.png new file mode 100644 index 0000000..4f5782c Binary files /dev/null and b/LinkTools_interactive_matching_gui_README.png differ diff --git a/NAMESPACE b/NAMESPACE index 01c4e47..a264dff 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -4,12 +4,19 @@ export(LTDataset) importFrom(R6,R6Class) importFrom(RcppCWB,cl_cpos2struc) importFrom(RcppCWB,cl_struc2str) +importFrom(cli,cli_abort) +importFrom(cli,cli_alert_info) +importFrom(cli,cli_alert_success) +importFrom(cli,cli_alert_warning) +importFrom(cli,cli_progress_message) +importFrom(cli,cli_progress_update) importFrom(cwbtools,registry_file_parse) importFrom(cwbtools,s_attribute_encode) importFrom(data.table,":=") importFrom(data.table,data.table) importFrom(data.table,is.data.table) importFrom(data.table,rleid) +importFrom(data.table,rleidv) importFrom(data.table,setnames) importFrom(data.table,setorder) importFrom(fuzzyjoin,fuzzy_join) diff --git a/NEWS.md b/NEWS.md index bbeb867..a62decf 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,23 @@ +V0.0.1.9005 +* replaced `\code{}` tags in the documentation +* reduced verbosity of intermediate steps +* introduced a check if the external dataset contains NA values in significant columns before fuzzy matching +* introduced messages from the `cli` package + +[2023-04-19] +* addressed a quite comprehensive issue in `external_attribute_to_region_matrix()` that potentially obscured speakers which were not matched, making them unavailable for both the fuzzy matching and manual inspection (issue #14) +* introduced tests +* modified vignette, uses GERMAPARLMINI as sample data and the `btmp` package for linking +* removed data and data-raw which were containing the external data now provided by the `btmp` package +* made the addition more robust for subcorpora + +[2023-04-12] +* starting rework to address more diverging dataset-text data-combination +* adding the `additional_attributes` argument to `create_attribute_region_datatable()` to make manual inspection more meaningful +* made `check_and_add_missing_values()` more flexible by passing `check_for_groups` as a list and adding the `negate` argument +* added capability to use more than one fuzzy matched variable in `fuzzy_join_missing_values()` +* made more explicit use of fuzzyjoin::stringdist_join() + [2023-02-07] # v0.0.1.9003 * Added `Depends: R (>= 3.5.0)` to DESCRIPTION to avoid warning when build the package #2. diff --git a/R/LTDataset.R b/R/LTDataset.R index a72e6b1..da8f374 100644 --- a/R/LTDataset.R +++ b/R/LTDataset.R @@ -42,13 +42,13 @@ #' to add missing values interactively.} #' \item{`create_attribute_region_datatable()`}{returns a data.table of matched #' attributes. Can be used for further manual checks.} -#' \item{`encode_new_s_attribute(add_temporarily)`}{encodes the new attribute +#' \item{`encode_new_s_attribute()`}{encodes the new attribute #' as a structural attribute (for CWB)} #' } #' @export LTDataset #' @importFrom magrittr %>% #' @importFrom R6 R6Class -#' @importFrom data.table data.table := is.data.table setnames setorder rleid +#' @importFrom data.table data.table := is.data.table setnames setorder rleid rleidv #' @importFrom polmineR size s_attributes corpus decode #' @importFrom cwbtools s_attribute_encode registry_file_parse #' @importFrom RcppCWB cl_cpos2struc cl_struc2str @@ -58,6 +58,7 @@ #' @importFrom stringr str_to_lower str_length #' @importFrom stringdist stringdist #' @importFrom fuzzyjoin fuzzy_join +#' @importFrom cli cli_abort cli_alert_info cli_alert_success cli_alert_warning cli_progress_message cli_progress_update LTDataset <- R6Class( "LTDataset", @@ -66,48 +67,48 @@ LTDataset <- R6Class( # fields - #' @field textual_data a \code{character vector} of the corpus the IDs + #' @field textual_data a `character vector` of the corpus the IDs #' should be added to. - #' @field textual_data_type a \code{character vector} indicating the type of + #' @field textual_data_type a `character vector` indicating the type of #' textual data (cwb or xml). - #' @field external_resource a \code{data.frame or data.table} of an external + #' @field external_resource a `data.frame` or `data.table` of an external #' dataset the IDs are coming from. - #' @field attr_to_add a \code{character vector} indicating the ID that + #' @field attr_to_add a `character vector` indicating the ID that #' should be added. If named, the name indicates the desired name of the #' ID as an structural attribute in the corpus data. The value of the #' character vector indicates the column of the external dataset the ID is #' stored in. - #' @field split_by a \code{character vector} indicating a s-attribute the + #' @field split_by a `character vector` indicating a s-attribute the #' corpus data should be split by to reduce memory usage. - #' @field match_by a \code{character vector} indicating the metadata used to + #' @field match_by a `character vector` indicating the metadata used to #' match the corpus data and the external dataset. If named, the names of #' the character vector indicate the names of the s-attributes in the #' corpus data and the values indicate the corresponding column names in #' the external dataset. - #' @field forced_encoding a \code{character vector} of the desired output + #' @field forced_encoding a `character vector` of the desired output #' encoding of the textual data. This might be useful when the original #' encoding and the locale differ. - #' @field attrs_by_region_dt a \code{data.table} of regions that should be + #' @field attrs_by_region_dt a `data.table` of regions that should be #' matched in the merge. - #' @field verbose \code{logical} whether or not to print messages. Defaults + #' @field verbose `logical` whether or not to print messages. Defaults #' to FALSE. - #' @field attribute_name_in_corpus a \code{character} the desired name of the + #' @field attribute_name_in_corpus a `character` the desired name of the #' attribute to add in the text data. - #' @field attribute_in_external_resource a \code{character} the name of the + #' @field attribute_in_external_resource a `character` the name of the #' attribute to add in the external resource. - #' @field text_dt a \code{data.table} in case of a cwb corpus, this + #' @field text_dt a `data.table` in case of a cwb corpus, this #' contains the decoded corpus. - #' @field encoding_method a \code{character vector} of the encoding method + #' @field encoding_method a `character vector` of the encoding method #' used to create the corpus (R or CWB). - #' @field values a \code{character vector} of values to encode after the + #' @field values a `character vector` of values to encode after the #' merge. - #' @field region_matrix a \code{matrix} of corpus positions corresponding to + #' @field region_matrix a `matrix` of corpus positions corresponding to #' the values in the value field. - #' @field cpos_left a \code{integer} of the left cpos boundary of the corpus + #' @field cpos_left a `integer` of the left cpos boundary of the corpus #' object. - #' @field cpos_right a \code{integer} of the right cpos boundary of the + #' @field cpos_right a `integer` of the right cpos boundary of the #' corpus object. - #' @field missing_after_check a \code{data.table} of observations still + #' @field missing_after_check a `data.table` of observations still #' missing after the manual check textual_data = NULL, textual_data_type = NULL, @@ -121,7 +122,7 @@ LTDataset <- R6Class( attribute_name_in_corpus = NULL, attribute_in_external_resource = NULL, - + text_dt = NULL, encoding_method = "R", values = NULL, @@ -129,11 +130,11 @@ LTDataset <- R6Class( cpos_left = NULL, cpos_right = NULL, missing_after_check = NULL, - + # methods # R6 documentation with Roxygen - - #' @description initialize a new object of class \code{LTDataset}. + + #' @description initialize a new object of class `LTDataset`. #' @param textual_data a character vector of the corpus the IDs should be #' added to. #' @param textual_data_type a character vector indicating the type of @@ -164,27 +165,27 @@ LTDataset <- R6Class( split_by = NULL, verbose = TRUE, forced_encoding = NULL){ - + stopifnot(!is.null(textual_data), !is.null(textual_data_type), is.data.frame(external_resource), is.character(attr_to_add), is.character(match_by)) - + self$textual_data <- textual_data self$textual_data_type <- textual_data_type self$split_by <- split_by self$forced_encoding <- forced_encoding - + # if the columns are the same, there might not be a name. if (is.null(names(match_by))) { names(match_by) <- match_by } - + self$match_by <- match_by - + # Checks which are independent from the corpus file format - + if (!is.null(names(attr_to_add))) { self$attribute_name_in_corpus <- names(attr_to_add) self$attribute_in_external_resource <- as.character(attr_to_add) @@ -195,13 +196,13 @@ LTDataset <- R6Class( self$attribute_in_external_resource <- attr_to_add names(self$attribute_in_external_resource) <- attr_to_add # naming is just to rename the columns accordingly. } - + if (!is.null(self$textual_data) & self$textual_data_type == "cwb") { - + if (!is.character(self$textual_data)) { # if textual_data is no character, it might be a subcorpus, etc. Take - # the corpus from slot? + # the corpus from slot. if ("corpus" %in% slotNames(self$textual_data)) { self$cpos_left = self$textual_data@cpos[1, 1] # take first row, first column from the cpos slot of the object @@ -216,28 +217,34 @@ LTDataset <- R6Class( if (!is.null(self$textual_data) & self$attribute_name_in_corpus %in% polmineR::s_attributes(self$textual_data)) { + cli_abort(c("x" = "Attribute {.var {self$attribute_name_in_corpus}} that should be added already exists.")) + # the new var already exists as an s-attribute. What to do? modify, # overwrite, stop? Modification would be best, keeping entries which aren't # "NA"? } - if (any(!names(self$match_by) %in% polmineR::s_attributes(self$textual_data))) stop("... not all structural attributes in match_by are in corpus.") - if (any(!as.character(self$match_by) %in% colnames(external_resource))) stop("... not all variables in match_by are in the external dataset.") + if (any(!names(self$match_by) %in% polmineR::s_attributes(self$textual_data))) cli_abort(c("x" = "Not all structural attributes in match_by are in corpus.")) + if (any(!as.character(self$match_by) %in% colnames(external_resource))) cli_abort(c("x" = "Not all variables in match_by are in the external dataset.")) if (!is.null(self$split_by)) { - if (!self$split_by %in% polmineR::s_attributes(self$textual_data)) stop(sprintf("... the variable defined in split_by (%s) is no structural attribute in %s", self$split_by, self$textual_data)) + if (!self$split_by %in% polmineR::s_attributes(self$textual_data)) { + cli_abort(c("x" = "The variable {.var {self$split_by}} defined in split_by is no structural attribute in {.var {self$textual_data}}")) + } } - + } else if (is.character(self$textual_data) & self$textual_data_type == "xml") { - # in this case, it is assumed that the input is a path to either a xml directory or an xml_file - stop("xml not yet implemented.") + # in this case, it is assumed that the input is a path to either a xml + # directory or an xml_file. + cli_abort(c("x" = "xml not yet implemented.")) + } else if (is.character(self$textual_data) & self$textual_data_type == "quanteda") { - stop("quanteda corpora not yet implemented.") + cli_abort(c("x" = "quanteda corpora not yet implemented.")) } else { - stop("Type not provided or not supported.") + cli_abort(c("x" = "Type {.var {self$textual_data_type}} not yet implemented.")) } - + if (!data.table::is.data.table(external_resource)) { external_resource <- data.table::as.data.table(external_resource) # otherwise these ".." assignments don't work. @@ -254,7 +261,7 @@ LTDataset <- R6Class( invisible(self) }, - #' @description print class \code{LTDataset}. + #' @description print class `LTDataset`. print = function() { if (inherits(self$textual_data, "subcorpus")) { @@ -270,26 +277,32 @@ LTDataset <- R6Class( cat("LTDataset Object: \n") cat(" Corpus: ", corpus_obj_name, "\n", sep = "") cat(" Attribute to be added: ", self$attribute_name_in_corpus, "\n", sep = "") - cat(" Via: ", paste(names(self$match_by), collapse = ", "), "\n", sep = "") + cat(" via: ", paste(names(self$match_by), collapse = ", "), "\n", sep = "") invisible(self) }, #' @description perform the actual merge of the data. - #' @param na_value a \code{character vector} indicating which value + #' @param na_value a `character vector` indicating which value #' attributes should have that aren't merged. + #' @references When formulating the data.table join functions in + #' `join_textual_and_external_data()`, the following Stack Overflow links + #' were useful: + #' https://stackoverflow.com/questions/45043600/merging-all-column-by-reference-in-a-data-table + #' and https://stackoverflow.com/questions/44433451/r-data-table-update-join join_textual_and_external_data = function(na_value = "NA") { if (self$textual_data_type == "cwb") { if (!is.null(self$split_by)) { + if (self$verbose) cli_alert_info("Splitting data by {.var {self$split_by}}") + # we create a data.table which already has all rows and columns and # update-join the real data later. This should be a bit faster and more # memory efficient. - - if (self$verbose) message("... creating empty cpos data.table.") - self$text_dt <- data.table::data.table(cpos = self$cpos_left:self$cpos_right, - key = "cpos") + + self$text_dt <- data.table::data.table(cpos = self$cpos_left:self$cpos_right, + key = "cpos") self$text_dt[, eval(names(self$match_by)) := "NA"] self$text_dt[, eval(self$attribute_name_in_corpus) := na_value] @@ -305,259 +318,237 @@ LTDataset <- R6Class( # it might be a subcorpus, etc. split_text_object <- self$textual_data %>% polmineR::split(s_attribute = self$split_by) - - # now we probably don't need the subcorpus anymore because we split - # before. - self$textual_data <- self$textual_data@corpus } - if (self$verbose) message("... for each split, retrieve s-attributes for comparison.") + if (self$verbose) cli_alert_info("Processing data in {length(split_text_object)} split{?s} based on s-attribute {.var {self$split_by}}") garbage <- lapply(1:length(split_text_object), function(i_split) { - if (self$verbose) message(sprintf("... processing split %s out of %s based on s-attribute %s.", i_split, length(split_text_object), self$split_by)) - - # decode doesn't work that well for nested corpora with attributes on - # both protocol and speaker level. Use RcppCWB directly. - - if (self$verbose) message("...... extract s-attributes the data should be matched by.") + if (self$verbose) cli_alert_info("Processing split {i_split}") current_split_cpos <- min(split_text_object[[i_split]]@cpos):max(split_text_object[[i_split]]@cpos) + if (is.character(self$textual_data)) { + corpus_name <- self$textual_data + } else { + corpus_name <- self$textual_data@corpus + } + s_attr_streams_for_split <- lapply(names(self$match_by), function(s_attr) { - retval <- RcppCWB::cl_cpos2struc(corpus = self$textual_data, - s_attribute = s_attr, - cpos = current_split_cpos) %>% - RcppCWB::cl_struc2str(corpus = self$textual_data, - s_attribute = s_attr, + retval <- RcppCWB::cl_cpos2struc(corpus = corpus_name, + s_attribute = s_attr, + cpos = current_split_cpos) %>% + RcppCWB::cl_struc2str(corpus = corpus_name, + s_attribute = s_attr, struc = .) # sometimes if the corpus is latin1 and the locale is UTF-8 - if (!is.null(self$forced_encoding)) retval <- iconv(retval, polmineR::encoding(self$textual_data), self$forced_encoding) + if (!is.null(self$forced_encoding)) retval <- iconv(retval, polmineR::encoding(corpus_name), self$forced_encoding) return(retval) } ) - if (self$verbose) message("...... bind decoded s-attributes.") - s_attr_streams_for_split_df <- do.call("cbind", s_attr_streams_for_split) %>% as.data.table() %>% data.table::setnames(names(self$match_by)) rm(s_attr_streams_for_split) - if (self$verbose) message("...... add cpos to split decoded stream.") - # add cpos and split from above s_attr_streams_for_split_df[, cpos := current_split_cpos] - if (self$verbose) message(sprintf('...... split based on s-attribute %s with column names "%s" finished.', self$split_by, paste(names(s_attr_streams_for_split_df), collapse = ", "))) - # now merge. # this should be done in the split because this join can be quite costly. - - if (self$verbose) message(sprintf('...... merge external attribute with decoded split stream.')) - + s_attr_streams_for_split_df[self$external_resource, on = names(self$match_by), (self$attribute_name_in_corpus) := get(paste0("i.", self$attribute_name_in_corpus))] - + # Note: We renamed the columns in the external resource above which is why # we use the self$attribute_name_in_corpus name for both sides. - - if (self$verbose) message("...... bind split stream to the nearly empty data.table.") + data.table::setkey(s_attr_streams_for_split_df, cpos) - # https://stackoverflow.com/questions/45043600/merging-all-column-by-reference-in-a-data-table - cols_to_add <- setdiff(names(s_attr_streams_for_split_df), "cpos") self$text_dt[s_attr_streams_for_split_df, (cols_to_add) := mget(paste0("i.", cols_to_add))] - # https://stackoverflow.com/questions/44433451/r-data-table-update-join - - rm(s_attr_streams_for_split_df) + return(NULL) } ) - } else { + cli_alert_success("Processed {length(split_text_object)} split{?s} of data.") - if (self$verbose) message("... decoding the entire object at once.") - - self$text_dt <- polmineR::decode(self$textual_data, - s_attributes = names(self$match_by), - p_attributes = character(), - to = "data.table") - - if (self$verbose) message(sprintf('...... merge external attribute with decoded token stream.')) + } else { - self$text_dt[self$external_resource, on = names(self$match_by), - (self$attribute_name_in_corpus) := get(paste0("i.", self$attribute_name_in_corpus))] + self$text_dt <- polmineR::decode(self$textual_data, + s_attributes = names(self$match_by), + p_attributes = character(), + to = "data.table") - # it might make sense to use the approach from above here anyway + # we remove struc which introduces a lot more regions than probably necessary. + if ("struc" %in% colnames(self$text_dt)) { + self$text_dt[, struc := NULL] + } - } + self$text_dt[self$external_resource, on = names(self$match_by), + (self$attribute_name_in_corpus) := get(paste0("i.", self$attribute_name_in_corpus))] + + } # if CWB, then make region matrix self$external_attribute_to_region_matrix() } else { - message("... not yet implemented.") + cli_abort("Processing data type {.var {self$textual_data_type}} yet implemented.") } - + + cli_alert_success("joined textual and external data.") + invisible(self) }, #' @description transform the matched data to a matrix for encoding. external_attribute_to_region_matrix = function() { - if (self$verbose) message("... reduce data.table to minimum.") - - columns_to_drop <- setdiff(colnames(self$text_dt), c("cpos", self$attribute_name_in_corpus)) - self$text_dt[, (columns_to_drop) := NULL] - - # if only parts of the corpus are encoded, there might be corpus positions - # not covered by self$cpos_left:self$cpos_right. It is probably wise to - # fill the corpus positions. - - # Note: If the object is a subcorpus, this only applies to the range of - # the subcorpus. This is why the size of the corpus, not the object, is - # taken now. - - if (isTRUE(is.character(self$textual_data))) { - all_cpos_in_corpus <- 0L:(polmineR::size(self$textual_data) - 1) - } else { - # otherwise, its a subcorpus and the name of the corpus is taken from - # the slot. - all_cpos_in_corpus <- 0L:(polmineR::size(self$textual_data@corpus) - 1) - } - - cpos_missing <- which(!all_cpos_in_corpus %in% self$cpos_left:self$cpos_right) - 1 - - # This is slow but setdiff seems to be even slower here - - # Note: Actually one should probably allow for the additions of new IDs - # even if some already exist? This could be done by looking if the corpus - # to add is already in the corpus. - - if (self$attribute_name_in_corpus %in% polmineR::s_attributes(self$textual_data)) { - stop("... attribute to add already exists.") - } - - if (length(cpos_missing) > 0) { - # make data.table with all missing cpos and rbind - - missing_dt <- data.table::data.table( - cpos = cpos_missing - ) - - missing_dt[, (self$attribute_name_in_corpus) := NA] - self$text_dt <- rbind(self$text_dt, missing_dt) - data.table::setorder(self$text_dt, cpos) - } + if (self$verbose) cli_alert_info("Preparing region matrix for encoding.") cpos_vec <- self$text_dt[["cpos"]] - if (self$verbose) message("... preparing breaks for the encoding process.") - - breaks <- match(self$text_dt[[self$attribute_name_in_corpus]], - unique(self$text_dt[[self$attribute_name_in_corpus]])) - 1 - - breaks_rle <- rle(breaks) - + # prepare breaks every time the speaker changes + self$text_dt[, row_index := data.table::rleidv(self$text_dt, cols = setdiff(colnames(self$text_dt), c("cpos", self$attribute_name_in_corpus)))] + + breaks <- match(unique(self$text_dt$row_index), self$text_dt$row_index) + cpos_breaks <- c(cpos_vec[breaks], cpos_vec[length(cpos_vec)]) + id_factor <- cut( x = cpos_vec, - breaks = cumsum(c(0, breaks_rle$lengths)), # adding the first one here is important + breaks = cpos_breaks, include.lowest = TRUE, right = FALSE ) - rm(breaks, breaks_rle) - - if (self$verbose) message("... preparing the region matrix for the encoding process.") - id_cpos <- unname(split(x = cpos_vec, f = id_factor)) self$region_matrix <- do.call(rbind, lapply(id_cpos, function(cpos) c(cpos[1L], cpos[length(cpos)]))) - # now we need to get the values of the ids per row of the region matrix - self$text_dt[, row_in_region_matrix := data.table::rleid(get(self$attribute_name_in_corpus))] - - # remove cpos column and then unique self$text_dt[, cpos := NULL] self$text_dt <- unique(self$text_dt) # this might not be very memory effective self$values <- as.character(self$text_dt[[self$attribute_name_in_corpus]]) - # last sanity checks - + # Last Sanity Checks + # the last cpos of the region matrix must be equal to the size of - # the corpus (-1 for 0-indexation) - - stopifnot(self$region_matrix[nrow(self$region_matrix), 2] == (polmineR::size(self$textual_data) - 1)) - + # the corpus (-1 for 0-indexation). + + cpos_start <- self$region_matrix[1, 1] + cpos_end <- self$region_matrix[nrow(self$region_matrix), 2] + + if ((cpos_end - cpos_start) != (polmineR::size(self$textual_data) - 1)) { + cli_abort(c("x" = "The encoded character vector and the initial character vector are not equally long.")) + } + # there must be as many values as regions - stopifnot(length(self$values) == nrow(self$region_matrix)) + if (length(self$values) != nrow(self$region_matrix)) { + cli_abort(c("x" = "The number of values and regions is not the same.")) + } + invisible(self) }, #' @description encodes the new attribute as a structural attribute (for CWB). - #' @param add_temporarily whether to add the structural attribute temporarily - #' or permanently. #' @return A new `LTDataset` object. - encode_new_s_attribute = function(add_temporarily) { + encode_new_s_attribute = function() { + # if this is only a subcorpus, then we might want to add all cpos which + # are not part of the subcorpus and add a value of "NA" to the missing + # cpos. + + self$add_missing_regions() + + # replace all missing/NA values with literal "NA". Not done earlier + # because in the shiny application user input is still possible. + + self$values[is.na(self$values)] <- "NA" + corpus_name <- ifelse(is.character(self$textual_data), self$textual_data, self$textual_data@corpus) - - data_dir_path <- cwbtools::registry_file_parse(corpus = corpus_name)[["home"]] - registry_dir_path <- RcppCWB::corpus_registry_dir(corpus_name) - - if (isTRUE(add_temporarily)) { - delete_param <- TRUE + corpus_registry <- RcppCWB::corpus_registry_dir(corpus = corpus_name) + + corpus_encoding <- cwbtools::registry_file_parse(corpus = corpus_name, + registry = corpus_registry)[["properties"]][["charset"]] + + if (is.character(self$textual_data)) { + data_dir_path <- RcppCWB::corpus_data_dir(corpus = self$textual_data, + registry = corpus_registry) } else { - delete_param <- FALSE + data_dir_path <- RcppCWB::corpus_data_dir(corpus = self$textual_data@corpus, + registry = corpus_registry) } - - if (self$verbose) message("... start encoding the s-attribute.") - - s_attribute_encode( + + if (self$verbose) cli_alert_info("Start encoding new attribute.") + + cwbtools::s_attribute_encode( values = self$values, data_dir = data_dir_path, s_attribute = tolower(self$attribute_name_in_corpus), corpus = corpus_name, region_matrix = self$region_matrix, method = self$encoding_method, - registry_dir = registry_dir_path, - encoding = registry_file_parse(corpus = corpus_name)[["properties"]][["charset"]], - delete = delete_param, + registry_dir = corpus_registry, + encoding = corpus_encoding, + delete = TRUE, verbose = FALSE ) - - if (self$verbose) message("... done encoding the s-attribute.") + + if (self$verbose) cli_alert_success("Done encoding new attribute.") + invisible(self) }, #' @description returns a data.table of matched attributes. Can be used for - #' further manual checks. - #' @param verbose \code{logical} - create_attribute_region_datatable = function(verbose = FALSE) { - - if (verbose) message("... get attributes which were used for matching.") - - attrs_by_region <- lapply(names(self$match_by), function(attr_to_check) { - retval <- RcppCWB::cl_cpos2struc(self$textual_data, s_attribute = attr_to_check, cpos = self$region_matrix[, 1]) %>% - RcppCWB::cl_struc2str(corpus = self$textual_data, s_attribute = attr_to_check, struc = .) + #' further manual checks. + #' @param additional_attributes `a character vector` of additional + #' structural attributes which should be considered when evaluating the + #' results of the linkage. + #' @details `additional_attributes` might be useful when information of a + #' dataset is added which only covers a part of the corpus such as as + #' specific period of time or specific groups of speakers. + create_attribute_region_datatable = function(additional_attributes = NULL) { + + if (!is.null(additional_attributes)) { + stopifnot(additional_attributes %in% polmineR::s_attributes(self$textual_data)) + } + + attributes_to_check <- c(names(self$match_by), additional_attributes) + + if (is.character(self$textual_data)) { + corpus_name <- self$textual_data + cwb_registry <- RcppCWB::corpus_registry_dir(self$textual_data) + } else { + corpus_name <- self$textual_data@corpus + cwb_registry <- RcppCWB::corpus_registry_dir(self$textual_data@corpus) + } + + attrs_by_region <- lapply(attributes_to_check, function(attr_to_check) { + retval <- RcppCWB::cl_cpos2struc(corpus_name, s_attribute = attr_to_check, cpos = self$region_matrix[, 1], registry = cwb_registry) %>% + RcppCWB::cl_struc2str(corpus = corpus_name, s_attribute = attr_to_check, struc = ., registry = cwb_registry) - if (!is.null(self$forced_encoding)) retval <- iconv(retval, polmineR::encoding(self$textual_data), self$forced_encoding) + if (!is.null(self$forced_encoding)) retval <- iconv(retval, polmineR::encoding(corpus_name), self$forced_encoding) return(retval) - + } ) - + self$attrs_by_region_dt <- data.table::as.data.table(do.call("cbind", attrs_by_region)) rm(attrs_by_region) - colnames(self$attrs_by_region_dt) <- names(self$match_by) + if (!is.null(additional_attributes)) { + colnames(self$attrs_by_region_dt) <- c(names(self$match_by), additional_attributes) + } else { + colnames(self$attrs_by_region_dt) <- names(self$match_by) + } + self$attrs_by_region_dt[, (self$attribute_name_in_corpus) := self$values] invisible(self) @@ -565,27 +556,63 @@ LTDataset <- R6Class( #' @description check the completeness of the merging operation and add #' missing values interactively. - #' @param check_for_groups a \code{character vector}; in case not all - #' elements are expected to have been matched, filter which elements - #' should be checked. - #' @param modify \code{logical}; whether missing values should not only be + #' @param check_for_groups a named `list`; in case not all elements are + #' expected to have been matched, select which elements should be checked. + #' @param negate `logical`; whether the selection in `check_for_groups` + #' should be negated, i.e. if these elements should be filtered instead of + #' selected. + #' @param modify `logical`; whether missing values should not only be #' inspected but also modified interactively. - #' @param match_fuzzily_by a \code{character vector}; if not NULL, a fuzzy + #' @param match_fuzzily_by a `character vector`; if not NULL, a fuzzy #' match will be performed on the column indicated. - #' @param doc_dir a \code{character vector}; Indicating a directory in which + #' @param doc_dir a `character vector`; Indicating a directory in which #' a text file is created which documents manual changes to the merge. - #' @param verbose \code{logical}; whether to print more comprehensive + #' @param ignore_case `logical` whether to match the case. This is a + #' argument of `stringdist_join` which is used in + #' `fuzzy_join_missing_values` + #' @param dist_method A `character` The measurement of distance used by + #' `stringdist_join`. See `stringdist-metrics` in the `stringdist` package + #' which is used in `fuzzy_join_missing_values`. + #' @param max_dist An `integer` value indicating the maximum distance + #' between the two input vectors. This is a argument of `stringdist_join` + #' which is used in `fuzzy_join_missing_values`. + #' @param verbose `logical`; whether to print more comprehensive #' messages. - check_and_add_missing_values = function(check_for_groups = NULL, - modify = FALSE, + #' @references For fuzzy matching `fuzzyjoin` is a crucial dependency: + #' + #' David Robinson (2020). fuzzyjoin: Join Tables Together on + #' Inexact Matching. R package version 0.1.6. + #' https://CRAN.R-project.org/package=fuzzyjoin + #' @references Using `stringdist_join` in combination with `fuzzy_join` in + #' `fuzzy_join_missing_values()` was inspired by the second answer here: + #' https://stackoverflow.com/questions/48008903/combined-fuzzy-and-exact-matching + #' @references Using `formals` to reset the default values of + #' `stringdist_join` in `fuzzy_join_missing_values()` was inspired by the + #' second answer here: + #' https://stackoverflow.com/questions/27673415/store-function-arguments-inside-the-function-and-apply-them-for-future-use-in-r + #' @references Large parts of `add_missing_attributes_via_shiny()` are taken + #' from the implementation of rhandsontable and shinyWidgets in: Blaette, + #' Andreas (2023). polmineR: Verbs and Nouns for Corpus Analysis. R + #' package version v0.8.8. + check_and_add_missing_values = function(check_for_groups = NULL, + negate = FALSE, + modify = FALSE, match_fuzzily_by = NULL, - doc_dir = NULL, + doc_dir = NULL, + ignore_case = TRUE, + dist_method = "lv", + max_dist = 4L, verbose = TRUE) { if (!is.null(check_for_groups)) { - if (verbose) message("... subsetting by groups which should have been matched.") + if (verbose) cli_alert_info("Subset data by group which should have been matched.") for (i in 1:length(check_for_groups)) { - attrs_by_region_dt_min <- self$attrs_by_region_dt[get(names(check_for_groups)[[i]]) == check_for_groups[[i]], ] + group_name <- names(check_for_groups)[[i]] + if (isTRUE(negate)) { + attrs_by_region_dt_min <- self$attrs_by_region_dt[!get(group_name) %in% check_for_groups[[i]], ] + } else { + attrs_by_region_dt_min <- self$attrs_by_region_dt[get(group_name) %in% check_for_groups[[i]], ] + } attrs_by_region_dt_min <- unique(attrs_by_region_dt_min[is.na(get(self$attribute_name_in_corpus))]) } } else { @@ -594,20 +621,16 @@ LTDataset <- R6Class( if (nrow(attrs_by_region_dt_min) > 0) { - if (modify) { - - # if there are missing values and modify is TRUE, then we might add values - # manually. + if (isTRUE(modify)) { - add_manually <- menu(title = "After inspecting the results, do you want to add values for the missing attributes manually?\n\nThese additions are added to the value vector used for encoding.\n\nThese manual additions are documented in a log file.\n\nAlternatively, modify ID resource and redo?", + add_manually <- menu(title = "Do you want to add missing attributes manually?", choices = c("Yes", "No")) if (!exists("add_manually") || add_manually == 2) { # do nothing } else { - if (is.null(doc_dir)) stop("No existing directory provided.") - + if (is.null(doc_dir)) cli_abort(c("x" = "No existing directory provided.")) if (!is.null(match_fuzzily_by)) { @@ -615,80 +638,81 @@ LTDataset <- R6Class( # attribute is matched via fuzzy matching while all other # variables are used for literate matching. - attrs_by_region_dt_min <- self$fuzzy_join_missing_values(attrs_by_region_dt_min, match_fuzzily_by) - + attrs_by_region_dt_min_mod <- self$fuzzy_join_missing_values(attrs_by_region_dt_min = attrs_by_region_dt_min, + match_fuzzily_by = match_fuzzily_by, + ignore_case = ignore_case, + dist_method = dist_method, + max_dist = max_dist) + + attrs_by_region_dt_min_mod <- self$add_missing_attributes_via_shiny(y = attrs_by_region_dt_min_mod, + doc_dir = doc_dir) + } else { + + attrs_by_region_dt_min_mod <- self$add_missing_attributes_via_shiny(y = attrs_by_region_dt_min, + doc_dir = doc_dir) } - - # make shiny / rhandsontable with log: LTD$attribute_name_in_corpus - # added manually in the following instances: ... (simply by subsetting - # changed rows?) - - attrs_by_region_dt_min <- self$add_missing_attributes_via_shiny(y = attrs_by_region_dt_min, doc_dir = doc_dir) - + # first check if any of the values should not be kept - keep_not_idx <- which(attrs_by_region_dt_min[["keep"]] == FALSE) - if (length(keep_not_idx) > 0) attrs_by_region_dt_min <- attrs_by_region_dt_min[-keep_not_idx, ] - + keep_not_idx <- which(attrs_by_region_dt_min_mod[["keep"]] == FALSE) + if (length(keep_not_idx) > 0) attrs_by_region_dt_min_mod <- attrs_by_region_dt_min_mod[-keep_not_idx, ] + # then we need to check if each speaker only occurs once with the # actual attribute columns. cols_to_keep_after_shiny <- c(names(self$match_by), self$attribute_name_in_corpus) - - if (nrow(unique(attrs_by_region_dt_min[, ..cols_to_keep_after_shiny])) != nrow(attrs_by_region_dt_min)) { - stop("... there seems to be a problem as there are more rows than unique attribute combinations, meaning that a single value has been added more than once.") - - # IDEALLY THIS WOULD TRIGGER THE SHINY APPLICATION AGAIN + + if (nrow(unique(attrs_by_region_dt_min_mod[, ..cols_to_keep_after_shiny])) != nrow(attrs_by_region_dt_min)) { + cli_abort(c("x" = "There seems to be a problem as there are more rows than unique attribute combinations, meaning that a single value has been added more than once.")) } else { - attrs_by_region_dt_min <- attrs_by_region_dt_min[, ..cols_to_keep_after_shiny] + attrs_by_region_dt_min_mod <- attrs_by_region_dt_min_mod[, ..cols_to_keep_after_shiny] } - - attrs_by_region_dt_min_added <- attrs_by_region_dt_min[!is.na(get(self$attribute_name_in_corpus)), ] - self$missing_after_check <- attrs_by_region_dt_min[is.na(get(self$attribute_name_in_corpus)), ] - + + attrs_by_region_dt_min_added <- attrs_by_region_dt_min_mod[!is.na(get(self$attribute_name_in_corpus)), ] + self$missing_after_check <- attrs_by_region_dt_min_mod[is.na(get(self$attribute_name_in_corpus)), ] + # indicate that this is added attrs_by_region_dt_min_added[, added := TRUE] - + # we have to change the values in self$values here - join_cols <- setdiff(names(attrs_by_region_dt_min), self$attribute_name_in_corpus) - + join_cols <- setdiff(names(attrs_by_region_dt_min_mod), self$attribute_name_in_corpus) + self$attrs_by_region_dt[attrs_by_region_dt_min_added, - c((self$attribute_name_in_corpus), "added") := .(get(paste0("i.", (self$attribute_name_in_corpus))), i.added), - on = join_cols] - + c((self$attribute_name_in_corpus), "added") := .(get(paste0("i.", (self$attribute_name_in_corpus))), i.added), + on = join_cols] + # now which rows in attrs_by_region_dt are those added. - rows_to_add_idx <- self$attrs_by_region_dt[added == TRUE, which = TRUE] - + if (length(rows_to_add_idx) > 0) { - if (verbose) message("... adding manually identified attributes as values.") - + if (verbose) cli_alert_info("Adding manually identified attributes as values.") + # final sanity checks. # (in case something was added twice nevertheless) - if (nrow(self$attrs_by_region_dt) != length(self$values)) stop("... region table and values vector aren't the same length.") - if (!all(is.na(self$values[rows_to_add_idx]))) stop("... region table contains non NA values which aren't expected.") - + if (nrow(self$attrs_by_region_dt) != length(self$values)) cli_abort(c("x" = "Region table and values vector aren't the same length.")) + if (!all(is.na(self$values[rows_to_add_idx]))) cli_abort(c("x" = "Region table contains non NA values which aren't expected.")) + self$values[rows_to_add_idx] <- self$attrs_by_region_dt[added == TRUE, get(self$attribute_name_in_corpus)] } } } } - + invisible(self) }, - #' @description Add missing values with shiny and rhandsontable. Called by `check_and_add_missing_values()`. - #' @param y a \code{data.table} containing incomplete rows after the merge. - #' @param doc_dir a \code{character vector}; Indicating a directory in which + #' @description Add missing values with shiny and rhandsontable. Called by + #' `check_and_add_missing_values()`. + #' @param y a `data.table` containing incomplete rows after the merge. + #' @param doc_dir a `character vector`; Indicating a directory in which #' a text file is created which documents manual changes to the merge. add_missing_attributes_via_shiny = function(y, doc_dir) { - + # add keep column and set to TRUE per default y[, keep := TRUE] - - ##### This is taken from the internal and never really used - ##### WikiParliamentaryLookup package (check_and_correct_incomplete_mps.R) which - ##### is based on the implementation of rhandsontable and shinyWidgets used in - ##### polmineR. - + + # the first attempts to make this work were done in the internal + # and never really used WikiParliamentaryLookup package + # (check_and_correct_incomplete_mps.R). + # prepare shiny gadget .editing_gadget_ui <- function() { miniUI::miniPage( @@ -696,18 +720,18 @@ LTDataset <- R6Class( miniUI::miniContentPanel(shiny::fillRow(rhandsontable::rHandsontableOutput("hot"))) ) } - + server <- function(input, output, session) { values <- shiny::reactiveValues() reactiveData <- shiny::reactive(y) - + .reset_values <- function(df){ values[["hot"]] <- df y[, (self$attribute_name_in_corpus) := df[[self$attribute_name_in_corpus]]] y[, keep := df[["keep"]]] } - + output$hot <- rhandsontable::renderRHandsontable({ data <- reactiveData() # Identical result with rhandsontable:::isErrorMessage(data), @@ -721,7 +745,7 @@ LTDataset <- R6Class( rht <- rhandsontable::hot_col(rht, col = (1L:ncol(df))[which(!colnames(df) %in% c(self$attribute_name_in_corpus, "keep"))], readOnly = TRUE) rht }) - + shiny::observeEvent(input$done, shiny::stopApp(returnValue = y)) } @@ -729,7 +753,6 @@ LTDataset <- R6Class( y <- shiny::runGadget(.editing_gadget_ui(), server, viewer = shiny::paneViewer(minHeight = 550)) # now extract the added attributes and document. - document_log_file <- sprintf("%s/manual_addition_of_attribute_while_linking_%s.txt", doc_dir, Sys.Date()) @@ -751,124 +774,204 @@ LTDataset <- R6Class( #' @description Add missing values using a fuzzy join. Called by #' `check_and_add_missing_values()`. - #' @param attrs_by_region_dt_min a \code{data.table} containing incomplete - #' rows after the merge. - #' @param match_fuzzily_by a \code{character vector}; if not NULL, a fuzzy - #' match will be performed on the column indicated. - #' @references See the second answer here: - #' https://stackoverflow.com/questions/48008903/combined-fuzzy-and-exact-matching - fuzzy_join_missing_values = function(attrs_by_region_dt_min, match_fuzzily_by) { - - message("... performing fuzzy matching.") - - # first, define a match function - # https://stackoverflow.com/questions/48008903/combined-fuzzy-and-exact-matching - - match_fun_stringdist <- function(v1, v2) { - - ignore_case = TRUE - method = "lv" - max_dist = 4 - distance_col = "dist" - - if (ignore_case) { - v1 <- stringr::str_to_lower(v1) - v2 <- stringr::str_to_lower(v2) - } - - # maybe handle NAs like that? - if (any(is.na(v1))) v1[is.na(v1)] <- "" - if (any(is.na(v2))) v2[is.na(v2)] <- "" - - # shortcut for Levenshtein-like methods: if the difference in - # string length is greater than the maximum string distance, the - # edit distance must be at least that large - - # length is much faster to compute than string distance - if (method %in% c("osa", "lv", "dl")) { - length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2)) - include <- length_diff <= max_dist - - dists <- rep(NA, length(v1)) - - dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method) - } else { - # have to compute them all - dists <- stringdist::stringdist(v1, v2, method = method) - } - ret <- dplyr::tibble(include = (dists <= max_dist)) - if (!is.null(distance_col)) { - ret[[distance_col]] <- dists - } - ret - } - - fuzzy_in_textual_data <- as.character(self$match_by[which(names(self$match_by) == match_fuzzily_by)]) - fuzzy_in_external_data <- names(self$match_by)[which(names(self$match_by) == match_fuzzily_by)] - - non_fuzzy_in_textual_data <- as.character(self$match_by[which(names(self$match_by) != match_fuzzily_by)]) - non_fuzzy_in_external_data <- names(self$match_by)[which(names(self$match_by) != match_fuzzily_by)] - - by_x_vector <- c(fuzzy_in_textual_data, non_fuzzy_in_textual_data) - by_y_vector <- c(fuzzy_in_external_data, non_fuzzy_in_external_data) - - + #' @param attrs_by_region_dt_min a `data.table` containing incomplete rows + #' after the merge. + #' @param match_fuzzily_by a `character vector`; if not NULL, a fuzzy match + #' will be performed on the column indicated. + #' @param ignore_case `logical` whether to match the case. This is a + #' argument of `stringdist_join`. + #' @param dist_method A `character` The measurement of distance used by + #' `stringdist_join`. See `stringdist-metrics` in the `stringdist` + #' package. + #' @param max_dist An `integer` value indicating the maximum distance + #' between the two input vectors. This is a argument of `stringdist_join`. + fuzzy_join_missing_values = function(attrs_by_region_dt_min, + match_fuzzily_by, + ignore_case = TRUE, + dist_method = "lv", + max_dist = 4L) { + + cli_alert_info("Performing fuzzy matching.") + + # column names in external data are renamed in the class slot. + fuzzy_var_name <- names(self$match_by)[match(match_fuzzily_by, names(self$match_by))] + non_fuzzy_var_name <- names(self$match_by)[which(!names(self$match_by) %in% match_fuzzily_by)] + + by_vector <- c(fuzzy_var_name, non_fuzzy_var_name) + target_var <- colnames(self$external_resource)[which(!colnames(self$external_resource) %in% by_vector)] + # this might be not very robust. - match_fun_list <- vector("list", length(c(fuzzy_in_textual_data, non_fuzzy_in_textual_data))) - - for (i in 1:length(fuzzy_in_textual_data)) { - match_fun_list[[i]] <- match_fun_stringdist + match_fun_list <- vector("list", length(c(fuzzy_var_name, non_fuzzy_var_name))) + + # replacing the default values of fuzzyjoin::stringdist_join + stringdist_join_match_fun_for_list <- self$stringdist_join_match_fun + + formals(stringdist_join_match_fun_for_list)[["ignore_case"]] <- ignore_case + formals(stringdist_join_match_fun_for_list)[["method"]] <- dist_method + formals(stringdist_join_match_fun_for_list)[["max_dist"]] <- max_dist + + for (i in 1:length(fuzzy_var_name)) { + match_fun_list[[i]] <- stringdist_join_match_fun_for_list } - - for (i in (length(fuzzy_in_textual_data) + 1):length(c(fuzzy_in_textual_data, non_fuzzy_in_textual_data))) { + + for (i in (length(fuzzy_var_name) + 1):length(c(fuzzy_var_name, non_fuzzy_var_name))) { match_fun_list[[i]] <- eval(`==`) } - + + # NA values in by_vector attributes cause errors fuzzy_join. Issue warning. + columns_with_na_bool <- apply(self$external_resource[, ..by_vector], MARGIN = 2, function(x) any(is.na(x))) + + if (any(columns_with_na_bool) == TRUE) { + columns_with_na_names <- names(which(columns_with_na_bool == TRUE)) + cli_alert_warning(text = "Found {length(columns_with_na_names)} column{?s} in external dataset with NA values in matching variable{?s} {.var {columns_with_na_names}}. It is likely that {.fn fuzzyjoin::fuzzy_join} will not work.", + wrap = TRUE) + } + attrs_by_region_dt_min_joined <- fuzzyjoin::fuzzy_join( x = attrs_by_region_dt_min, - y = self$external_resource, - by = list(x = by_x_vector, - y = by_y_vector - ), - + y = self$external_resource, + by = by_vector, match_fun = match_fun_list, mode = "left" ) - + # clean up. Remove all ".y" columns except for the fuzzily joined # attribute and all dist column. + attrs_by_region_dt_min_joined <- data.table::as.data.table(attrs_by_region_dt_min_joined) - columns_to_omit <- c(paste0(non_fuzzy_in_textual_data, ".y"), - paste0(c(fuzzy_in_textual_data, non_fuzzy_in_textual_data), ".dist") - ) - + + columns_to_omit <- c(paste0(by_vector, ".y"), paste0(by_vector, ".distance_col")) + + fuzzy_var_name_joined <- paste0(fuzzy_var_name, ".y") + columns_to_omit <- columns_to_omit[columns_to_omit != fuzzy_var_name_joined] attrs_by_region_dt_min_joined[, (columns_to_omit) := NULL] - + # then use the joined new attribute to add to the missing values - attribute_name_in_corpus_in_x <- paste0(self$attribute_name_in_corpus, ".x") attribute_name_in_corpus_in_y <- paste0(self$attribute_name_in_corpus, ".y") - + # we remove the column with the old, missing values attrs_by_region_dt_min_joined[, (attribute_name_in_corpus_in_x) := NULL] - + # and rename it accordingly - data.table::setnames(attrs_by_region_dt_min_joined, old = attribute_name_in_corpus_in_y, new = self$attribute_name_in_corpus) - + data.table::setnames(attrs_by_region_dt_min_joined, + old = attribute_name_in_corpus_in_y, + new = self$attribute_name_in_corpus) + # and do this for all the other textual data columns (".x") as well. data.table::setnames(attrs_by_region_dt_min_joined, old = paste0(names(self$match_by), ".x"), new = names(self$match_by)) - - # finally, we rename the added fuzzily matched match variable to make + + # finally, we rename the fuzzily matched match variable to make # transparent - data.table::setnames(attrs_by_region_dt_min_joined, old = paste0(fuzzy_in_external_data, ".y"), new = paste0(fuzzy_in_external_data, "_fuzzy_matched")) - data.table::setcolorder(attrs_by_region_dt_min_joined, c(names(self$match_by), - self$attribute_name_in_corpus, - paste0(fuzzy_in_external_data, "_fuzzy_matched"))) - + + data.table::setnames(attrs_by_region_dt_min_joined, + old = fuzzy_var_name_joined, + new = paste0(fuzzy_var_name, "_fuzzy_matched")) + + data.table::setcolorder(attrs_by_region_dt_min_joined, c(names(self$match_by), + paste0(fuzzy_var_name, "_fuzzy_matched"), + target_var)) + return(attrs_by_region_dt_min_joined) + }, + + #' @description The function used for those attributes which are matched + #' fuzzily. Called by `fuzzy_join_missing_values()`. This function is + #' identical to the "match_fun" of `stringdist_join()` of the `fuzzyjoin` + #' package with changed defaults. + #' @details The defaults were changed. "lv" is now the default method, + #' max_dist is 4 and ignore_case is now TRUE. + #' @param v1,v2 `character vectors` to be compared fuzzily. + #' @param ignore_case `logical` whether to match the case. This is a + #' argument of `stringdist_join`. + #' @param method A `character` The measurement of distance used by + #' `stringdist_join`. See `stringdist-metrics` in the `stringdist` + #' package. + #' @param max_dist An `integer` value indicating the maximum distance + #' between the two input vectors. This is a argument of `stringdist_join`. + stringdist_join_match_fun = function(v1, + v2, + max_dist = 4L, + method = c("lv", "osa", "dl", + "hamming", "lcs", "qgram", + "cosine", "jaccard", "jw", + "soundex"), + ignore_case = TRUE) { + method <- match.arg(method) + + if (method == "soundex") { + max_dist <- 0.5 + } + + if (ignore_case) { + v1 <- stringr::str_to_lower(v1) + v2 <- stringr::str_to_lower(v2) + } + if (method %in% c("osa", "lv", "dl")) { + length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2)) + include <- length_diff <= max_dist + dists <- rep(NA, length(v1)) + dists[include] <- stringdist::stringdist(v1[include], + v2[include], + method = method) + } + else { + dists <- stringdist::stringdist(v1, v2, method = method) + } + ret <- tibble::tibble(include = (dists <= max_dist)) + ret[["distance_col"]] <- dists + return(ret) + }, + + #' @description The function checks if the entire corpus is covered by the + #' region matrix. If not, the missing regions of the corpus are added to + #' the region matrix and the value vector. This is needed to encode the + #' new data. + add_missing_regions = function() { + + if (self$verbose) cli_alert_info("Adding missing regions to region matrix and value vector.") + + if (isTRUE(is.character(self$textual_data))) { + + all_cpos_in_corpus <- 0L:(polmineR::size(self$textual_data) - 1) + + } else { + + # otherwise, its a subcorpus and the name of the corpus is taken from + # the slot. + + all_cpos_in_corpus <- 0L:(polmineR::size(self$textual_data@corpus) - 1) + } + + cpos_missing <- which(!all_cpos_in_corpus %in% self$cpos_left:self$cpos_right) - 1 + + if (length(cpos_missing) > 0) { + + # see stack overflow for cumsum: https://stackoverflow.com/questions/24837401/find-consecutive-values-in-vector-in-r + missing_consecutive <- split(cpos_missing, cumsum(c(1, diff(cpos_missing) != 1))) + missing_matrix <- t(sapply(missing_consecutive, function(x) c(min(x), max(x)))) + rownames(missing_matrix) <- NULL + + # bind it at the end of the region_matrix + self$region_matrix <- rbind(self$region_matrix, missing_matrix) + + # and add NA value to the values slot + self$values <- c(self$values, + rep(x = NA, times = nrow(missing_matrix))) + + # reorder by first cpos per region + order_vector <- order(self$region_matrix[, 1]) + self$region_matrix <- self$region_matrix[order_vector, ] + self$values <- self$values[order_vector] + + # last sanity check here + stopifnot(nrow(self$region_matrix) == length(self$values)) + } + + invisible(self) } ) ) diff --git a/R/LinkTools.R b/R/LinkTools.R index a82e39d..9f23b71 100644 --- a/R/LinkTools.R +++ b/R/LinkTools.R @@ -1,36 +1,23 @@ #' R-package 'LinkTools' -#' -#' Tools for linking data sets via shared unique identifiers. -#' -#' Four steps are integrated into this package: -#' +#' +#' Tools for linking data sets. +#' +#' Once finalized, four steps should be integrated into this package: +#' #' a) the preparation of data sets which should be linked, i.e. the -#' transformation into a comparable format and the assignment of shared unique -#' identifiers, -#' +#' transformation into a comparable format and the assignment of new +#' information, in particular shared unique identifiers, +#' #' b) the merge of data sets based on these identifiers, -#' -#' c) the encoding or enrichment of the data with three output formats +#' +#' c) the encoding or enrichment of the data with different output formats #' (data.table, XML or CWB). -#' +#' #' d) the package includes a wrapper for the Named Entity Linking of textual -#' data based on DBPedia Spotlight. +#' data based on DBpedia Spotlight. #' @keywords package #' @docType package #' @aliases LinkTools LinkTools-package #' @name LinkTools-package #' @rdname LinkTools-package NULL - -#' Stammdaten with WikiData-IDs -#' -#' A minimized version of the Stammdaten of the German Bundestag of the 13th and -#' 14th legislative period with added WikiData IDs retrieved via the Wikidata Query -#' Service and added party affiliations specific for the legislative period retrieved -#' from Wikipedia. For preparation see bt_stammdaten. -#' @source https://www.bundestag.de/services/opendata (Creation Date 2021-11-04) -#' @source https://de.wikipedia.org/wiki/Liste_der_Mitglieder_des_Deutschen_Bundestages_(13._Wahlperiode) (Information Retrieved on 2021-11-23) -#' @source https://de.wikipedia.org/wiki/Liste_der_Mitglieder_des_Deutschen_Bundestages_(14._Wahlperiode) (Information Retrieved on 2021-11-23) -#' @docType data -#' @keywords datasets -"stammdaten_wikidatafied_2022_02_01_min" diff --git a/README.Rmd b/README.Rmd index 98d1b85..384bcda 100644 --- a/README.Rmd +++ b/README.Rmd @@ -14,22 +14,194 @@ output: github_document ## About LinkTools +### Motivation -## Dependencies +`LinkTools` facilitates the linkage of datasets by providing accessible approaches to two tasks: Record Linkage and Entity Linking. It is developed in the measure [Linking Textual Data](https://www.konsortswd.de/en/konsortswd/the-consortium/services/linking-textual-data/) in [KonsortSWD](https://www.konsortswd.de/en/) within the [National Research Data Infrastructure Germany](https://www.nfdi.de/?lang=en) (NFDI) and has the goal to make linking textual data more accessible. -* running the Vignette requires the availability of the GermaParl CWB corpus +### Purpose +Once finalized, four steps should be integrated into this R package: -## Installation +* the preparation of datasets which should be linked, i.e. the transformation into a comparable format and the assignment of new information, in particular shared unique identifiers +* the merge of datasets based on these identifiers +* the encoding or enrichment of the data with different output formats (data.table, XML or CWB) +* the package includes a wrapper for the Named Entity Linking of textual data based on DBpedia Spotlight -```{r install_from_github, eval = FALSE} -remotes::install_github("PolMine/LinkTools") +A major focus of this package is the provision of an intuitive workflow with transparency and robustness. Documentation, validity and user experience as well as training and education are at the heart of this development. Consequently, the processes of linkage and linking should be designed as approachable as possible - using GUIs and feedback - as well as robust and repeatable. + +## Current Status + +The Record Linkage functionality is maturing and the current state is documented in the package vignette. Record Linkage with CWB corpora is currently implemented. Entity Linking using [DBpedia Spotlight](https://www.dbpedia-spotlight.org/) as the backend is currently in development. + +## Core functionality of LinkTools + +```{r load_LinkTools_package} +library(LinkTools) ``` -```{r install_devversion_from_github, eval = FALSE} -remotes::install_github("PolMine/LinkTools", ref = "dev") +### Record Linkage + +The `LTDataset` class, an R6 class, is the main driver of the Record Linkage process within the `LinkTools` package. Aside from the wrangling of the textual data input, its core functionality includes the merge of metadata found in text corpora and external datasets. To this end, first, a strict merge of exact matches is performed. Thereafter, if observations could not be joined directly, a fuzzy matching approach is possible in which `LinkTools` suggests matches based on different measures of string distance. In this stage, the manual inspection of these suggestions and the addition of missing values is possible. + +The vignette shows this for larger data and a real external dataset. In the following a short artificial example is provided. It is assumed that the following two resources should be linked: + +### Text corpus + +```{r load_polmineR_and_germaparlmini} +library(polmineR) +use("polmineR") ``` -## Current Status +GermaParlMini is a sample corpus of the larger GermaParl corpus of parliamentary debates. It is provided by the `polmineR` R package. It contains a number of metadata such as the speaker name and the party of a speaker. To show the capabilities of the tool, a small sample of this dataset is used. + +```{r subset_germaparlmini} +germaparlmini_session <- polmineR::corpus("GERMAPARLMINI") |> + polmineR::subset(date == "2009-11-12") +``` + +The metadata used for linking looks like the following: + +```{r retrieve_attributes, echo = FALSE} +speaker_s_attributes <- germaparlmini_session |> + polmineR::s_attributes(c("speaker", "party")) |> + data.table::setorder(speaker) -* Currently implemented: Record Linkage with CWB corpora +speaker_s_attributes |> + knitr::kable(format = "markdown") +``` + +### External Data + +To show the (fuzzy) matching possibilities of the package, an artificial dataset is created on the spot. It is created by modifying the speaker data found in the textual data by introducing some deviation. In consequence, it contains most of the speakers in the textual data shown above as well as the same metadata plus a variable called "ID" which represents the additional information we want to add to the textual data. To showcase the fuzzy matching, this artificial dataset also contains some typos in the names of the speakers as well as a differently named column for the speaker names. + +For a real dataset, please see the package vignette. + +```{r make_artificial_dataset, echo = FALSE} +set.seed(24) +artificial_id_data <- data.table::copy(speaker_s_attributes) + +# to show what happens if not all observations can be matched, remove one random +# row. + +row_to_remove_idx <- sample(1:nrow(artificial_id_data), 1) +artificial_id_data <- artificial_id_data[-row_to_remove_idx, ] + +# a single purpose function to mildly modify speaker names +scramble_name <- function(x) { + to_scramble <- sample(c(1, 2), 1) + # only potentially modify about every second name + if (to_scramble == 1) { + + # separate first and last elements in name column + name_elements_split <- unlist(strsplit(x, " ")) + last_name <- name_elements_split[length(name_elements_split)] + first_name <- name_elements_split[1:(length(name_elements_split) - 1)] + + # split into characters + name_as_chars <- unlist(strsplit(last_name, "")) + + # to scramble area in the middle of the name, select two characters in the + # middle + min <- length(name_as_chars) - floor(length(name_as_chars) / 2) + max <- min + 1 + middle_sector <- name_as_chars[min:max] + middle_sector_scrambled <- sample(middle_sector, length(middle_sector)) + + # get rest of the name. Short names might cause errors. + start_sector <- name_as_chars[1:(min-1)] + if (max == length(name_as_chars)) { + end_sector <- NULL + } else { + end_sector <- name_as_chars[(max+1):length(name_as_chars)] + } + + # paste again + last_name_mod <- paste(c(start_sector, + middle_sector_scrambled, + end_sector), + collapse = "") + + name_scrambled <- paste(c(first_name, last_name_mod), collapse = " ") + return(name_scrambled) + } else { + return(x) + } +} + +artificial_id_data[, speaker := scramble_name(speaker), by = seq_len(nrow(artificial_id_data))] +artificial_id_data[, id := sprintf("ID_%s", 1:nrow(artificial_id_data))] +data.table::setnames(artificial_id_data, old = "speaker", new = "name") + +artificial_id_data |> + knitr::kable(format = "markdown") +``` + +### Step 1: Instantiating the `LTDataset` class + +The `LTDataset` class is instantiated with the names of the two resources and some additional information. The package vignette shows some more arguments. Also see `?LTDataset` for more in-depth documentation of these arguments. + +```{r instantiate_LTDataset} +LTD <- LTDataset$new(textual_data = germaparlmini_session, + textual_data_type = "cwb", + external_resource = artificial_id_data, + attr_to_add = c("id_in_corpus" = "id"), + match_by = c("speaker" = "name", + "party" = "party"), + forced_encoding = "UTF-8") +``` + +### Step 2: Join strictly + +With the shared attributes provided in the `match_by` argument, the two datasets are joined. + +```{r join_data} +LTD$join_textual_and_external_data() +``` + +After this join, a data.table object can be created. This can be used to inspect the results of the join directly. Each row in which the ID column is "NA" was not matched. + +```{r create_region_datatable} +LTD$create_attribute_region_datatable() +``` + +```{r print_region_datatable} +LTD$attrs_by_region_dt |> + unique() |> + data.table::setorder(speaker) |> + knitr::kable(format = "markdown") +``` + + +### Step 3: Check and add missing values + +If there are cases in which a match was not possible - for example because of slight differences in the two datasets - manual annotation supported by fuzzy matching is possible. + +In an interactive process, realized with `shiny`, suggestions based on the fuzzy matching of attributes are provided and can be checked manually. Remaining missing attributes can be added to the metadata of the corpus. + +```{r check_missing, eval = FALSE} +LTD$check_and_add_missing_values(modify = TRUE, + match_fuzzily_by = "speaker", + doc_dir = tempdir()) +``` + +The screenshot below should serve as an illustration of this interactive element. + +![Interactive and manual control of suggestions in LinkTools v0.0.1.9005](LinkTools_interactive_matching_gui_README.png) + +A final functionality allows to write the new attribute back into the CWB corpus. + +## LinkTools and other Resources + +The purpose of `LinkTools` is to provide an integrated user experience for interested persons wanting to link textual data with other types of data. The package thus will handle different data types which are important in the realm of textual data, taking care of preprocessing and the enrichment of the data in a robust and transparent manner. It provides access to different existing approaches of linking and linkage, lowering barriers to make use of available resources. + +Focusing on textual data, the code base of the [PolMine project](https://polmine.github.io/) is at the core of `LinkTools`. In particular, the R package [polmineR](https://cran.r-project.org/package=polmineR) is used to manage large CWB corpora. In the future, a broader scope of input will be covered. The strict merging in the record linkage workflow is mainly realized with [data.table](https://cran.r-project.org/package=data.table). The fuzzy - or probabilistic - joins are mainly realized using the R packages [fuzzyjoin](https://cran.r-project.org/package=fuzzyjoin) and [stringdist](https://cran.r-project.org/package=stringdist). + +## Dependencies + +Running the Vignette requires the availability of the GermaParlMini CWB corpus and the `btmp` data package which provides the external data for linking. + +## Installation + +```{r install_from_github, eval = FALSE} +remotes::install_github("PolMine/LinkTools") +``` diff --git a/README.md b/README.md index 027beda..8f8a6e5 100644 --- a/README.md +++ b/README.md @@ -14,21 +14,269 @@ experimental](https://img.shields.io/badge/lifecycle-experimental-red.svg)](http ## About LinkTools -## Dependencies +### Motivation -- running the Vignette requires the availability of the GermaParl CWB - corpus +`LinkTools` facilitates the linkage of datasets by providing accessible +approaches to two tasks: Record Linkage and Entity Linking. It is +developed in the measure [Linking Textual +Data](https://www.konsortswd.de/en/konsortswd/the-consortium/services/linking-textual-data/) +in [KonsortSWD](https://www.konsortswd.de/en/) within the [National +Research Data Infrastructure Germany](https://www.nfdi.de/?lang=en) +(NFDI) and has the goal to make linking textual data more accessible. -## Installation +### Purpose + +Once finalized, four steps should be integrated into this R package: + +- the preparation of datasets which should be linked, i.e. the + transformation into a comparable format and the assignment of new + information, in particular shared unique identifiers +- the merge of datasets based on these identifiers +- the encoding or enrichment of the data with different output formats + (data.table, XML or CWB) +- the package includes a wrapper for the Named Entity Linking of textual + data based on DBpedia Spotlight + +A major focus of this package is the provision of an intuitive workflow +with transparency and robustness. Documentation, validity and user +experience as well as training and education are at the heart of this +development. Consequently, the processes of linkage and linking should +be designed as approachable as possible - using GUIs and feedback - as +well as robust and repeatable. + +## Current Status + +The Record Linkage functionality is maturing and the current state is +documented in the package vignette. Record Linkage with CWB corpora is +currently implemented. Entity Linking using [DBpedia +Spotlight](https://www.dbpedia-spotlight.org/) as the backend is +currently in development. + +## Core functionality of LinkTools ``` r -remotes::install_github("PolMine/LinkTools") +library(LinkTools) ``` +### Record Linkage + +The `LTDataset` class, an R6 class, is the main driver of the Record +Linkage process within the `LinkTools` package. Aside from the wrangling +of the textual data input, its core functionality includes the merge of +metadata found in text corpora and external datasets. To this end, +first, a strict merge of exact matches is performed. Thereafter, if +observations could not be joined directly, a fuzzy matching approach is +possible in which `LinkTools` suggests matches based on different +measures of string distance. In this stage, the manual inspection of +these suggestions and the addition of missing values is possible. + +The vignette shows this for larger data and a real external dataset. In +the following a short artificial example is provided. It is assumed that +the following two resources should be linked: + +### Text corpus + ``` r -remotes::install_github("PolMine/LinkTools", ref = "dev") +library(polmineR) +use("polmineR") ``` -## Current Status + ## ✔ corpus loaded: GERMAPARLMINI (version: 0.1.0 | build date: 2023-04-16) + +GermaParlMini is a sample corpus of the larger GermaParl corpus of +parliamentary debates. It is provided by the `polmineR` R package. It +contains a number of metadata such as the speaker name and the party of +a speaker. To show the capabilities of the tool, a small sample of this +dataset is used. + +``` r +germaparlmini_session <- polmineR::corpus("GERMAPARLMINI") |> + polmineR::subset(date == "2009-11-12") +``` + +The metadata used for linking looks like the following: + +| speaker | party | +|:-----------------------|:----------------| +| Alexander Bonde | B90_DIE_GRUENEN | +| Barbara Hendricks | SPD | +| Birgitt Bender | B90_DIE_GRUENEN | +| Carl-Ludwig Thiele | FDP | +| Carola Reimann | SPD | +| Elisabeth Scharfenberg | B90_DIE_GRUENEN | +| Elke Ferner | SPD | +| Gerda Hasselfeldt | NA | +| Gesine Lötzsch | DIE_LINKE | +| Hermann Otto Solms | NA | +| Jens Spahn | CDU_CSU | +| Joachim Poß | SPD | +| Norbert Lammert | NA | +| Philipp Rösler | FDP | +| Rolf Koschorrek | CDU_CSU | +| Ulrike Flach | FDP | +| Wolfgang Schäuble | CDU_CSU | +| Wolfgang Zöller | CDU_CSU | + +### External Data + +To show the (fuzzy) matching possibilities of the package, an artificial +dataset is created on the spot. It is created by modifying the speaker +data found in the textual data by introducing some deviation. In +consequence, it contains most of the speakers in the textual data shown +above as well as the same metadata plus a variable called “ID” which +represents the additional information we want to add to the textual +data. To showcase the fuzzy matching, this artificial dataset also +contains some typos in the names of the speakers as well as a +differently named column for the speaker names. + +For a real dataset, please see the package vignette. + +| name | party | id | +|:-----------------------|:----------------|:------| +| Alexander Bodne | B90_DIE_GRUENEN | ID_1 | +| Barbara Hendricks | SPD | ID_2 | +| Birgitt Bender | B90_DIE_GRUENEN | ID_3 | +| Carl-Ludwig Thiele | FDP | ID_4 | +| Carola Reimann | SPD | ID_5 | +| Elisabeth Scharfenberg | B90_DIE_GRUENEN | ID_6 | +| Gerda Hassefleldt | NA | ID_7 | +| Gesine Lötzsch | DIE_LINKE | ID_8 | +| Hermann Otto Somls | NA | ID_9 | +| Jens Sphan | CDU_CSU | ID_10 | +| Joachim Poß | SPD | ID_11 | +| Norbert Lamemrt | NA | ID_12 | +| Philipp Rösler | FDP | ID_13 | +| Rolf Koschorrek | CDU_CSU | ID_14 | +| Ulrike Flcah | FDP | ID_15 | +| Wolfgang Schäuble | CDU_CSU | ID_16 | +| Wolfgang Zöller | CDU_CSU | ID_17 | + +### Step 1: Instantiating the `LTDataset` class -- Currently implemented: Record Linkage with CWB corpora +The `LTDataset` class is instantiated with the names of the two +resources and some additional information. The package vignette shows +some more arguments. Also see `?LTDataset` for more in-depth +documentation of these arguments. + +``` r +LTD <- LTDataset$new(textual_data = germaparlmini_session, + textual_data_type = "cwb", + external_resource = artificial_id_data, + attr_to_add = c("id_in_corpus" = "id"), + match_by = c("speaker" = "name", + "party" = "party"), + forced_encoding = "UTF-8") +``` + +### Step 2: Join strictly + +With the shared attributes provided in the `match_by` argument, the two +datasets are joined. + +``` r +LTD$join_textual_and_external_data() +``` + + ## ... decoding s_attribute speaker + + ## ... decoding s_attribute party + + ## ℹ Preparing region matrix for encoding. + + ## ✔ joined textual and external data. + +After this join, a data.table object can be created. This can be used to +inspect the results of the join directly. Each row in which the ID +column is “NA” was not matched. + +``` r +LTD$create_attribute_region_datatable() +``` + +``` r +LTD$attrs_by_region_dt |> + unique() |> + data.table::setorder(speaker) |> + knitr::kable(format = "markdown") +``` + +| speaker | party | id_in_corpus | +|:-----------------------|:----------------|:-------------| +| Alexander Bonde | B90_DIE_GRUENEN | NA | +| Barbara Hendricks | SPD | ID_2 | +| Birgitt Bender | B90_DIE_GRUENEN | ID_3 | +| Carl-Ludwig Thiele | FDP | ID_4 | +| Carola Reimann | SPD | ID_5 | +| Elisabeth Scharfenberg | B90_DIE_GRUENEN | ID_6 | +| Elke Ferner | SPD | NA | +| Gerda Hasselfeldt | NA | NA | +| Gesine Lötzsch | DIE_LINKE | ID_8 | +| Hermann Otto Solms | NA | NA | +| Jens Spahn | CDU_CSU | NA | +| Joachim Poß | SPD | ID_11 | +| Norbert Lammert | NA | NA | +| Philipp Rösler | FDP | ID_13 | +| Rolf Koschorrek | CDU_CSU | ID_14 | +| Ulrike Flach | FDP | NA | +| Wolfgang Schäuble | CDU_CSU | ID_16 | +| Wolfgang Zöller | CDU_CSU | ID_17 | + +### Step 3: Check and add missing values + +If there are cases in which a match was not possible - for example +because of slight differences in the two datasets - manual annotation +supported by fuzzy matching is possible. + +In an interactive process, realized with `shiny`, suggestions based on +the fuzzy matching of attributes are provided and can be checked +manually. Remaining missing attributes can be added to the metadata of +the corpus. + +``` r +LTD$check_and_add_missing_values(modify = TRUE, + match_fuzzily_by = "speaker", + doc_dir = tempdir()) +``` + +The screenshot below should serve as an illustration of this interactive +element. + +![Interactive and manual control of suggestions in LinkTools +v0.0.1.9005](LinkTools_interactive_matching_gui_README.png) + +A final functionality allows to write the new attribute back into the +CWB corpus. + +## LinkTools and other Resources + +The purpose of `LinkTools` is to provide an integrated user experience +for interested persons wanting to link textual data with other types of +data. The package thus will handle different data types which are +important in the realm of textual data, taking care of preprocessing and +the enrichment of the data in a robust and transparent manner. It +provides access to different existing approaches of linking and linkage, +lowering barriers to make use of available resources. + +Focusing on textual data, the code base of the [PolMine +project](https://polmine.github.io/) is at the core of `LinkTools`. In +particular, the R package +[polmineR](https://cran.r-project.org/package=polmineR) is used to +manage large CWB corpora. In the future, a broader scope of input will +be covered. The strict merging in the record linkage workflow is mainly +realized with +[data.table](https://cran.r-project.org/package=data.table). The fuzzy - +or probabilistic - joins are mainly realized using the R packages +[fuzzyjoin](https://cran.r-project.org/package=fuzzyjoin) and +[stringdist](https://cran.r-project.org/package=stringdist). + +## Dependencies + +Running the Vignette requires the availability of the GermaParlMini CWB +corpus and the `btmp` data package which provides the external data for +linking. + +## Installation + +``` r +remotes::install_github("PolMine/LinkTools") +``` diff --git a/data-raw/stammdaten_wikidatafied_2022-02-01.R b/data-raw/stammdaten_wikidatafied_2022-02-01.R deleted file mode 100644 index c2b8a3f..0000000 --- a/data-raw/stammdaten_wikidatafied_2022-02-01.R +++ /dev/null @@ -1,23 +0,0 @@ -# prepare and select Stammdaten (2023-02-06) - -library(dplyr) -library(tidyr) -library(data.table) - -stammdaten_file <- read.csv("~/lab/gitlab/bt_stammdaten/stammdaten_wikidatafied_2022-02-0") - -stammdaten_wikidatafied_2022_02_01_min <- stammdaten_file |> - filter(lp %in% c(13, 14)) |> - select(c(first_name, name_adel, name_praefix, family_name, party_by_lp, lp, parliamentary_group, QID)) |> - mutate(role = "mp") |> - mutate(speaker_full_name = trimws(paste(first_name, name_adel, name_praefix, family_name))) |> - mutate(speaker_full_name = gsub("\\s+", " ", speaker_full_name)) |> - mutate(lp = as.character(lp)) |> - separate_rows(parliamentary_group, sep = "\\|") |> - select(speaker_full_name, party_by_lp, lp, role, QID) |> - rename("party" = "party_by_lp", - "speaker" = "speaker_full_name", - "wikidata_id" = "QID") |> - as.data.table() - -save(stammdaten_wikidatafied_2022_02_01_min, file = "~/lab/github/LinkTools/data/stammdaten_wikidatafied_2022_02_01_min.rda") diff --git a/data/stammdaten_wikidatafied_2022_02_01_min.rda b/data/stammdaten_wikidatafied_2022_02_01_min.rda deleted file mode 100644 index 35cb45b..0000000 Binary files a/data/stammdaten_wikidatafied_2022_02_01_min.rda and /dev/null differ diff --git a/man/LTDataset.Rd b/man/LTDataset.Rd index 3cafc3a..90ca561 100644 --- a/man/LTDataset.Rd +++ b/man/LTDataset.Rd @@ -18,6 +18,13 @@ of datasets which should be linked, i.e. the transformation into a comparable format and the assignment of shared unique identifiers, b) the merge of datasets based on these identifiers, c) the encoding or enrichment of the data with three output formats (data.table, XML or CWB). + +\code{additional_attributes} might be useful when information of a +dataset is added which only covers a part of the corpus such as as +specific period of time or specific groups of speakers. + +The defaults were changed. "lv" is now the default method, +max_dist is 4 and ignore_case is now TRUE. } \section{Usage}{ LTDataset <- LTDataset$new() @@ -59,43 +66,66 @@ operation and add missing values interactively.} to add missing values interactively.} \item{\code{create_attribute_region_datatable()}}{returns a data.table of matched attributes. Can be used for further manual checks.} -\item{\code{encode_new_s_attribute(add_temporarily)}}{encodes the new attribute +\item{\code{encode_new_s_attribute()}}{encodes the new attribute as a structural attribute (for CWB)} } } \references{ -See the second answer here: +When formulating the data.table join functions in +\code{join_textual_and_external_data()}, the following Stack Overflow links +were useful: +https://stackoverflow.com/questions/45043600/merging-all-column-by-reference-in-a-data-table +and https://stackoverflow.com/questions/44433451/r-data-table-update-join + +For fuzzy matching \code{fuzzyjoin} is a crucial dependency: + +David Robinson (2020). fuzzyjoin: Join Tables Together on +Inexact Matching. R package version 0.1.6. +https://CRAN.R-project.org/package=fuzzyjoin + +Using \code{stringdist_join} in combination with \code{fuzzy_join} in +\code{fuzzy_join_missing_values()} was inspired by the second answer here: https://stackoverflow.com/questions/48008903/combined-fuzzy-and-exact-matching + +Using \code{formals} to reset the default values of +\code{stringdist_join} in \code{fuzzy_join_missing_values()} was inspired by the +second answer here: +https://stackoverflow.com/questions/27673415/store-function-arguments-inside-the-function-and-apply-them-for-future-use-in-r + +Large parts of \code{add_missing_attributes_via_shiny()} are taken +from the implementation of rhandsontable and shinyWidgets in: Blaette, +Andreas (2023). polmineR: Verbs and Nouns for Corpus Analysis. R +package version v0.8.8. } \section{Public fields}{ \if{html}{\out{
}} \describe{ -\item{\code{textual_data}}{a \code{character vector} of the corpus the IDs +\item{\code{textual_data}}{a \verb{character vector} of the corpus the IDs should be added to.} -\item{\code{textual_data_type}}{a \code{character vector} indicating the type of +\item{\code{textual_data_type}}{a \verb{character vector} indicating the type of textual data (cwb or xml).} -\item{\code{external_resource}}{a \code{data.frame or data.table} of an external +\item{\code{external_resource}}{a \code{data.frame} or \code{data.table} of an external dataset the IDs are coming from.} -\item{\code{attr_to_add}}{a \code{character vector} indicating the ID that +\item{\code{attr_to_add}}{a \verb{character vector} indicating the ID that should be added. If named, the name indicates the desired name of the ID as an structural attribute in the corpus data. The value of the character vector indicates the column of the external dataset the ID is stored in.} -\item{\code{split_by}}{a \code{character vector} indicating a s-attribute the +\item{\code{split_by}}{a \verb{character vector} indicating a s-attribute the corpus data should be split by to reduce memory usage.} -\item{\code{match_by}}{a \code{character vector} indicating the metadata used to +\item{\code{match_by}}{a \verb{character vector} indicating the metadata used to match the corpus data and the external dataset. If named, the names of the character vector indicate the names of the s-attributes in the corpus data and the values indicate the corresponding column names in the external dataset.} -\item{\code{forced_encoding}}{a \code{character vector} of the desired output +\item{\code{forced_encoding}}{a \verb{character vector} of the desired output encoding of the textual data. This might be useful when the original encoding and the locale differ.} @@ -114,10 +144,10 @@ attribute to add in the external resource.} \item{\code{text_dt}}{a \code{data.table} in case of a cwb corpus, this contains the decoded corpus.} -\item{\code{encoding_method}}{a \code{character vector} of the encoding method +\item{\code{encoding_method}}{a \verb{character vector} of the encoding method used to create the corpus (R or CWB).} -\item{\code{values}}{a \code{character vector} of values to encode after the +\item{\code{values}}{a \verb{character vector} of values to encode after the merge.} \item{\code{region_matrix}}{a \code{matrix} of corpus positions corresponding to @@ -146,6 +176,8 @@ missing after the manual check} \item \href{#method-LTDataset-check_and_add_missing_values}{\code{LTDataset$check_and_add_missing_values()}} \item \href{#method-LTDataset-add_missing_attributes_via_shiny}{\code{LTDataset$add_missing_attributes_via_shiny()}} \item \href{#method-LTDataset-fuzzy_join_missing_values}{\code{LTDataset$fuzzy_join_missing_values()}} +\item \href{#method-LTDataset-stringdist_join_match_fun}{\code{LTDataset$stringdist_join_match_fun()}} +\item \href{#method-LTDataset-add_missing_regions}{\code{LTDataset$add_missing_regions()}} \item \href{#method-LTDataset-clone}{\code{LTDataset$clone()}} } } @@ -227,7 +259,7 @@ perform the actual merge of the data. \subsection{Arguments}{ \if{html}{\out{
}} \describe{ -\item{\code{na_value}}{a \code{character vector} indicating which value +\item{\code{na_value}}{a \verb{character vector} indicating which value attributes should have that aren't merged.} } \if{html}{\out{
}} @@ -249,17 +281,9 @@ transform the matched data to a matrix for encoding. \subsection{Method \code{encode_new_s_attribute()}}{ encodes the new attribute as a structural attribute (for CWB). \subsection{Usage}{ -\if{html}{\out{
}}\preformatted{LTDataset$encode_new_s_attribute(add_temporarily)}\if{html}{\out{
}} +\if{html}{\out{
}}\preformatted{LTDataset$encode_new_s_attribute()}\if{html}{\out{
}} } -\subsection{Arguments}{ -\if{html}{\out{
}} -\describe{ -\item{\code{add_temporarily}}{whether to add the structural attribute temporarily -or permanently.} -} -\if{html}{\out{
}} -} \subsection{Returns}{ A new \code{LTDataset} object. } @@ -271,13 +295,15 @@ A new \code{LTDataset} object. returns a data.table of matched attributes. Can be used for further manual checks. \subsection{Usage}{ -\if{html}{\out{
}}\preformatted{LTDataset$create_attribute_region_datatable(verbose = FALSE)}\if{html}{\out{
}} +\if{html}{\out{
}}\preformatted{LTDataset$create_attribute_region_datatable(additional_attributes = NULL)}\if{html}{\out{
}} } \subsection{Arguments}{ \if{html}{\out{
}} \describe{ -\item{\code{verbose}}{\code{logical}} +\item{\code{additional_attributes}}{\verb{a character vector} of additional +structural attributes which should be considered when evaluating the +results of the linkage.} } \if{html}{\out{
}} } @@ -291,9 +317,13 @@ missing values interactively. \subsection{Usage}{ \if{html}{\out{
}}\preformatted{LTDataset$check_and_add_missing_values( check_for_groups = NULL, + negate = FALSE, modify = FALSE, match_fuzzily_by = NULL, doc_dir = NULL, + ignore_case = TRUE, + dist_method = "lv", + max_dist = 4L, verbose = TRUE )}\if{html}{\out{
}} } @@ -301,19 +331,34 @@ missing values interactively. \subsection{Arguments}{ \if{html}{\out{
}} \describe{ -\item{\code{check_for_groups}}{a \code{character vector}; in case not all -elements are expected to have been matched, filter which elements -should be checked.} +\item{\code{check_for_groups}}{a named \code{list}; in case not all elements are +expected to have been matched, select which elements should be checked.} + +\item{\code{negate}}{\code{logical}; whether the selection in \code{check_for_groups} +should be negated, i.e. if these elements should be filtered instead of +selected.} \item{\code{modify}}{\code{logical}; whether missing values should not only be inspected but also modified interactively.} -\item{\code{match_fuzzily_by}}{a \code{character vector}; if not NULL, a fuzzy +\item{\code{match_fuzzily_by}}{a \verb{character vector}; if not NULL, a fuzzy match will be performed on the column indicated.} -\item{\code{doc_dir}}{a \code{character vector}; Indicating a directory in which +\item{\code{doc_dir}}{a \verb{character vector}; Indicating a directory in which a text file is created which documents manual changes to the merge.} +\item{\code{ignore_case}}{\code{logical} whether to match the case. This is a +argument of \code{stringdist_join} which is used in +\code{fuzzy_join_missing_values}} + +\item{\code{dist_method}}{A \code{character} The measurement of distance used by +\code{stringdist_join}. See \code{stringdist-metrics} in the \code{stringdist} package +which is used in \code{fuzzy_join_missing_values}.} + +\item{\code{max_dist}}{An \code{integer} value indicating the maximum distance +between the two input vectors. This is a argument of \code{stringdist_join} +which is used in \code{fuzzy_join_missing_values}.} + \item{\code{verbose}}{\code{logical}; whether to print more comprehensive messages.} } @@ -324,7 +369,8 @@ messages.} \if{html}{\out{}} \if{latex}{\out{\hypertarget{method-LTDataset-add_missing_attributes_via_shiny}{}}} \subsection{Method \code{add_missing_attributes_via_shiny()}}{ -Add missing values with shiny and rhandsontable. Called by \code{check_and_add_missing_values()}. +Add missing values with shiny and rhandsontable. Called by +\code{check_and_add_missing_values()}. \subsection{Usage}{ \if{html}{\out{
}}\preformatted{LTDataset$add_missing_attributes_via_shiny(y, doc_dir)}\if{html}{\out{
}} } @@ -334,7 +380,7 @@ Add missing values with shiny and rhandsontable. Called by \code{check_and_add_m \describe{ \item{\code{y}}{a \code{data.table} containing incomplete rows after the merge.} -\item{\code{doc_dir}}{a \code{character vector}; Indicating a directory in which +\item{\code{doc_dir}}{a \verb{character vector}; Indicating a directory in which a text file is created which documents manual changes to the merge.} } \if{html}{\out{
}} @@ -347,20 +393,86 @@ a text file is created which documents manual changes to the merge.} Add missing values using a fuzzy join. Called by \code{check_and_add_missing_values()}. \subsection{Usage}{ -\if{html}{\out{
}}\preformatted{LTDataset$fuzzy_join_missing_values(attrs_by_region_dt_min, match_fuzzily_by)}\if{html}{\out{
}} +\if{html}{\out{
}}\preformatted{LTDataset$fuzzy_join_missing_values( + attrs_by_region_dt_min, + match_fuzzily_by, + ignore_case = TRUE, + dist_method = "lv", + max_dist = 4L +)}\if{html}{\out{
}} } \subsection{Arguments}{ \if{html}{\out{
}} \describe{ -\item{\code{attrs_by_region_dt_min}}{a \code{data.table} containing incomplete -rows after the merge.} +\item{\code{attrs_by_region_dt_min}}{a \code{data.table} containing incomplete rows +after the merge.} -\item{\code{match_fuzzily_by}}{a \code{character vector}; if not NULL, a fuzzy -match will be performed on the column indicated.} +\item{\code{match_fuzzily_by}}{a \verb{character vector}; if not NULL, a fuzzy match +will be performed on the column indicated.} + +\item{\code{ignore_case}}{\code{logical} whether to match the case. This is a +argument of \code{stringdist_join}.} + +\item{\code{dist_method}}{A \code{character} The measurement of distance used by +\code{stringdist_join}. See \code{stringdist-metrics} in the \code{stringdist} +package.} + +\item{\code{max_dist}}{An \code{integer} value indicating the maximum distance +between the two input vectors. This is a argument of \code{stringdist_join}.} } \if{html}{\out{
}} } +} +\if{html}{\out{
}} +\if{html}{\out{}} +\if{latex}{\out{\hypertarget{method-LTDataset-stringdist_join_match_fun}{}}} +\subsection{Method \code{stringdist_join_match_fun()}}{ +The function used for those attributes which are matched +fuzzily. Called by \code{fuzzy_join_missing_values()}. This function is +identical to the "match_fun" of \code{stringdist_join()} of the \code{fuzzyjoin} +package with changed defaults. +\subsection{Usage}{ +\if{html}{\out{
}}\preformatted{LTDataset$stringdist_join_match_fun( + v1, + v2, + max_dist = 4L, + method = c("lv", "osa", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", + "soundex"), + ignore_case = TRUE +)}\if{html}{\out{
}} +} + +\subsection{Arguments}{ +\if{html}{\out{
}} +\describe{ +\item{\code{v1, v2}}{\verb{character vectors} to be compared fuzzily.} + +\item{\code{max_dist}}{An \code{integer} value indicating the maximum distance +between the two input vectors. This is a argument of \code{stringdist_join}.} + +\item{\code{method}}{A \code{character} The measurement of distance used by +\code{stringdist_join}. See \code{stringdist-metrics} in the \code{stringdist} +package.} + +\item{\code{ignore_case}}{\code{logical} whether to match the case. This is a +argument of \code{stringdist_join}.} +} +\if{html}{\out{
}} +} +} +\if{html}{\out{
}} +\if{html}{\out{}} +\if{latex}{\out{\hypertarget{method-LTDataset-add_missing_regions}{}}} +\subsection{Method \code{add_missing_regions()}}{ +The function checks if the entire corpus is covered by the +region matrix. If not, the missing regions of the corpus are added to +the region matrix and the value vector. This is needed to encode the +new data. +\subsection{Usage}{ +\if{html}{\out{
}}\preformatted{LTDataset$add_missing_regions()}\if{html}{\out{
}} +} + } \if{html}{\out{
}} \if{html}{\out{}} diff --git a/man/LinkTools-package.Rd b/man/LinkTools-package.Rd index 723703c..7698070 100644 --- a/man/LinkTools-package.Rd +++ b/man/LinkTools-package.Rd @@ -6,21 +6,21 @@ \alias{LinkTools} \title{R-package 'LinkTools'} \description{ -Tools for linking data sets via shared unique identifiers. +Tools for linking data sets. } \details{ -Four steps are integrated into this package: +Once finalized, four steps should be integrated into this package: a) the preparation of data sets which should be linked, i.e. the -transformation into a comparable format and the assignment of shared unique -identifiers, +transformation into a comparable format and the assignment of new +information, in particular shared unique identifiers, b) the merge of data sets based on these identifiers, -c) the encoding or enrichment of the data with three output formats +c) the encoding or enrichment of the data with different output formats (data.table, XML or CWB). d) the package includes a wrapper for the Named Entity Linking of textual -data based on DBPedia Spotlight. +data based on DBpedia Spotlight. } \keyword{package} diff --git a/man/stammdaten_wikidatafied_2022_02_01_min.Rd b/man/stammdaten_wikidatafied_2022_02_01_min.Rd deleted file mode 100644 index 96cf525..0000000 --- a/man/stammdaten_wikidatafied_2022_02_01_min.Rd +++ /dev/null @@ -1,26 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/LinkTools.R -\docType{data} -\name{stammdaten_wikidatafied_2022_02_01_min} -\alias{stammdaten_wikidatafied_2022_02_01_min} -\title{Stammdaten with WikiData-IDs} -\format{ -An object of class \code{data.table} (inherits from \code{data.frame}) with 1397 rows and 5 columns. -} -\source{ -https://www.bundestag.de/services/opendata (Creation Date 2021-11-04) - -https://de.wikipedia.org/wiki/Liste_der_Mitglieder_des_Deutschen_Bundestages_(13._Wahlperiode) (Information Retrieved on 2021-11-23) - -https://de.wikipedia.org/wiki/Liste_der_Mitglieder_des_Deutschen_Bundestages_(14._Wahlperiode) (Information Retrieved on 2021-11-23) -} -\usage{ -stammdaten_wikidatafied_2022_02_01_min -} -\description{ -A minimized version of the Stammdaten of the German Bundestag of the 13th and -14th legislative period with added WikiData IDs retrieved via the Wikidata Query -Service and added party affiliations specific for the legislative period retrieved -from Wikipedia. For preparation see bt_stammdaten. -} -\keyword{datasets} diff --git a/tests/testthat.R b/tests/testthat.R new file mode 100644 index 0000000..9c224ce --- /dev/null +++ b/tests/testthat.R @@ -0,0 +1,12 @@ +# This file is part of the standard setup for testthat. +# It is recommended that you do not modify it. +# +# Where should you do additional test configuration? +# Learn more about the roles of various files in: +# * https://r-pkgs.org/tests.html +# * https://testthat.r-lib.org/reference/test_package.html#special-files + +library(testthat) +library(LinkTools) + +test_check("LinkTools") \ No newline at end of file diff --git a/tests/testthat/test-LTDataset.R b/tests/testthat/test-LTDataset.R new file mode 100644 index 0000000..5db859a --- /dev/null +++ b/tests/testthat/test-LTDataset.R @@ -0,0 +1,258 @@ +test_that("all unique speaker attribute pairs from the text data are kept and no observations are dropped implicitly when enriching entire corpora", { + + withr::local_package("polmineR") + use("polmineR") + + # all unique attribute combinations in GERMAPARLMINI + + s_attributes_gpmini <- s_attributes("GERMAPARLMINI", c("speaker", "party", "protocol_lp")) + s_attribute_combinations <- s_attributes_gpmini[, c("value", "party", "protocol_lp")] |> + unique() |> + nrow() + + # all unique attribute combinations in the enriched data.frame + + LTD <- LTDataset$new(textual_data = "GERMAPARLMINI", + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia", + "protocol_lp" = "legislative_period"), + split_by = NULL, + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD$join_textual_and_external_data() + LTD$create_attribute_region_datatable() + n_attributes_by_region <- LTD$attrs_by_region_dt |> + unique() |> + nrow() + + RcppCWB::cqp_reset_registry() + + expect_equal(n_attributes_by_region, s_attribute_combinations) +}) + +test_that("all unique speaker attribute pairs from the text data are kept and no observations are dropped implicitly when enriching subcorpora", { + + withr::local_package("polmineR") + use("polmineR") + + gpmini_day <- corpus("GERMAPARLMINI") |> + polmineR::subset(date == "2009-11-12") + + s_attributes_gpmini <- s_attributes(gpmini_day, c("speaker", "party")) |> + nrow() + + LTD <- LTDataset$new(textual_data = gpmini_day, + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia"), + split_by = NULL, + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD$join_textual_and_external_data() + LTD$create_attribute_region_datatable() + + n_attributes_by_region <- LTD$attrs_by_region_dt |> + unique() |> + nrow() + + RcppCWB::cqp_reset_registry() + + expect_equal(n_attributes_by_region, s_attributes_gpmini) +} +) + + +test_that("the output is identical regardless of the usage of split_by for corpora", { + + withr::local_package("polmineR") + use("polmineR") + + LTD <- LTDataset$new(textual_data = "GERMAPARLMINI", + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia", + "protocol_lp" = "legislative_period"), + split_by = NULL, + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD$join_textual_and_external_data() + + LTD_split <- LTDataset$new(textual_data = "GERMAPARLMINI", + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia", + "protocol_lp" = "legislative_period"), + split_by = "date", + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD_split$join_textual_and_external_data() + + RcppCWB::cqp_reset_registry() + + expect_identical(LTD_split$text_dt, LTD$text_dt) +}) + + +test_that("the output is identical regardless of the usage of split_by for subcorpora", { + + withr::local_package("polmineR") + use("polmineR") + + gpmini_day <- corpus("GERMAPARLMINI") |> + polmineR::subset(date == "2009-11-12") + + LTD <- LTDataset$new(textual_data = gpmini_day, + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia"), + split_by = NULL, + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD$join_textual_and_external_data() + + LTD_split <- LTDataset$new(textual_data = gpmini_day, + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia"), + split_by = "party", + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD_split$join_textual_and_external_data() + + RcppCWB::cqp_reset_registry() + + expect_identical(LTD_split$text_dt, LTD$text_dt) +}) + + +test_that("the region matrix in the end only contains each cpos once and that the length of unique cpos is equal to the corpus size", { + + withr::local_package("polmineR") + use("polmineR") + + gpmini_day <- corpus("GERMAPARLMINI") |> + polmineR::subset(date == "2009-11-10") + + LTD <- LTDataset$new(textual_data = gpmini_day, + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia"), + split_by = NULL, + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD$join_textual_and_external_data() + LTD$create_attribute_region_datatable() + LTD$add_missing_regions() + + cpos_vector <- apply(LTD$region_matrix, MARGIN = 1, function(row) c(row[[1]]:row[[2]])) |> + unlist() + + RcppCWB::cqp_reset_registry() + + expect_equal(length(cpos_vector), length(unique(cpos_vector))) + expect_equal(length(unique(cpos_vector)), size("GERMAPARLMINI")) +} +) + + +test_that("The new attribute is correctly encoded in the corpus", { + + withr::local_package("polmineR") + use("polmineR") + + # copy corpus to temp dir, using the current default values of + # cwbtools::corpus_copy explicitly. + + new_registry_dir <- fs::path(tempdir(), "cwb", "registry") + new_data_dir <- fs::path(tempdir(), "cwb", "indexed_corpora", tolower("GERMAPARLMINI")) + + cwbtools::corpus_copy(corpus = "GERMAPARLMINI", + registry_dir = RcppCWB::corpus_registry_dir(corpus = "GERMAPARLMINI"), + data_dir = RcppCWB::corpus_data_dir(corpus = "GERMAPARLMINI", + registry = RcppCWB::corpus_registry_dir(corpus = "GERMAPARLMINI")), + registry_dir_new = new_registry_dir, + data_dir_new = new_data_dir) + + cwbtools::corpus_rename( + old = "GERMAPARLMINI", + new = "GERMAPARLMINI_TMP", + registry_dir = new_registry_dir, + verbose = TRUE + ) + + old_corpus_registry <- Sys.getenv("CORPUS_REGISTRY") + + RcppCWB::cqp_initialize(registry = new_registry_dir) + stopifnot("GERMAPARLMINI_TMP" %in% corpus()[["corpus"]]) + + s_attributes_before <- s_attributes("GERMAPARLMINI_TMP") + + gpmini_day <- corpus("GERMAPARLMINI_TMP") |> + polmineR::subset(date == "2009-11-10") + + LTD <- LTDataset$new(textual_data = gpmini_day, + textual_data_type = "cwb", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia"), + split_by = NULL, + verbose = TRUE, + forced_encoding = "UTF-8") + + LTD$join_textual_and_external_data() + LTD$create_attribute_region_datatable() + + LTD$encode_new_s_attribute() + + RcppCWB::cqp_reset_registry() + + s_attributes_after <- s_attributes("GERMAPARLMINI_TMP") + + # expect that new s_attribute is there + expect_true(setdiff(s_attributes_after, s_attributes_before) == "id") + + + # except that this corresponds to the joins in the object + speaker_id_combination_encoded <- s_attributes(gpmini_day, c("speaker", "id")) |> + unique() + + data.table::setorder(speaker_id_combination_encoded, speaker) + + speaker_id_combination_ltd <- LTD$attrs_by_region_dt |> + unique() |> + data.table::setorder(speaker) + + speaker_id_combination_ltd <- speaker_id_combination_ltd[, c("speaker", "id")] + speaker_id_combination_ltd[is.na(speaker_id_combination_ltd$id), "id"] <- "NA" + + speaker_id_data_table <- merge(speaker_id_combination_encoded, speaker_id_combination_ltd, by = "speaker") + + expect_equal(speaker_id_data_table$id.x, speaker_id_data_table$id.y) + + RcppCWB::cqp_reset_registry(registry = old_corpus_registry) +} +) + diff --git a/vignettes/vignette.Rmd b/vignettes/vignette.Rmd index f291129..7fa6be0 100644 --- a/vignettes/vignette.Rmd +++ b/vignettes/vignette.Rmd @@ -10,15 +10,22 @@ vignette: > ```{r libraries} library(LinkTools) library(polmineR) +library(btmp) ``` +```{r} +use("polmineR") +set.seed(343) +``` + + # Linktools -The motivation of this set of tools is to link textual data to existing external data sets. The textual data might come in form of XML files, CWB-indexed corpora or Quanteda corpora. +The motivation of this set of tools is to link textual data to existing external data sets. The textual data might come in form of XML files, CWB-indexed corpora or quanteda corpora. -The external datasets might comprise of biographical data such as in the "Stammdaten des Deutschen Bundestages" [@...] or substantial findings in their own right, such as the "BT Vote MP Characteristics" dataset [@...]. +The external datasets might comprise of biographical data such as in the "Stammdaten des Deutschen Bundestages" [@...], substantial findings in their own right, such as the "BT Vote MP Characteristics" dataset [@...] or other structured information of knowledge bases such as Wikidata or DBpedia. -# Linktools as a suit of three functions +# Linktools as a suite of three functions This R package facilitates three steps: @@ -28,37 +35,50 @@ This R package facilitates three steps: # The LTDataset class -Textual data rarely is just a collection of tokens but in most cases enriched with metadata. Plenary speeches for example contain information about speakers or the date they were delivered. The `LTDataset` class realizes the linking of textual data and external datasets based on this metadata and essentially *joins two rectangular datasets by specific columns*. In addition, it contains some functionality to check if the data is completely linked (i.e. if the join results in missing values). +Textual data rarely is just a collection of tokens but in most cases enriched with metadata. Plenary speeches for example contain information about speakers or the date they were delivered. The `LTDataset` class realizes the linking of textual data and external datasets based on this metadata and essentially *joins two rectangular datasets by specific columns*. In addition, it contains some functionality to check if the data is completely linked (i.e. if the join results in missing values) and provides possibilities to add missing information both fuzzily and manually. ## Requirements -The `LTDataset` class merges the textual data and the external data by joining them based on different attributes in both datasets. +The `LTDataset` class merges the textual data and the external data by joining them based on different attributes in both datasets. As argued previously, a robust way to link external datasets and textual data is the use of shared unique identifiers. Following this intuition, we want to add Wikidata-IDs to these speakers. As pointed out in previous considerations [@WORKINGPAPER], these are generally available for a vast amount of entities, stable and extensible by users. -The textual data we want to add external information to is a collection of speeches by members of parliament in the German Bundestag taken from the GermaParl corpus of parliamentary debates [@BlaetteBlessing2018]. The textual data contains several attributes which can be used for merging. To keep the examples a bit smaller, only speeches of the 13th and 14th legislative period are used. +To realize this, the `LTDataset` class also needs the information about which speaker is associated with which ID. We can use a speaker's name, the respective party affiliation and the legislative period as the left sided input of the merge. -```{r germaparl_speaker_with_metadata_sample, echo = FALSE} -germaparl_lp13_lp14 <- subset("GERMAPARL", lp %in% c(13, 14)) +### Textual Data + +In this vignette, the textual data we want to add external information to is a collection of speeches by members of parliament in the German Bundestag. We use a part of the GermaParl corpus of parliamentary debates [@BlaetteBlessing2018] which is provided as sample data in the `polmineR` R package. This sample corpus, GermaParlMini, contains several attributes which can be used for merging. + +While in the example below, the entire corpus is enriched, with the external data, it is also possible to only enrich a part of the corpus (i.e. subcorpora). -germaparl_speaker_with_metadata <- s_attributes(germaparl_lp13_lp14, c("speaker", "party", "lp", "role")) -germaparl_speaker_with_metadata_sample <- germaparl_speaker_with_metadata[role == "mp"][sample(.N, 10)] -data.table::setorder(germaparl_speaker_with_metadata_sample, lp, speaker) +```{r germaparl_speaker_with_metadata_sample, echo = FALSE} +speakers_germaparlmini <- corpus("GERMAPARLMINI") |> + s_attributes(c("speaker", "protocol_lp", "party")) +speakers_germaparlmini <- speakers_germaparlmini[, !c("struc", "cpos_left", "cpos_right")] |> + unique() +data.table::setnames(speakers_germaparlmini, old = "value", new = "speaker") +speakers_germaparlmini_sample <- speakers_germaparlmini[party != "NA"][sample(.N, 10)] +data.table::setorder(speakers_germaparlmini_sample, protocol_lp, speaker) -DT::datatable(germaparl_speaker_with_metadata_sample, - caption = "10 randomly selected members of parliament from the - GermaParl corpus, 13th and 14th legislative period", +DT::datatable(speakers_germaparlmini_sample, + caption = "10 randomly selected members of parliament from the GermaParlMini corpus", options = list(dom = "t"), rownames = FALSE) ``` -As argued previously, a robust way to link external datasets and textual data is the use of shared unique identifiers. Following this intuition, we want to add Wikidata-IDs to the these speakers. As pointed out in previous considerations [@WORKINGPAPER], these are generally available for a vast amount of entities, stable and extendible by users. +### External Data + +The external data is a set of observations which contains information that can be used for matching the data to the metadata in the text corpus as well as additional data that should be added to the corpus. + +In other words, some overlap between the dataset must exist to perform the matching. Often, the overlap exists naturally between different data sources - such as names or party affiliations for members of parliament - but there certainly are instances in which external datasets must be prepared beforehand to facilitate the matching. -To realize this, the class also needs the information about what speaker is associated with which ID. As seen above, we can use a speaker's name, the respective party affiliation, the legislative period and the role as the left sided input of the merge. Preparing the external data in a way these two tables correspond is an upstream task in this sense that it must be prepared beforehand. +In the following, we use data gathered from the `Stammdaten des Deutschen Bundestages` which were enriched with both Wikidata-IDs and the speakers' party affiliation specific for the individual legislative period as per Wikipedia (the Stammdaten themselves only contain static party affiliation which is the most recent party affiliation of a speaker, regardless of politicians switching parties). This data has been prepared earlier and is provided as an R data package `btmp` [@btmp_package]. The preparation of the data is discussed in the corresponding R data package. -In the following, we use data gathered from the `Stammdaten des Deutschen Bundestages` which were enriched with both Wikidata-IDs and the speakers' party affiliation specific for the individual legislative period as per Wikipedia (the Stammdaten themselves only contain static party affiliation which is the most recent party affiliation of a speaker, regardless of politicians switching parties). +The following table shows the external dataset concerning those speakers which have been sampled randomly above. ```{r show_external_data_min_example, echo = FALSE} -external_data_min_example <- LinkTools::stammdaten_wikidatafied_2022_02_01_min[speaker %in% germaparl_speaker_with_metadata_sample[["speaker"]]] -data.table::setorder(external_data_min_example, lp, speaker) +external_data_min_example <- btmp::btmp_de |> + subset(legislative_period == 17) |> + subset(full_name %in% speakers_germaparlmini_sample[["speaker"]]) |> + data.table::setorder(legislative_period, full_name) DT::datatable(external_data_min_example, caption = "Speakers in External Data", @@ -67,60 +87,87 @@ DT::datatable(external_data_min_example, rownames = FALSE) ``` -With these two resources, the task now is to merge the two datasets based on these two tables. +*Note:* This example also illustrates that matching by identical names only is quite challenging. Not all ten speakers could be retrieved by name only because the names in the Stammdaten file (provided by the `btmp` package) and the names in the textual data of GermaParlMini are not identical and thus not all speakers can be matched by their names. The vignette shows how to perform fuzzy matching to address these problems. ## Joining Textual Data and Metadata -The `LTDataset` dataset uses the information provided in the textual data and this external resource to perform matching. Aside from the name of or the path to the textual data (the `textual_data`), its type (for now: "cwb" or "xml") and the external resource (a data.table object), it must be indicated which attribute of the external resource should be added. If the name of the column of the external dataset should not be used as the name of the attribute to be added in the textual data, a named vector could be used instead. In addition, it is possible that the textual data and the external resource do not share the same column names. In this case, a named character vector is used to indicate which column of the textual data corresponds to which column of the external dataset. This might make sense to be more transparent about the variables used from the external dataset. +With these two resources, the task now is to merge the two datasets based on these two tables. + +### Instantiate the Class + +The `LTDataset` dataset uses the information provided in the textual data and this external resource to perform matching. Aside from the name of or the path to the textual data (the `textual_data`), its type (for now: "cwb") and the external resource (a data.table object), it must be indicated which attribute of the external resource should be added. -A challenge is that the occurrence of the combination of a speaker's name, the party affiliation and the legislative period needs to be identified before the linkage can be performed for the entire corpus. This can result in rather large tables. Thus, merging can be done in multiple splits instead of merging the entire corpus. +If the name of the added attribute in the final data should be different than the column name of the external data, a named character vector can be used instead. The name of the vector then represents the final name of the attribute in the enriched textual data. + +In addition, it is possible that the name of the attributes used for matching is different between the textual data and the external data containing the additional information. In this case, a named character vector is used to indicate which column of the textual data corresponds to which column of the external dataset. + +Since identifying all regions of text described by the combination of metadata used to match the data can result in quite large data structured, merging can be done in multiple splits, reducing the memory demand. Given the rather small size of the GERMAPARLMINI corpus, this is not done here. ```{r initialize_class} -LTD <- LTDataset$new(textual_data = germaparl_lp13_lp14, +LTD <- LTDataset$new(textual_data = "GERMAPARLMINI", textual_data_type = "cwb", - external_resource = LinkTools::stammdaten_wikidatafied_2022_02_01_min, - attr_to_add = c("wikidata_id_speaker" = "wikidata_id"), - match_by = c("speaker", "party", "lp", "role"), - split_by = "lp", + external_resource = btmp::btmp_de, + attr_to_add = c("id" = "wikidata_id"), + match_by = c("speaker" = "full_name", + "party" = "party_wikipedia", + "protocol_lp" = "legislative_period"), + split_by = NULL, verbose = TRUE, forced_encoding = "UTF-8") ``` +### Instantiate the Class + After initializing the class, we can run the first method which identifies the parts of the textual data which should be linked and actually performs the linkage. Depending on the `textual_data_type`, different tasks are performed. ```{r join_textual_and_external_data} LTD$join_textual_and_external_data() ``` -For CWB corpora the output of this method is a region matrix and a vector of values which can be used to create structural attributes for corpora stored in the Corpus Workbench. +For CWB corpora the output of this method is a region matrix and a vector of values which can be used to create structural attributes for corpora stored in the CWB format. -For XML, the return value is a data.table with one individual row per unique combination of attributes which is then used to append the XML files. +For XML, the return value should be a data.table with one individual row per unique combination of attributes which is then used to append the XML files. -To validate the matching, a attribute region data.table can be created. +To validate the matching, an attribute region data.table can be created using the `create_attribute_region_datatable()` function. The function also allows to define additional attributes which were not used for the earlier verbatim matching but might be useful for both the fuzzy matching and the manual inspection. For example, specific periods of time or groups of persons which might not be part of the external data and thus cannot be matched in any case could be excluded like that. We omit this step here. ```{r create_attribute_region_datatable} LTD$create_attribute_region_datatable() ``` +Looking at the example data we see that not all speakers could be matched based on the speaker's name, her or his party affiliation and the legislative period. This is often due to speakers not having a party affiliation because they are not members of parliament. In GermaParlMini, the party metadata contains information about parliamentary groups which only concern members of parliament and not other types of speakers such as presidential speakers or governmental actors. These are not part of the dataset we want to link with and thus the linking fails in these instances. + +Looking at other speakers who have not been matched reveals that it is mostly the name of the party affiliation in the textual data which is slightly different. + ```{r print_attributes_by_region_datatable} head(LTD$attrs_by_region_dt) ``` +To provide a systematic way to inspect missing values manually and to add missing attributes, the function `check_and_add_missing_values()` is provided. Knowing that speakers without a party affiliation cannot be matched in any case, these observations can be excluded from the manual inspection. + +```{r} +LTD$attrs_by_region_dt[party != "NA"][is.na(id)] +``` + +If the parameter "match_fuzzily_by" is not NULL, the attribute name provided there there can be used to perform a fuzzy match. All other attributes in "match_by" are matched literally. The result of this matching is then shown in an interactive shiny session in which they can be accepted and kept, modified or refused and omitted. + +The fuzzy match is facilitated by the `fuzzy_join()` function of the package of the same name [@fuzzyjoin2020]. The main driver in the following match is a part of the `stringdist_join()` function to which the arguments `ignore_case`, `dist_method` and `max_dist` can be passed. The following chunk shows the chosen default values (a levenstein distance of a maximum of 4 with casing being ignored). -This also allows for the manual inspection of missing values as well as the manual addition of missing attributes. If the parameter "match_fuzzily_by" is not NULL, the attribute name provided there there can be used to perform a fuzzy match. All other attributes in "match_by" are still matched literally. The result of this matching is then shown in an interactive shiny session in which they can be accepted and kept or refused and omitted. +To document the changes made manually, a log file is created in the directory defined by the `doc_dir` argument. ```{r check_and_add_missing_values, eval = interactive()} -LTD$check_and_add_missing_values(check_for_groups = c("role" = "mp"), +LTD$check_and_add_missing_values(check_for_groups = list("party" = "NA"), + negate = TRUE, modify = TRUE, - match_fuzzily_by = "speaker", - doc_dir = "~/lab/gitlab/tmp", + match_fuzzily_by = c("speaker", "party"), + doc_dir = "~/lab/gitlab/tmp", + ignore_case = TRUE, + dist_method = "lv", + max_dist = 4L, verbose = TRUE) ``` Finally, the values can be added to the corpus by encoding them as a structural attribute. -~~Adding additional attributes can be realized only temporarily (for experimental reasons or if only a single analysis should be performed instead of changing the corpus forever) or persistently. - ```{r encode_new_s_attribute, eval = FALSE} -LTD$encode_new_s_attribute(add_temporarily = TRUE) -``` \ No newline at end of file +LTD$encode_new_s_attribute() +```