Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode() on partitions with cpos pairs of equal length #124

Closed
ChristophLeonhardt opened this issue Feb 23, 2020 · 2 comments
Closed

decode() on partitions with cpos pairs of equal length #124

ChristophLeonhardt opened this issue Feb 23, 2020 · 2 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

This is probably a very special scenario but because I encountered it in a real-life application, I report it. I needed to decode speeches.

# creating speeches
library(polmineR)
use("GermaParl")
speeches <- partition("GERMAPARL", year = 2016, interjection = "FALSE") %>%
  as.speeches(s_attribute_name = "speaker", mc = TRUE)

However, the decode() method for slices does not work when all cpos pairs in the partition have exactly the same length/distance. In the speech bundle above, this is the case for speech 8094 (among many others).

speeches@objects[[8094]]@cpos

          [,1]      [,2]
[1,] 100995643 100995653
[2,] 100996018 100996028

Both pairs comprise exactly ten tokens. In this case, the decode() method does not work as intended:

x <- decode(speeches@objects[[8094]])
... decoding p_attribute lemma
 Error in `[.data.table`(y, , `:=`((p_attr), get_token_stream(.Object,  : 
  Supplied 22 items to be assigned to 11 items of column 'lemma'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code. 

And I think it is this line that does the damage:

y <- data.table(cpos = unlist(apply(.Object@cpos, 1, function(row) row[1]:row[2])))

I think, the apply() part says: for every row in matrix, return the sequence from value in column one to value in column two. In case all resulting integer vectors have the same length, apply() apparently returns a matrix instead of a list, which then cannot be unlisted. In the next stop of the decode() method, only the number of rows of the matrix is taken into account.

I guess, the easiest way, albeit not necessarily the most elegant or fastest one to solve this issue is to check the return value of apply before creating y.

cpos_from_slice <- apply(.Object@cpos, 1, function(row) row[1]:row[2])

if (is.matrix(cpos_from_slice)) {
  cpos_from_slice <- as.integer(cpos_from_slice)
} else {
  cpos_from_slice <- unlist(cpos_from_slice)
}

y <- data.table(cpos = cpos_from_slice)

As for the decode method, see also issue #120.

@PolMine
Copy link
Collaborator

PolMine commented Feb 25, 2020

Thanks for reporting this issue in such a comprehensible and informed manner.

Indeed, I think I had seen this issue before. But I am guilty of not having followed the DRY principle (don't repeat yourself). So I used various varions of the snippet apply(.Object@cpos, 1, function(row) row[1]:row[2]) to unfold a region matrix. Not all of them were able to deal with the matrix that is returned when all regions have the same length, including the one you detected.

The best solution I had found (robust and fast) is still a cpos() method for matrix input I had written a few weeks ago:

setMethod("cpos", "matrix", function(.Object)
  do.call(c, lapply(1L:nrow(.Object), function(i) .Object[i,1]:.Object[i,2]))
)

A new version of polmineR available at the dev branch uses this method whenever a region matrix needs to be turned into single corpus positions. I have also written a unit test as a safeguard that we do not fall behind the current state of affairs.

The not-so-beautiful part of the solution is that the primary purpose of the cpos()-method is to get a region matrix. Ultimately, the method should be renamed. get_region_matrix() would be a plausible solution, I guess.

@ChristophLeonhardt
Copy link
Contributor Author

After some initial testing I would agree that this approach works nicely. I also would think that get_region_matrix() seems like a reasonably expressive name for the initial purpose of the cpos() method.

As far as I am concerned, this issue is solved and can be closed. Thank you very much for the swift fix.

@PolMine PolMine closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant