Feature Request | as.DocumentTermMatrix with KWIC-object | Different p_attributes in KWIC and DTM #103

ChristophLeonhardt · 2019-09-30T16:57:11Z

The as.DocumentTermMatrix method offers a p_attribute parameter.

tdm <- partition("GERMAPARL", date = "2013-03-13") %>% 
  partition_bundle(s_attribute = "speaker") %>% 
  as.TermDocumentMatrix(p_attribute = "lemma", verbose = FALSE)

This does work as expected.

The workflow presented here illustrates that this is also the case when transforming KWIC objects. However, when passing everything other than word, I run into a problem:

library(polmineR)
use("GermaParl")
bt15_sanktionen <- kwic("GERMAPARL", query = "Sanktionen")
dtm <- as.DocumentTermMatrix(bt15_sanktionen, p_attribute = "lemma")

This does not work with the error:

Error in eval(bysub, x, parent.frame()) : object 'lemma_id' not found

It makes sense that the lemma isn't actually stored in the KWIC object and I see that I can get the wanted result by specifying the p_attribute when performing the KWIC analysis such as:

bt15_sanktionen <- kwic("GERMAPARL", query = "Sanktion", p_attribute = "lemma")
dtm <- as.DocumentTermMatrix(bt15_sanktionen, p_attribute = "lemma")

So this doesn't seem to be an actual error but expected behavior.

However, I can think of use cases in which I want to perform a KWIC analysis on the regular words (p_attribute = "word") and then use another positional attribute to create a Document-Term-Matrix, for example create a more clean input for Machine Learning workflows.

So I have a feature request: Maybe a solution which makes it possible to use different p_attributes in these steps would be worth considering?

The text was updated successfully, but these errors were encountered:

ablaette · 2019-10-01T11:28:22Z

There are two answers, one that would not have caused me work, and one that did.

The first one: The kwic()-method accepts an argument p_attribute that you can be used to work with something different than the p-attribute "word". If you have prepared your kwic-object accordingly, the as.DocumentTermMatrix() will work without errors or warnings.

library(polmineR)
use("GermaParl")

k1 <- kwic("GERMAPARL", query = 'Sanktion', cqp = TRUE, p_attribute = "lemma")
dtm1 <- as.DocumentTermMatrix(k1, p_attribute = "lemma")

Note that without using the CQP syntax, the query needs to be "Sanktion", because the lemmatisation does away with the plural. If matching the plural is important, you could use the CQP syntax, however.

k2 <- kwic("GERMAPARL", query = '[word = "Sanktionen"]', cqp = TRUE, p_attribute = "lemma")
dtm2 <- as.DocumentTermMatrix(k1, p_attribute = "lemma")

This would have worked all the time. But you pointed out a problem anyway that required some thoughts and some refactoring, resulting in a new polmineR development version (v0.7.11.9040).

When running as.DocumentTermMatrix on the kwic object prepared for the p-attribute "word", you certainly saw a message "... adding token ids for p-attribute: lemma", and then the error occurred anyway.

This message was misleading, because an enrich()-operation tried to add the information for the (missing) p-attribute, but without any effect: The argument p_attribute is not defined for the enrich()-method for kwic-objects that is called. But you do not recognize this because the unused argument disappears in the sea of three dots (the ... argument). A further complication was that I tried to have the same as.TermDocumentMatrix()-method for kwic and context-objects and managed to do so by introducing a virtual class called neighborhood.

To cut a long story short: The virtual class neighborhood is gone now, there is a new coerce method that turns kwic objects into context objects, and the as.DocumentTermMatrix()-method for kwic objects now turns the input kwic object into a context object. And as the enrich()-method for context-objects processes the argument p_attribute nicely, your initial code should work now:

kwic_word <- kwic("GERMAPARL", query = "Sanktionen", p_attribute = "word")
dtm_lemma <- as.DocumentTermMatrix(kwic_word, p_attribute = "lemma")

And please note that message ("... checking that all p-attributes are available" / "... getting token id for p-attribute: lemma") now refer to something that is really happening.

Discarding the neighborhood virtual class may incur a minimal performance cost, but the code and the documentation are easier to understand without this class, and as everything works now, I hope this is a good solution.

ChristophLeonhardt · 2019-11-27T15:23:44Z

Thank you for the comprehensive explanation. This is very helpful. So is your solution. It certainly seems to work and I didn't notice any real loss in performance. Can be closed, if you like.

PolMine closed this as completed Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request | as.DocumentTermMatrix with KWIC-object | Different p_attributes in KWIC and DTM #103

Feature Request | as.DocumentTermMatrix with KWIC-object | Different p_attributes in KWIC and DTM #103

ChristophLeonhardt commented Sep 30, 2019

ablaette commented Oct 1, 2019

ChristophLeonhardt commented Nov 27, 2019

Feature Request | as.DocumentTermMatrix with KWIC-object | Different p_attributes in KWIC and DTM #103

Feature Request | as.DocumentTermMatrix with KWIC-object | Different p_attributes in KWIC and DTM #103

Comments

ChristophLeonhardt commented Sep 30, 2019

ablaette commented Oct 1, 2019

ChristophLeonhardt commented Nov 27, 2019