Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request | as.DocumentTermMatrix with KWIC-object | Different p_attributes in KWIC and DTM #103

Closed
ChristophLeonhardt opened this issue Sep 30, 2019 · 2 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

The as.DocumentTermMatrix method offers a p_attribute parameter.

tdm <- partition("GERMAPARL", date = "2013-03-13") %>% 
  partition_bundle(s_attribute = "speaker") %>% 
  as.TermDocumentMatrix(p_attribute = "lemma", verbose = FALSE)

This does work as expected.

The workflow presented here illustrates that this is also the case when transforming KWIC objects. However, when passing everything other than word, I run into a problem:

library(polmineR)
use("GermaParl")
bt15_sanktionen <- kwic("GERMAPARL", query = "Sanktionen")
dtm <- as.DocumentTermMatrix(bt15_sanktionen, p_attribute = "lemma")

This does not work with the error:

Error in eval(bysub, x, parent.frame()) : object 'lemma_id' not found

It makes sense that the lemma isn't actually stored in the KWIC object and I see that I can get the wanted result by specifying the p_attribute when performing the KWIC analysis such as:

bt15_sanktionen <- kwic("GERMAPARL", query = "Sanktion", p_attribute = "lemma")
dtm <- as.DocumentTermMatrix(bt15_sanktionen, p_attribute = "lemma")

So this doesn't seem to be an actual error but expected behavior.

However, I can think of use cases in which I want to perform a KWIC analysis on the regular words (p_attribute = "word") and then use another positional attribute to create a Document-Term-Matrix, for example create a more clean input for Machine Learning workflows.

So I have a feature request: Maybe a solution which makes it possible to use different p_attributes in these steps would be worth considering?

@ablaette
Copy link
Collaborator

ablaette commented Oct 1, 2019

There are two answers, one that would not have caused me work, and one that did.

The first one: The kwic()-method accepts an argument p_attribute that you can be used to work with something different than the p-attribute "word". If you have prepared your kwic-object accordingly, the as.DocumentTermMatrix() will work without errors or warnings.

library(polmineR)
use("GermaParl")

k1 <- kwic("GERMAPARL", query = 'Sanktion', cqp = TRUE, p_attribute = "lemma")
dtm1 <- as.DocumentTermMatrix(k1, p_attribute = "lemma")

Note that without using the CQP syntax, the query needs to be "Sanktion", because the lemmatisation does away with the plural. If matching the plural is important, you could use the CQP syntax, however.

k2 <- kwic("GERMAPARL", query = '[word = "Sanktionen"]', cqp = TRUE, p_attribute = "lemma")
dtm2 <- as.DocumentTermMatrix(k1, p_attribute = "lemma")

This would have worked all the time. But you pointed out a problem anyway that required some thoughts and some refactoring, resulting in a new polmineR development version (v0.7.11.9040).

When running as.DocumentTermMatrix on the kwic object prepared for the p-attribute "word", you certainly saw a message "... adding token ids for p-attribute: lemma", and then the error occurred anyway.

This message was misleading, because an enrich()-operation tried to add the information for the (missing) p-attribute, but without any effect: The argument p_attribute is not defined for the enrich()-method for kwic-objects that is called. But you do not recognize this because the unused argument disappears in the sea of three dots (the ... argument). A further complication was that I tried to have the same as.TermDocumentMatrix()-method for kwic and context-objects and managed to do so by introducing a virtual class called neighborhood.

To cut a long story short: The virtual class neighborhood is gone now, there is a new coerce method that turns kwic objects into context objects, and the as.DocumentTermMatrix()-method for kwic objects now turns the input kwic object into a context object. And as the enrich()-method for context-objects processes the argument p_attribute nicely, your initial code should work now:

kwic_word <- kwic("GERMAPARL", query = "Sanktionen", p_attribute = "word")
dtm_lemma <- as.DocumentTermMatrix(kwic_word, p_attribute = "lemma")

And please note that message ("... checking that all p-attributes are available" / "... getting token id for p-attribute: lemma") now refer to something that is really happening.

Discarding the neighborhood virtual class may incur a minimal performance cost, but the code and the documentation are easier to understand without this class, and as everything works now, I hope this is a good solution.

@ChristophLeonhardt
Copy link
Contributor Author

Thank you for the comprehensive explanation. This is very helpful. So is your solution. It certainly seems to work and I didn't notice any real loss in performance. Can be closed, if you like.

@PolMine PolMine closed this as completed Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants