New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request | as.DocumentTermMatrix with KWIC-object | Different p_attributes in KWIC and DTM #103
Comments
There are two answers, one that would not have caused me work, and one that did. The first one: The library(polmineR)
use("GermaParl")
k1 <- kwic("GERMAPARL", query = 'Sanktion', cqp = TRUE, p_attribute = "lemma")
dtm1 <- as.DocumentTermMatrix(k1, p_attribute = "lemma") Note that without using the CQP syntax, the query needs to be "Sanktion", because the lemmatisation does away with the plural. If matching the plural is important, you could use the CQP syntax, however. k2 <- kwic("GERMAPARL", query = '[word = "Sanktionen"]', cqp = TRUE, p_attribute = "lemma")
dtm2 <- as.DocumentTermMatrix(k1, p_attribute = "lemma") This would have worked all the time. But you pointed out a problem anyway that required some thoughts and some refactoring, resulting in a new polmineR development version (v0.7.11.9040). When running This message was misleading, because an To cut a long story short: The virtual class kwic_word <- kwic("GERMAPARL", query = "Sanktionen", p_attribute = "word")
dtm_lemma <- as.DocumentTermMatrix(kwic_word, p_attribute = "lemma") And please note that message ("... checking that all p-attributes are available" / "... getting token id for p-attribute: lemma") now refer to something that is really happening. Discarding the |
Thank you for the comprehensive explanation. This is very helpful. So is your solution. It certainly seems to work and I didn't notice any real loss in performance. Can be closed, if you like. |
The as.DocumentTermMatrix method offers a p_attribute parameter.
This does work as expected.
The workflow presented here illustrates that this is also the case when transforming KWIC objects. However, when passing everything other than word, I run into a problem:
This does not work with the error:
It makes sense that the lemma isn't actually stored in the KWIC object and I see that I can get the wanted result by specifying the p_attribute when performing the KWIC analysis such as:
So this doesn't seem to be an actual error but expected behavior.
However, I can think of use cases in which I want to perform a KWIC analysis on the regular words (p_attribute = "word") and then use another positional attribute to create a Document-Term-Matrix, for example create a more clean input for Machine Learning workflows.
So I have a feature request: Maybe a solution which makes it possible to use different p_attributes in these steps would be worth considering?
The text was updated successfully, but these errors were encountered: