-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Danger when using tokens
from a un-ordered corpus
#25
Comments
I thought the wrong ordering was fixed in this commit 6efe367, and version 0.8.2. The function as.sento_corpus() refers to sento_corpus() for the ordering. What version are you using? What is the output when you do |
The issue is not about a wrong order - the re-ordering of sento_corpus is correct. The danger comes from the fact that the initial corpus is un-ordered, and so is the tokens constructed from the corpus. Carelessly using this tokens in compute_sentiment creates the issue. I believe there should be some sort of warning or check to prevent using the tokens argument if the order does not match. Version is 0.8.4, and here are some other outputs > compute_sentiment(st, lex, tokens = as.list(tokens(corp)))
id date word_count GI_en--dummyFeature
1: text2 2000-01-03 3 1
2: text1 2000-01-26 5 -1
> compute_sentiment(st, lex, tokens = as.list(tokens(st)))
id date word_count GI_en--dummyFeature
1: text2 2000-01-03 5 -1
2: text1 2000-01-26 3 1
> compute_sentiment(corp, lex)
id word_count GI_en
1: text1 3 1
2: text2 5 -1 |
Alright, got it, that's good news at least! Outputs make sense. In the documentation of What do you suggest to efficiently compare the corpus input with the tokens input? Possible clean solution: whenever tokens is not null, print a message() saying "Make sure the tokens are constructed from (the texts from) the x argument!". Not sure if there's more we can do. |
I think a printed message, especially when sento_corpus() re-order the corpus could help. |
Sounds good. I prefer the first option, it’s the least invasive one, although the second one is a bit safer. Feel free to adapt and file a pull request, otherwise we’ll take this up later. |
I like tokenizing text by myself before using compute_sentiment. My usual framework is to start from a quanteda::corpus, from which I create a sento_corpus and a quanteda::tokens object.
I just realized that since as.sento_corpus re-order the quanteda::corpus, the order of the sento_corpus and the tokens object do not match. This leads to the wrong allocation of sentiment to texts.
I realize that in an ideal world, the safest way would be to use
as.list(tokens(x))
when callingcompute_sentiment
. But I feel that this error is very difficult to notice as there is no warning, and I see situations where you would handle tokenization separately from the sento_corpus object.Reproducible example:
The text was updated successfully, but these errors were encountered: