Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Danger when using tokens from a un-ordered corpus #25

Open
odelmarcelle opened this issue May 17, 2021 · 5 comments
Open

Danger when using tokens from a un-ordered corpus #25

odelmarcelle opened this issue May 17, 2021 · 5 comments

Comments

@odelmarcelle
Copy link
Contributor

odelmarcelle commented May 17, 2021

I like tokenizing text by myself before using compute_sentiment. My usual framework is to start from a quanteda::corpus, from which I create a sento_corpus and a quanteda::tokens object.

I just realized that since as.sento_corpus re-order the quanteda::corpus, the order of the sento_corpus and the tokens object do not match. This leads to the wrong allocation of sentiment to texts.

I realize that in an ideal world, the safest way would be to use as.list(tokens(x)) when calling compute_sentiment. But I feel that this error is very difficult to notice as there is no warning, and I see situations where you would handle tokenization separately from the sento_corpus object.

Reproducible example:

> library(quanteda)
> library(sentometrics)
> e <- data.frame(text = c("good good good", "bad bad bad bad bad"), date = c("2000-01-26", "2000-01-03"))
> 
> corp <- corpus(e)
> st <- as.sento_corpus(corp)
We detected no features, so we added a dummy feature 'dummyFeature'.
> lex <- sento_lexicons(list_lexicons["GI_en"])
> 
> compute_sentiment(st, lex)
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          5                  -1
2: text1 2000-01-26          3                   1
> compute_sentiment(st, lex, tokens = as.list(tokens(corp)))
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          3                   1
2: text1 2000-01-26          5                  -1
@sborms
Copy link
Collaborator

sborms commented May 17, 2021

I thought the wrong ordering was fixed in this commit 6efe367, and version 0.8.2. The function as.sento_corpus() refers to sento_corpus() for the ordering.

What version are you using?

What is the output when you do compute_sentiment(corp, lex)?

@odelmarcelle
Copy link
Contributor Author

odelmarcelle commented May 17, 2021

The issue is not about a wrong order - the re-ordering of sento_corpus is correct. The danger comes from the fact that the initial corpus is un-ordered, and so is the tokens constructed from the corpus. Carelessly using this tokens in compute_sentiment creates the issue.

I believe there should be some sort of warning or check to prevent using the tokens argument if the order does not match.

Version is 0.8.4, and here are some other outputs

> compute_sentiment(st, lex, tokens = as.list(tokens(corp)))
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          3                   1
2: text1 2000-01-26          5                  -1
> compute_sentiment(st, lex, tokens = as.list(tokens(st)))
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          5                  -1
2: text1 2000-01-26          3                   1
> compute_sentiment(corp, lex)
      id word_count GI_en
1: text1          3     1
2: text2          5    -1

@sborms
Copy link
Collaborator

sborms commented May 18, 2021

Alright, got it, that's good news at least! Outputs make sense.

In the documentation of compute_sentiment() as part of the tokens argument there is already this: "... Make sure the tokens are constructed from (the texts from) the x argument, are unigrams, and preferably set to lowercase, otherwise, results may be spurious and errors could occur. ...". It hints to the user it is their own responsibility.

What do you suggest to efficiently compare the corpus input with the tokens input?

Possible clean solution: whenever tokens is not null, print a message() saying "Make sure the tokens are constructed from (the texts from) the x argument!". Not sure if there's more we can do.

@odelmarcelle
Copy link
Contributor Author

I think a printed message, especially when sento_corpus() re-order the corpus could help.
Alternatively, tokens could expect a named list where names represent the texts' ID.

@sborms
Copy link
Collaborator

sborms commented May 21, 2021

Sounds good. I prefer the first option, it’s the least invasive one, although the second one is a bit safer. Feel free to adapt and file a pull request, otherwise we’ll take this up later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants