Danger when using `tokens` from a un-ordered corpus #25

odelmarcelle · 2021-05-17T14:00:44Z

I like tokenizing text by myself before using compute_sentiment. My usual framework is to start from a quanteda::corpus, from which I create a sento_corpus and a quanteda::tokens object.

I just realized that since as.sento_corpus re-order the quanteda::corpus, the order of the sento_corpus and the tokens object do not match. This leads to the wrong allocation of sentiment to texts.

I realize that in an ideal world, the safest way would be to use as.list(tokens(x)) when calling compute_sentiment. But I feel that this error is very difficult to notice as there is no warning, and I see situations where you would handle tokenization separately from the sento_corpus object.

Reproducible example:

> library(quanteda)
> library(sentometrics)
> e <- data.frame(text = c("good good good", "bad bad bad bad bad"), date = c("2000-01-26", "2000-01-03"))
> 
> corp <- corpus(e)
> st <- as.sento_corpus(corp)
We detected no features, so we added a dummy feature 'dummyFeature'.
> lex <- sento_lexicons(list_lexicons["GI_en"])
> 
> compute_sentiment(st, lex)
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          5                  -1
2: text1 2000-01-26          3                   1
> compute_sentiment(st, lex, tokens = as.list(tokens(corp)))
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          3                   1
2: text1 2000-01-26          5                  -1

The text was updated successfully, but these errors were encountered:

sborms · 2021-05-17T16:00:03Z

I thought the wrong ordering was fixed in this commit 6efe367, and version 0.8.2. The function as.sento_corpus() refers to sento_corpus() for the ordering.

What version are you using?

What is the output when you do compute_sentiment(corp, lex)?

odelmarcelle · 2021-05-17T17:11:55Z

The issue is not about a wrong order - the re-ordering of sento_corpus is correct. The danger comes from the fact that the initial corpus is un-ordered, and so is the tokens constructed from the corpus. Carelessly using this tokens in compute_sentiment creates the issue.

I believe there should be some sort of warning or check to prevent using the tokens argument if the order does not match.

Version is 0.8.4, and here are some other outputs

> compute_sentiment(st, lex, tokens = as.list(tokens(corp)))
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          3                   1
2: text1 2000-01-26          5                  -1
> compute_sentiment(st, lex, tokens = as.list(tokens(st)))
      id       date word_count GI_en--dummyFeature
1: text2 2000-01-03          5                  -1
2: text1 2000-01-26          3                   1
> compute_sentiment(corp, lex)
      id word_count GI_en
1: text1          3     1
2: text2          5    -1

sborms · 2021-05-18T20:55:24Z

Alright, got it, that's good news at least! Outputs make sense.

In the documentation of compute_sentiment() as part of the tokens argument there is already this: "... Make sure the tokens are constructed from (the texts from) the x argument, are unigrams, and preferably set to lowercase, otherwise, results may be spurious and errors could occur. ...". It hints to the user it is their own responsibility.

What do you suggest to efficiently compare the corpus input with the tokens input?

Possible clean solution: whenever tokens is not null, print a message() saying "Make sure the tokens are constructed from (the texts from) the x argument!". Not sure if there's more we can do.

odelmarcelle · 2021-05-20T11:48:51Z

I think a printed message, especially when sento_corpus() re-order the corpus could help.
Alternatively, tokens could expect a named list where names represent the texts' ID.

sborms · 2021-05-21T17:13:53Z

Sounds good. I prefer the first option, it’s the least invasive one, although the second one is a bit safer. Feel free to adapt and file a pull request, otherwise we’ll take this up later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Danger when using `tokens` from a un-ordered corpus #25

Danger when using `tokens` from a un-ordered corpus #25

odelmarcelle commented May 17, 2021 •

edited

Loading

sborms commented May 17, 2021

odelmarcelle commented May 17, 2021 •

edited

Loading

sborms commented May 18, 2021 •

edited

Loading

odelmarcelle commented May 20, 2021

sborms commented May 21, 2021

Danger when using tokens from a un-ordered corpus #25

Danger when using tokens from a un-ordered corpus #25

Comments

odelmarcelle commented May 17, 2021 • edited Loading

sborms commented May 17, 2021

odelmarcelle commented May 17, 2021 • edited Loading

sborms commented May 18, 2021 • edited Loading

odelmarcelle commented May 20, 2021

sborms commented May 21, 2021

Danger when using `tokens` from a un-ordered corpus #25

Danger when using `tokens` from a un-ordered corpus #25

odelmarcelle commented May 17, 2021 •

edited

Loading

odelmarcelle commented May 17, 2021 •

edited

Loading

sborms commented May 18, 2021 •

edited

Loading