Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in sentenceSimil #11

Closed
Monduiz opened this issue Dec 10, 2017 · 4 comments
Closed

Error in sentenceSimil #11

Monduiz opened this issue Dec 10, 2017 · 4 comments
Labels

Comments

@Monduiz
Copy link

Monduiz commented Dec 10, 2017

I apologize if you have covered this before. I am trying to understand what is causing lexRank to fail in some texts. I am encountering this with a lot of text material.

Here is a reproducible example with the CNN dataset:

data_path <- "cnn/stories"

files <- list.files(data_path, pattern = "story$") 

# files_sample <- sample(files, 30)

file_two <- files[2]

data <- file_two %>%
  map(~ read_lines(file.path(data_path, .))) %>% 
  data_frame()

data <- data %>%
  rename(articles = ".") 

data <- data %>% 
  mutate(doc_id = 1:length(articles))

sent <- data %>%
  pull(articles) %>% 
  as.character()

lexRank(sent)

I am trying to determine what is the cause of the error. I am encountering this in a high number of articles from different sources. I am trying to know if its a pre-processing problem.

@AdamSpannbauer
Copy link
Owner

I'm assuming this is related to the way that tfidf was being calculated before calculating sentence similarity (I did not download the google drive file).

The inverse document frequency was being calculated as idf(d, t) = log( n / df(d, t) ); this has a value of 0 if a term is present in every document (which would force the tfidf to 0 as well). The zero as a low bound doesn't make too much sense here so the idf calc has been changed & will be updated with the next release to CRAN. I had been meaning to change the calculation, but I forgot to take note to do it. The updated idf calc will have a min bound of 1; idf(d, t) = log( n / df(d, t) ) + 1

You can see if this fixes your issue by installing from github using devtools::install_github("AdamSpannbauer/lexRankr").

@Monduiz
Copy link
Author

Monduiz commented Dec 11, 2017

Thank you for looking into this! I installed the dev version from GitHub and I got this error:

Error in sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, :
Only one sentence had nonzero tfidf scores. Similarities would return as NaN

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1
lexRankr_0.4.1

@AdamSpannbauer
Copy link
Owner

In your call to lexRankr::lexRank the function is assuming each element of the input character vector is a separate document. This is creating some confusion. Here are some options.

#read text
text_lines = trimws(readLines(file_two))
#rm blank lines
text_lines = text_lines[text_lines != '']
#rm lines that == "@highlight" (shown as bad lines during manual inspection)
text_lines = text_lines[text_lines != "@highlight" ]

########################################
# OPTION 1
########################################
#assume end of lines were end of sentences adding period to help parser
collapsed = paste0(text_lines, collapse=". ")
#fix double periods introduced by collapse
collapsed = gsub("..", ".", collapsed, fixed = TRUE)
#call lexrank
lexRankr::lexRank(collapsed)

########################################
# OPTION 2
########################################
#create df. only 1 doc so only on doc id
dt = data.table::data.table(doc_id=1, text_lines=text_lines)
#parse sentences
dt = lexRankr::unnest_sentences(dt, sents, text_lines)
#correct sentence ids (do within doc_id if multiple docs)
#something that needs to be fixed in unnest_sentences function
dt[,sent_id := 1:.N]
#lexrank sentences
ranked = lexRankr::bind_lexrank(dt, sents, doc_id, sent_id)
#extract top 3
ranked[order(-lexrank), ][1:3,]

The 2nd option shows an issue with the unnest_sentences function. I've created #12 to fix the problem.

@Monduiz
Copy link
Author

Monduiz commented Dec 11, 2017

Adam, thanks! It works with bindlexrank. #12 will help with this!

@Monduiz Monduiz closed this as completed Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants