Error in sentenceSimil #11

Monduiz · 2017-12-10T19:08:08Z

I apologize if you have covered this before. I am trying to understand what is causing lexRank to fail in some texts. I am encountering this with a lot of text material.

Here is a reproducible example with the CNN dataset:

data_path <- "cnn/stories"

files <- list.files(data_path, pattern = "story$") 

# files_sample <- sample(files, 30)

file_two <- files[2]

data <- file_two %>%
  map(~ read_lines(file.path(data_path, .))) %>% 
  data_frame()

data <- data %>%
  rename(articles = ".") 

data <- data %>% 
  mutate(doc_id = 1:length(articles))

sent <- data %>%
  pull(articles) %>% 
  as.character()

lexRank(sent)

I am trying to determine what is the cause of the error. I am encountering this in a high number of articles from different sources. I am trying to know if its a pre-processing problem.

The text was updated successfully, but these errors were encountered:

AdamSpannbauer · 2017-12-10T21:33:31Z

I'm assuming this is related to the way that tfidf was being calculated before calculating sentence similarity (I did not download the google drive file).

The inverse document frequency was being calculated as idf(d, t) = log( n / df(d, t) ); this has a value of 0 if a term is present in every document (which would force the tfidf to 0 as well). The zero as a low bound doesn't make too much sense here so the idf calc has been changed & will be updated with the next release to CRAN. I had been meaning to change the calculation, but I forgot to take note to do it. The updated idf calc will have a min bound of 1; idf(d, t) = log( n / df(d, t) ) + 1

You can see if this fixes your issue by installing from github using devtools::install_github("AdamSpannbauer/lexRankr").

Monduiz · 2017-12-11T00:10:46Z

Thank you for looking into this! I installed the dev version from GitHub and I got this error:

Error in sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, :
Only one sentence had nonzero tfidf scores. Similarities would return as NaN

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1
lexRankr_0.4.1

AdamSpannbauer · 2017-12-11T02:55:21Z

In your call to lexRankr::lexRank the function is assuming each element of the input character vector is a separate document. This is creating some confusion. Here are some options.

#read text
text_lines = trimws(readLines(file_two))
#rm blank lines
text_lines = text_lines[text_lines != '']
#rm lines that == "@highlight" (shown as bad lines during manual inspection)
text_lines = text_lines[text_lines != "@highlight" ]

########################################
# OPTION 1
########################################
#assume end of lines were end of sentences adding period to help parser
collapsed = paste0(text_lines, collapse=". ")
#fix double periods introduced by collapse
collapsed = gsub("..", ".", collapsed, fixed = TRUE)
#call lexrank
lexRankr::lexRank(collapsed)

########################################
# OPTION 2
########################################
#create df. only 1 doc so only on doc id
dt = data.table::data.table(doc_id=1, text_lines=text_lines)
#parse sentences
dt = lexRankr::unnest_sentences(dt, sents, text_lines)
#correct sentence ids (do within doc_id if multiple docs)
#something that needs to be fixed in unnest_sentences function
dt[,sent_id := 1:.N]
#lexrank sentences
ranked = lexRankr::bind_lexrank(dt, sents, doc_id, sent_id)
#extract top 3
ranked[order(-lexrank), ][1:3,]

The 2nd option shows an issue with the unnest_sentences function. I've created #12 to fix the problem.

Monduiz · 2017-12-11T11:29:39Z

Adam, thanks! It works with bindlexrank. #12 will help with this!

AdamSpannbauer mentioned this issue Dec 10, 2017

no sentences above threshold needs verbose error #2

Closed

AdamSpannbauer mentioned this issue Dec 11, 2017

create doc id arg for unnest sentence function #12

Closed

AdamSpannbauer added the question label Dec 11, 2017

Monduiz closed this as completed Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in sentenceSimil #11

Error in sentenceSimil #11

Monduiz commented Dec 10, 2017

AdamSpannbauer commented Dec 10, 2017

Monduiz commented Dec 11, 2017 •

edited

Loading

AdamSpannbauer commented Dec 11, 2017

Monduiz commented Dec 11, 2017 •

edited

Loading

Error in sentenceSimil #11

Error in sentenceSimil #11

Comments

Monduiz commented Dec 10, 2017

AdamSpannbauer commented Dec 10, 2017

Monduiz commented Dec 11, 2017 • edited Loading

AdamSpannbauer commented Dec 11, 2017

Monduiz commented Dec 11, 2017 • edited Loading

Monduiz commented Dec 11, 2017 •

edited

Loading

Monduiz commented Dec 11, 2017 •

edited

Loading