Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue w/ Encoding while using Fedora #39

Open
keenan-smith-data opened this issue Aug 1, 2022 · 4 comments
Open

Issue w/ Encoding while using Fedora #39

keenan-smith-data opened this issue Aug 1, 2022 · 4 comments
Labels
dependency Issue with the dependency package

Comments

@keenan-smith-data
Copy link

keenan-smith-data commented Aug 1, 2022

Recieved Warning while trying to Scrape text data from various websites.

Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
input string '^()(\s)#' cannot be translated to UTF-8, is it valid in
'ANSI_X3.4-1968'?

This error happens for any subsequent events after the first:

Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
restarting interrupted promise evaluation

jacobin_pull <- function(hyperlink) {
  session <- polite::bow(hyperlink)
  temp <- polite::scrape(session)
  text_data <-
    temp |>
    rvest::html_element(css = "#post-content") |>
    rvest::html_nodes("p") |>
    rvest::html_text2() |>
    dplyr::as_tibble() |>
    dplyr::rename(text = value) |>
  return(text_data)
}

jacobin_pull_try <- function(hyperlink) {
  tryCatch(
    expr = {
      message(paste("Trying", hyperlink))
      jacobin_pull(hyperlink)
    },
    error = function(cond) {
      message(paste("This URL has caused an error:", hyperlink))
      message(cond)
    },
    warning = function(cond) {
      message(paste("URL has a warning:", hyperlink))
      message(cond)
    },
    finally = {
      message(paste("Processed URL:", hyperlink))
    }
  )
}

jacobin_test_link <- "https://jacobin.com/2022/07/we-still-have-to-take-donald-trump-seriously"

jacobin_test_link_2 <- "https://jacobin.com/2022/07/ukraine-russia-war-debt-forgiveness-us-eu"

jacobin_test_3 <- "https://jacobin.com/2022/06/american-exceptionalism-off-the-rails"

jac_test <- jacobin_pull_try(jacobin_test_link)
jac_test_2 <- jacobin_pull_try(jacobin_test_link_2)
jac_test_3 <- jacobin_pull_try(jacobin_test_3)

Sys Environment:

R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 36 (Xfce)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.2

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

I tried to look into the source code to discover the issue but it's outside of my current understanding.

rvest::read_html() does not tigger the same error.

EDIT: Forgot to mention, ran the same code on windows and did not have the same issue.

@dmi3kno
Copy link
Owner

dmi3kno commented Aug 2, 2022

I believe this is {robotstxt} issue. @petermeissner can you please look into it?

@dmi3kno dmi3kno added the dependency Issue with the dependency package label Aug 2, 2022
@petermeissner
Copy link

Hmmm, sounds like a robotstxt issue or even something deeper … did you try running it on other Linux distribution (Debian/Ubuntu) ?

@petermeissner
Copy link

I will look into it

@keenan-smith-data
Copy link
Author

Did not try any Debian distro. Only tried Windows and Fedora 36. I got it to work once on Fedora, but I couldn't reproduce it. I tried changing the locale settings on Fedora with no effect. I changed scraping to read_html() and got no issue with encoding.

I tried other sites and had the same issue. I have a series of functions like the one written above and I tried each one same series of errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependency Issue with the dependency package
Projects
None yet
Development

No branches or pull requests

3 participants