Skip to content

Feature Request - Create a setter method to sets the encoding when parsing the response as a Document #997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
julianomqs opened this issue Dec 22, 2017 · 2 comments

Comments

@julianomqs
Copy link

julianomqs commented Dec 22, 2017

Hi,

I'm using your library in some projects here, it's great.

It would be great to have a setter method to set the encoding when parsing a response as a document.

For example, if I need to execute a post request and parse the response as a Document with the ISO-8859-1 encoding, I have to do this:

private Document executeRequest(String value2) throws IOException {
    return Jsoup.connect(DEFAULT_URL)
        .timeout(DEFAULT_TIMEOUT)
        .data("param1", "value1")
        .data("param2", value2)
        .userAgent(DEFAULT_USER_AGENT)
        .method(Method.POST)
        .execute()
        .charset(CHARSET_ISO_8859_1)
        .parse();
}

It would be great something like this:

private Document executeRequest(String value2) throws IOException {
    return Jsoup.connect(DEFAULT_URL)
        .timeout(DEFAULT_TIMEOUT)
        .data("param1", "value1")
        .data("param2", value2)
        .userAgent(DEFAULT_USER_AGENT)
        .responseEncoding("ISO-8859-1") // <-- This is the setter method I'm suggesting, something like that
        .post();
}

The postDataCharset method sets the charset when sending a POST request, but not for parsing the response as a document.

Of course the method name is your choice.

What do you think?

P.S: @krystiangorecki This is the issue with the correct description.

@jhy
Copy link
Owner

jhy commented Dec 22, 2017

Thanks, makes sense. Is the site not setting the response encoding in a header or meta though, or is jsoup parsing it incorrectly? Trying to understand the root issue.

@julianomqs
Copy link
Author

In my tests, the site I was scraping didn't bring the encoding in response headers neither in html, but I knew beforehand the proper encoding was ISO-8859-1.

As far as I know, jsoup parses documents as UTF-8 when it can't detect the document encoding, right?

I don't think it is a jsoup bug, more likely a site problem.

This is the site if you want to check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants