Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on parse an URL #5

Closed
stardiviner opened this issue May 12, 2018 · 12 comments
Closed

Error on parse an URL #5

stardiviner opened this issue May 12, 2018 · 12 comments
Labels

Comments

@stardiviner
Copy link

(require 'elquery)
(with-current-buffer (url-retrieve-synchronously "https://nginx.org/en/docs/dirindex.html")
  (message "%S"
           (elquery-$ "div #content"
                      (elquery-read-string (buffer-string)))))

It got nil, but <div id="content"> is not empty.

@AdamNiederer
Copy link
Owner

AdamNiederer commented May 12, 2018

That's super weird - Emacs is retrieving the page over the network correctly, but that element is stripped when elquery calls libxml-parse-html-region to turn the HTML into a list.

(search "\"content" ; changing to "\"menu" reveals the <div id="menu">
        (prin1-to-string (with-temp-buffer
                           (insert-string (with-current-buffer (url-retrieve-synchronously "https://nginx.org/en/docs/dirindex.html")
                                            (buffer-string)))
                           (let ((tree (libxml-parse-html-region (point-min) (point-max))))
                             tree))))

Could this be an issue with libxml-parse-html-region? elquery-read-string definitely isn't seeing the div in question.

EDIT: Yeah, removing the two billion nodes within div#content fixes the issue. I think libxml is hitting some sort of internal DoS protection mechanism.

@stardiviner
Copy link
Author

I tested on this too, it's striped (confirmed). I use Python retrieve this, Python side is correct.
This should be problem on libxml side. I will submit bug to Emacs ML.
Thanks.

@AdamNiederer
Copy link
Owner

No problem; thanks for pointing it out!

@npostavs
Copy link

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31427#8

elquery-read-string should probably throw an error (or warn, at least) when

(multibyte-string-p string) => nil

@AdamNiederer
Copy link
Owner

It sounds like simply setting the temp buffer to unibyte would fix this. 59f93b8 appears to work for both multibyte and unibyte strings.

@AdamNiederer AdamNiederer reopened this May 13, 2018
@AdamNiederer AdamNiederer added bug and removed upstream labels May 13, 2018
@stardiviner
Copy link
Author

Yes, confirmed.

@npostavs
Copy link

It could fail for web pages which are encoded in something other than utf-8. Although utf-8 is probably the most common encoding. Also, you might have trouble passing multibyte strings now (e.g., if the source of the string is not from a web page).

(let ((string "<html><body>α</body></html>"))
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert string)
    (libxml-parse-html-region (point-min) (point-max))))
;=> (html nil (body nil "\261"))

(let ((string "<html><body>α</body></html>"))
  (with-temp-buffer
    (insert string)
    (libxml-parse-html-region (point-min) (point-max))))
;=> (html nil (body nil "α"))

@stardiviner
Copy link
Author

Can try to detect buffer string with multibyte-string-p.

@stardiviner
Copy link
Author

BTW @AdamNiederer In function elquery-read-string, there is no insert-string. Is it a bug or Emacs version compatiblity issue?

@AdamNiederer
Copy link
Owner

AdamNiederer commented May 14, 2018

BTW @AdamNiederer In function elquery-read-string, there is no insert-string. Is it a bug or Emacs version compatiblity issue?

Oh, looks like it's deprecated in 25.x. Thanks for the heads up.

It could fail for web pages which are encoded in something other than utf-8. Although utf-8 is probably the most common encoding. Also, you might have trouble passing multibyte strings now (e.g., if the source of the string is not from a web page).

I tried converting the unibyte string with string-to-multibyte, but it looks like that has the same issues as inserting a unibyte string into a multibyte buffer. Is there a reasonable way to do the right thing if all I'm given is a string? Decoding the results of an HTTP request isn't really within the scope of this library.

@AdamNiederer AdamNiederer reopened this May 14, 2018
@npostavs
Copy link

Decoding the results of an HTTP request isn't really within the scope of this library.

Right, that's why I suggested signaling an error or warning. You could possibly decode with undecided as the coding system to let Emacs guess the encoding, although that's not completely reliable of course.

@AdamNiederer
Copy link
Owner

This exact case, as well as the issue described in #9, has been fixed in b74e2a6. However, there are still less-common strings which will cause problems with elquery-read-string. I'm going to track that problem in #10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants