Error on parse an URL #5

stardiviner · 2018-05-12T06:46:02Z

(require 'elquery)
(with-current-buffer (url-retrieve-synchronously "https://nginx.org/en/docs/dirindex.html")
  (message "%S"
           (elquery-$ "div #content"
                      (elquery-read-string (buffer-string)))))

It got nil, but <div id="content"> is not empty.

The text was updated successfully, but these errors were encountered:

AdamNiederer · 2018-05-12T07:20:49Z

That's super weird - Emacs is retrieving the page over the network correctly, but that element is stripped when elquery calls libxml-parse-html-region to turn the HTML into a list.

(search "\"content" ; changing to "\"menu" reveals the <div id="menu">
        (prin1-to-string (with-temp-buffer
                           (insert-string (with-current-buffer (url-retrieve-synchronously "https://nginx.org/en/docs/dirindex.html")
                                            (buffer-string)))
                           (let ((tree (libxml-parse-html-region (point-min) (point-max))))
                             tree))))

Could this be an issue with libxml-parse-html-region? elquery-read-string definitely isn't seeing the div in question.

EDIT: Yeah, removing the two billion nodes within div#content fixes the issue. I think libxml is hitting some sort of internal DoS protection mechanism.

stardiviner · 2018-05-12T07:29:01Z

I tested on this too, it's striped (confirmed). I use Python retrieve this, Python side is correct.
This should be problem on libxml side. I will submit bug to Emacs ML.
Thanks.

AdamNiederer · 2018-05-12T07:29:58Z

No problem; thanks for pointing it out!

npostavs · 2018-05-13T19:57:16Z

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31427#8

elquery-read-string should probably throw an error (or warn, at least) when

(multibyte-string-p string) => nil

AdamNiederer · 2018-05-13T23:26:38Z

It sounds like simply setting the temp buffer to unibyte would fix this. 59f93b8 appears to work for both multibyte and unibyte strings.

stardiviner · 2018-05-14T01:15:27Z

Yes, confirmed.

npostavs · 2018-05-14T02:35:25Z

It could fail for web pages which are encoded in something other than utf-8. Although utf-8 is probably the most common encoding. Also, you might have trouble passing multibyte strings now (e.g., if the source of the string is not from a web page).

(let ((string "<html><body>α</body></html>"))
  (with-temp-buffer
    (set-buffer-multibyte nil)
    (insert string)
    (libxml-parse-html-region (point-min) (point-max))))
;=> (html nil (body nil "\261"))

(let ((string "<html><body>α</body></html>"))
  (with-temp-buffer
    (insert string)
    (libxml-parse-html-region (point-min) (point-max))))
;=> (html nil (body nil "α"))

stardiviner · 2018-05-14T02:48:53Z

Can try to detect buffer string with multibyte-string-p.

stardiviner · 2018-05-14T02:49:38Z

BTW @AdamNiederer In function elquery-read-string, there is no insert-string. Is it a bug or Emacs version compatiblity issue?

AdamNiederer · 2018-05-14T02:58:46Z

BTW @AdamNiederer In function elquery-read-string, there is no insert-string. Is it a bug or Emacs version compatiblity issue?

Oh, looks like it's deprecated in 25.x. Thanks for the heads up.

It could fail for web pages which are encoded in something other than utf-8. Although utf-8 is probably the most common encoding. Also, you might have trouble passing multibyte strings now (e.g., if the source of the string is not from a web page).

I tried converting the unibyte string with string-to-multibyte, but it looks like that has the same issues as inserting a unibyte string into a multibyte buffer. Is there a reasonable way to do the right thing if all I'm given is a string? Decoding the results of an HTTP request isn't really within the scope of this library.

npostavs · 2018-05-14T03:34:00Z

Decoding the results of an HTTP request isn't really within the scope of this library.

Right, that's why I suggested signaling an error or warning. You could possibly decode with undecided as the coding system to let Emacs guess the encoding, although that's not completely reliable of course.

AdamNiederer · 2020-06-28T16:11:54Z

This exact case, as well as the issue described in #9, has been fixed in b74e2a6. However, there are still less-common strings which will cause problems with elquery-read-string. I'm going to track that problem in #10.

AdamNiederer closed this as completed May 12, 2018

AdamNiederer added the upstream label May 12, 2018

AdamNiederer reopened this May 13, 2018

AdamNiederer added bug and removed upstream labels May 13, 2018

stardiviner closed this as completed May 14, 2018

AdamNiederer reopened this May 14, 2018

AdamNiederer closed this as completed in b74e2a6 Jun 28, 2020

AdamNiederer mentioned this issue Jun 28, 2020

Re-encode non-UTF-8 unibyte strings, or refuse to read them #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on parse an URL #5

Error on parse an URL #5

stardiviner commented May 12, 2018

AdamNiederer commented May 12, 2018 •

edited

Loading

stardiviner commented May 12, 2018

AdamNiederer commented May 12, 2018

npostavs commented May 13, 2018

AdamNiederer commented May 13, 2018

stardiviner commented May 14, 2018

npostavs commented May 14, 2018

stardiviner commented May 14, 2018

stardiviner commented May 14, 2018

AdamNiederer commented May 14, 2018 •

edited

Loading

npostavs commented May 14, 2018

AdamNiederer commented Jun 28, 2020

Error on parse an URL #5

Error on parse an URL #5

Comments

stardiviner commented May 12, 2018

AdamNiederer commented May 12, 2018 • edited Loading

stardiviner commented May 12, 2018

AdamNiederer commented May 12, 2018

npostavs commented May 13, 2018

AdamNiederer commented May 13, 2018

stardiviner commented May 14, 2018

npostavs commented May 14, 2018

stardiviner commented May 14, 2018

stardiviner commented May 14, 2018

AdamNiederer commented May 14, 2018 • edited Loading

npostavs commented May 14, 2018

AdamNiederer commented Jun 28, 2020

AdamNiederer commented May 12, 2018 •

edited

Loading

AdamNiederer commented May 14, 2018 •

edited

Loading