Page (possibly UTF content) makes plugins/web/titleSnarfer breaks (Uncaught exception) #1359

Rodrigo-NH · 2019-02-01T00:02:13Z

Hi. While trying to check why titleSnarfer won't return Title's page for http://lastsummer.de/creating-custom-packages-on-freebsd found this on the logs:

ERROR 2019-01-31T18:23:44 supybot Printed to stderr after daemonization: Exception in thread Thread #62 (for snarfing http://lastsummer.de/creating-custom-packages-on-freebsd/):
 Traceback (most recent call last):
   File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
     self.run()
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 166, in run
     super(UrlSnarfThread, self).run()
   File "/usr/local/lib/python2.7/threading.py", line 754, in run
     self.__target(*self.__args, **self.__kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 217, in doSnarf
     f(self, irc, msg, match, *L, **kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 194, in titleSnarfer
     r = self.getTitle(irc, url, False)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 166, in getTitle
     parser.feed(text)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 117, in feed
     self.goahead(0)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 161, in goahead
     k = self.parse_starttag(i)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
     attrvalue = self.unescape(attrvalue)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 476, in unescape
     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
   File "/usr/local/lib/python2.7/re.py", line 155, in sub
     return _compile(pattern, flags).sub(repl, string, count)
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0: ordinal not in range(128)
 INFO 2019-01-31T18:23:45 supybot Flushers flushed and garbage collected.
 ERROR 2019-01-31T18:24:15 supybot Unhandled error message from server: IrcMsg(prefix="cherryh.freenode.net", command="412", args=('VimDiesel', 'No text to send'))
 ERROR 2019-01-31T18:24:16 supybot Unhandled error message from server: IrcMsg(prefix="cherryh.freenode.net", command="412", args=('VimDiesel', 'No text to send'))

In this block from plugins/Web/plugin.py:

try:
            text = text.decode(utils.web.getEncoding(text) or 'utf8',
                    'replace')
        except UnicodeDecodeError:
            pass
        parser = Title()
        if minisix.PY3 and isinstance(text, bytes):
            if raiseErrors:
                irc.error(_('Could not guess the page\'s encoding. (Try '
                        'installing python-charade.)'), Raise=True)
            else:
                return None
        parser.feed(text)
        parser.close()

It seems that 'text' is the whole page to be parsed by HTMLparser. Anyway, changing line 166 (in my plugin.py copy) from
parser.feed(text)
to
parser.feed(text.encode('utf-8'))

Fixed the problem for this specific page while other pages (as far as I tested) keeps working as usual.
Can't conclude what is the problem or how relevant it could be, reporting here in case this example is useful.

The current (running) version of this Limnoria is installed on 2018-12-22T04-00-46, running on Python 2.7.15 (default, Dec 20 2018, 01:13:53) [GCC 4.2.1 Compatible FreeBSD Clang 6.0.0 (tags/RELEASE_600/final 326565)]. The newest versions available online are 2019.01.27 (in testing), 2018.12.19 (in master).
Salute!

The text was updated successfully, but these errors were encountered:

progval · 2019-02-01T09:52:59Z

Hi,

This is caused by a bug in Python 2's re module that Limnoria already triggers elsewhere.
I'll commit a fix based on your solution later this week.

You should also upgrade to Python 3 if possible (sudo pip uninstall limnoria; cd /usr/ports/irc/py-limnoria && sudo PYTHON_VERSION=3.6 make install clean)

Thanks for the report

Rodrigo-NH · 2019-02-03T04:34:19Z

Hi there.
Unfortunately the fix didn't' stopped the error happening for other links, at least for other site link that someone paste into channel.
(Also) Not sure from where this come from (from my setup or from limnoria), but after the link that triggered the uncaught error was posted in the channel, limnoria keeps repeating exactly the same error once a hour, referring to the same offending link even when it wasn't pasted in the channels anymore and even after unloading the Web plugin.
The new link that resulted in the error is https://savagedlight.me/2014/03/07/freebsd-jail-host-with-multiple-local-networks
I tried some more coherent alternatives to get rid of the error but get it working only after importing and using lxml to retrieve page's title. It's doing the magic (until now) somehow.

The site link source says 'meta charset="UTF-8"', still, the ".encode('utf-8')" failed. I have searched but couldn't find python 2 functions to retrieve url encoding from source (like requests.get.encoding in python 3), to try it more.

The way I have it now is from this block:

    try:
        parser = Title()
        parser.feed(text)
    except UnicodeDecodeError:
        # Workaround for Python 2
        # https://github.com/ProgVal/Limnoria/issues/1359
        parser = Title()
        parser.feed(text.encode('utf8'))
    parser.close()
    title = utils.str.normalizeWhitespace(''.join(parser.data).strip())

To:

    # Workaround for Python 2
    # https://github.com/ProgVal/Limnoria/issues/1359
    if sys.version_info.major == 2:
        page = urlopen(url)
        pageparse = lxml.html.parse(page)
       title = utils.str.normalizeWhitespace(pageparse.find(".//title").text)
    else:
        parser = Title()
        parser.feed(text)
        parser.close()
        title = utils.str.normalizeWhitespace(''.join(parser.data).strip())

Like I said had to load a new import to limnoria lxml.html :(
This works for http://lastsummer.de/creating-custom-packages-on-freebsd and https://savagedlight.me/2014/03/07/freebsd-jail-host-with-multiple-local-networks and other links I tested randomly.
I will leave this running and wait for new errors (i hope they never arrive again on this :))
Reference: https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte

Thanks!

progval · 2019-02-03T09:40:18Z

Could you post the new error message?

I have searched but couldn't find python 2 functions to retrieve url encoding from source

There's supybot.utils.web.getEncoding.

Rodrigo-NH · 2019-02-03T15:39:16Z

Hi ProgVal.
Thanks for the tip, I will try using supybot.utils.web.getEncoding.
The new error is similar to the old one:

ERROR 2019-02-03T04:11:18 supybot Printed to stderr after daemonization: Exception in thread Thread #299 (for snarfing https://savagedlight.me/2014/03/07/freebsd-jail-host-with-multiple-local-networks/):
 Traceback (most recent call last):
   File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
     self.run()
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 166, in run
     super(UrlSnarfThread, self).run()
   File "/usr/local/lib/python2.7/threading.py", line 754, in run
     self.__target(*self.__args, **self.__kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 217, in doSnarf
     f(self, irc, msg, match, *L, **kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 199, in titleSnarfer
     r = self.getTitle(irc, url, False)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 171, in getTitle
     parser.feed(text.encode('utf8'))
   File "/usr/local/lib/python2.7/HTMLParser.py", line 117, in feed
     self.goahead(0)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 161, in goahead
     k = self.parse_starttag(i)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
     attrvalue = self.unescape(attrvalue)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 476, in unescape
     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
   File "/usr/local/lib/python2.7/re.py", line 155, in sub
     return _compile(pattern, flags).sub(repl, string, count)
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0: ordinal not in range(128)
 INFO 2019-02-03T04:11:18 supybot Flushers flushed and garbage collected.

For that new code block version I'm testing got 'AttributeError: 'NoneType' object has no attribute 'text'' in ' title = utils.str.normalizeWhitespace(pageparse.find(".//title").text)', but this one is easier to understand and fix.

Thanks!

progval · 2019-02-03T15:58:12Z

I will try using supybot.utils.web.getEncoding.

The Web plugin already does... And everything we're dealing with here is either ASCII or UTF-8 so the problem's not there.

I'm sorry, but I don't know what to do other than upgrading to Python 3

Rodrigo-NH · 2019-02-03T17:10:17Z

Nice! Thanks for the explanation.

Backport fixes for the Web plugin [1][2][3]. [1] progval/Limnoria#1371 [2] progval/Limnoria#1362 [3] progval/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC) git-svn-id: svn+ssh://svn.freebsd.org/ports/head@513446 35697150-7ecd-e111-bb59-0022644237b5

Backport fixes for the Web plugin [1][2][3]. [1] progval/Limnoria#1371 [2] progval/Limnoria#1362 [3] progval/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC)

Backport fixes for the Web plugin [1][2][3]. [1] progval/Limnoria#1371 [2] progval/Limnoria#1362 [3] progval/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC) git-svn-id: svn+ssh://svn.freebsd.org/ports/head@513446 35697150-7ecd-e111-bb59-0022644237b5

Backport fixes for the Web plugin [1][2][3]. [1] progval/Limnoria#1371 [2] progval/Limnoria#1362 [3] progval/Limnoria#1359 Submitted by: DanDare (GitHub: Rodrigo-NH, via IRC)

progval added Bug Python 2 compatibility labels Feb 1, 2019

progval closed this as completed in 0f82f89 Feb 21, 2019

Rodrigo-NH mentioned this issue Sep 30, 2019

Fix 'Uncaught exception', makes check for 'charade' dep to logs, others #1371

Closed

Rodrigo-NH mentioned this issue Oct 11, 2019

CatchTitleSnarferErrors #1377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page (possibly UTF content) makes plugins/web/titleSnarfer breaks (Uncaught exception) #1359

Page (possibly UTF content) makes plugins/web/titleSnarfer breaks (Uncaught exception) #1359

Rodrigo-NH commented Feb 1, 2019 •

edited by progval

progval commented Feb 1, 2019

Rodrigo-NH commented Feb 3, 2019 •

edited

progval commented Feb 3, 2019

Rodrigo-NH commented Feb 3, 2019 •

edited by progval

progval commented Feb 3, 2019

Rodrigo-NH commented Feb 3, 2019

Page (possibly UTF content) makes plugins/web/titleSnarfer breaks (Uncaught exception) #1359

Page (possibly UTF content) makes plugins/web/titleSnarfer breaks (Uncaught exception) #1359

Comments

Rodrigo-NH commented Feb 1, 2019 • edited by progval

progval commented Feb 1, 2019

Rodrigo-NH commented Feb 3, 2019 • edited

progval commented Feb 3, 2019

Rodrigo-NH commented Feb 3, 2019 • edited by progval

progval commented Feb 3, 2019

Rodrigo-NH commented Feb 3, 2019

Rodrigo-NH commented Feb 1, 2019 •

edited by progval

Rodrigo-NH commented Feb 3, 2019 •

edited

Rodrigo-NH commented Feb 3, 2019 •

edited by progval