Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page (possibly UTF content) makes plugins/web/titleSnarfer breaks (Uncaught exception) #1359

Closed
Rodrigo-NH opened this issue Feb 1, 2019 · 6 comments

Comments

@Rodrigo-NH
Copy link
Contributor

Rodrigo-NH commented Feb 1, 2019

Hi. While trying to check why titleSnarfer won't return Title's page for http://lastsummer.de/creating-custom-packages-on-freebsd found this on the logs:

ERROR 2019-01-31T18:23:44 supybot Printed to stderr after daemonization: Exception in thread Thread #62 (for snarfing http://lastsummer.de/creating-custom-packages-on-freebsd/):
 Traceback (most recent call last):
   File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
     self.run()
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 166, in run
     super(UrlSnarfThread, self).run()
   File "/usr/local/lib/python2.7/threading.py", line 754, in run
     self.__target(*self.__args, **self.__kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 217, in doSnarf
     f(self, irc, msg, match, *L, **kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 194, in titleSnarfer
     r = self.getTitle(irc, url, False)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 166, in getTitle
     parser.feed(text)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 117, in feed
     self.goahead(0)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 161, in goahead
     k = self.parse_starttag(i)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
     attrvalue = self.unescape(attrvalue)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 476, in unescape
     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
   File "/usr/local/lib/python2.7/re.py", line 155, in sub
     return _compile(pattern, flags).sub(repl, string, count)
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0: ordinal not in range(128)
 INFO 2019-01-31T18:23:45 supybot Flushers flushed and garbage collected.
 ERROR 2019-01-31T18:24:15 supybot Unhandled error message from server: IrcMsg(prefix="cherryh.freenode.net", command="412", args=('VimDiesel', 'No text to send'))
 ERROR 2019-01-31T18:24:16 supybot Unhandled error message from server: IrcMsg(prefix="cherryh.freenode.net", command="412", args=('VimDiesel', 'No text to send'))

In this block from plugins/Web/plugin.py:

try:
            text = text.decode(utils.web.getEncoding(text) or 'utf8',
                    'replace')
        except UnicodeDecodeError:
            pass
        parser = Title()
        if minisix.PY3 and isinstance(text, bytes):
            if raiseErrors:
                irc.error(_('Could not guess the page\'s encoding. (Try '
                        'installing python-charade.)'), Raise=True)
            else:
                return None
        parser.feed(text)
        parser.close()

It seems that 'text' is the whole page to be parsed by HTMLparser. Anyway, changing line 166 (in my plugin.py copy) from
parser.feed(text)
to
parser.feed(text.encode('utf-8'))

Fixed the problem for this specific page while other pages (as far as I tested) keeps working as usual.
Can't conclude what is the problem or how relevant it could be, reporting here in case this example is useful.

The current (running) version of this Limnoria is installed on 2018-12-22T04-00-46, running on Python 2.7.15 (default, Dec 20 2018, 01:13:53) [GCC 4.2.1 Compatible FreeBSD Clang 6.0.0 (tags/RELEASE_600/final 326565)]. The newest versions available online are 2019.01.27 (in testing), 2018.12.19 (in master).
Salute!

@progval
Copy link
Owner

progval commented Feb 1, 2019

Hi,

This is caused by a bug in Python 2's re module that Limnoria already triggers elsewhere.
I'll commit a fix based on your solution later this week.

You should also upgrade to Python 3 if possible (sudo pip uninstall limnoria; cd /usr/ports/irc/py-limnoria && sudo PYTHON_VERSION=3.6 make install clean)

Thanks for the report

@Rodrigo-NH
Copy link
Contributor Author

Rodrigo-NH commented Feb 3, 2019

Hi there.
Unfortunately the fix didn't' stopped the error happening for other links, at least for other site link that someone paste into channel.
(Also) Not sure from where this come from (from my setup or from limnoria), but after the link that triggered the uncaught error was posted in the channel, limnoria keeps repeating exactly the same error once a hour, referring to the same offending link even when it wasn't pasted in the channels anymore and even after unloading the Web plugin.
The new link that resulted in the error is https://savagedlight.me/2014/03/07/freebsd-jail-host-with-multiple-local-networks
I tried some more coherent alternatives to get rid of the error but get it working only after importing and using lxml to retrieve page's title. It's doing the magic (until now) somehow.

The site link source says 'meta charset="UTF-8"', still, the ".encode('utf-8')" failed. I have searched but couldn't find python 2 functions to retrieve url encoding from source (like requests.get.encoding in python 3), to try it more.

The way I have it now is from this block:

    try:
        parser = Title()
        parser.feed(text)
    except UnicodeDecodeError:
        # Workaround for Python 2
        # https://github.com/ProgVal/Limnoria/issues/1359
        parser = Title()
        parser.feed(text.encode('utf8'))
    parser.close()
    title = utils.str.normalizeWhitespace(''.join(parser.data).strip())

To:

    # Workaround for Python 2
    # https://github.com/ProgVal/Limnoria/issues/1359
    if sys.version_info.major == 2:
        page = urlopen(url)
        pageparse = lxml.html.parse(page)
       title = utils.str.normalizeWhitespace(pageparse.find(".//title").text)
    else:
        parser = Title()
        parser.feed(text)
        parser.close()
        title = utils.str.normalizeWhitespace(''.join(parser.data).strip())

Like I said had to load a new import to limnoria lxml.html :(
This works for http://lastsummer.de/creating-custom-packages-on-freebsd and https://savagedlight.me/2014/03/07/freebsd-jail-host-with-multiple-local-networks and other links I tested randomly.
I will leave this running and wait for new errors (i hope they never arrive again on this :))
Reference: https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte

Thanks!

@progval
Copy link
Owner

progval commented Feb 3, 2019

Could you post the new error message?

I have searched but couldn't find python 2 functions to retrieve url encoding from source

There's supybot.utils.web.getEncoding.

@Rodrigo-NH
Copy link
Contributor Author

Rodrigo-NH commented Feb 3, 2019

Hi ProgVal.
Thanks for the tip, I will try using supybot.utils.web.getEncoding.
The new error is similar to the old one:

ERROR 2019-02-03T04:11:18 supybot Printed to stderr after daemonization: Exception in thread Thread #299 (for snarfing https://savagedlight.me/2014/03/07/freebsd-jail-host-with-multiple-local-networks/):
 Traceback (most recent call last):
   File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
     self.run()
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 166, in run
     super(UrlSnarfThread, self).run()
   File "/usr/local/lib/python2.7/threading.py", line 754, in run
     self.__target(*self.__args, **self.__kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/commands.py", line 217, in doSnarf
     f(self, irc, msg, match, *L, **kwargs)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 199, in titleSnarfer
     r = self.getTitle(irc, url, False)
   File "/usr/local/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 171, in getTitle
     parser.feed(text.encode('utf8'))
   File "/usr/local/lib/python2.7/HTMLParser.py", line 117, in feed
     self.goahead(0)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 161, in goahead
     k = self.parse_starttag(i)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
     attrvalue = self.unescape(attrvalue)
   File "/usr/local/lib/python2.7/HTMLParser.py", line 476, in unescape
     return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
   File "/usr/local/lib/python2.7/re.py", line 155, in sub
     return _compile(pattern, flags).sub(repl, string, count)
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0: ordinal not in range(128)
 INFO 2019-02-03T04:11:18 supybot Flushers flushed and garbage collected.

For that new code block version I'm testing got 'AttributeError: 'NoneType' object has no attribute 'text'' in ' title = utils.str.normalizeWhitespace(pageparse.find(".//title").text)', but this one is easier to understand and fix.

Thanks!

@progval
Copy link
Owner

progval commented Feb 3, 2019

I will try using supybot.utils.web.getEncoding.

The Web plugin already does... And everything we're dealing with here is either ASCII or UTF-8 so the problem's not there.

I'm sorry, but I don't know what to do other than upgrading to Python 3

@Rodrigo-NH
Copy link
Contributor Author

Nice! Thanks for the explanation.

uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Oct 1, 2019
Backport fixes for the Web plugin [1][2][3].

[1] progval/Limnoria#1371
[2] progval/Limnoria#1362
[3] progval/Limnoria#1359

Submitted by:	DanDare (GitHub: Rodrigo-NH, via IRC)


git-svn-id: svn+ssh://svn.freebsd.org/ports/head@513446 35697150-7ecd-e111-bb59-0022644237b5
uqs pushed a commit to freebsd/freebsd-ports that referenced this issue Oct 1, 2019
Backport fixes for the Web plugin [1][2][3].

[1] progval/Limnoria#1371
[2] progval/Limnoria#1362
[3] progval/Limnoria#1359

Submitted by:	DanDare (GitHub: Rodrigo-NH, via IRC)
Jehops pushed a commit to Jehops/freebsd-ports-legacy that referenced this issue Oct 2, 2019
Backport fixes for the Web plugin [1][2][3].

[1] progval/Limnoria#1371
[2] progval/Limnoria#1362
[3] progval/Limnoria#1359

Submitted by:	DanDare (GitHub: Rodrigo-NH, via IRC)


git-svn-id: svn+ssh://svn.freebsd.org/ports/head@513446 35697150-7ecd-e111-bb59-0022644237b5
svmhdvn pushed a commit to svmhdvn/freebsd-ports that referenced this issue Jan 10, 2024
Backport fixes for the Web plugin [1][2][3].

[1] progval/Limnoria#1371
[2] progval/Limnoria#1362
[3] progval/Limnoria#1359

Submitted by:	DanDare (GitHub: Rodrigo-NH, via IRC)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants