Feed retrieved by urllib2 is sometimes truncated #1

Open
wettenhj opened this Issue May 17, 2013 · 0 comments

Projects

None yet

1 participant

@wettenhj
Contributor

I have experienced truncation of feeds retrieved by urllib2 as described here:
http://stackoverflow.com/questions/13222376/urllib2-https-truncated-response
and here:
http://bugs.python.org/issue17569

The behaviour from feedparser's point of view is this:

Python 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012, 20:32:06)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import feedparser
doc = feedparser.parse("https://115.146.87.34:8080/")
if doc.bozo:
... raise doc.bozo_exception
...
Traceback (most recent call last):
File "", line 2, in
xml.sax._exceptions.SAXParseException: :137:14: unclosed token

The feed content used to trigger the error above is being dynamically generated by a Node.js application. If I instead serve the same feed content (saved into a static document) from an Apache web server, then the problem is avoided, so perhaps it is related to a timing issue, i.e. Node.js pausing part-way through serving up the atom feed. One timing issue which could affect urllib2 is case 3 in this question:
http://stackoverflow.com/questions/7174927/when-does-socket-recvrecv-size-return

The truncation could be avoided by replacing use of the "urllib2" module in feedparser.py with use of the "requests" module, as described here:
http://stackoverflow.com/questions/13222376/urllib2-https-truncated-response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment