New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feed retrieved by urllib2 is sometimes truncated #1

Open
wettenhj opened this Issue May 17, 2013 · 0 comments

Comments

Projects
None yet
1 participant
@wettenhj
Contributor

wettenhj commented May 17, 2013

I have experienced truncation of feeds retrieved by urllib2 as described here:
http://stackoverflow.com/questions/13222376/urllib2-https-truncated-response
and here:
http://bugs.python.org/issue17569

The behaviour from feedparser's point of view is this:

Python 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012, 20:32:06)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import feedparser
doc = feedparser.parse("https://115.146.87.34:8080/")
if doc.bozo:
... raise doc.bozo_exception
...
Traceback (most recent call last):
File "", line 2, in
xml.sax._exceptions.SAXParseException: :137:14: unclosed token

The feed content used to trigger the error above is being dynamically generated by a Node.js application. If I instead serve the same feed content (saved into a static document) from an Apache web server, then the problem is avoided, so perhaps it is related to a timing issue, i.e. Node.js pausing part-way through serving up the atom feed. One timing issue which could affect urllib2 is case 3 in this question:
http://stackoverflow.com/questions/7174927/when-does-socket-recvrecv-size-return

The truncation could be avoided by replacing use of the "urllib2" module in feedparser.py with use of the "requests" module, as described here:
http://stackoverflow.com/questions/13222376/urllib2-https-truncated-response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment