Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wpull -r fails to change host when grabbing a redirecting robots.txt #58

Closed
ivan opened this issue Mar 5, 2014 · 2 comments
Closed
Assignees
Labels

Comments

@ivan
Copy link
Contributor

ivan commented Mar 5, 2014

# python3 -m wpull "http://www.techcrunch.com/" -r
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 537.2 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
WARNING Ignoring robots.txt redirect loop.
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 0.0 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO FINISHED.
INFO Time length: 0:00:03.
INFO Downloaded: 0 files, 0.0 B.
INFO Exiting with status 0.

This works fine, though:

# python3 -m wpull "http://www.techcrunch.com/robots.txt"
INFO Fetching ‘http://www.techcrunch.com/robots.txt’.
Requesting http://www.techcrunch.com/robots.txt... 301 Moved Permanently
Length: 178 [text/html]
100.0% [=========================] 178.0 B 0:00:00 637.1 B/s
Bytes received: 178
INFO Fetched ‘http://www.techcrunch.com/robots.txt’: 301 Moved Permanently. Length: 178 [text/html].
INFO Fetching ‘http://techcrunch.com/robots.txt’.
Requesting http://techcrunch.com/robots.txt... 200 OK
Length: None [text/plain; charset=utf-8]
[O                        ] 5.0 B 0:00:00 15.5 B/s
Bytes received: 867
INFO Fetched ‘http://techcrunch.com/robots.txt’: 200 OK. Length: None [text/plain; charset=utf-8].
INFO FINISHED.
INFO Time length: 0:00:00.
INFO Downloaded: 1 file, 855.0 B.
INFO Exiting with status 0.
@ivan
Copy link
Contributor Author

ivan commented Mar 5, 2014

This happens even with --span-hosts.

python3 -m wpull -r http://crunchbase.com/ has the same problem, except it's trying to redirect to www. instead.

@chfoo chfoo added the bug label Mar 5, 2014
@chfoo chfoo self-assigned this Mar 5, 2014
chfoo added a commit that referenced this issue Mar 6, 2014
Re: #58

When next_request was called multiple times, the redirect URL was set to
None. This caused it to fetch the same redirecting robots.txt repeatly.
@chfoo
Copy link
Member

chfoo commented Mar 6, 2014

Commit 0595831 fixes the issue where it was broken even on --span-hosts. Issue #29 will address the redirects in general.

@chfoo chfoo closed this as completed in 8b540a3 Mar 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants