Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Inappropriately converting / -> /index.html while mirroring sites with Slurping #248

Closed
GoogleCodeExporter opened this issue Apr 6, 2015 · 7 comments

Comments

@GoogleCodeExporter
Copy link

From jmaessen:

I've been collecting a fresh slurp, since we're now doing more stuff than we 
were the last time I did so.  But in looking at the logs, I've realized we're 
running into an odd problem:

When we ask apache for a uri ending in /, like say http://www.ibm.com/ , Apache 
sees the url and says "hey, a directory, I'd better append index.html".  So we 
end up asking the web for http://www.ibm.com/index.html , which is great except 
that this page says "302, try http://www.ibm.com/ instead".  So we end up not 
fetching quite a lot of content, because apache corrupts the uri as it proxies 
it through the slurper.  I presume (but don't know for sure) that mod_proxy 
doesn't have the same flaw.  Not 100% sure of the mechanics inside Apache that 
cause this to happen; does anyone know more?

Original issue reported on code.google.com by sligocki@google.com on 21 Mar 2011 at 8:18

@GoogleCodeExporter
Copy link
Author

Confirmed for trunk build:

$ curl -x localhost:8080 -v http://www.ibm.com/
* About to connect() to proxy localhost port 8080 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 8080 (#0)
> GET http://www.ibm.com/ HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k 
zlib/1.2.3.3 libidn/1.15
> Host: www.ibm.com
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 302 Found
< Date: Wed, 16 Mar 2011 20:31:36 GMT
< Server: Apache/2.2.16 (Unix) DAV/2
< Location: http://www.ibm.com/
< Content-Length: 203
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.ibm.com/">here</a>.</p>
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

But it is not broken in latest release:

$ curl -x localhost:80 -v http://www.ibm.com/
* About to connect() to proxy localhost port 80 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 80 (#0)
> GET http://www.ibm.com/ HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k 
zlib/1.2.3.3 libidn/1.15
> Host: www.ibm.com
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 302 Found
< Date: Wed, 16 Mar 2011 20:20:07 GMT
< Server: IBM_HTTP_Server
< Location: http://www.ibm.com/us/en/
< Cache-Control: no-cache, must-revalidate
< Pragma: no-cache
< Expires: Mon, 01 Jan 1990 00:00:20 GMT
< Vary: Accept-Encoding
< Content-Length: 209
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.ibm.com/us/en/">here</a>.</p>
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

Original comment by sligocki@google.com on 21 Mar 2011 at 8:19

@GoogleCodeExporter
Copy link
Author

This also affects playback of older slurps -- we simply 404 on them.

Original comment by morlov...@google.com on 21 Mar 2011 at 8:23

@GoogleCodeExporter
Copy link
Author

I believe I know why this is.  I'm being troubled by this situation now.  It's 
due to this call sequence:
   apache_slurp.cc: SlurpUrl()
   InstawebContext::MakeRequestUrl()
   ap_construct_url()
This occurs, in my debugger, with request with these fields:
     the_request = 0x7c10d0 "GET http://www.vip-chicks.de/ HTTP/1.1", 
     hostname = 0x7601a0 "www.vip-chicks.de", 
     unparsed_uri = 0x7b9550 "/index.html", 
     uri = 0x7b9570 "/index.html", 
     parsed_uri.path = "/index.html"
     main != NULL
the 'main' points to a request where:
     unparsed_uri = 0x75f4f0 "http://www.vip-chicks.de/", 
     uri = 0x74e980 "/", 
     parsed_uri.path = 0x74e980 "/", 
So I think a good fix for that is, in MakeRequestUri, follow the main() pointer 
till its null before looking at 'uri' fields.

Original comment by jmara...@google.com on 23 Mar 2011 at 2:23

@GoogleCodeExporter
Copy link
Author

Original comment by jmara...@google.com on 23 Mar 2011 at 3:16

@GoogleCodeExporter
Copy link
Author

fix coming....

Original comment by jmara...@google.com on 23 Mar 2011 at 3:50

  • Changed state: Started

@GoogleCodeExporter
Copy link
Author

Original comment by jmara...@google.com on 24 Mar 2011 at 12:51

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

Original comment by jmara...@google.com on 6 May 2011 at 4:39

  • Changed title: Inappropriately converting / -> /index.html while mirroring sites with Slurping

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant