Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Siege fetching a non-existent URL not in source #208

Closed
barryhunter opened this issue Apr 8, 2022 · 4 comments
Closed

Siege fetching a non-existent URL not in source #208

barryhunter opened this issue Apr 8, 2022 · 4 comments

Comments

@barryhunter
Copy link

barryhunter commented Apr 8, 2022

I've got a strange issue with Siege fetching a URL that not in the source of the page

Can be reproduced with a single '--print' request...
$ siege -p https://www.geograph.org.uk/photo/9 | grep Lane
Shows a fetch to /Lane, which doesn't exist...

GET /photo/Lane, HTTP/1.0

Transactions:                      2 hits
Availability:                 100.00 %
Elapsed time:                   0.05 secs

In a normal run (without -p) shows it a 404
HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane,
The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where.

The only place word Lane has a comma, is in the meta description

$ siege -p --no-parser https://www.geograph.org.uk/photo/9 2>&1 | grep Lane,
        <meta name="description" content="SO8601 :: Burleigh Lane, near to Minchinhampton, Gloucestershire, Great Britain by Helena Downton" />

Not sure why Lane, would be singled out in that text as being worthy of fetching.

@JoeDog
Copy link
Owner

JoeDog commented Apr 8, 2022 via email

@barryhunter
Copy link
Author

barryhunter commented Apr 8, 2022

Actually think figured it out. Went digging in the source....

It tries to extract URLs from 'meta refresh' links

/* <meta http-equiv="refresh" content="0; url=http://example.com/" /> */

  /* <meta http-equiv="refresh" content="0; url=http://example.com/" /> */

Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag

        if (__strcasestr(ptr, "url") != NULL) {

And my description has the token "url" in there! Burleigh - so it then seems to just use the next word as a relative link.

Not sure if upto recompiling the code, but seems like would be better changed to something like

   if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) {

Not sure if that will work in C or not. (my C is very rusty!)

Another example with url in the description to confirm...

$ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P
 <meta name="description" content="SU5016 :: Durley Church, near to..." />
GET /photo/Church, HTTP/1.0

@barryhunter
Copy link
Author

Oh, didn't see your reply. Thanks!

Yes, that page should remain online long term :) Feel free to make requests, to the domain for testing. Although not large numbers of concurrent requests ;p

@JoeDog JoeDog closed this as completed Jul 31, 2022
@JoeDog
Copy link
Owner

JoeDog commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants