Siege fetching a non-existent URL not in source #208

barryhunter · 2022-04-08T16:39:22Z

I've got a strange issue with Siege fetching a URL that not in the source of the page

Can be reproduced with a single '--print' request...
$ siege -p https://www.geograph.org.uk/photo/9 | grep Lane
Shows a fetch to /Lane, which doesn't exist...

GET /photo/Lane, HTTP/1.0

Transactions:                      2 hits
Availability:                 100.00 %
Elapsed time:                   0.05 secs

In a normal run (without -p) shows it a 404
HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane,
The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where.

The only place word Lane has a comma, is in the meta description

$ siege -p --no-parser https://www.geograph.org.uk/photo/9 2>&1 | grep Lane,
        <meta name="description" content="SO8601 :: Burleigh Lane, near to Minchinhampton, Gloucestershire, Great Britain by Helena Downton" />

Not sure why Lane, would be singled out in that text as being worthy of fetching.

The text was updated successfully, but these errors were encountered:

JoeDog · 2022-04-08T16:49:07Z

That's strange. Will that page be available for a while? I'll try to debug this when I get a chance (hopefully this weekend)

…

On Fri, Apr 8, 2022 at 12:39 PM barryhunter ***@***.***> wrote: I've got a strange issue with Siege fetching a URL that not in the source of the page Can be reproduced with a single '--print' request... $ siege -p https://www.geograph.org.uk/photo/9 | grep Lane Shows a fetch to /Lane, which doesn't exist... GET /photo/Lane, HTTP/1.0 Transactions: 2 hits Availability: 100.00 % Elapsed time: 0.05 secs In a normal run (without -p) shows it a 404 HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane, The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where. The only place word Lane has a comma, is in the meta description $ siege -p --no-parser https://www.geograph.org.uk/photo/9 2>&1 | grep Lane, Not sure why Lane, would be singled out in that text as being worthy of fetching. — Reply to this email directly, view it on GitHub <#208>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJRHZXFGHLRUMOAVEG6PUTVEBOMPANCNFSM5S5CNKOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Jeff Fulmer 1-717-799-8226 https://www.joedog.org/ He codes

barryhunter · 2022-04-08T16:52:58Z

Actually think figured it out. Went digging in the source....

It tries to extract URLs from 'meta refresh' links

siege/src/parser.c

Line 156 in f69b445

/* <meta http-equiv="refresh" content="0; url=http://example.com/" /> */

  /* <meta http-equiv="refresh" content="0; url=http://example.com/" /> */

Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag

        if (__strcasestr(ptr, "url") != NULL) {

And my description has the token "url" in there! Burleigh - so it then seems to just use the next word as a relative link.

Not sure if upto recompiling the code, but seems like would be better changed to something like

   if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) {

Not sure if that will work in C or not. (my C is very rusty!)

Another example with url in the description to confirm...

$ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P
 <meta name="description" content="SU5016 :: Durley Church, near to..." />
GET /photo/Church, HTTP/1.0

barryhunter · 2022-04-08T17:23:10Z

Oh, didn't see your reply. Thanks!

Yes, that page should remain online long term :) Feel free to make requests, to the domain for testing. Although not large numbers of concurrent requests ;p

JoeDog · 2022-10-11T09:11:55Z

You want this: if (__strcasestr(ptr, "; url=") != NULL || __strcasestr(ptr, ";url=") != NULL) { (you missed an underscore in the second function call) I'll test it out but if you put that line in you'll be on the code base if this tests out

…

On Fri, Apr 8, 2022 at 12:53 PM barryhunter ***@***.***> wrote: Actually think figured it out. Went digging in the source.... It tries to extract URLs from 'meta refresh' links https://github.com/JoeDog/siege/blob/f69b44511d61db6fd3a0cd6f8684e2eef2406516/src/parser.c#L156 /* */ Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag if (__strcasestr(ptr, "url") != NULL) { And my description has the token "url" in there! B*url*eigh Not sure if upto recompiling the code, but seems like would be better changed to something like if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) { Not sure if that will work in C or not. (my C is very rusty!) Another example with url in the description to confirm... $ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P <meta name="description" content="SU5016 :: Durley Church, near to..." /> GET /photo/Church, HTTP/1.0 — Reply to this email directly, view it on GitHub <#208 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJRHZXA6THM22XAEZEFTTTVEBP7LANCNFSM5S5CNKOA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Jeff Fulmer 1-717-799-8226 https://www.joedog.org/ He codes

JoeDog closed this as completed Jul 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Siege fetching a non-existent URL not in source #208

Siege fetching a non-existent URL not in source #208

barryhunter commented Apr 8, 2022 •

edited

Loading

JoeDog commented Apr 8, 2022 via email

barryhunter commented Apr 8, 2022 •

edited

Loading

barryhunter commented Apr 8, 2022

JoeDog commented Oct 11, 2022 via email

Siege fetching a non-existent URL not in source #208

Siege fetching a non-existent URL not in source #208

Comments

barryhunter commented Apr 8, 2022 • edited Loading

JoeDog commented Apr 8, 2022 via email

barryhunter commented Apr 8, 2022 • edited Loading

barryhunter commented Apr 8, 2022

JoeDog commented Oct 11, 2022 via email

barryhunter commented Apr 8, 2022 •

edited

Loading

barryhunter commented Apr 8, 2022 •

edited

Loading