-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Siege fetching a non-existent URL not in source #208
Comments
That's strange. Will that page be available for a while? I'll try to debug
this when I get a chance (hopefully this weekend)
…On Fri, Apr 8, 2022 at 12:39 PM barryhunter ***@***.***> wrote:
I've got a strange issue with Siege fetching a URL that not in the source
of the page
Can be reproduced with a single '--print' request...
$ siege -p https://www.geograph.org.uk/photo/9 | grep Lane
Shows a fetch to /Lane, which doesn't exist...
GET /photo/Lane, HTTP/1.0
Transactions: 2 hits
Availability: 100.00 %
Elapsed time: 0.05 secs
In a normal run (without -p) shows it a 404
HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane,
The word 'Lane' does appear in the page in lots of places, but nowhere in
a URL (and no css/js etc reference, which is what the parser should be
extracting). Using --no-parser shows the bogus request isnt made, showing
it coming from parsing somewhere. Just can't figure out where.
The only place word Lane has a comma, is in the meta description
$ siege -p --no-parser https://www.geograph.org.uk/photo/9 2>&1 | grep
Lane,
Not sure why Lane, would be singled out in that text as being worthy of
fetching.
—
Reply to this email directly, view it on GitHub
<#208>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJRHZXFGHLRUMOAVEG6PUTVEBOMPANCNFSM5S5CNKOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Jeff Fulmer
1-717-799-8226
https://www.joedog.org/
He codes
|
Actually think figured it out. Went digging in the source.... It tries to extract URLs from 'meta refresh' links Line 156 in f69b445
Seems it just looks for the token 'url' inside the 'content' attribute. Assuming it a 'refresh' tag
And my description has the token "url" in there! Burleigh - so it then seems to just use the next word as a relative link. Not sure if upto recompiling the code, but seems like would be better changed to something like
Not sure if that will work in C or not. (my C is very rusty!) Another example with url in the description to confirm...
|
Oh, didn't see your reply. Thanks! Yes, that page should remain online long term :) Feel free to make requests, to the domain for testing. Although not large numbers of concurrent requests ;p |
You want this:
if (__strcasestr(ptr, "; url=") != NULL || __strcasestr(ptr, ";url=") !=
NULL) {
(you missed an underscore in the second function call)
I'll test it out but if you put that line in you'll be on the code base if
this tests out
…On Fri, Apr 8, 2022 at 12:53 PM barryhunter ***@***.***> wrote:
Actually think figured it out. Went digging in the source....
It tries to extract URLs from 'meta refresh' links
https://github.com/JoeDog/siege/blob/f69b44511d61db6fd3a0cd6f8684e2eef2406516/src/parser.c#L156
/* */
Seems it just looks for the token 'url' inside the 'content' attribute.
Assuming it a 'refresh' tag
if (__strcasestr(ptr, "url") != NULL) {
And my description has the token "url" in there! B*url*eigh
Not sure if upto recompiling the code, but seems like would be better
changed to something like
if (__strcasestr(ptr, "; url=") != NULL || _strcasestr(ptr, ";url=") != NULL) {
Not sure if that will work in C or not. (my C is very rusty!)
Another example with url in the description to confirm...
$ siege -p https://www.geograph.org.uk/photo/27592 2>&1 | grep '(GET|description)' -P
<meta name="description" content="SU5016 :: Durley Church, near to..." />
GET /photo/Church, HTTP/1.0
—
Reply to this email directly, view it on GitHub
<#208 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJRHZXA6THM22XAEZEFTTTVEBP7LANCNFSM5S5CNKOA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Jeff Fulmer
1-717-799-8226
https://www.joedog.org/
He codes
|
I've got a strange issue with Siege fetching a URL that not in the source of the page
Can be reproduced with a single '--print' request...
$ siege -p https://www.geograph.org.uk/photo/9 | grep Lane
Shows a fetch to /Lane, which doesn't exist...
In a normal run (without -p) shows it a 404
HTTP/1.1 404 0.08 secs: 2322 bytes ==> GET /photo/Lane,
The word 'Lane' does appear in the page in lots of places, but nowhere in a URL (and no css/js etc reference, which is what the parser should be extracting). Using --no-parser shows the bogus request isnt made, showing it coming from parsing somewhere. Just can't figure out where.
The only place word Lane has a comma, is in the meta description
Not sure why Lane, would be singled out in that text as being worthy of fetching.
The text was updated successfully, but these errors were encountered: