Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache redirected robots.txt for target host only if path is /robots.txt and query is empty #1057

Merged

Conversation

sebastian-nagel
Copy link
Contributor

Get/put redirected robots.txt from/to cache of target host only if the URL path including the query part is /robots.txt. The current code does not take the URL query part into consideration. In some (likely rare) cases, this may cause that the wrong robots.txt, e.g. /robots.txt?from=source.example.com is cached.

Note: the host port (and protocol) are part of the cache key. If the redirect leads to a different port on the target host, the robots.txt received from a request on this port is cached separately.

URL path including the query part is `/robots.txt`

Signed-off-by: Sebastian Nagel <sebastian@commoncrawl.org>
@jnioche jnioche added this to the 2.9 milestone Apr 19, 2023
@jnioche jnioche merged commit 6dc24bd into apache:master Apr 19, 2023
3 checks passed
@jnioche
Copy link
Contributor

jnioche commented Apr 19, 2023

thanks @sebastian-nagel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants