Implement robots.txt check before scraping#253
Merged
R-Sandor merged 3 commits intoFindFirst-Development:mainfrom Oct 14, 2024
Merged
Implement robots.txt check before scraping#253R-Sandor merged 3 commits intoFindFirst-Development:mainfrom
R-Sandor merged 3 commits intoFindFirst-Development:mainfrom
Conversation
R-Sandor
approved these changes
Oct 14, 2024
Collaborator
R-Sandor
left a comment
There was a problem hiding this comment.
Wow this is awesome! Looks really good! Thanks for the test coverage too!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue number: resolves #236
Checklist
What is the current behavior?
On bookmark creation, it tried to scrape no matter what (as long as the client has 'scrapable' toggled on (which is the default as of now)
What is the new behavior?
Does this introduce a breaking change?
Other information
"Google follows at least five redirect hops as defined by RFC 1945 and then stops and treats it as a 404 for the robots.txt. This also applies to any disallowed URLs in the redirect chain, since the crawler couldn't fetch rules due to the redirects." (https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)
--> If you want to, you can make an issue to implement this too.