Skip to content

Implement robots.txt check before scraping#253

Merged
R-Sandor merged 3 commits intoFindFirst-Development:mainfrom
joelramilison:implement-scrabe-robots-check
Oct 14, 2024
Merged

Implement robots.txt check before scraping#253
R-Sandor merged 3 commits intoFindFirst-Development:mainfrom
joelramilison:implement-scrabe-robots-check

Conversation

@joelramilison
Copy link
Contributor

Issue number: resolves #236


Checklist

  • Code Formatter (run prettier/spotlessApply)
  • Code has unit tests? (If no explain in other_information)
  • Builds on localhost
  • Builds/Runs in docker compose

What is the current behavior?

On bookmark creation, it tried to scrape no matter what (as long as the client has 'scrapable' toggled on (which is the default as of now)

What is the new behavior?

  • Check the robots.txt file
  • Use the Google procedure to parse it and find out whether the URL is scrapable

Does this introduce a breaking change?

  • Yes
  • No

Other information

"Google follows at least five redirect hops as defined by RFC 1945 and then stops and treats it as a 404 for the robots.txt. This also applies to any disallowed URLs in the redirect chain, since the crawler couldn't fetch rules due to the redirects." (https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)

--> If you want to, you can make an issue to implement this too.

Copy link
Collaborator

@R-Sandor R-Sandor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is awesome! Looks really good! Thanks for the test coverage too!

@R-Sandor R-Sandor merged commit 3bfe288 into FindFirst-Development:main Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Server] Before initaiting the scrape check robot.txt on the domain.

2 participants