Implement robots.txt check before scraping by joelramilison · Pull Request #253 · FindFirst-Development/FindFirst-core

joelramilison · 2024-10-13T23:54:52Z

Issue number: resolves #236

Checklist

Code Formatter (run prettier/spotlessApply)
Code has unit tests? (If no explain in other_information)
Builds on localhost
Builds/Runs in docker compose

What is the current behavior?

On bookmark creation, it tried to scrape no matter what (as long as the client has 'scrapable' toggled on (which is the default as of now)

What is the new behavior?

Check the robots.txt file
Use the Google procedure to parse it and find out whether the URL is scrapable

Does this introduce a breaking change?

Yes
No

Other information

"Google follows at least five redirect hops as defined by RFC 1945 and then stops and treats it as a 404 for the robots.txt. This also applies to any disallowed URLs in the redirect chain, since the crawler couldn't fetch rules due to the redirects." (https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)

--> If you want to, you can make an issue to implement this too.

R-Sandor

Wow this is awesome! Looks really good! Thanks for the test coverage too!

joelramilison added 3 commits October 11, 2024 21:34

Change RobotsTxtResponse from class to record

d5b0868

Check robot.txt before scraping a URL

761218f

deleted RobotAgent.java

ad37a6f

R-Sandor approved these changes Oct 14, 2024

View reviewed changes

R-Sandor merged commit 3bfe288 into FindFirst-Development:main Oct 14, 2024

R-Sandor added the hacktoberfest-accepted label Oct 14, 2024

R-Sandor assigned joelramilison Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement robots.txt check before scraping#253

Implement robots.txt check before scraping#253
R-Sandor merged 3 commits intoFindFirst-Development:mainfrom
joelramilison:implement-scrabe-robots-check

joelramilison commented Oct 13, 2024

Uh oh!

R-Sandor left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joelramilison commented Oct 13, 2024

Checklist

What is the current behavior?

What is the new behavior?

Does this introduce a breaking change?

Other information

Uh oh!

R-Sandor left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants