Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the number of redirects to 5 for Robots.txt fetching #1074

Merged
merged 2 commits into from
May 20, 2023

Conversation

michaeldinzinger
Copy link
Contributor

Described in Issue #1058
In IETF RFC9309 (Robots Exclusion Protocol), it is stated that crawlers should follow up to 5 consecutive redirects in their attempt to fetch a Robots.txt file. Up to now, the SC only followed one level of redirects. So this code change might slightly improve the politeness of the crawler

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Copy link
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the tests. Just a minor suggestion

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
@jnioche jnioche merged commit edba0d0 into apache:master May 20, 2023
4 checks passed
@jnioche jnioche added this to the 2.9 milestone May 20, 2023
@jnioche jnioche added the core label May 20, 2023
@jnioche
Copy link
Contributor

jnioche commented May 20, 2023

thanks @michaeldinzinger and @rzo1

michaeldinzinger added a commit to michaeldinzinger/storm-crawler that referenced this pull request May 22, 2023
…#1074)

* Issue apache#1058: Allow 5 redirects for Robots.txt fetching

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Minor variable renaming

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

---------

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
jnioche added a commit that referenced this pull request May 23, 2023
* Remove injection from crawl topologies in *Search archetypes, fixes #1065

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 (#1062)

* Fix unmangleQueryString filter.

Fix unmangleQueryString filter. Do not analyze full URL path, just last child,

* formatting

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Removed remaining references to ES in OPenSearch module

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Dependency upgrades.fixes #1066 (#1067)

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Automatic creation of index definitions should use the bolt type (#1069)

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Maven plugin upgrades + better handling of plugin versions

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* bgufix test jar not attached

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Update maven.yml

v3 version of actions

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* mechanism to retrieve more generic value of configuration  (#1071)

* mechanism to retrieve more generic value of configuration if a specific one is not found, fixes #1070

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

* minor javadoc fix

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

---------

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Batch requests in DeleterBolt, fixes #1072

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Update README.md

link to docker project

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Create DeletionBolt.java for Solr. #1050 (#1073)

* Create DeletionBolt.java

storm-crawler-solr bug. Missing DeletionBolt bolt code. #1050

* Update DeletionBolt.java

License header added

* Update DeletionBolt.java

formatting

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* SOLR: suppress warnings + minor changes and Javadoc + added deletion to default topology

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Tika 2.8.0, fixes 1066

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Increase the number of redirects to 5 for Robots.txt fetching (#1074)

* Issue #1058: Allow 5 redirects for Robots.txt fetching

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Minor variable renaming

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

---------

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Add test coverage reports with JaCoCo and Coveralls, fixes #1075

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* #1075 - Add test coverage reports with JaCoCo

Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* #1075 - Update GH workflow to reduce log spam by adding -B and --no-transfer-progess maven options

Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Rebase - Issue #1042: Forbid all rules by default

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Modify Robots.txt parsing logic and add test cases

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Parse robots txt rules only for status code 200

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Trying to resolve merge conflicts

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Modify Robots.txt parsing logic and add test cases

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Parse robots txt rules only for status code 200

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

* Merge HttpRobotRulesParserTest

Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>

---------

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Michael Dinzinger <michael.dinzinger@uni-passau.de>
Signed-off-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
Co-authored-by: Julien Nioche <julien@digitalpebble.com>
Co-authored-by: syefimov <syefimov@ptfs.com>
Co-authored-by: Richard Zowalla <richard.zowalla@hs-heilbronn.de>
@michaeldinzinger michaeldinzinger deleted the devIncreaseRedirectsREP branch May 24, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants