Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebHost: Add robots.txt to WebHost #3157

Merged
merged 9 commits into from
Apr 21, 2024
Merged

WebHost: Add robots.txt to WebHost #3157

merged 9 commits into from
Apr 21, 2024

Conversation

LegendaryLinux
Copy link
Member

@LegendaryLinux LegendaryLinux commented Apr 16, 2024

What is this fixing or adding?

This adds a robots.txt file to the WebHost, which will prevent any bot which respects the file from including the site in search results. The main site, https://archipelago.gg should set a config value in config.yaml to disable allow itself to be indexed.

I also added an ASSET_RIGHTS = false entry to config.yaml. If false, /robots.txt will be served. The app.config value defaults to false if the file is not present.

How was this tested?

I ran the WebHost locally and accessed the file at localhost://robots.txt

@LegendaryLinux LegendaryLinux added affects: webhost Issues/PRs that touch webhost and may need additional validation. waiting-on: peer-review Issue/PR has not been reviewed by enough people yet. labels Apr 16, 2024
@LegendaryLinux LegendaryLinux self-assigned this Apr 16, 2024
@mpql
Copy link

mpql commented Apr 16, 2024

I would like to also add an option in a config file to exclude robots.py from being loaded

It might then be reasonable to just have robots.py check if the host is archipelago.gg and in such a case return e.g.:

return app.send_static_file('robots_archipelago.gg.txt')

@LegendaryLinux
Copy link
Member Author

How do you propose testing that outside of the production environment?

@mpql
Copy link

mpql commented Apr 16, 2024

How do you propose testing that outside of the production environment?

Not sure what your current tests look like or which file in test/webhost/ would be responsible for this (feel free to point me there and I can write something up), but assuming TDD, I'd think it'd roughly be:

  • check that the contents of @app.route('/robots.txt') match the contents of robots.txt
  • mock the host to archipelago.gg, then check that the contents of @app.route('/robots.txt') match the contents of robots_archipelago.gg.txt

@LegendaryLinux
Copy link
Member Author

I went with the config file route. We have a preexisting config.yaml which controls WebHost behavior already, so I just used that.

@mpql
Copy link

mpql commented Apr 16, 2024

Gotcha, makes sense!

@Berserker66
Copy link
Member

I don't like controlling the behaviour via import, someone importing the module later and breaking this seems like a likely future bug.

@ThePhar ThePhar changed the title Add robots.txt to WebHost WebHost: Add robots.txt to WebHost Apr 20, 2024
@ThePhar ThePhar added the is: enhancement Issues requesting new features or pull requests implementing new features. label Apr 20, 2024
@cache.cached()
@app.route('/robots.txt')
def robots():
configpath = os.path.abspath("config.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

within the function, nothing of this is needed, just app.config["ASSET_RIGHTS"] suffices

Comment on lines 11 to 17
configpath = os.path.abspath("config.yaml")
if not os.path.exists(configpath):
configpath = os.path.abspath(Utils.user_path('config.yaml'))

if os.path.exists(configpath) and not app.config["TESTING"]:
import yaml
app.config.from_file(configpath, yaml.safe_load)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
configpath = os.path.abspath("config.yaml")
if not os.path.exists(configpath):
configpath = os.path.abspath(Utils.user_path('config.yaml'))
if os.path.exists(configpath) and not app.config["TESTING"]:
import yaml
app.config.from_file(configpath, yaml.safe_load)

@LegendaryLinux LegendaryLinux merged commit ad44512 into main Apr 21, 2024
26 checks passed
@powerlord
Copy link
Contributor

powerlord commented Apr 21, 2024

While this is a great idea, Google's own developer documentation says that it will still index pages blocked by robots.txt if other pages link to it:

If your web page is blocked with a robots.txt file, its URL can still appear in search results, but the search result will not have a description.

While Google won't crawl or index the content blocked by a robots.txt file, we might still find and index a disallowed URL if it is linked from other places on the web.

They only block it completely if the page has a meta noindex tag, like <meta name="robots" content="noindex">... and they won't even look for a meta noindex tag if the page is already covered by robots.txt.

@mpql
Copy link

mpql commented Apr 21, 2024

Good catch @powerlord; I didn't know Google did that.

Assuming y'all still want to differentiate, it might do to just have that config option show a banner, on by default, so satellite sites will say something at the top like:

Note: You're viewing an Archipelago instance. You can find the main Archipelago site at archipelago.gg.

@Berserker66 Berserker66 deleted the robots branch April 22, 2024 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects: webhost Issues/PRs that touch webhost and may need additional validation. is: enhancement Issues requesting new features or pull requests implementing new features. waiting-on: peer-review Issue/PR has not been reviewed by enough people yet.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants