Skip to content

Proposal to edit 4.6: Security tooling #69

Open
@codewordcreative

Description

@codewordcreative

Success Criterion - Security tooling (Machine-testable)

Current:
Web browsing from bots has been steadily increasing in recent years. As such, it is a growing concern for security, performance, and sustainability. Use security tools that automatically block bad actors and minimize bad behavior. This results in substantially less load on the server, fewer logs, less data, less effect due to compromise, and more. The result of compromised websites is a large increase in HTTP, email, and other traffic as malicious code attempts to infiltrate other resources and exfiltrate data. Compromised websites are typically identified by anomalous patterned behavior.

Suggested:
Follow best practices to block unwanted and unnecessary third-party crawlers from accessing or downloading your content. This includes adjustments to both robots.txt and server access rules. In the process, take care to ensure your content remains accessible to search engines and any helpful, welcome crawlers. Preventing access to suspicious user agents, unwanted visitors and scrapers reduces emissions. Where these scrapers are seeking training data for large language models and similar technologies, there is an additional third-party impact to consider as your data is processed and used to expand their model(s). Blocking unwanted visitors will save energy, preserve performance, and reduce burden on your hardware, which could reduce your need to provision more resources.

Additional information:
AFAIK, Siddhesh has assembled a list of resources and studies relating to crawling and LLM impact that can be included here to accompany the guidelines.

My comments:
I'm adding this here so we can more easily comment on a more substantial edit.
Actual solutions are a little thin on the ground right now. Cloudflare does a lot. 8G/Perishable Press firewall stuff does a lot. And Dark Visitors collects a lot of data, although the free solution only covers robots.txt. Overall, many seem to not understand the need for server-level protections, and the impact of failed attempts. We may need to add a specific criterion that simply points to the security guidelines to say "do this" - and this is the "also do this, to maximize sustainability".

Ryan's comments on the security section in general:
I feel like Security is an area where we identified there should be a cross-reference to other W3C guidelines, instead of writing in depth here? And/or I would move this under a separate numbered guideline of its own.

Ryan's comments on this section:
Rose and I worked on this text together. I'm not positive it belongs under Automation, but it's specific enough now to be less about security and more about automatically blocking bad bots.

Credit: Edit drafted directly together with @ryansholin, following discussions with @systemstree (Siddhesh Wagle).

Metadata

Metadata

Assignees

No one assigned

    Labels

    taskforce-infraThis issue affects the infrastructure taskforce.technicalCorrections, bugs, or minor omissions

    Type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions