Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: More agents for dark visitors in robots.txt #1314

Open
arthurzenika opened this issue Jan 19, 2024 · 4 comments
Open

[Feature Request]: More agents for dark visitors in robots.txt #1314

arthurzenika opened this issue Jan 19, 2024 · 4 comments

Comments

@arthurzenika
Copy link

Feature Description

First of all, congratulations for having a setting out of the box for blocking ChatGPT and other bots in robots.txt, it's a super cool feature !

There are some more bots that could be added to the list : https://darkvisitors.com/

(I looked at the code to try to contribute the list directly, and maybe there could be a catchall setting instead of having a button per bot ? Or do you have some users that want to select them one by one ? )

@atomGit
Copy link

atomGit commented Feb 6, 2024

blocking anything via robots.txt is generally a poor approach - this is like me asking you to cease using the word 'the' in conversation; it's entirely your choice whether you do or don't - i had a chuckle when i saw this option was added to publii because it's pointless

while some language scrapers for so-called "AI" (there is no AI (yet), not that the public can access anyway) may obey robots.txt, others will not, thus attempting to block such requests via robots.txt is nothing but an exercise in enumerating badness

if you really want to block this type of requests, the tl;dr answer is: don't bother

the longer answer is that you may have to result in blocking consecutive requests from non-search engines because the UA/IP could be anything ... including a search engine or a genuine web browser, thus this approach is also useless (see previous answer)

i feel ya and i agree that one should be able to block this crap, but there is literally no reliable way to do so that i'm aware of - every site will be indexed for AI at some point, either directly or indirectly

@raramuridesign
Copy link

@arthurzenika We have found the most effective way to block bots is to rather use Cloudflare WAF Rules. Although this does require knowing which ones to block. We have a list of more than 50 bots we actively block on projects.
You can read more here.
M

@internettips
Copy link

Blocking bots may be more trouble than it is worth. Perhaps the best solution is to simply put your valuable content behind a password-protected / Passkey login. Interested parties will register -- in most instances, that could be a hand-raiser signal. With password-protected logins for premium content, casual bots get blocked by default. So do search engines for your valuable content (unless you create SE pass-throughs). Then develop a business sales strategy to focus on your password-protected content.

For content behind the password-protected barrier, if you're concerned search engines won't be able to index your content, with typically only 8 spots -- or less -- possible on page 1 of search results, is more than a token effort any more -- at least for most websites? There are better ways to find and connect with hand-raisers, then turn some of those into buyers. Search engine listings are a nice side benefit, not a core business driver. Or, if you have funds to spare, maybe you could use web adds to promote content. Though with AI developments about to gain another boost this year, the days of old-style SEO and even web advertising, are probably going to change sooner rather than later.

@atomGit
Copy link

atomGit commented Feb 15, 2024

We have found the most effective way to block bots is to rather use Cloudflare ...

that's mistake no. 2

Stay away from Cloudflare

why you shouldn't use Cloudflare - tiq's tech-blog

Why does cloudflare suck? : CloudFlare

cloudflare can take their gd annoying captcha crap and shove it where the sun don't shine, right along with their stupid custom http error codes

sites using this service are compromising user privacy and that shouldn't be acceptable, not by anyone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants