Skip to content
kkrugler edited this page Sep 13, 2010 · 2 revisions

General Information

The Bixo open source project is a web mining toolkit. If you are looking at this page, then it’s likely that somebody is running a web crawler based on Bixo. The following user agent names have special meaning:

  • “bixo integration test” – Used for tests when the Bixo project is being built, and we need to validate proper fetching of real (external) web pages. This should only be an occasional activity, and targets a handful of high-traffic sites.
  • “bixo test fetcher” – Used for manual tests of specific pages. This should be a very, very low-volume activity.

If you are getting crawled, and the user agent name is not one of these two special values, then there are three possibilities:

  1. Somebody is running the Bixo SimpleCrawlTool. This is a demonstration tool that can be used to crawl a single domain. When the tool is run, it requires that the user enter the user agent name, but it leaves the email address and web page set to the standard Bixo configuration.
  2. Somebody is running the “helpful” Bixo example project. This should only be used to fetch Hadoop mailing list logs, so if you’re not the Apache.org webmaster then see possibility #3 below.
  3. Somebody is intentionally trying to mask their crawling activities by re-using the Bixo project information. If this happens, please get in touch with us (see below) so we can track them down and apply appropriate punishment.

To be polite, the Bixo fetching code will only access a given IP address via one thread at any time. In addition, we impose a minimum delay between requests, and we honor the Robots Exclusion Standard.

Contact Us

If code built on top of Bixo is causing any problems for your site, or you have questions about our spider that are not answered by this page, please get in touch with us!

  • Phone: +1 530-210-6378
  • Email: bixo-dev@groups.yahoo.com