Being a Good Robot
Pages 202
- Home
- 2009 Esri Federal UC
- 2009 Esri International UC
- 2010 Esri Federal UC
- 2010 Esri International UC
- 2011 Esri Federal UC
- 2012 Esri Federal UC
- 2013 Esri Federal GIS Conference
- 2013 Esri International User Conference
- 2015 SDI Special Interest Group
- Add a Custom Profile
- Add an OpenSearch endpoint for Federated Search
- Add Another Tab to the Geoportal Interface
- Add Custom Link to a Search Result
- Add Custom Search Criteria
- Add the Geoportal Search to a List of Search Providers
- Add v1.1.1 FGDC editor to a previous Geoportal release
- AGP TO AGP Harvesting with the Geoportal
- AGS TO AGP Harvesting with the Geoportal
- All gpt.xml file settings
- An Introduction to vi
- Apache Tomcat geoportal logging
- Being a Good Robot
- Best Practice for Edits to JSP files
- Biological or Remote Sensing FGDC xsds
- Browse Tree
- Cart Processor
- Catalog Service
- Clear the Tomcat Work Folder
- Collections
- Common problems and solutions
- Communities and live examples
- Components
- Configure a Directory Server for the Geoportal
- Configure geoportal User and Schema in the PostgreSQL Database
- Configure Previewable Filetypes
- Configure Searching of YouTube
- Configure the gpt.xml File
- Configure Widgets
- Connecting to a User Directory
- Create a user account
- Create Relationships between Resources
- Customizations
- Customize DCAT output
- Customize Metadata Validation
- Database problems
- Database Tables
- DataDownload Tab
- Deploy and Configure the Geoportal Web Application in Tomcat
- Deploy and Configure the Servlet Web Application
- Deploy the Geoportal Web Application
- Details of Lucene Indexing in the Geoportal
- Development topics
- Discovering Resources
- Eclipse Project from Compiled WAR
- Eclipse Project from Source Code
- Enable Search Using an Ontology Service
- Error Messages in the Geoportal Web Application
- Esri Geoportal Server LiveDVD
- Extending the Web Harvester
- Federated Search in Portal for ArcGIS
- Feedback
- FGDC Biological Profile and Remote Sensing Extension
- FGDC Service Checker Integration
- Geoportal Clients for ArcGIS
- Geoportal CSW Clients
- Geoportal Facets using Apache Solr
- Geoportal genie
- Geoportal Project from Compiled WAR
- Geoportal Publish Client
- Geoportal Server 1.2.5 What's New
- Geoportal Server 1.2.6 What's New
- Geoportal Server 1.2.7 What's New
- Geoportal server as a broker
- Geoportal Server Downloads
- Geoportal Server v 1.0 What's New
- Geoportal Server v 1.1 What's New
- Geoportal Server v 1.1.1 What's New
- Geoportal Server v 1.2 What's New
- Geoportal Server v 1.2.2 What's New
- Geoportal Server v 1.2.4 What's New
- Geoportal SPARQL Sample
- Geoportal User Interface Components
- Geoportal Web Application File Organization
- Geoportal XML Editor
- Get Assistance with an Implementation
- GXE Concepts
- GXE Crash Course
- GXE Structure
- GXE Workflow
- High Availability and Large Number of Records
- How to Browse for Resources
- How to Create and Manage My Profile
- How to find all documents of a particular metadata standard
- How to Leave a Resource Review
- How to Login and Manage my Password
- How to Manage and Edit Resources
- How to Publish Resources
- How to Restrict Access to Resources
- How to Search for Resources
- How to Search with an Ontology Service
- How to Set Up an Esri Geoportal Server on Linux
- How to Use Search Page Results
- How to Use the Data Download Feature
- How to View Resource Relationships
- IDE Topics
- Identity Components LDAP and Single Sign On
- Index All Metadata Content
- Indexing and Searching the Time Period of the Content
- Install Apache Tomcat 6
- Install Desktop Tools
- Install Esri Geoportal Server
- Install PostgreSQL 9.1.2
- Install the JDBC .jar Files
- Installation
- Installation Version 1.0
- Installation Version 1.1
- Installation Version 1.2
- Installation Version 1.2.2
- Installation Version 1.2.4
- Installation Version 1.2.5
- Installation Version 1.2.6
- Installation Version 1.2.7
- Installation Version 1.2.8
- Integrate with a Content Management System
- Integrate with the con terra Security Solution
- Localization
- Log In to the Geoportal
- Logging
- Look and Feel of the User Interface
- Main Page
- Map LDAP Attributes on the Registration Page
- Map Viewer
- Online form editing for all publication methods
- Open source acknowledgements
- Oracle WebLogic geoportal logging
- Orientation to the Create Metadata Page
- Perform Preinstallation Computer Setup
- Portal for ArcGIS Integration
- Post Deployment Actions
- Preinstallation
- Preinstallation 0.9
- Preinstallation 1.0 and 1.1.x
- Preinstallation 1.2
- Preinstallation 1.2.2
- Preinstallation 1.2.4
- Preinstallation 1.2.5
- Preinstallation 1.2.6
- Preinstallation 1.2.7
- Preinstallation 1.2.8
- Preview Function
- Publication Components
- Ratings and Comments for Search Results
- Register ArcGIS for Server with the Geoportal
- Release notes
- REST API Syntax
- Sample FGDC metadata.xml
- Scheduled tasks
- Search Components
- Search Map
- Search Widget for Flex
- Search Widget for HTML
- Search Widget for Silverlight
- Security Concepts
- Set Up Systemwide Environment Variables
- Set up the Geoportal Database
- Share Link
- Single Sign On
- Smoketest the Geoportal
- Standards Support
- Supported CSW Profiles for Synchronization
- Theme Library
- Troubleshooting
- Troubleshooting Tips
- Two geoportals on the same server
- Upgrade 1.x to 1.2 database
- Upgrading file system approach
- Upgrading Read This Overview
- Upgrading SVN approach
- Url filter customization
- Use an XSLT to Render the Details Page
- Use Ant to build Geoportal
- User Functions and Roles
- User Management Interface
- Using a geoportal
- Using Lucene Search Text Queries
- Version 0.9
- Version 1.0
- Version 1.1
- Version 1.1.1
- Version 1.2
- Version 1.2.2
- Version 1.2.4
- Version 1.2.5
- Version 1.2.6
- Version 1.2.7
- Version 1.2.8
- What is a geoportal and the geoportal server
- What is the esri geoportal server
- What's New
- wiki template
- WMC Client
- Show 187 more pages…
Clone this wiki locally
Introduction
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a specification used by websites to communicate with web crawlers and other web robots. The specification determines how to inform web robots about which areas of the website should not be processed or scanned.
With the use of Geoportal Server as harvester, we have decided to implement support for reading and respecting the rules of engagement defined in a site's robots.txt file and be a good robot...
Robots.txt is a de facto standard, which means there is no governing body of the standard an each implementation may vary, adhering to the standard more or less strictly. It may also introduce its own extensions as well its own interpretation if of any ambiguous topics. The goal of our implementation is to provide best solution considering both the standard and the consensus amongst community. To read more about robots.txt, refer to the following documents:
- http://www.robotstxt.org/robotstxt.html
- https://en.wikipedia.org/wiki/Robots_exclusion_standard
- http://www.robotstxt.org/norobots-rfc.txt (this is the specification)
In general, there is agreement that only User-Agent and Disallow directives are widely recognized. Our implementation can also understand and apply Allow, Crawl-Delay and Host. Sitemap directive is not applied, although it is recognized and ready to use. There is limited support for pattern matching; also hash (#) is used as a beginning of the comment.
In Geoportal Server this information is being applied during harvesting and only during harvesting. For example, CSW is used for search and harvest, but only harvest makes use of robots.txt. All that means it will not attempt to reach URL’s which are determined to be denied, it will wait certain amount of minutes between subsequent requests to the same server if Crawl-Delay is defined, and also it will substitute original URL with the information from Host directive if present. Let’s look at the example:
# General Section
User-Agent: *
Disallow: /private/
# Specific section
User-Agent: GeoportalServer
Disallow: /not-for-geoportal/
Disallow: /*.html
Allow: /private/for-geoportal/
Allow: /*.xml$
Crawl-Delay: 10
Host: https://myhost:8443
The above robots.txt allows to access anything but /private/ folder on the path (general) section, unless the crawler introduces itself as “GeoportalServer” for which there are additional directives. Especially, such crawler is forbidden to access any html file, and is permitted to access any xml file anywhere on the path even if it’s in /private/ folder. Also, any crawler should wait 10 seconds between each requests to the server, and must use https protocol no matter what.
The implementation in Geoportal Server allows to declare the user-agent (this is how Geoportal Server's crawler introduces itself) in gpt.xml. Also, it is possible to entirely turn off functionality to respect robots.txt, which makes geoportal a “bad robot”. However, it is allowed for individual users through the site registration to override that last setting for example disable robots.txt even if geoportal is configured to use it.
The API we use for this is as simple as possible exposing only a minimal set of functions to the rest of the software. There is a rich set of information going to the log file, but for most of it the log level is FINE.
Implementation Considerations
As said, robot.txt is not a formal standard and actually has some ambiguities. Below a brief discussion and explanation of the choices we made in our implementation.
-
Pattern matching: the standard doesn’t say anything about any pattern matching algorithm; however, many sites are using some form of it. In general, there is a consensus that the pattern matching is NOT a regular expression type matching. It MIGHT BE a Glob matching and some sites specify what kind of pattern matching they are using by announcing it on its site (Google does this). Our implementation recognizes only asterisks (*) as a wildcard character anywhere in the pattern to match any sequence of characters and the dollar sign ($) to mark the end of a path. Asterisks match is a “greedy” matching (vs. “reluctant”) in that it tries to match as much as possible.
-
Matching priority: let’s say a path will match both “disallow” pattern and “allow” pattern. Should the path in this case be recognized as permitted or denied? There are two possible cases: one where the first match wins and this is suggested in norobots-rfc.txt, and secondly the approach taken by Google, where “allow” wins regardless of the position but only if the length pattern assigned to it is equal or greater than the shortest “disallow” matching pattern. We have selected to implement the approach defined in the specification.
-
Fall back: in general its well understood that if path doesn’t match any pattern on the list, then accessing that path is permitted. What if a crawler recognizes specific section as applicable to itself and exhausts the listed patterns in that specific section without a match. It could either stop and take it as permitted, or it may fall back to the general section and continue matching process. Our implementation does fall back to the general section as it is suggested in the specification.
-
Default user-agent name: We have implemented the ability to set the user-agent HTTP header to use as its crawler signature. The default value for this is “GeoportalServer”. It is used in two cases: for scanning robots.txt to find applicable section and as value of “User-Agent” header used for making HTTP requests.
-
Override option: there is an option in gpt.xml enabling functionality which allows users to declare during registration of a site whether or not to respect robots.txt. If this override setting is on then the user will get an additional UI element (drop-down list) with three choices: inherit (i.e. us the global setting for geoportal), always (i.e. read and respect robots.txt even if disabled in gpt.xml), and never (ignore the site's robots.txt even if geoportal is configured to read and respect it).
-
Multi-machine setup: technically (although not advised) it is possible to configure client architecture with geoportal such that there are multiple harvesting machines. In such a case it is quite possible that multiple requests to the same server might be submitted more often than allowed by Crawl-Delay.
-
Sitemap: Currently we ignore the sitemap directive in robots.txt files.
-
WAF harvesting: For WAF harvesting Geoportal Server relies on being able access a parent folder so that it can retrieve contents and sub-folders. A robots.txt could allow accessing the sub-folders, but not the parent. This would cause trouble for WAF harvesting. We're working on some fine-tuning of the approach for WAF harvesting.