Skip to content

High Availability and Large Number of Records

eggwhites edited this page Dec 13, 2014 · 5 revisions

'High Availability' and 'Large' are arbitrary terms, but in this case we refer to geoportals that must meet a requirement for system failover and/or contain 500,000+ records. If your organization plans to implement such a geoportal, then there are some things you can do to improve the performance and success of your implementation. This topic discusses architectural considerations and settings in the gpt.xml file to accommodate high availability and larger geoportals.

Table of Contents

User store and database

The Geoportal architecture typically includes a server hosting a user store (LDAP), a server hosting the geoportal database, and a server hosting the geoportal web application. For a failover environment, you should follow the guidance of your LDAP and RDBMS software for backing up both the user store and the database. This topic does not include steps for configuring this backup.

Architecture Overview

The diagram below provides a visual overview of one way to set up the geoportal environment for high availability. It is recommended that you deploy the geoportal web application in two server instances and use a load balancer to direct web traffic to the endpoint. Each instance should have its own lucene index - they should not share an index, but each be configured in their respective gpt.xml files to point to their own index. This means that its possible for search to retrieve slightly different results for newly published documents, as the indices will typically sync with the database at night.

If you have scheduled harvesting of several repositories - a likely scenario for large geoportal deployments - then you should reduce the workload of the server hosting the geoportal by separating out the harvesting functionality. Here you would deploy an additional geoportal instance to be behind the scenes and used solely for harvesting. You will configure the gpt.xml file for the two user-facing geoportals and the geoportal harvester in a specific way, discussed in the gpt.xml configuration section below.

Note: In the diagram, the Database1 and LDAP1 servers are being backed up on the Database2 and LDAP2 servers. That is part of best practice for maintaining user stores and databases, but is not addressed in this topic.

gpt.xml configuration

The gpt.xml configuration for the two geoportal servers and the geoportal harvester instance is shown below. Here we show only the Web Harvester parameters section; the rest of the gpt.xml configuration will be determined by your organization's preferences.

Geoportal1 and Geoportal2

In the example below, you will update and add the following parameters to the gpt.xml file for the two user-facing geoportals. Remember to replace the url_to_the_harvester_machine with the url to your geoportal instance that you want to dedicate solely to harvesting.

<parameter key="webharvester.updateindex" value="false"/>
<thread class="com.esri.gpt.catalog.context.CatalogSynchronizer" at="01:00"/>
<thread class="com.esri.gpt.control.webharvest.engine.ScheduledPause" at="00:45">
  <parameter key="remoteIndexingUrls"
             value="url_to_the_harvester_machine"/>
  <parameter key="connectionTimeout"    value="30[MINUTE]"/>
  <parameter key="responseTimeout"      value="30[MINUTE]"/>
  <parameter key="initialSleepTime"     value="30[MINUTE]"/>
  <parameter key="consecutiveSleepTime" value="15[MINUTE]"/>
</thread>

Geoportal Harvester

On the harvesting geoportal, the configuration would be the same except that for the parameter remoteIndexingUrls, the value should be set to 'self', as shown in the example below:

<parameter key="webharvester.updateindex" value="false"/>
<thread class="com.esri.gpt.catalog.context.CatalogSynchronizer" at="01:00"/>
<thread class="com.esri.gpt.control.webharvest.engine.ScheduledPause" at="00:45">
  <parameter key="remoteIndexingUrls"
             value="self"/>
  <parameter key="connectionTimeout"    value="30[MINUTE]"/>
  <parameter key="responseTimeout"      value="30[MINUTE]"/>
  <parameter key="initialSleepTime"     value="30[MINUTE]"/>
  <parameter key="consecutiveSleepTime" value="15[MINUTE]"/>
</thread>

Spatial Ranking

Another setting in the gpt.xml file that can be adjusted is the one responsible for spatial ranking. Spatial ranking is an automatic attempt to rank records in the geoportal catalog by their spatial relevance. When there are many records, this ranking becomes resource-intensive. To change the maximum number of records your geoportal should support spatial ranking for, update the spatialRelevance.ranking.maxDoc parameter in the gpt.xml file. You may want to set this number lower e.g., 100,000 so the ranking does not happen for your large catalog.


Back to Customizations
Clone this wiki locally