Skip to content

Details of Lucene Indexing in the Geoportal

Christine White edited this page Oct 16, 2013 · 6 revisions

The Geoportal Server uses Lucene to index metadata for search. How metadata is indexed is important because it determines what search results are returned when a user submits search criteria to the geoportal. When publishing a metadata document, certain content from the document will be submitted for indexing. When a user conducts a search, it is the index not the geoportal database that is searched. Indexed information can is assigned a particular meaning. 'Meaning', refers to a concept you would like to specifically search or query. This 'meaning' determines how Lucene will index the content and how it may be used in searching.

Before a 'meaning' value can be used, it has to be defined in a file called property-meanings.xml, located in \\geoportal\WEB-INF\classes\gpt\metadata. The geoportal references property-meanings.xml to index the metadata value for search and retrieval.

Each geoportal metadata profile definition.xml file can specify the set of properties that will be indexed. These properties are usually captured in that profile's indexables.xml file. The indexables.xml makes a connection between an element's XML xpath and its associated meaning in the proptery-meanings.xml file. This in turn defines how that element will be indexed and searched.

Note: The geoportal can be customized so that it automatically indexes all metadata content, regardless of which parameter it is associated with in the metadata. To enable this customization, see Index All Metadata Content.

Table of Contents

Determine if a metadata element is already indexed by default

To check if a specific XML element is already indexed, identify the definition.xml file for the profile that references the metadata element. For example if we want to investigate if the Lineage element from the INSPIRE (Data) profile, we start by opening the inspire-iso-19115-definition.xml file. Here, we will need to identify which indexables.xml file is referenced by this profile. To find the indexables.xml file, look in the <indexables fileName=""></indexables> attribute in definition.xml. In our example, this points to the apiso-indexables.xml file from the \\geoportal\WEB-INF\classes\gpt\metadata\iso folder. Once you have identified which indexables.xml file is referenced, open that indexables.xml.

Now, search in the indexables.xml file for the xpath of the metadata element in which you are interested. If the xpath is not referenced, then it is not indexed. Alternatively, you may see that it is present in the file and therefore indexed, but may want to change how it is indexed.

In our Lineage example, the xpath is /gmd:MD_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:statement/gco:CharacterString. We do find this xpath in the apiso-indexables.xml file, and see that it is indexed by the property meaning name apiso:Lineage. When we look up the apiso name="apiso:Lineage" in property-meaningx.xml, we see that the queriable for this is the text apiso.Lineage. So we could type apiso.Lineage:searchTerm in the Search field on the geoportal search page to search the Lineage elements for searchTerm.

If a metadata element is not already indexed, add it to the indexables.xml file

If the xpath to the metadata element is not provided in indexables.xml, you can add its xpath to one of the property meanings listed in that file. After adding the xpath to the property meaning that matches your metadata element's meaning, save indexables.xml and restart the geoportal web application. You will need to re-approve the resources through the geoportal Administration interface for them to be reindexed with your new property meaning.

Note: It is possible to implement conditional indexing as well. For example, if you wanted to index a URL only if contained a certain phrase, you could leverage a [contains] component in the xpath. In the snippet below, we are indexing the resource.url only if it contains the word "thredds" in it. URL's that do not contain the word "thredds" would not be indexed:

<property meaning="resource.url"  
  xpath="/metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/networkr[contains(.,'thredds')]"/>

Instructions for #How to define a new property meaning are provided later in this topic, but first read the section on the property-meanings.xml file below.

The property-meanings.xml file

Before adding new meanings, check the property-meanings.xml file to see if an existing meaning will suit your need. Some of the meanings already defined in the file are listed in the table below, along with any functionality the geoportal code associates with that meaning. Additional meanings defined for ISO-based standards are also found in property-meanings.xml, but are not listed in the table. By using existing meanings, the effort to upgrade to future versions of the Geoportal Server is minimized. The existing meanings should satisfy most of the search needs.

property-meaning name description geoportal function
uuid geoportal's primary key for identifying the document. Typically you will see this value in URLs. For example: http://host:port/geoportal/rest/document?id=[uuid]
fileIdentifier Represents an identifier from within the metadata document. Not all metadata standards support an internal identifier metadata element. If present, it is recommended that it be globally unique. Used by the geoportal to avoid duplication of resources and as an alternative identifier for most of the REST-based functions. For example: http://host:port/geoportal/rest/document?id=[fileIdentifier]
sys.siteuuid Internally used by the geoportal, associated with documents that are harvested from remote catalogs. Is the identifier of the remote catalog, and is available . Do not alter this. Available for query.
dateModified geoportal's modification datestamp associated with the last occurance that the resource's XML was updated. Used in the Additional Options dialog on the geoportal Search page, and for sorting by date.
geometry Represents the bounding envelope associated with the resource. Used for spatial queries.
keywords Keywords associated with the resource. Available for query.
body Non-specific query; a catch-all for indexing and searching text in a metadata document. If you want to index a certain element, but do not plan to query for that specific element, index it as body.
anytext Anytext is not actually indexed. It represents a collection of properties that will be searched when the queriable anytext is specified. General searches that are not directed to a specific property are anytext queries.
title Title of the resource. Used when the resource's title is displayed, for example in the list of search results on the Search page.
title.org Captures the original title as provided from a resource's GetCapabilities response. Enables geoportal to search both a user-given title for a registered resource, and its original title as per the GetCapabilities response.
abstract Abstract associated in the resource. Maps to the information displayed as text below the title or a record in the list of search results.
contentType Esri concept for catagorizing resources. Used for generating the icon for the resource listed in Search page results, and also as a filter on the Additional Options dialog.
dataTheme ISO Topic Catagory code associated with the resource. ISO has defined the Topic Category codelist in the 19115 standard. Maps to the ISO Categories in the Additional Options dialog.
resource.url Primary endpoint for accessing the resource through the internet. Used for generation of links in search results. For example, it is the URL accessed when the Preview or Open link is clicked. It is also sometimes used to determine the Esri contentType for the resource.
thumbnail.url URL to the thumbnail image for the resource. Used for generation of the thumbnail image next to the resource in the list of search results.
website.url URL to a website associated with the resource. Used for generation of a website link for the resource in the list of search results.

Each property-meaning in property-meanings.xml has attributes. These attributes for property-meanings are described below.

Attribute Name Description
name Unique name for the meaning in this file, and should match the meaning="" attribute in the definition.xml file. The name designated becomes a Lucene field that can be used for advanced searches, as per Lucene documentation. For example, designating a name of title and then typing title:water on your geoportal search page will only return items with water in the index Lucene has associated with the property-meaning title.
meaningType Used to flag metadata elements that are tied to functionality within the geoportal. It is good practice to avoid altering the meaningType of a property-meaning.
valueType Data type of the property value, e.g. Double, Geometry, Long, String, or Timestamp.
comparisonType Indicates how Lucene will index the property values. There are three options defined in the property-meaning.xml file:
  • term: phrases associated with this attribute are tokenized. For example, if "San Diego" is the word that is being stored, if it is associated with a meaning that has a comparisonType of term, it will be stored as two separate words "San" and "Diego". Terms are also stored in a lowercase form, e.g. "san" and "diego".
  • keyword: phrases associated with this attribute are not tokenized. For example, if "San Diego" is the word that is being stored, if it is associated with a meaning that has a comparisonType of keyword, it will be stored as one phrase. A search for "San" will not return the record; only a search for "San Diego". Keywords are also stored in a lowercase form, e.g., "san diego".
Note: There is an issue with how lucene parses out property-meanings with the comparisonType set to 'keyword' if the value is a phrase containing a dash, a space, or an underscore. The limitation is that the specific field must be targeted, e.g., why keywords:"surface temperature" may retrieve results but searching for just "surface temperature" may not. The workaround is to index the values twice. This means that you not only index certain metadata fields as keywords (or dataTheme, or other indices that leverage comparisonType="keywords"), but you also index them as something more general. For example, you may set a property-meaning like 'body' from the property-meanings.xml file to index the same field as your keywords or dataTheme. You would set this property meaning="body" in the indexables.xml file.
  • value: items associated with this attribute are stored as values, not phrases or words. Items are case-sensitive. An example would be the fileIdentifier meaning. Parameters with a meaning="fileIdentifier" likely hold unique identification strings, such as {F56408D6-4325-484C-B753-5E8FD4421E31}. Searching for part of the string, such as "E31" will not retrieve the record because the string is stored as a complete value and not parsed. Searching for the string "{f56408d6-4325-b753-5e8fd4421e31}" will also not return the record because the value stored is case-sensitive.

Some property-meanings have one or two additional sub-elements, <dc></dc> and <consider></consider>.

  • The <dc></dc> element facilitates the connection of property-meanings to Dublin Core concepts. This is essential to supporting the CS-W OGCORE profile, defining what is queriable and returnable through CS-W. Within the <dc></dc> element, there are is an attribute for name and for aliases. The name attribute defines the name of the Dublin Core element. The aliases attribute defines alternate words that will be recognized when supplied as a CS-W property name.
  • The <consider></consider> element is used only for the anytext property. It defines other property-meanings that should be included when a search target is anytext. For example, the property-meaning for anytext is shown below. Because anytext has four other property-meanings listed in its <consider></consider> element, a search for anytext results in the title, abstract, keywords, and body properties being searched.
<property-meaning name="anytext" 
  meaningType="anytext" 
  valueType="String" 
  comparisonType="terms" 
  allowLeadingWildcard="true">
  <consider>title,abstract,keywords,body,contentType,dataTheme</consider>
  <dc name="AnyText" aliases="csw:AnyText,any,csw:Any"/>
</property-meaning>

How to define a new property meaning

If you have created a custom metadata profile, or added new elements to an existing geoportal metadata profile, and none of the existing property meanings in property-meanings.xml suit your needs for indexing a specific element, then you may need to define a new property meaning. Follow instructions below.

  • Choose an existing property meaning from property-meanings.xml that is conceptually similar to the new one you'd like to create. Make a copy of the existing property meaning, and use it as a template to add a new property meaning per the specifications discussed above.
  • Add a reference to your property meaning to indexables.xml for the profile for your metadata. Make sure that the xpath for the property meaning in indexables.xml correctly references the xpath for the element in the metadata.
  • Save the files, and restart the geoportal web application. You will need to re-approve the resources through the geoportal Administration interface for them to be reindexed with your new property meaning.

Configure the property meaning to be returned in the CSW response

If you would like your new property meaning information to be returned when folks query your geoportal through CSW, then there are some additional steps. You will add the property meanings you want to query into the brief, summary, and/or full CSW response. You will also map the property to an appropriate http://dublincore.org/documents/dcmi-terms/ Dublin core element (dct:references in this example) of your choosing so it can be returned. Follow the steps below:

  • In property-meanings.xml, find the brief, summary, and full section.
  • Add the property meaning that you wanted returned in the CSW rseponse to the list(s) of meanings in the meaning-names tags for the brief, summary, and/or full section (depending on if you want to include them in the brief, summary, or full response). An example for a property meaning called new.property is shown below for the summary section; note that it doesn't really matter which of the meaning-names tags you put the new property in.
<summary>
  <dc>
    <meaning-names>fileIdentifier,uuid,title</meaning-names>
    <meaning-names>contentType,dataTheme</meaning-names>
    <meaning-names>dateModified,abstract</meaning-names>
    <meaning-names>resource.url,website.url,thumbnail.url,xml.url</meaning-names>
    <meaning-names>geometry,date,relation,new.property</meaning-names>
  </dc>
</summary>
  • Now, go to the place in the property-meanings.xml file where the property meaning of interest is defined, e.g., like here:
<property-meaning name="new.property" valueType="String" comparisonType="value">
</property-meaning>
  • Map the new property to a Dublin Core element by adding a dc element that is appropriate to your property meaning. Note you'll need to update the scheme attribute as well - e.g., scheme="urn:x-esri:specification:new.property". Because the CSW GetRecords response typically provides Dublin Core elements by default, this maps your property meaning into an acceptable Dublin Core element. The dct:references can be used, or another of your choosing - the element you choose will be holding your property's information in the response. See the example below:
<property-meaning name="new.property" valueType="String" comparisonType="value">
  <dc name="dct:references" scheme="urn:x-esri:specification:new.property"/>
</property-meaning>
  • Save property-meanings.xml file and restart your geoportal web application. Now when you post a CSW query, you should be able to see the property information in the response in the section corresponding to that dc element you chose.

Tips and Tools

Clone this wiki locally