Skip to content

Fields (Spider)

Randy Guck edited this page Sep 26, 2014 · 2 revisions

[Table of Contents](https://github.com/dell-oss/Doradus/wiki/Spider Databases: Table-of-Contents) | [Previous](https://github.com/dell-oss/Doradus/wiki/Object IDs (Spider)) | Next
Spider Data Model: Fields


All fields other than _ID are user-defined. Every field has a type, which determines the type of values it holds. Field types fall into three categories: scalar, link, and group.

Scalar Fields

Scalar fields store simple data such as numbers or text. A single-valued (SV) scalar field stores a single value per object. Multi-valued (MV) scalar fields, also called scalar collections, can store multiple values per object. The scalar field types supported by Doradus are summarized below:

  • Text: An arbitrary length string of Unicode characters.

  • Boolean: A logical true or false value.

  • Integer and Long: A signed 64-bit integer value. These two types are synonyms.

  • Float: A 32-bit floating-point value.

  • Double: A 64-bit floating-point value.

  • Timestamp: A date/time value with millisecond precision. Doradus treats all timestamp values as occurring in the UTC time zone. Timestamp values must be given in the format:

yyyy-MM-dd HH:mm:ss.SSS

where:

  • yyyy is a 4-digit year between 0000 and 9999

  • MM is a 1- or 2-digit month between 1 and 12

  • dd is a 1- or 2-digit day-of-month between 1 and 31

  • HH is a 1- or 2-digit hour between 0 and 23

  • mm is a 1- or 2-digit minute between 0 and 59

  • ss is a 1- or 2-digit second between 0 and 59

  • SSS is a 1-to-3 digit millisecond between 0 and 999

Only the year component is required. All other components can be omitted (along with their preceding separator character) in right-to-left order. Omitted time elements default to 0; omitted date elements default to 1. Hence, the following timestamp values are all valid:

2011-02-01 08:50:01.123
2011-02-01 08:50:01	   // same as 2011-02-01 08:50:01.000
2011-02-01 08:50		  // same as 2011-02-01 08:50:00.000
2011-02-01 08			 // same as 2011-02-01 08:00:00.000
2011-02-01				// same as 2011-02-01 00:00:00.000
2011-02				   // same as 2011-02-01 00:00:00.000
2011	 				 // same as 2011-01-01 00:00:00.000
  • Binary: An arbitrary length sequence of bytes. A binary field definition includes an encoding (Base64 or Hex) that defines how values are encoded when sent to or returned by Doradus.

When type names are used in schema definitions, they are case-insensitive (e.g., integer or INTEGER).

The Collection Property

By default, scalar fields are SV. Assigning an SV scalar field for an existing object replaces the existing value, if present. A scalar field can be declared MV by setting its collection property to true. Example:

<field name="Tags" collection="true" type="TEXT"/>

MV scalar field values are added and removed individually and treated as a set: duplicate values are not stored. This preserves idempotent update semantics: adding the same value twice is a no-op.

The Analyzer Property

Doradus Spider uses analyzers to determine how to store and index scalar fields. The analyzer controls two aspects of scalar field management:

  • Indexing: The analyzer determines whether or not the field’s value are indexed.

  • Term extraction: If values are indexed, the analyzer generates one or more terms from each field value that are used to create index records. In other words, the analyzer tokenizes the field for indexing.

Every scalar field defined in the schema is assigned a default analyzer that is optimized for the scalar’s field type. Optionally, the field can explicitly declare an analyzer property that assigns the default or an alternate analyzer as allowed by the chart below:

Field Type Default Analyzer Alternate Analyzers
Text TextAnalyzer NullAnalyzer, OpaqueTextAnalyzer, HTMLAnalyzer
Integer IntegerAnalyzer NullAnalyzer
Long IntegerAnalyzer NullAnalyzer
Boolean BooleanAnalyzer NullAnalyzer
Timestamp DateAnalyzer NullAnalyzer
Binary NullAnalyzer

Every scalar field type can declare the NullAnalyzer to prevent the field from being indexed. Text fields can assign the OpaqueTextAnalyzer to index the entire field as a single value instead of being tokenized into terms, or they can assign the HTMLAnalyzer to index the field as HTML text. Binary fields are not indexed and must always use the NullAnalyzer.

When a scalar field is MV, all values are indexed. For example, consider the following declaration:

<field name="Quotes" type="Text" collection="true" analyzer="TextAnalyzer"/>

Suppose an object is inserted with the following three Quotes values:

{"What time is it?", "Life is a Beach", "It's Happy Hour!"}

Using the TextAnalyzer, each unique term is indexed once ("a", “happy”, “is”, etc.), and each whole value (“life is a beach”) is also indexed. Therefore phrase clauses such as Quotes:Happy will select the object, as will equality searches such as Quotes="life is a beach". If the analyzer is set to OpaqueTextAnalyzer, only whole values are indexed, hence the equality clause will select the object but the term clause won’t. Details of each analyzer are described in the next sections.

TextAnalyzer

The TextAnalyzer assumes that text values are "plain text" (i.e., no metadata or mark-up). For each field value it tokenizes, the TextAnalyzer generates zero or more terms as lowercased letter/digit/apostrophe sequences separated by consecutive whitespace/punctuation sequences. That is, each contiguous sequence of letters, digits, and/or apostrophes becomes a term. A term can contain an apostrophe, but it cannot begin or end with an apostrophe. For example, the apostrophe is included in doesn’t, but outer apostrophes in the sequence ‘tough’ are excluded, yielding the term tough. An apostrophe is any of the following characters:

  • The Unicode APOSTROPHE (0x27)

  • The Windows right single quote (0x92)

  • The Unicode RIGHT SINGLE QUOTATATION MARK (0x2019)

As example of how text fields are tokenized by the TextAnalyzer, suppose an email object is created with the following text field values, all indexed with TextAnalyzer:

From:	   John Smith

To:		 Betty Sue
Subject:	The Office Move
Body:	   Hi Betty,
Just a reminder that you’re scheduled to move to your "fancy" new office tomorrow, number B413. If you have any questions, please let me know.
Thanks, John.

The TextAnalyze``r indexes these fields to generate the following terms:

Field Name Terms
From john smith
To betty sue
Subject move office the
Body a any b413 betty fancy have hi if john just know let me move new number office please questions reminder scheduled thanks that to tomorrow you your you're

As shown, terms are extracted in lowercase, and punctuation and whitespace are removed. As part of down-casing, the TextAnalyzer converts any apostrophe retained within a term to the "straight apostrophe" character (0x27.) Although a term may appear multiple times within a field, it is indexed but once.

Though not shown above, the TextAnalyzer also creates a term equal to the term’s entire field value, down-cased, and enclosed in single quotes. This value is used as an optimization for equality searches. For example, the whole-field value for the field From is 'john smith'. For text fields with large values, this "whole field" value is created as an MD5 value instead of the literal text.

The terms generated by the TextAnalyzer allow efficient execution of a wide range of full text queries: single terms, phrases, wildcard terms, range clauses, etc. Searches are performed without case sensitivity: for example the phrase "You’re Scheduled to MOVE" will match the Body field shown above.

IntegerAnalyzer

The IntegerAnalyzer is the default analyzer for Integer/Long scalar fields. It creates terms using trie values, which allow efficient searching of value ranges in addition to exact value matching. For example, consider the following range clause:

Size:[500 TO 10000]

Text-based searches don’t work on numeric fields: "500" is actually greater than “10000” when compared as text. If only exact values were stored for Size`, the range clause above would require searching 9501 separate values. But using trie values, this range clause requires searching no more than 77 lookups.

BooleanAnalyzer

The BooleanAnalyzer is the default analyzer for Boolean scalar fields. It creates one term record that indexes all fields with a "true" value and another term record for “false” values. This allows all objects with either value to be quickly found.

DateAnalyzer

The DateAnalyzer is the default analyzer for Timestamp scalar fields. Similar to the IntegerAnalyzer, it indexes timestamp values by creating trie values, which allow efficient execution of range clauses. The trie values are chosen to efficiently find timestamp values that match values with various granularities. For example, a search for all field values that fall in a given year, month, date, or date+time requires only one lookup.

NullAnalyzer

Any scalar field can be assigned the NullAnalyzer, which prevents the corresponding field from being indexed. Values for the field are stored as-is and are not indexed, thereby reducing storage space. Searching is not allowed for un-indexed fields. Hence, fields assigned the NullAnalyzer are "stored only". Note that binary fields are not indexed and must use the NullAnalyzer.

OpaqueTextAnalyzer

Text fields can optionally be assigned the OpaqueTextAnalyzer. This causes the corresponding field values to be indexed as a single, opaque, down-cased value instead of as a series of terms. For example, if the text field UserDomain is assigned the OpaqueTextAnalyzer, the field value "NT AUTHORITY" will only match exact value searches for “nt authority”, “NT authority”, etc., but not term clauses for NT or authority.

HTMLAnalyzer

Text fields can optionally be assigned the HTMLAnalyzer. This analyzer indexes the element content of any HTML elements it finds in the text. Tag names and element attributes are ignored. The content of all elements is logically concatenated into a single text field, which is then indexed with the same process as the TextAnalyzer. For example, consider the following HTML document:

<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
	<body>
		<div name="Body">Hi Betty,
		<p>Just a reminder that you’re scheduled to move to your "fancy" new office tomorrow, number <b>B413</b>. If you have any questions, please let me know.
		Thanks, John.
		</div>
	</body>
</html>

If this HTML document is the value of a field named Body, it generates the exact same terms as in the plain text example shown for the TextAnalyzer. Specifically, the terms generated are:

a any b413 betty fancy have hi if john just know let me move new number office please questions reminder scheduled thanks that to tomorrow you your you're

Link Fields

Link fields are pointers that create inter-object relationships. All relationships are bi-directional, therefore every link has an inverse link that defines the same relationship from the opposite direction. A link and its inverse link can be in the same table or they can reside in different tables. An example link declaration in XML is shown below:

<table name="Participant">
	<fields>
		<field name="MessageAddress" table="Address" type="LINK" inverse="Messages"/>
		...
	</fields>
</table>

In this example, the link field MessageAddress is owned by the Participant table and points to the Address table, whose inverse link is called Messages. The table to which a link points is called the link’s extent.

Link fields are always MV: the collection property, if set, is ignored. A link’s values are IDs of objects that belong to its extent table. Relationships are created or deleted by adding IDs to or removing IDs from the link field. Like MV scalar fields, link values are sets, hence duplicates are ignored.

Because relationships are bi-directional, when a link is updated, its inverse link is automatically updated at the same time. For example, if an object ID is added to MessageAddress, connecting the owning participant object to a specific Address object, the Messages link for that address object is updated to point back to the same participant.

One side-effect of this referential integrity behavior is that objects can be *implicitly *created: if an object ID is added to MessageAddress and the corresponding person doesn’t exist, it is created. An implicitly-created object will only have an _ID and automatically-updated link field value(s).

If a link’s owner and extent are the same table, the relationship is reflexive. An example reflexive relationship is the Manager/DirectReports relationship in the Person table:

<table name="Person">
	<fields>
		<field name="DirectReports" table="Person" type="LINK" inverse="Manager"/>
		<field name="Manager" table="Person" type="LINK" inverse="DirectReports"/>
		...
	</fields>
</table>

A link can also be its own inverse: such relationships are called self-reflexive. For example, we could define spouse and friends as self-reflexive relationships (though some may argue friendship is not always reciprocal).

The Sharded Property

The table sharding feature described earlier can benefit certain high fan-out links. These are links that have a large number of values (thousands to millions). When a high fan-out link points to a sharded table, the link can also be declared as sharded by settings its sharded property to true. Example:

<table name="Address">
	<fields>
		<field name="Name" type="TEXT"/>
		<field name="Messages" type="LINK" inverse="MessageAddress" table="Participant" sharded="true"/>
		...
	</fields>
</table>

In this example, the Messages link connects each Address object to related Participant objects. Since a participant exists for each sender or recipient of every message, some address objects could be linked to thousands or millions of participant objects. Because the Messages extent table, Address, is declared as sharded, the Messages link can be declared as sharded. That is, only links whose target table is sharded can be declared as sharded.

A sharded link’s values are stored in sharded term vectors similarly as sharded scalar fields. This improves performance for queries that include a selection clause on the link’s owning table and a clause that uses the referenced table’s sharding field. Example:

GET /Msgs/Address/_query?q=Name:dell AND Messages.ReceiptDate=[2013-01-01 TO 2013-01-31]

This query searches for Address objects whose Name contains the term "dell" and that are connected to participants whose ReceiptDate is in January, 2013. In other words, this query lists all addresses that sent or received a message in a specific time frame. Queries that fit this pattern are more efficient for high fan-out links declared as sharded.

Group Fields

A group field is a named field that contains one or more other fields. The contained fields are called nested fields and may be scalar, link, or group fields. The group field itself does not hold any values; values are stored by the contained leaf scalar and link fields. Note that all field names within a table must be unique, even for those contained within a group field. An example group field is shown below:

<table name="Person">
	<fields>                    <!-- fields belonging to Person -->
		<field name="Location">
			<fields>             <!-- fields belong to Location -->
				<field name="Department" type="TEXT"/>
				<field name="Office" type="TEXT"/>
			</fields>
		</field>
		...
	</fields>
</table>

Here, the group field Location contains the nested text fields Department and Office. In a query, the values of both nested fields are returned when the group field Location is requested. Below is another example consisting of link fields:

<table name="Message">
	<fields>
		<field name="Participants">
			<fields>
				<field name="Sender" type="LINK" table="Participant" inverse="MessageAsSender"/>
				<field name="Recipients">
					<fields>
						<field name="ExternalRecipients" type="link" table="Participant" inverse="MessageAsExternalRecipient"/>
						<field name="InternalRecipients" type="link" table="Participant" inverse="MessageAsInternalRecipient"/>
					</fields>
				</field>
			</fields>
		</field>
		...
	</fields>
</table>

In this example, the group field Participants contains a link called Sender and a nested group called Recipients, which has two additional links ExternalRecipients and InternalRecipients. The values of all three link fields can be retrieved by requesting the Participants in a query.

Dynamically-Added Fields

Doradus Spider allows fields to be dynamically added to objects. That is, objects in Add Batch and Update Batch commands can assign values to fields not declared in the corresponding table. Dynamic fields are treated as text and indexed using the TextAnalyzer. Dynamic fields can be referenced in query clauses and returned in query results, though they cannot be used as grouping fields in aggregate queries.

Clone this wiki locally