-
Notifications
You must be signed in to change notification settings - Fork 22
Fields (Spider)
[Table of Contents](https://github.com/dell-oss/Doradus/wiki/Spider Databases: Table-of-Contents) | [Previous](https://github.com/dell-oss/Doradus/wiki/Object IDs (Spider)) | Next
Spider Data Model: Fields
All fields other than _ID
are user-defined. Every field has a type, which determines the type of values it holds. Field types fall into three categories: scalar, link, and group.
Scalar fields store simple data such as numbers or text. A single-valued (SV) scalar field stores a single value per object. Multi-valued (MV) scalar fields, also called scalar collections, can store multiple values per object. The scalar field types supported by Doradus are summarized below:
-
Text: An arbitrary length string of Unicode characters.
-
Boolean: A logical true or false value.
-
Integer and Long: A signed 64-bit integer value. These two types are synonyms.
-
Float: A 32-bit floating-point value.
-
Double: A 64-bit floating-point value.
-
Timestamp: A date/time value with millisecond precision. Doradus treats all timestamp values as occurring in the UTC time zone. Timestamp values must be given in the format:
yyyy-MM-dd HH:mm:ss.SSS
where:
-
yyyy
is a 4-digit year between 0000 and 9999 -
MM
is a 1- or 2-digit month between 1 and 12 -
dd
is a 1- or 2-digit day-of-month between 1 and 31 -
HH
is a 1- or 2-digit hour between 0 and 23 -
mm
is a 1- or 2-digit minute between 0 and 59 -
ss
is a 1- or 2-digit second between 0 and 59 -
SSS
is a 1-to-3 digit millisecond between 0 and 999
Only the year component is required. All other components can be omitted (along with their preceding separator character) in right-to-left order. Omitted time elements default to 0; omitted date elements default to 1. Hence, the following timestamp values are all valid:
2011-02-01 08:50:01.123
2011-02-01 08:50:01 // same as 2011-02-01 08:50:01.000
2011-02-01 08:50 // same as 2011-02-01 08:50:00.000
2011-02-01 08 // same as 2011-02-01 08:00:00.000
2011-02-01 // same as 2011-02-01 00:00:00.000
2011-02 // same as 2011-02-01 00:00:00.000
2011 // same as 2011-01-01 00:00:00.000
-
Binary: An arbitrary length sequence of bytes. A binary field definition includes an encoding (
Base64
orHex
) that defines how values are encoded when sent to or returned by Doradus.
When type names are used in schema definitions, they are case-insensitive (e.g., integer
or INTEGER
).
By default, scalar fields are SV. Assigning an SV scalar field for an existing object replaces the existing value, if present. A scalar field can be declared MV by setting its collection
property to true
. Example:
<field name="Tags" collection="true" type="TEXT"/>
MV scalar field values are added and removed individually and treated as a set: duplicate values are not stored. This preserves idempotent update semantics: adding the same value twice is a no-op.
Doradus Spider uses analyzers to determine how to store and index scalar fields. The analyzer controls two aspects of scalar field management:
-
Indexing: The analyzer determines whether or not the field’s value are indexed.
-
Term extraction: If values are indexed, the analyzer generates one or more terms from each field value that are used to create index records. In other words, the analyzer tokenizes the field for indexing.
Every scalar field defined in the schema is assigned a default analyzer that is optimized for the scalar’s field type. Optionally, the field can explicitly declare an analyzer
property that assigns the default or an alternate analyzer as allowed by the chart below:
Field Type | Default Analyzer | Alternate Analyzers |
---|---|---|
Text | TextAnalyzer |
NullAnalyzer, OpaqueTextAnalyzer, HTMLAnalyzer |
Integer | IntegerAnalyzer |
NullAnalyzer |
Long | IntegerAnalyzer |
NullAnalyzer |
Boolean | BooleanAnalyzer |
NullAnalyzer |
Timestamp | DateAnalyzer |
NullAnalyzer |
Binary | NullAnalyzer |
Every scalar field type can declare the NullAnalyzer
to prevent the field from being indexed. Text fields can assign the OpaqueTextAnalyzer
to index the entire field as a single value instead of being tokenized into terms, or they can assign the HTMLAnalyzer
to index the field as HTML text. Binary fields are not indexed and must always use the NullAnalyzer
.
When a scalar field is MV, all values are indexed. For example, consider the following declaration:
<field name="Quotes" type="Text" collection="true" analyzer="TextAnalyzer"/>
Suppose an object is inserted with the following three Quotes
values:
{"What time is it?", "Life is a Beach", "It's Happy Hour!"}
Using the TextAnalyzer
, each unique term is indexed once ("a", “happy”, “is”, etc.), and each whole value (“life is a beach”) is also indexed. Therefore phrase clauses such as Quotes:Happy
will select the object, as will equality searches such as Quotes="life is a beach"
. If the analyzer is set to OpaqueTextAnalyzer
, only whole values are indexed, hence the equality clause will select the object but the term clause won’t. Details of each analyzer are described in the next sections.
The TextAnalyzer
assumes that text values are "plain text" (i.e., no metadata or mark-up). For each field value it tokenizes, the TextAnalyzer
generates zero or more terms as lowercased letter/digit/apostrophe sequences separated by consecutive whitespace/punctuation sequences. That is, each contiguous sequence of letters, digits, and/or apostrophes becomes a term. A term can contain an apostrophe, but it cannot begin or end with an apostrophe. For example, the apostrophe is included in doesn’t, but outer apostrophes in the sequence ‘tough’ are excluded, yielding the term tough. An apostrophe is any of the following characters:
-
The Unicode APOSTROPHE (0x27)
-
The Windows right single quote (0x92)
-
The Unicode RIGHT SINGLE QUOTATATION MARK (0x2019)
As example of how text fields are tokenized by the TextAnalyzer
, suppose an email object is created with the following text field values, all indexed with TextAnalyzer
:
From: John Smith
To: Betty Sue
Subject: The Office Move
Body: Hi Betty,
Just a reminder that you’re scheduled to move to your "fancy" new office tomorrow, number B413. If you have any questions, please let me know.
Thanks, John.
The TextAnalyze``r
indexes these fields to generate the following terms:
Field Name | Terms |
---|---|
From |
john smith |
To |
betty sue |
Subject |
move office the |
Body |
a any b413 betty fancy have hi if john just know let me move new number office please questions reminder scheduled thanks that to tomorrow you your you're |
As shown, terms are extracted in lowercase, and punctuation and whitespace are removed. As part of down-casing, the TextAnalyzer
converts any apostrophe retained within a term to the "straight apostrophe" character (0x27.) Although a term may appear multiple times within a field, it is indexed but once.
Though not shown above, the TextAnalyzer
also creates a term equal to the term’s entire field value, down-cased, and enclosed in single quotes. This value is used as an optimization for equality searches. For example, the whole-field value for the field From
is 'john smith'
. For text fields with large values, this "whole field" value is created as an MD5 value instead of the literal text.
The terms generated by the TextAnalyzer
allow efficient execution of a wide range of full text queries: single terms, phrases, wildcard terms, range clauses, etc. Searches are performed without case sensitivity: for example the phrase "You’re Scheduled to MOVE" will match the Body
field shown above.
The IntegerAnalyzer
is the default analyzer for Integer/Long scalar fields. It creates terms using trie values, which allow efficient searching of value ranges in addition to exact value matching. For example, consider the following range clause:
Size:[500 TO 10000]
Text-based searches don’t work on numeric fields: "500" is actually greater than “10000” when compared as text. If only exact values were stored for
Size`, the range clause above would require searching 9501 separate values. But using trie values, this range clause requires searching no more than 77 lookups.
The BooleanAnalyzer
is the default analyzer for Boolean scalar fields. It creates one term record that indexes all fields with a "true" value and another term record for “false” values. This allows all objects with either value to be quickly found.
The DateAnalyzer
is the default analyzer for Timestamp scalar fields. Similar to the IntegerAnalyzer
, it indexes timestamp values by creating trie values, which allow efficient execution of range clauses. The trie values are chosen to efficiently find timestamp values that match values with various granularities. For example, a search for all field values that fall in a given year, month, date, or date+time requires only one lookup.
Any scalar field can be assigned the NullAnalyzer
, which prevents the corresponding field from being indexed. Values for the field are stored as-is and are not indexed, thereby reducing storage space. Searching is not allowed for un-indexed fields. Hence, fields assigned the NullAnalyzer
are "stored only". Note that binary fields are not indexed and must use the NullAnalyzer
.
Text fields can optionally be assigned the OpaqueTextAnalyzer
. This causes the corresponding field values to be indexed as a single, opaque, down-cased value instead of as a series of terms. For example, if the text field UserDomain
is assigned the OpaqueTextAnalyzer
, the field value "NT AUTHORITY
" will only match exact value searches for “nt authority
”, “NT authority
”, etc., but not term clauses for NT
or authority
.
Text fields can optionally be assigned the HTMLAnalyzer
. This analyzer indexes the element content of any HTML elements it finds in the text. Tag names and element attributes are ignored. The content of all elements is logically concatenated into a single text field, which is then indexed with the same process as the TextAnalyzer
. For example, consider the following HTML document:
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<body>
<div name="Body">Hi Betty,
<p>Just a reminder that you’re scheduled to move to your "fancy" new office tomorrow, number <b>B413</b>. If you have any questions, please let me know.
Thanks, John.
</div>
</body>
</html>
If this HTML document is the value of a field named Body
, it generates the exact same terms as in the plain text example shown for the TextAnalyzer
. Specifically, the terms generated are:
a any b413 betty fancy have hi if john just know let me move new number office please questions reminder scheduled thanks that to tomorrow you your you're
Link fields are pointers that create inter-object relationships. All relationships are bi-directional, therefore every link has an inverse link that defines the same relationship from the opposite direction. A link and its inverse link can be in the same table or they can reside in different tables. An example link declaration in XML is shown below:
<table name="Participant">
<fields>
<field name="MessageAddress" table="Address" type="LINK" inverse="Messages"/>
...
</fields>
</table>
In this example, the link field MessageAddress
is owned by the Participant
table and points to the Address
table, whose inverse link is called Messages
. The table to which a link points is called the link’s extent.
Link fields are always MV: the collection
property, if set, is ignored. A link’s values are IDs of objects that belong to its extent table. Relationships are created or deleted by adding IDs to or removing IDs from the link field. Like MV scalar fields, link values are sets, hence duplicates are ignored.
Because relationships are bi-directional, when a link is updated, its inverse link is automatically updated at the same time. For example, if an object ID is added to MessageAddress
, connecting the owning participant object to a specific Address
object, the Messages
link for that address object is updated to point back to the same participant.
One side-effect of this referential integrity behavior is that objects can be *implicitly *created: if an object ID is added to MessageAddress
and the corresponding person doesn’t exist, it is created. An implicitly-created object will only have an _ID
and automatically-updated link field value(s).
If a link’s owner and extent are the same table, the relationship is reflexive. An example reflexive relationship is the Manager
/DirectReports
relationship in the Person
table:
<table name="Person">
<fields>
<field name="DirectReports" table="Person" type="LINK" inverse="Manager"/>
<field name="Manager" table="Person" type="LINK" inverse="DirectReports"/>
...
</fields>
</table>
A link can also be its own inverse: such relationships are called self-reflexive. For example, we could define spouse and friends as self-reflexive relationships (though some may argue friendship is not always reciprocal).
The table sharding feature described earlier can benefit certain high fan-out links. These are links that have a large number of values (thousands to millions). When a high fan-out link points to a sharded table, the link can also be declared as sharded by settings its sharded
property to true. Example:
<table name="Address">
<fields>
<field name="Name" type="TEXT"/>
<field name="Messages" type="LINK" inverse="MessageAddress" table="Participant" sharded="true"/>
...
</fields>
</table>
In this example, the Messages
link connects each Address
object to related Participant
objects. Since a participant exists for each sender or recipient of every message, some address objects could be linked to thousands or millions of participant objects. Because the Messages
extent table, Address
, is declared as sharded, the Messages
link can be declared as sharded. That is, only links whose target table is sharded can be declared as sharded.
A sharded link’s values are stored in sharded term vectors similarly as sharded scalar fields. This improves performance for queries that include a selection clause on the link’s owning table and a clause that uses the referenced table’s sharding field. Example:
GET /Msgs/Address/_query?q=Name:dell AND Messages.ReceiptDate=[2013-01-01 TO 2013-01-31]
This query searches for Address
objects whose Name
contains the term "dell" and that are connected to participants whose ReceiptDate
is in January, 2013. In other words, this query lists all addresses that sent or received a message in a specific time frame. Queries that fit this pattern are more efficient for high fan-out links declared as sharded.
A group field is a named field that contains one or more other fields. The contained fields are called nested fields and may be scalar, link, or group fields. The group field itself does not hold any values; values are stored by the contained leaf scalar and link fields. Note that all field names within a table must be unique, even for those contained within a group field. An example group field is shown below:
<table name="Person">
<fields> <!-- fields belonging to Person -->
<field name="Location">
<fields> <!-- fields belong to Location -->
<field name="Department" type="TEXT"/>
<field name="Office" type="TEXT"/>
</fields>
</field>
...
</fields>
</table>
Here, the group field Location
contains the nested text fields Department
and Office
. In a query, the values of both nested fields are returned when the group field Location
is requested. Below is another example consisting of link fields:
<table name="Message">
<fields>
<field name="Participants">
<fields>
<field name="Sender" type="LINK" table="Participant" inverse="MessageAsSender"/>
<field name="Recipients">
<fields>
<field name="ExternalRecipients" type="link" table="Participant" inverse="MessageAsExternalRecipient"/>
<field name="InternalRecipients" type="link" table="Participant" inverse="MessageAsInternalRecipient"/>
</fields>
</field>
</fields>
</field>
...
</fields>
</table>
In this example, the group field Participants
contains a link called Sender
and a nested group called Recipients
, which has two additional links ExternalRecipients
and InternalRecipients
. The values of all three link fields can be retrieved by requesting the Participants
in a query.
Doradus Spider allows fields to be dynamically added to objects. That is, objects in Add Batch and Update Batch commands can assign values to fields not declared in the corresponding table. Dynamic fields are treated as text and indexed using the TextAnalyzer
. Dynamic fields can be referenced in query clauses and returned in query results, though they cannot be used as grouping fields in aggregate queries.
Technical Documentation
[Doradus OLAP Databases](https://github.com/dell-oss/Doradus/wiki/Doradus OLAP Databases)
- Architecture
- OLAP Database Overview
- OLAP Data Model
- Doradus Query Language (DQL)
- OLAP Object Queries
- OLAP Aggregate Queries
- OLAP REST Commands
- Architecture
- Spider Database Overview
- Spider Data Model
- Doradus Query Language (DQL)
- Spider Object Queries
- Spider Aggregate Queries
- Spider REST Commands
- [Installing and Running Doradus](https://github.com/dell-oss/Doradus/wiki/Installing and Running Doradus)
- [Deployment Guidelines](https://github.com/dell-oss/Doradus/wiki/Deployment Guidelines)
- [Doradus Configuration and Operation](https://github.com/dell-oss/Doradus/wiki/Doradus Configuration and Operation)
- [Cassandra Configuration and Operation](https://github.com/dell-oss/Doradus/wiki/Cassandra Configuration and Operation)