Skip to content

Table (Spider)

JoeWinter edited this page Feb 18, 2015 · 5 revisions

[Table of Contents](https://github.com/dell-oss/Doradus/wiki/Spider Databases: Table-of-Contents) | Previous | Next
Spider Data Model: Table


A table is a named set of objects. Table names are identifiers and must be unique within the same application. Example table names are: Message, LogSnapshot, and Security_4xx_Events.

A table can include the following components:

  • Options: Table-level options (described below).

  • Fields: Definitions of scalar, link, and group fields that the table uses.

  • Aliases: Alias definitions, which are schema-defined expressions that can then be used in queries.

The general structure of a table definition in XML is shown below:

<tables>
	<table name="Message">
		<fields>
			// fields
		</fields>
		<aliases>
			// aliases
		</aliases>
	</table>
	...
</tables>

In same structure in JSON is shown below:

"tables": {
	"Message": {
		"fields": {
			// fields
		},
		"aliases": {
			// aliases
		}
	},
	...
}

Table Options

Spider applications support table-level options, which engage two different features: automatic data aging and sharding. Below is an example in XML:

<table name="Message">
	<options>
		<option name="aging-field">SendDate</option>
		<option name="retention-age">5 YEARS</option>
		<option name="sharding-field">SendDate</option>
		<option name="sharding-granularity">DAY</option>
		<option name="sharding-start">2010-07-17</option>
	</options>
	...
</table>

Data Aging Options

Data aging causes objects within the table to be deleted when a timestamp field reaches a defined age. Aging is performed in a background task whose schedule can be controlled. An object is deleted when the data-aging task executes and finds it is equal to or greater than the defined age.

Data aging is controlled by the following options:

  • aging-field: Defines the field to use for data aging. It is required if a non-zero retention-age is specified. The aging field must be defined in the table’s schema, and its type must be timestamp.

  • retention-age: Enables data aging and defines the retention age. It must be specified in the format:

    <value> [<units>]

    Where <value> is a positive integer and <units>, if provided, is days, months, or years; days is the default. An object’s age is the difference between "now" (when the aging task executes) and the object’s aging field value. When this age is greater than the retention-age, the object is deleted. If retention-age is set to 0, aging is disabled.

  • aging-check-frequency: This option specifies how often a background task should check the table for expired objects. At the table level, this option overrides the default value, if specified, at the application level. The value of this option must be in the form:

    <value> <units>

    Where <value> is a positive integer and <units> is MINUTES, HOURS, or DAYS. (Singular forms of these mnemonics are also allowed.) If a non-zero retention-age is specified but aging-check-frequency is not specified, it defaults to 1 DAY.

When data aging is enabled, each object’s aging field can be modified at any time. An object is deleted only when the aging field has a value.

Table Sharding Options

Table sharding improves the performance of certain queries for tables with large populations (millions of objects or more). To benefit from sharding, a table must meet the following conditions:

  • Objects have a timestamp field whose value is stable, meaning it is rarely modified. In the example schema, the Message table’s SendDate field works well because it is rarely modified once a message is created. This timestamp field is used as the sharding field.

  • To benefit from a sharded table, queries must include an equality clause or range clause that uses the sharding field. For example, both of the following queries select objects in specific time frames:

      GET /Msgs/Message/_query?q=SendDate=PERIOD().LASTWEEK AND ...
      GET /Msgs/Message/_query?q=SendDate=[2014-01-01 TO 2014-03-01] AND ...
    

Normally, Doradus Spider creates a single term vector for each field/term combination. For example, the term vector with key "Body/the" holds references to all objects that use the term “the” in the field Body. For common terms, the term vector may point to every object in the table, and very large term vectors slow query performance. When sharding is enabled, separate term vectors are created for objects in each shard. Faster searching occurs when the sharding field is then included in queries.

Sharding is enabled with the following table-level options:

  • sharding-field: This option enables sharding and identifies the sharding field. Its value must be a timestamp field defined in the schema.

  • sharding-granularity: This option specifies what time period causes objects to be assigned to a new shard. It can be HOUR, DAY, WEEK, or MONTH. If not specified, it defaults to MONTH. The value should be chosen so that each shard as a reasonable number of objects (< 1 million).

  • sharding-start: This option specifies the date on which sharding begins for the table. Objects whose sharding-field value is null or less than the sharding-start value are considered "un-sharded" and assigned to shard #0. Objects whose sharding field is greater than or equal to the sharding-start value are assigned a shard number based on the difference between the two values and the sharding-granularity. If not explicitly assigned, sharding-start defaults to “now”, meaning the timestamp of the schema change that enables sharding.

Each object’s sharding field value can be modified at any time. If the modified value does not cause the object to be assigned a new shard number, the update is efficient. However, if the sharding field is assigned a value that changes the object’s shard number, the update is slower since the object’s fields are re-indexed.

Table sharding can also benefit certain links that have very high fan-outs. See the description later on Sharded Links.

Clone this wiki locally