A record is the smallest unit you can load from and store in the database. Records come in three types:
-
Document
-
Vertex
-
Edge
Documents are softly typed and are defined by schema types, but you can also use them in a schema-less mode too. Documents handle fields in a flexible manner. You can easily import and export them in JSON format. For example,
{
"name":"Jay",
"surname":"Miner",
"job":"Developer",
"creations":[{
"name":"Amiga 1000",
"company":"Commodore Inc."
},{
"name":"Amiga 500",
"company":"Commodore Inc."
}]
}In graph databases the vertices (also: vertices), or nodes represent the main entity that holds the information. It can be a patient, a company or a product. Vertices are themselves documents with some additional features. This means they can contain embedded records and arbitrary properties exactly like documents. Vertices are connected with other vertices through edges.
An edge, or arc, is the connection between two vertices. Edges can be unidirectional and bidirectional. One edge can only connect two vertices. Edges, like vertices, are also documents with additional features.
For more information on connecting vertices in general, see Relationships below.
When ArcadeDB generates a record, it auto-assigns a unique identifier called a Record ID, RID for short.
The syntax for the RID is the pound symbol (#) with the bucket identifier, colon (:), and the position like so:
#<bucket-identifier>:<record-position>.
-
bucket-identifier: This number indicates the bucket id to which the record belongs. Positive numbers in the bucket identifier indicate persistent records. You can have up to 2,147,483,648 buckets in a database.
-
record-position: This number defines the absolute position of the record in the bucket.
A special case is #-1:-1 symbolizing the null RID.
|
Note
|
The prefix character # is mandatory.
|
Each Record ID is immutable, universal, and only reused when configured to, see bucketReuseSpaceMode.
Additionally, records can be accessed directly through their RIDs at O(1) complexity which means the query speed is constant, unaffected by database size.
For this reason, you don’t need to create a field to serve as the primary key as you do in relational databases.
Retrieving a record by RID is of complexity O(1).
This is possible as the RID itself encodes both, the file a record is stored in, and the position inside it.
In an RID, i.e. #12:1000000, the bucket identifier (here #12) specifies the record’s associated file, while the record position (here 1000000) describes the position inside the file.
Bucket files are organized in pages (with default size 64KB) with a maximum number records per page (by default 2048).
To determine the byte position of a record in a bucket file, the rounded down quotient of record position and maximum records per page yields the page (here ⌊1000000 / 2048⌋), and the remainder gives the position on the page (here ⌊1000000 % 2048⌋).
In pseudo-code this computation is given by:
int pageId = floor(rid.getPosition() / maxRecordsInPage);
int positionInPage = floor(rid.getPosition() % maxRecordsInPage);The concept of type is taken from the Object Oriented Programming paradigm, sometimes known as "Class". In ArcadeDB, types define records. It is closest to the concept of a "Table" in relational databases and a "Class" in an object database.
Types can be schema-less, schema-full, or a mix. They can inherit from other types, creating a tree of types. Inheritance, in this context, means that a subtype extends a parent type, inheriting all of its properties and attributes. Practically, this is done by extending a type or setting a super-type.
Each type has its own buckets (data files). A type can support multiple buckets. When you execute a query against a type, it automatically fetches from all the buckets that are part of the type. When you create a new record, ArcadeDB selects the bucket to store it in using a configurable strategy.
As a default, ArcadeDB creates one bucket per type, but can be configured to for example to as many cores (processors) the host machine has, see typeDefaultBuckets.
In this, CRUD operations can go full speed in parallel with zero contention between CPUs and/or cores.
Having many buckets per type means having more files at file system level.
Check if your operating system has any limitation with the number of files supported and opened at the same time (ulimit for Unix-like systems).
|
Note
|
You can query the defined types by executing the following SQL query: select from schema:types.
|
Where types provide you with a logical framework for organizing data, buckets provide physical or in-memory space in which ArcadeDB actually stores the data. Each bucket is one file at file system level. It is comparable to the "collection" in Document databases, the "table" in Relational databases and the "cluster" in OrientDB. You can have up to 2,147,483,648 buckets in a database.
A bucket can only be part of one type. This means two types can not share the same bucket. Also, sub-types have their separate buckets from their super-types.
When you create a new type, the CREATE TYPE statement automatically creates the physical buckets (files) that serve as the default location in which to store data for that type.
ArcadeDB forms the bucket names by using the type name + underscore + a sequential number starting from 0. For example, the first bucket for the type Beer will be Beer_0 and the correspondent file in the file system will be Beer_0.31.65536.bucket.
By default ArcadeDB creates one bucket per type.
For massive inserts, performance can be improved by creating additional buckets and hence taking advantage of parallelism, i.e. by creating one bucket for each CPU core on the server.
The combination of types and buckets is very powerful and has a number of use cases. In most case, you can work with types and you will be fine. But if you are able to split your database into multiple buckets, you could address a specific bucket based instead of the entire type. By wisely using the buckets to divide your database in a way that help you with the retrieval means zero or less use of indexes. Indexes slow down insertion and take space on disk and RAM. In most cases you need indexes to speed up your queries, but in some use cases you could totally or partially avoid using indexes and still having good performance on queries.
Consider an example where you create a type Invoice, with one bucket per year. Invoice_2015 and Invoice_2016.
You can query all invoices using the type as a target with the SELECT statement.
SELECT FROM InvoiceIn addition to this, you can filter the result set by the year.
The type Invoice includes a year field, you can filter it through the WHERE clause.
SELECT FROM Invoice WHERE year = 2016You can also query specific records from a single bucket.
By splitting the type Invoice across multiple buckets, (that is, one per year in our example), you can optimize the query by narrowing the potential result set.
SELECT FROM BUCKET:Invoice_2016By using the explicit bucket instead of the logical type, this query runs significantly faster, because ArcadeDB can narrow the search to the targeted bucket.
No index is needed on the year, because all the invoices for year 2016 will be stored in the bucket Invoice_2016 by the application.
Like with the example above, we could split our records by location creating one bucket per location. Example:
CREATE BUCKET Customer_Europe
CREATE BUCKET Customer_Americas
CREATE BUCKET Customer_Asia
CREATE BUCKET Customer_Other
CREATE VERTEX TYPE Customer BUCKET Customer_Europe,Customer_Americas,Customer_Asia,Customer_OtherHere we are using the graph model by creating a vertex type, but it’s the same with documents.
Use CREATE DOCUMENT TYPE instead.
Now in your application, store the vertices or documents in the right bucket, based on the location of such customer. You can use any API and set the bucket. If you’re using SQL, this is the way you can insert a new customer into a specific bucket.
INSERT INTO BUCKET:Customer_Europe CONTENT { firstName: 'Enzo', lastName: 'Ferrari' }Since a bucket can only be part of one type, when you use the bucket notation with SQL, the type is inferred from the bucket, "Customer" in this case.
When you’re looking for customers based in Europe, you could execute this query:
SELECT FROM BUCKET:Customer_EuropeYou can go even more specific by creating a bucket per country, not just for continent, and query from that bucket. Example:
CREATE BUCKET 'Customer_Europe_Italy'
CREATE BUCKET 'Customer_Europe_Spain'Now get all the customers that live in Italy:
SELECT FROM BUCKET:Customer_Europe_ItalyYou can also specify a list of buckets in your query. This is the query to retrieve both Italian and Spanish customers.
SELECT FROM BUCKET:[Customer_Europe_Italy,Customer_Europe_Spain]ArcadeDB supports three kinds of relationships: connections, referenced and embedded. It can manage relationships in a schema-full or schema-less scenario.
As a graph database, spanning edges between vertices is one way to express a connections between records. This is the graph model’s natural way of relationsships and traversable by the SQL, Gremlin, and Cypher query languages. Internally, ArcadeDB deposes a direct (referenced) relationship for edge-wise connected vertices to ensure fast graph traversals.
Example
vars: {
d2-config: {
sketch: false
}
}
direction: right
A: {
shape: circle
label: "Vertex A\nTYPE Customer\nRID #5:23"
}
X: {
shape: circle
label: "Edge X\nTYPE isBilled\nRID #16:9"
}
B: {
shape: circle
label: "Vertex B\nTYPE Invoice\nRID #10:2"
}
direction: right
A -> X -> B
In ArcadeDB’s SQL, edges are created via the CREATE EDGE command.
In Relational databases, tables are linked through JOIN commands, which can prove costly on computing resources.
ArcadeDB manages relationships natively without computing a JOIN but storing a direct LINK to the target object of the relationship.
This boosts the load speed for the entire graph of connected objects, such as in graph and object database systems.
Example
vars: {
d2-config: {
sketch: false
}
}
direction: right
A: {
shape: circle
label: "Record A\nTYPE Customer\nRID #5:23"
}
B: {
shape: circle
label: "Record B\nTYPE Invoice\nRID #10:2"
}
A -> B
Note, that referenced relationships differ from edges: references are properties connecting any record while edges are types connecting vertices, and particularly, graph traversal is only applicable to edges.
When using Embedded relationships, ArcadeDB stores the relationship within the record that embeds it. These relationships are stronger than Reference relationships. You can represent it as a UML Composition relationship.
Embedded records do not have their own RID, so it can’t be referenced through other records. It is only accessible directly through the container record. Furthermore, an embedded record is stored inside the embedding record, and not in an embedded record type’s bucket. Hence, in the event that you delete the container record, the embedded record is also deleted. For example,
vars: {
d2-config: {
sketch: false
}
}
direction: right
A: {
shape: rectangle
label: "Record A\nTYPE Account\nRID #5:23"
}
B: {
shape: rectangle
label: "Record B\nTYPE Address\nNO RID"
}
A o-> B
Here, record A contains the entirety of record B in the property address.
You can reach record B only by traversing the container record.
For example,
SELECT FROM Account WHERE address.city = 'Rome'ArcadeDB expresses relationships of these kinds using the EMBEDDED type.
ArcadeDB expresses relationships of these kinds using a list or a map of links, such as:
-
LISTAn ordered list of records. -
MAPAn ordered map of records as the value and a string as the key, it doesn’t accept duplicate keys.
In ArcadeDB, all edges in the graph model are bidirectional. This differs from the document model, where relationships are always unidirectional, requiring the developer to maintain data integrity. In addition, ArcadeDB automatically maintains the consistency of all bidirectional relationships.
ArcadeDB supports edge constraints, which means limiting the admissible vertex types that can be connected by an edge type.
To this end the implicit metadata properties @in and @out need to be made explicit by creating them.
For example, for an edge type HasParts that is supposed to connect only from vertices of type Product to vertices of type Component, this can be schemed by:
CREATE EDGE TYPE HasParts;
CREATE PROPERTY HasParts.`@out` link OF Product;
CREATE PROPERTY HasParts.`@in` link OF Component;As a native graph database, ArcadeDB supports index free adjacency. This means constant graph traversal complexity of O(1), independent of the graph expanse (database size).
To traverse a graph structure, one needs to follow references stored by the current record. These references are always stored as RIDs, and are not only pointers to incoming and outgoing edges, but also to connected vertices. Internally, references are managed by a stack (also known as LIFO), which allows to get the latest insertion first. As not only edges, but also connected vertices are stored, neighboring nodes can be reached directly, particularly without going via the connecting edge. This is useful if edges are used purely to connect vertices and do not carry i.e. properties themselves.
Each server or Java VM can handle multiple database instances, but the database name must be unique.
ArcadeDB uses its own URL format, of engine and database name as <engine>:<db-name>.
The embedded engine is the default and can be omitted.
To open a database on the local file system you can use directly the path as URL.
You must always close the database once you finish working on it.
|
Note
|
ArcadeDB automatically closes all opened databases, when the process dies gracefully (not by killing it by force).
This is assured if the operating system allows a graceful shutdown.
For example, on Unix/Linux systems using SIGTERM, or in Docker exit code 143 instead of SIGKILL, or in Docker exit code 137.
|
