Introduction to Elasticsearch #18

OrAbramovich · 2017-11-01T21:49:30Z

General Information

Elasticsearch is a document oriented database. The Elasticsearch provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents that includes filtering, aggregate statistics and analysis capabilities. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License. Official clients are available in Java, .NET (C#), PHP, Python, Apache Groovy and many other languages. Elasticsearch is the most popular enterprise search engine. Elasticsearch is also easily scalable, supporting clustering and leader election out of the box.

The core of Elasticsearch's intelligent search engine is largely another software project: Lucene. It is perhaps easiest to understand elasticsearch as a piece of infrastructure built around Lucene’s Java libraries. Everything in elasticsearch that pertains to the actual algorithms for matching text and storing optimized indexes of searchable terms is implemented by Lucene. Elasticsearch itself provides a more useable and concise API, scalability, and operational tools on top of Lucene’s search implementation.

Since Elasticsearch is a standalone Java application, getting up and running is a cinch on almost any platform. You’ll want Java 1.7 or newer.

About the set up of Elasticsearch

The smallest individual unit of data in Elasticsearch is a field, which has a defined type and has one or many values of that type. A field contains a single piece of data, like the number 42 or the string "Hello, World!", or a single list of data of the same type, such as the array [5, 6, 7, 8].

Documents are collections of fields, and comprise the base unit of storage in Elasticsearch; something like a row in a traditional RDBMS. The reason a document is considered the base unit of storage is because, peculiar to Lucene, all field updates fully rewrite a given document to storage (while preserving unmodified fields). So, while from an API perspective the field is the smallest single unit, the document is the smallest unit from a storage perspective.

The primary data-format Elasticsearch uses is JSON. Given that, all documents must be valid JSON values. A simple document might look like:

{
"_id": 1,
"handle": "ron",
"age": 28,
"hobbies": ["hacking", "the great outdoors"],
"computer": {"cpu": "pentium pro", "mhz": 200}
}

The hobbies and computer fields specifically are rich types; an array and an object (dictionary) respectively, while the other fields are simple string and numeric types.

Elasticsearch reserves some fields for special use. We’ve specified one of these fields in this example: the _id field. A document’s id is unique, and if unassigned will be created automatically. An elasticsearch id would be a primary key in RDBMS parlance.

Each document in Elasticsearch must conform to a user-defined type mapping, analogous to a database schema.A type’s mapping both defines the types of its fields (say integer, string, etc.) and the way in which those properties are indexed. Types are defined with the Mapping API, which associates type names to property definitions e.g:

{
"user": {
"properties": {
"handle": {"type": "string"},
"age": {"type": "integer"},
"hobbies": {"type": "string"},
"computer": {
"properties": {
"cpu": {"type": "string"},
"speed": {"type": "integer"}}}}}

}

Available types:

Type Definition
string Text
integer 32 bit integers
long 64 bit integers
float IEEE float
double Double precision floats
boolean true or false
date UTC Date/Time (JodaTime)
geo_point Latitude / Longitude

There is nothing to declare regarding a field’s array-ness in the mapping. An important thing to remember, however, is that elasticsearch arrays cannot store mixed types. If a field is declared as an integer, it can store one or many integers, but never a mix of types.

The largest single unit of data in elasticsearch is an index. Indexes are logical and physical partitions of documents within elasticsearch. Documents and document types are unique per-index. Indexes have no knowledge of data contained in other indexes. elasticsearch supports cross-index searches. Elasticsearch indexes are most similar to the ‘database’ abstraction in the relational world. An elasticsearch index is a fully partitioned universe within a single running server instance. Documents and type mappings are scoped per index, making it safe to re-use names and ids across indexes.

Full usage example:

// Create an index named 'planet'
PUT /planet
// Create a type called 'hacker'
PUT /planet/hacker/_mapping
{
"hacker": {
"properties": {
"handle": {"type": "string"},
"age": {"type": "long"}}}}
// Create a document
PUT /planet/hacker/1
{"handle": "jean-michel", "age": 18}
// Retrieve the document
GET /planet/hacker/1
// Update the document's age field
POST /planet/hacker/1/_update
{"doc": {"age": 19}}
// Delete the document
DELETE /planet/hacker/1

Integration with programming languages

As said - official clients are available in Java, .NET (C#), PHP, Python, Apache Groovy and many other languages e.g:

C# client:

var client = new ElasticsearchClient();
//index a document under /myindex/mytype/1
var indexResponse = client.Index("myindex","mytype","1", new { Hello = "World" });

Java client:

SearchResponse response = client.prepareSearch("index1", "index2")
.setTypes("type1", "type2")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.termQuery(field, text)) // Query
.setPostFilter(QueryBuilders.rangeQuery("age").from(12).to(18)) // Filter
.setFrom(0).setSize(60).setExplain(true)
.get();

Elasticsearch and geolocation data

Elasticsearch offers two ways of representing geolocations: latitude-longitude points using the geo_point field type, and complex shapes defined in GeoJSON, using the geo_shape field type.

Geo-points allow you to find points within a certain distance of another point, to calculate distances between two points for sorting or relevance scoring, or to aggregate into a grid to display on a map (a geo-point is a single latitude/longitude point on the Earth’s surface).

Example for quering locations within a given distance (Java client):

GeoDistanceFilterBuilder filter = FilterBuilders.geoDistanceFilter("location").point(latitude, longitude).distance(distance,DistanceUnit.KILOMETERS);
SearchQuery searchQuery = new NativeSearchQueryBuilder()
            .withFilter(filter)
            .withSort(SortBuilders.geoDistanceSort("site.location").point(latitude, longitude).order(SortOrder.ASC)).build();

searchQuery.addIndices(channelKey);
searchQuery.addTypes("site");

In terms of scaling, an index is divided into one or more shards. This is specified when the index is created and cannot be changed. Thus, an index should be sharded proportionally with the anticipated growth. As more nodes are added to an Elasticsearch cluster, it does a good job at reallocating and moving shards around. As such, Elasticsearch is very easy to scale out.

Limitations

Elasticsearch does not have transactions - there is no way to rollback a submitted document, and you cannot submit a group of documents and have either all or none of them indexed.
Elasticsearch (and the components it's made of) does not currently handle OutOfMemory-errors very well - It is very important to provide Elasticsearch with enough memory and be careful before running searches with unknown memory requirements on a production cluster.
Elasticsearch does not have any features for authentication or authorization.

More detaild information about basic concepts of Elasticsearch can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

(based on https://www.elastic.co/guide/en/elasticsearch/client/index.html)

The text was updated successfully, but these errors were encountered:

AdiOmari · 2017-11-06T16:41:45Z

Or @OrAbramovich and Alon @alonttal , you did a great job. This should help us get to a confident conclusion regarding which database to use.

alonttal · 2017-11-18T07:09:42Z

moved this tutorial to "Guides" section in Wiki

OrAbramovich added Database info Research labels Nov 1, 2017

This was referenced Nov 1, 2017

Research: Compare DB alternatives: MongoDB & ElasticSearch #13

Closed

Or Weekly Report #3

Open

OrAbramovich self-assigned this Nov 3, 2017

alonttal removed the info label Nov 13, 2017

alonttal closed this as completed Nov 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction to Elasticsearch #18

Introduction to Elasticsearch #18

OrAbramovich commented Nov 1, 2017

AdiOmari commented Nov 6, 2017

alonttal commented Nov 18, 2017

Introduction to Elasticsearch #18

Introduction to Elasticsearch #18

Comments

OrAbramovich commented Nov 1, 2017

General Information

About the set up of Elasticsearch

Integration with programming languages

Elasticsearch and geolocation data

Limitations

AdiOmari commented Nov 6, 2017

alonttal commented Nov 18, 2017