Skip to content

Commit

Permalink
add section on automatic typing and type conversion
Browse files Browse the repository at this point in the history
relates to #71
  • Loading branch information
costin committed Aug 14, 2013
1 parent 4ab9352 commit 548cef2
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/src/reference/asciidoc/core.adoc
Expand Up @@ -10,3 +10,5 @@ include::core/cascading.adoc[]
include::core/hive.adoc[]

include::core/pig.adoc[]

include::core/mapping.adoc[]
1 change: 1 addition & 0 deletions docs/src/reference/asciidoc/core/intro.adoc
Expand Up @@ -12,3 +12,4 @@ This part of the documentation explains the core functionality of {eh} starting

<<pig>>:: how-to on using {es} in Pig scripts through {eh}.

<<mapping>>:: deep-dive into the strategies employed by {eh} for doing type conversion and mapping to and from {es}.
19 changes: 19 additions & 0 deletions docs/src/reference/asciidoc/core/mapping.adoc
@@ -0,0 +1,19 @@
[[mapping]]
== Mapping and Type conversion

As explained in the previous sections, {eh} integrates closely with the Hadoop ecosystem and perform close introspection of the type information so that the data flow between {es} and Hadoop is as transparent as possible.
This section takes a closer look at how the type conversion takes place and how data is mapped between the two systems.

=== Automatic mapping

By default, {es} provides http://www.elasticsearch.org/guide/reference/api/index_/[automatic index and mapping] when data is added under an index that has not been created before. In other words, data can be added into {es} without the index and the mappings being defined a priori. This is quite convenient since {es} automatically adapts to the data being fed to it - moreover, if certain entries have extra fields, {es} schema-less nature allows them to be indexed without any issues.

[[number-conversion]]
It is important to remember that automatic mapping uses the payload values to identify its http://www.elasticsearch.org/guide/reference/mapping/core-types/[type], using the *first document* creates the mapping. {eh} communicates with {es} through JSON which does not provide any type information, rather only the field names and their values. One can think of it as 'type erasure' or information loss; for example JSON does not differentiate integer numeric types - +byte+, +short+, +int+, +long+ are all placed in the same +integer+ 'bucket'. this can have unexpected side-effects since the type information is _guessed_ such as:

numbers mapped only as +long+/+double+:: Whenever {es} encounters a number, it will allocate the largest type for it since it does not know the exact number type of the field. Allocating a small type (such as +byte+, +int+ or +float+) can lead to problems if a future document is larger, so {es} uses a safe default.
incorrect mapping:: This happens when a string field contains only numbers (say +1234+) - {es} has no information that the number is actually a string and thus it map the field as a number. The same issue tends to occur with dates and strings.

To avoid such problems and to have maximum control over your mapping, it is recommended to define the http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping/[mapping] before doing data ingestion.

TIP: In most cases, http://www.elasticsearch.org/guide/reference/api/admin-indices-templates/[templates] are quite handy as they are automatically applied to new indices created that match the pattern; in other words instead of defining the mapping per index, one can just define the template once and then have it applied to all indices that match its pattern.

0 comments on commit 548cef2

Please sign in to comment.