add section on automatic typing and type conversion

relates to #71
elastic · Aug 14, 2013 · 548cef2 · 548cef2
1 parent 4ab9352
commit 548cef2
Show file tree

Hide file tree

Showing 3 changed files with 22 additions and 0 deletions.
diff --git a/docs/src/reference/asciidoc/core.adoc b/docs/src/reference/asciidoc/core.adoc
@@ -10,3 +10,5 @@ include::core/cascading.adoc[]
 include::core/hive.adoc[]
 
 include::core/pig.adoc[]
+
+include::core/mapping.adoc[]
diff --git a/docs/src/reference/asciidoc/core/intro.adoc b/docs/src/reference/asciidoc/core/intro.adoc
@@ -12,3 +12,4 @@ This part of the documentation explains the core functionality of {eh} starting
 
 <<pig>>:: how-to on using {es} in Pig scripts through {eh}.
 
+<<mapping>>:: deep-dive into the strategies employed by {eh} for doing type conversion and mapping to and from {es}.
diff --git a/docs/src/reference/asciidoc/core/mapping.adoc b/docs/src/reference/asciidoc/core/mapping.adoc
@@ -0,0 +1,19 @@
+[[mapping]]
+== Mapping and Type conversion
+
+As explained in the previous sections, {eh} integrates closely with the Hadoop ecosystem and perform close introspection of the type information so that the data flow between {es} and Hadoop is as transparent as possible.
+This section takes a closer look at how the type conversion takes place and how data is mapped between the two systems.
+
+=== Automatic mapping
+
+By default, {es} provides http://www.elasticsearch.org/guide/reference/api/index_/[automatic index and mapping] when data is added under an index that has not been created before. In other words, data can be added into {es} without the index and the mappings being defined a priori. This is quite convenient since {es} automatically adapts to the data being fed to it - moreover, if certain entries have extra fields, {es} schema-less nature allows them to be indexed without any issues.
+
+[[number-conversion]]
+It is important to remember that automatic mapping uses the payload values to identify its http://www.elasticsearch.org/guide/reference/mapping/core-types/[type], using the *first document* creates the mapping. {eh} communicates with {es} through JSON which does not provide any type information, rather only the field names and their values. One can think of it as 'type erasure' or information loss; for example JSON does not differentiate integer numeric types - +byte+, +short+, +int+, +long+ are all placed in the same +integer+ 'bucket'. this can have unexpected side-effects since the type information is _guessed_ such as:
+
+numbers mapped only as +long+/+double+:: Whenever {es} encounters a number, it will allocate the largest type for it since it does not know the exact number type of the field. Allocating a small type (such as +byte+, +int+ or +float+) can lead to problems if a future document is larger, so {es} uses a safe default. 
+incorrect mapping:: This happens when a string field contains only numbers (say +1234+) - {es} has no information that the number is actually a string and thus it map the field as a number. The same issue tends to occur with dates and strings.
+
+To avoid such problems and to have maximum control over your mapping, it is recommended to define the http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping/[mapping] before doing data ingestion. 
+
+TIP: In most cases, http://www.elasticsearch.org/guide/reference/api/admin-indices-templates/[templates] are quite handy as they are automatically applied to new indices created that match the pattern; in other words instead of defining the mapping per index, one can just define the template once and then have it applied to all indices that match its pattern.