Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add section on automatic typing and type conversion
relates to #71
- Loading branch information
Showing
3 changed files
with
22 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,5 @@ include::core/cascading.adoc[] | |
include::core/hive.adoc[] | ||
|
||
include::core/pig.adoc[] | ||
|
||
include::core/mapping.adoc[] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
[[mapping]] | ||
== Mapping and Type conversion | ||
|
||
As explained in the previous sections, {eh} integrates closely with the Hadoop ecosystem and perform close introspection of the type information so that the data flow between {es} and Hadoop is as transparent as possible. | ||
This section takes a closer look at how the type conversion takes place and how data is mapped between the two systems. | ||
|
||
=== Automatic mapping | ||
|
||
By default, {es} provides http://www.elasticsearch.org/guide/reference/api/index_/[automatic index and mapping] when data is added under an index that has not been created before. In other words, data can be added into {es} without the index and the mappings being defined a priori. This is quite convenient since {es} automatically adapts to the data being fed to it - moreover, if certain entries have extra fields, {es} schema-less nature allows them to be indexed without any issues. | ||
|
||
[[number-conversion]] | ||
It is important to remember that automatic mapping uses the payload values to identify its http://www.elasticsearch.org/guide/reference/mapping/core-types/[type], using the *first document* creates the mapping. {eh} communicates with {es} through JSON which does not provide any type information, rather only the field names and their values. One can think of it as 'type erasure' or information loss; for example JSON does not differentiate integer numeric types - +byte+, +short+, +int+, +long+ are all placed in the same +integer+ 'bucket'. this can have unexpected side-effects since the type information is _guessed_ such as: | ||
|
||
numbers mapped only as +long+/+double+:: Whenever {es} encounters a number, it will allocate the largest type for it since it does not know the exact number type of the field. Allocating a small type (such as +byte+, +int+ or +float+) can lead to problems if a future document is larger, so {es} uses a safe default. | ||
incorrect mapping:: This happens when a string field contains only numbers (say +1234+) - {es} has no information that the number is actually a string and thus it map the field as a number. The same issue tends to occur with dates and strings. | ||
|
||
To avoid such problems and to have maximum control over your mapping, it is recommended to define the http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping/[mapping] before doing data ingestion. | ||
|
||
TIP: In most cases, http://www.elasticsearch.org/guide/reference/api/admin-indices-templates/[templates] are quite handy as they are automatically applied to new indices created that match the pattern; in other words instead of defining the mapping per index, one can just define the template once and then have it applied to all indices that match its pattern. |