Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge integer field data implementations together #3220

Closed
jpountz opened this issue Jun 24, 2013 · 2 comments
Closed

Merge integer field data implementations together #3220

jpountz opened this issue Jun 24, 2013 · 2 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Jun 24, 2013

Elasticsearch has 4 similar field data implementations for its integer types: byte, short, int and long. These implementations could be merged together and even be made a little more memory-efficient by using Lucene's PackedInts API.

@ghost ghost assigned jpountz Jun 24, 2013
jpountz added a commit to jpountz/elasticsearch that referenced this issue Jun 26, 2013
This commit merges field data implementations for byte, short, int and long
data into PackedArrayAtomicFieldData which uses Lucene's PackedInts API to
store data.

Close elastic#3220
@jpountz
Copy link
Contributor Author

jpountz commented Jun 26, 2013

With the help of @martijnvg , I ran a few benchmarks to compare the new implementation against the old ones. Loading time are similar, memory usage is between 1x and 2x smaller and faceting runs at similar speeds (there are little differences based on the dataset due to CPU caching effects). For example, here are the results of HistogramFacetSearchBenchmark on a 20m documents index for fields of type byte (b_value), short (s_value), int (i_value) and long (l_value):

Without this commit:
--> Histogram Facet (b_value) 599ms
--> Histogram Facet (b_value/b_value) 819ms
--> Histogram Facet (s_value) 681ms
--> Histogram Facet (s_value/s_value) 813ms
--> Histogram Facet (i_value) 668ms
--> Histogram Facet (i_value/i_value) 804ms
--> Histogram Facet (l_value) 670ms
--> Histogram Facet (l_value/l_value) 815ms
With this commit:
--> Histogram Facet (b_value) 604ms
--> Histogram Facet (b_value/b_value) 752ms
--> Histogram Facet (s_value) 637ms
--> Histogram Facet (s_value/s_value) 738ms
--> Histogram Facet (i_value) 637ms
--> Histogram Facet (i_value/i_value) 737ms
--> Histogram Facet (l_value) 640ms
--> Histogram Facet (l_value/l_value) 743ms

And here are the results on a 5m index:

Without this commit:
--> Histogram Facet (b_value) 150ms
--> Histogram Facet (b_value/b_value) 166ms
--> Histogram Facet (i_value) 141ms
--> Histogram Facet (i_value/i_value) 164ms
--> Histogram Facet (i_value) 140ms
--> Histogram Facet (i_value/i_value) 164ms
--> Histogram Facet (l_value) 140ms
--> Histogram Facet (l_value/l_value) 164ms
With this commit:
--> Histogram Facet (b_value) 152ms
--> Histogram Facet (b_value/b_value) 195ms
--> Histogram Facet (i_value) 147ms
--> Histogram Facet (i_value/i_value) 169ms
--> Histogram Facet (i_value) 146ms
--> Histogram Facet (i_value/i_value) 169ms
--> Histogram Facet (l_value) 145ms
--> Histogram Facet (l_value/l_value) 169ms

@jpountz
Copy link
Contributor Author

jpountz commented Jun 27, 2013

About memory and loading time, here are reports from LongFieldDataBenchmark on 1M documents.

Without this commit
Data    Loading time    Implementation  Actual size     Expected size
SINGLE_VALUES_DENSE_ENUM    65  Single  976.6 KB    976.6 KB
SINGLE_VALUED_DENSE_DATE    200 Single  7.6 MB  7.6 MB
MULTI_VALUED_DATE   233 WithOrdinals    15.4 MB 15.4 MB
MULTI_VALUED_ENUM   48  WithOrdinals    7.8 MB  7.8 MB
SINGLE_VALUED_SPARSE_RANDOM 30  WithOrdinals    3.6 MB  3.6 MB
MULTI_VALUED_SPARSE_RANDOM  71  WithOrdinals    8.1 MB  8.1 MB
MULTI_VALUED_DENSE_RANDOM   428 WithOrdinals    27.1 MB 27.1 MB

With this commit
Data    Loading time    Implementation  Actual size     Expected size
SINGLE_VALUES_DENSE_ENUM        86      Single  488.4 KB        488.3 KB
SINGLE_VALUED_DENSE_DATE        191     Single  4.3 MB  4.3 MB
MULTI_VALUED_DATE       224     WithOrdinals    10.5 MB 10.5 MB
MULTI_VALUED_ENUM       46      WithOrdinals    7.8 MB  7.8 MB
SINGLE_VALUED_SPARSE_RANDOM     30      WithOrdinals    3.5 MB  3.5 MB
MULTI_VALUED_SPARSE_RANDOM      76      WithOrdinals    7.7 MB  7.7 MB
MULTI_VALUED_DENSE_RANDOM       448     WithOrdinals    23.7 MB 23.7 MB

More information about the data sets:

  • SINGLE_VALUES_DENSE_ENUM assigns a single long between 0 and 15 to each document
  • SINGLE_VALUED_DENSE_DATE assigns a single date between 2010 and 2012 to each document
  • MULTI_VALUED_DATE assigns 0, 1 or 2 dates between 2010 and 2012 to every document
  • MULTI_VALUED_ENUM assigns 0, 1 or 2 longs between 3 and 10 to every document
  • SINGLE_VALUED_SPARSE_RANDOM assigns 1 random long to 10% of documents
  • MULTI_VALUED_SPARSE_RANDOM assigns between 1 and 5 random longs to 10% of documents
  • MULTI_VALUED_DENSE_RANDOM assigns between 1 and 3 random longs to all documents

More information about the columns:

  • Loading time is the time to load field data from the directory into memory.
  • Implementation is the class name of the AtomicFieldData instance class which has been loaded
  • Actual size is the memory usage reported by RamUsageEstimator
  • Expected size is the memory usage reported by AtomicFieldData.getMemorySizeInBytes

Explanation of the memory reduction:

  • On single-valued fields, the new implementation performs better when the number of required bits per value is not right below 8, 16, 32 or 64, for example for small enums or dates.
  • On multi-valued fields with low cardinality, most of the memory usage is taken by the ordinals map, so memory usage doesn't change much.
  • On multi-valued fields with high cardinality, the fact that we use MonotonicAppendingLongBuffer to encode values (this class compresses efficiently sequences of monotonically increasing longs) gives a memory reduction again (for example 13% on MULTI_VALUED_DENSE_RANDOM even though values have been chosen completely randomly).

jpountz added a commit that referenced this issue Jun 28, 2013
This commit merges field data implementations for byte, short, int and long
data into PackedArrayAtomicFieldData which uses Lucene's PackedInts API to
store data.

Close #3220
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
This commit merges field data implementations for byte, short, int and long
data into PackedArrayAtomicFieldData which uses Lucene's PackedInts API to
store data.

Close elastic#3220
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant