Skip to content

ubpb/metacrunch-elasticsearch

Repository files navigation

metacrunch-elasticsearch

Gem Version Code Climate Test Coverage CircleCI

This is the official Elasticsearch package for the metacrunch ETL toolkit.

NOTE: metacrunch-elasticsearch 5.x requires Elasticsearch 7.x. For older versions of Elasticsearch try metacrunch-elasticsearch 4.x

Installation

Include the gem in your Gemfile

gem "metacrunch-elasticsearch", "~> 5.0.0"

and run $ bundle install to install it.

Or install it manually

$ gem install metacrunch-elasticsearch

Usage

Note: For working examples on how to use this package check out our demo repository.

Metacrunch::Elasticsearch::Source

This class provides a metacrunch source implementation that can be used to read data from Elasticsearch into a metacrunch job.

# my_job.metacrunch

# Create a Elasticsearch connection 
elasticsearch = Elasticsearch::Client.new(...)

# Set the source
source Metacrunch::Elasticsearch::Source.new(elasticsearch, OPTIONS)

Options

  • :search_options: A hash with search options (including your query) as described here. We have set some meaningful defaults though: size: 100, scroll: 1m, sort: ["_doc"]. Depending on your use-case it may be needed to modify :size and :scroll for optimal performance.
  • :total_hits_callback: You can set a Proc that gets called with the total number of hits your query will match. Use can use this callback to setup a progress bar for example. Defaults to nil.

Metacrunch::Elasticsearch::Destination

This class provides a metacrunch destination implementation that can be used to write data from a metacrunch job to Elasticsearch.

The data that gets passed to the destination, must be in a proper format. You can use a transformation to transform your data before it reaches the destination.

As Metacrunch::Elasticsearch::Destination utilizes the Elasticsearch bulk API, the expected format must match one of the available options for the bodyparameter described here. Please note that you can use the bulk API not only to index records. You can update or delete records as well.

# my_job.metacrunch

# Transform data into a format that the destination can understand.
# In this example `data` is some hash.
transformation ->(data) do
  {
    index: {
      _index: "my-index",
      _id: data.delete(:id),
      data: data
    }
  }
end

It is not efficient to call Elasticsearch for every single record. Therefore we can use a transformation with a buffer, to create bulks of records. In this example we use a buffer size of 10. In production environments and depending on your data, larger buffers may be useful.

# my_job.metacrunch

transformation ->(data) { data }, buffer: 10

If these transformations are in place you can now use the Metacrunch::Elasticsearch::Destination class as a destination.

# my_job.metacrunch

# Write data into elasticsearch
destination Metacrunch::Elasticsearch::Destination.new(elasticsearch [, OPTIONS])

Options

  • :result_callback: You can set a Proc that gets called with the result from the bulk operation. Defaults to nil.
  • :bulk_options: A hash of options for the Eleasticsearch bulk API as described here. Setting body here will be ignored. Defaults to {}.

License

metacrunch-elasticsearch is available at github under MIT license.

About

Elasticsearch package for the metacrunch ETL toolkit.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages