Elasticsearch

Elasticsearch Intro

Elasticsearch is a distributed NoSQL JSON document database derived from Lucene. Elasticsearch provides a full-text search service and is used quite extensively with websites such as Quora, Github, StackExchange and many more. The The RESTful API provides a simple to use interface with the distributed database allowing simple integration with websites. In this dev-op we will be deploying Elasticsearch on an AWS cluster and perform a simple query.

## Spin up AWS instances

We would recommend using t2.micro instances with Ubuntu Server 14.04 LTS (HVM), SSD Volume Type and take advantage of Amazon’s Free Tier program. Be sure to terminate the instances when you are finished to prevent AWS charges if you go over the 700 hour limit. For practice you can try spinning up 3 nodes for Elasticsearch.

## Setup Elasticsearch

Elasticsearch will be installed on all nodes with the same configuration.

Run the following on the all nodes by SSH-ing into each node:

node$ sudo apt-get update

Install java-development-kit
node$ sudo apt-get install openjdk-7-jdk

Install Elasticsearch
node$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.5.2.tar.gz -P ~/Downloads
node$ sudo tar -xvf ~/Downloads/elasticsearch-1.5.2.tar.gz -C /usr/local
node$ sudo mv /usr/local/elasticsearch-1.5.2 /usr/local/elasticsearch

Set the ELASTICSEARCH_HOME environment variable and add to PATH in .profile
node$ nano ~/.profile

# Add the following
export ELASTICSEARCH_HOME=/usr/local/elasticsearch
export PATH=$PATH:$ELASTICSEARCH_HOME/bin

node$ source ~/.profile

Install AWS Cloud Plugin for Elasticsearch
node$ sudo $ELASTICSEARCH_HOME/bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.5.0

Configure Elasticsearch for node discovery
node$ sudo nano $ELASTICSEARCH_HOME/config/elasticsearch.yml

Change the access_key (AWS access key id), secret_key (AWS secret access key), region (cluster region), and group (security group name) to your AWS settings. Also change the name of your cluster to be something specific to you (otherwise Elasticsearch will assume all the nodes on your EC2 are yours.) Warning: BE CAREFUL NOT TO COMMIT THIS SCRIPT TO GITHUB SINCE IT HAS YOUR AWS CREDENTIALS.

cloud.aws.access_key: AKIAJVKQLSNIFBFH66EA
cloud.aws.secret_key: d79HExZf1tyy9xl7IPNXogDfdc4lQR92scWQIZ+H
cloud.aws.region: us-west-2

discovery.type: ec2
discovery.ec2.groups: your-security-group
################### Elasticsearch Configuration Example ###################
…
…
cluster.name: my-cluster-name

Start Elasticsearch

node$ sudo $ELASTICSEARCH_HOME/bin/elasticsearch &

## Check status of Elasticsearch cluster You can check to see if all nodes are up and running by executing the following on any of the nodes.

node$ curl --user elastic:changeme 'localhost:9200/_cat/health?v'

Output should look like the following with a 3 node cluster

You can also list the nodes from your cluster with the following

node$ curl 'localhost:9200/_cat/nodes?v'

Output should look similar to the following

## Example Search with Python Elasticsearch Client We will be using practice problems and datasets from exploringelasticsearch.com. The tutorial currently is using the popular JSON HTTP API, but we will be going through the exercise using the [python elasticsearch client](https://www.google.com/url?q=https%3A%2F%2Felasticsearch-py.readthedocs.org%2Fen%2Fmaster%2F&sa=D&sntz=1&usg=AFQjCNFPa1A-94IPTczjq_LyeWSvvCO93Q). The following can be executed on any node in the Elasticsearch cluster. ### Install Python Elasticsearch Client

node$ sudo apt-get install python-pip
node$ sudo pip install elasticsearch

Install git and clone the repository

node$ sudo apt-get install git
node$ git clone https://github.com/andrewvc/ee-datasets

### Put the movie_db data onto the Elasticsearch cluster

node$ cd ee-datasets
node$ java -jar elastic-loader.jar http://localhost:9200 datasets/movie_db.eloader

### Perform a simple search in python Here we will simply look for a movie that contains the word CIA in its description field. In the previous step when we loaded the movie_db data, we actually created an index called movie_db in Elasticsearch.

node$ python
>>> from elasticsearch import Elasticsearch
>>> import json
>>> es = Elasticsearch(http_auth=('elastic','changeme'))
>>> result = es.search(index="movie_db", body={'query': {'match': {'description': 'CIA'}}})
>>> print json.dumps(result, indent=2)

{
  "hits": {
    "hits": [
      {
        "_score": 0.067124054,
        "_type": "movie",
        "_id": "3",
        "_source": {
          "description": "A cast of characters challenge society's commonly held view that computer experts are not the beautiful people. Somehow, the CIA is hacked in under 5 minutes.",
          "title": "Swordfish",
          "actors": [
            "John Travolta",
            "Hugh Jackman",
            "Halle Berry"
          ],
          "genre": [
            "Action",
            "Crime"
          ],
          "_id": 3,
          "release_year": 2001
        },
        "_index": "movie_db"
      }
    ],
    "total": 1,
    "max_score": 0.067124054
  },
  "_shards": {
    "successful": 5,
    "failed": 0,
    "total": 5
  },
  "took": 25,
  "timed_out": false
}

### Additional notes If you are planning to shut down the Elasticsearch nodes you can try the following:

# Shutdown local node
$ curl --user elastic:changeme -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'

# Shutdown all nodes in the cluster
$ curl --user elastic:changeme -XPOST 'http://localhost:9200/_shutdown'

Interested in transitioning to a career in data engineering?

Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch

Elasticsearch Intro

Table of Contents

Install git and clone the repository

Interested in transitioning to a career in data engineering?

Home

AWS

Ingestion

File Systems

Batch Processing

Stream Processing

Databases

Web frameworks

Other

Clone this wiki locally