Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES hadoop problem finding the correct cluster nodes #636

Closed
pantrif opened this issue Dec 22, 2015 · 7 comments
Closed

ES hadoop problem finding the correct cluster nodes #636

pantrif opened this issue Dec 22, 2015 · 7 comments

Comments

@pantrif
Copy link

pantrif commented Dec 22, 2015

Hello,
I cannot read or write data in elasticsearch using the following pig script:

REGISTER elasticsearch-hadoop-pig-2.2.0-beta1.jar
A = LOAD '/logs/log.bz2' USING PigStorage(',');
B = LIMIT A 3;
STORE B INTO 'test/apps' USING org.elasticsearch.hadoop.pig.EsStorage(
'es.index.auto.create=true',
'es.nodes=masterHost'
);

After running the above script i get the following error:

ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Encountered IOException. org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [_nodes/http] failed; server[masterHost/<hostip>:9200] returned [404|Not Found:]

For some reason the es hadoop lib makes invalid request (host/ip:9200 instead of host:9200). I am using latest stable version of ES hadoop (2.2.0) and i have a 3 node cluster of elasticsearch (2.1.0). Hadoop version: 2.4.0.2.1.5.0-695
Pig version: 0.12.1.2.1.5.0-695

Thanks in Advance

@ebuildy
Copy link
Contributor

ebuildy commented Dec 28, 2015

Is PIG host can access to all your ES hosts?

You can see your ES cluster topology by querying "/_cat/nodes", be sure these IPs are accessible from host where PIG is located.

@pantrif
Copy link
Author

pantrif commented Dec 28, 2015

Hello and thanks for the responser.
The pig is located in the master node of es cluster. I have tried with es.nodes=localhost as well and i am getting this exception:

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]] 

It doesnt seem to find the nodes in the cluster using localhost.

Howerer when i trying to GET localhost:9200 from ssh i get no exception:

# curl -XGET http://localhost:9200
{
  "name" : "<node name>",
  "cluster_name" : "<cluster name>",
  "version" : {
    "number" : "2.1.0",
    "build_hash" : "72cd1f1a3eee09505e036106146dc1949dc5dc87",
    "build_timestamp" : "2015-11-18T22:40:03Z",
    "build_snapshot" : false,
    "lucene_version" : "5.3.1"
  },
  "tagline" : "You Know, for Search"
}

@ebuildy
Copy link
Contributor

ebuildy commented Dec 28, 2015

I am not sure but I believe elasticsearch for Hadoop is working like Zookeeper:

1- Hadoop client (Pig, Spark...) use elasticsearch JAR lib and query node(s) you specify in the configuration

2- A ES node will answer who is the master node in the cluster

3- Hadoop client will use this IP

So, use _cat API to get the real master IP of your ES nodes.

@pantrif
Copy link
Author

pantrif commented Dec 28, 2015

With the _cat/nodes i can see the 4 nodes of es cluster.
The problem is that for some reason pig script cant access the master of es even though they are in same host (different port).
I have register this jar in order to run my script:
elasticsearch-hadoop-pig-2.2.0-beta1.jar

I have also tried older versions.

@ebuildy
Copy link
Contributor

ebuildy commented Dec 28, 2015

So from the Pig host, you can curl without problem the IP given by _cat/nodes ?

Also, enable log tracing: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/logging.html, maybe this will give some clues.

@pantrif
Copy link
Author

pantrif commented Dec 28, 2015

I run this command to enable logging:

 pig -4 pig/log4j.properties pig/test.pig

log4j.properties:

log4j.category.org.elasticsearch.hadoop.pig=DEBUG
log4j.category.org.elasticsearch.hadoop.rest=DEBUG

test.pig:

REGISTER /home/hdfs/elasticsearch-hadoop-2.2.0-beta1.jar
A = LOAD '/logs/log.bz2' USING PigStorage(',');
B = LIMIT A 3;
STORE B INTO 'test/logs' USING org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes=my-host-address'
);

The only debug messages i get:

15/12/28 12:25:28 DEBUG pig.EsStorage: Using pre-defined writer serializer [org.elasticsearch.hadoop.pig.PigValueWriter] as default
15/12/28 12:25:28 DEBUG pig.EsStorage: Using pre-defined reader serializer [org.elasticsearch.hadoop.pig.PigValueReader] as default
15/12/28 12:25:28 DEBUG pig.EsStorage: Using pre-defined field extractor [org.elasticsearch.hadoop.pig.PigFieldExtractor] as default

And the error:

15/12/28 12:26:14 ERROR pigstats.SimplePigStats: ERROR: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [_nodes/http] failed; server[my-host-address/my-ip-address:9200] returned [404|Not Found:]

@costin
Copy link
Member

costin commented Jan 10, 2016

@pantrif This has been fixed some time ago through #641. Please try out the latest 2.2 (2.2-rc1) and report back.

Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants