Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Es-Hadoop ingestion through Pig is missing the mappings #405

Closed
BalachanderGS opened this issue Mar 26, 2015 · 8 comments
Closed

Es-Hadoop ingestion through Pig is missing the mappings #405

BalachanderGS opened this issue Mar 26, 2015 · 8 comments

Comments

@BalachanderGS
Copy link

ES:

curl -XDELETE 'http://localhost:9200/ztmp_inventory_tool_sample'

curl -XPOST localhost:9200/ztmp_inventory_tool_sample -d '{
"settings" : {
"term_index_interval" : 256,
"term_index_divisor" : 5
},
"mappings" : {
"invData" : {
"_source" : { "enabled" : true },
"properties" : {
"ekv_raw" : { "type" : "byte" },
"ekv_flight" : { "type" : "byte" },
"event_id" : { "type" : "long" },
"cookie_id" : { "type" : "long" },
"dpId" : { "type" : "short" },
"vertical" : { "type" : "string" },
"activity_group" : { "type" : "string" },
"activity" : { "type" : "string" },
"eventDateTime" : {"type":"date", "format":"YYYY-MM-dd'"'T'"'HH:mm:ss.SSSZ"},
"departureDate" : { "type" : "date", "format":"YYYY-MM-dd", "ignore_malformed" : true},
"returnDate" : { "type" : "string" },
"origin" : { "type" : "string" },
"destination" : { "type" : "string" },
"destination_country_code" : { "type" : "string" },
"destination_state" : { "type" : "string" },
"destination_city" : { "type" : "string" },
"carrier" : { "type" : "string" },
"cabinClassGroup" : { "type" : "string" },
"currency" : { "type" : "string" },
"travelers" : { "type" : "short" },
"duration" : { "type" : "short" },
"bookedDate" : { "type" : "date", "format":"YYYY-MM-dd", "ignore_malformed" : true},
"airFare" : { "type" : "float" }}
}}}'

REGISTER elasticsearch-hadoop-2.1.0.Beta3/dist/elasticsearch-hadoop-2.1.0.Beta3.jar

DEFINE EsStorage org.elasticsearch.hadoop.pig.EsStorage('es.http.timeout=1m',
'es.resource=ztmp_inventory_tool_sample/invData',
'es.mapping.pig.tuple.use.field.names = true',
'es.http.timeout = 5s',
'es.index.auto.create = false',
'es.nodes = 68.67.141.63',
'es.port = 9200');

testEs = LOAD '/user/bganapathy/flightPKeys' USING PigStorage('|') AS (ekv_raw: int, ekv_flight: int, eventId: long, cookie_id: long, dpId: int, vertical: chararray, activity_group: chararray, activity: chararray, eventDateTime: chararray, departureDate: chararray, returnDate: chararray, origin: chararray, destination: chararray, destination_country_code: chararray, destination_state: chararray, destination_city: chararray, carrier: chararray, cabinClassGroup: chararray, currency: chararray, travelers: chararray, duration: chararray, bookedDate: chararray, airFare: chararray);

testEs = LIMIT testEs 100000;
STORE testEs INTO 'ztmp_inventory_tool_sample/invData' USING EsStorage();

Logs:

2015-03-24 22:10:34,544 [JobControl] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2015-03-24 22:10:34,755 [JobControl] INFO org.elasticsearch.hadoop.mr.EsOutputFormat - Writing to [ztmp_inventory_tool_sample/invData]
2015-03-24 22:10:34,774 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-03-24 22:10:34,774 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-03-24 22:10:34,776 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-03-24 22:10:35,971 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201408291723_27582
2015-03-24 22:10:35,971 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases testEs
2015-03-24 22:10:35,971 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: C: R: testEs[-1,-1]
2015-03-24 22:10:35,971 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://hnn101-lax1:50030/jobdetails.jsp?jobid=job_201408291723_27582
2015-03-24 22:10:47,051 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 75% complete
2015-03-24 22:10:59,619 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 92% complete
2015-03-24 22:11:15,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-03-24 22:11:15,721 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.4.0 0.11.0-cdh4.4.0 bganapathy 2015-03-24 22:09:46 2015-03-24 22:11:15 LIMIT

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias FeatureOutputs
job_201408291723_27581 12 1 8 5 7 7 15 15 15 15 testEs

job_201408291723_27582 1 1 5 5 5 5 17 17 17 17 testEs ztmp_inventory_tool_sample/invData,

Input(s):
Successfully read 1200000 records (19469816 bytes) from: "/user/bganapathy/flightPKeys"

Output(s):
Successfully stored 100000 records in: "ztmp_inventory_tool_sample/invData"

@BalachanderGS
Copy link
Author

From Costin:

@BalachanderGS This is an es-hadoop issue and should have been posted under elastic/elasticsearch-hadoop issue tracker.

The read/write operation succeed so likely the issue is that the data hits an ES cluster but not the one you are expecting. In the past we've seen issue with DEFINE command which might ignore the given configuration - in other words, while you do EsStorage to point to 'es.nodes = 68.67.141.63 Pig might disregard that and use the default parameter, namely the localhost.
You can double check this in several ways:

  1. make sure there's no cluster on your localhost. If you are running the Pig script against a remote cluster, double check your elasticsearch ip.
  2. turn on logging (see the docs) to double check to what IP/node the connection is made to
  3. Make sure you are looking at the right cluster when checking the mapping.

I'm closing the issue for now - if you still have issues, please open a new one in the es-hadoop project.

Thanks,

@BalachanderGS
Copy link
Author

I think it is going to the write IP as I am able to see the indexing through Marvel (on the desired cluster). We are on Pig 0.11 and CDH 4.4. Do you think there is some version issue here ?

@costin
Copy link
Member

costin commented Mar 30, 2015

The versions should work. Pig is somewhat old but still supported. Again there are no errors in the logs and clearly there is an indication of the number of records. Consider turning on logging to see the network traffic and what ES nodes are hit.

@BalachanderGS
Copy link
Author

Thanks.
Should I turn on Verbose in ES or Hadoop ?

@costin
Copy link
Member

costin commented Mar 30, 2015

Neither. Turn in on in the es-hadoop connector.

@BalachanderGS
Copy link
Author

Thanks. I use the JAR that has the EsStorage class to write from Pig code. Is there a config file that should drop on the JAR location that overwrites the default Log level?

@costin
Copy link
Member

costin commented Mar 30, 2015

When in doubt, see the reference manual:
http://www.elastic.co/guide/en/elasticsearch/hadoop/2.1.Beta/logging.html

On 3/30/15 8:34 PM, BalachanderGS wrote:

Thanks. I use the JAR that has the EsStorage class to write from Pig code. Is there a config file that should drop on
the JAR location that overwrites the default Log level?


Reply to this email directly or view it on GitHub
#405 (comment).

Costin

@costin
Copy link
Member

costin commented Apr 28, 2015

Closing the issue; if the problem persists please create a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants