Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticSearch, Hive, and HBase #439

Closed
tugisuwon opened this issue Apr 29, 2015 · 6 comments
Closed

ElasticSearch, Hive, and HBase #439

tugisuwon opened this issue Apr 29, 2015 · 6 comments

Comments

@tugisuwon
Copy link

Hello,

I have HDP 2.2 installed on my server and I am using elasticSearch to do indexing as following:

  1. Load HBase Table
  2. Create an external table in Hive associated with HBase Table (message is json in string)
create external table test (rowkey string, cre_dat timestamp, message string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' =':key,c:cre_dat,c:message') TBLPROPERTIES ('hbase.table.name' = 'hbase'); 
  1. Create an external table in Hive associated with ElasitcSearch
create external table json (data string) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' tblproperties('es.resource'='1p/2015','es.input.json'='yes','es.nodes'='localhost');
  1. Create '1p' index on elasticsearch with mapping file
  2. Load elasticsearch as following:
insert into table json select regexp_replace(message,'\'','\"') from test;

Now I am getting the following error. When it pushes about 1300~1500 records to elasticsearch, command from step 5 fails with the following error:

Query ID = hdfs_20150429123737_af5f9ba1-1272-499b-9df7-5651af1fb1e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1430276814521_0018, Tracking URL = http://localhost:8088/proxy/application_1430276814521_0018/
Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job  -kill job_1430276814521_0018
Hadoop job information for Stage-0: number of mappers: 4; number of reducers: 0
2015-04-29 12:37:54,240 Stage-0 map = 0%,  reduce = 0%
2015-04-29 12:38:34,316 Stage-0 map = 100%,  reduce = 0%
Ended Job = job_1430276814521_0018 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1430276814521_0018_m_000000 (and more) from job job_1430276814521_0018
Examining task ID: task_1430276814521_0018_m_000001 (and more) from job job_1430276814521_0018

Task with the most failures(4):
-----
Task ID:
  task_1430276814521_0018_m_000000

URL:
  http://localhost:8088/taskdetails.jsp?jobid=job_1430276814521_0018&tipid=task_1430276814521_0018_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"[some json message"}
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:185)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"[some json message]'}"}
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:503)
        at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
        ... 8 more
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [Bad Request(400) - [MapperParsingException[failed to parse]; nested: JsonParseException[Unexpected character ('s' (code 115)): was expecting comma to separate OBJECT entries
 at [Source: [B@2967c4c9; line: 1, column: 4581]]; ]]; Bailing out..
        at org.elasticsearch.hadoop.rest.RestClient.retryFailedEntries(RestClient.java:202)
        at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:166)
        at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:199)
        at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:223)
        at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:161)
        at org.elasticsearch.hadoop.rest.RestRepository.writeProcessedToIndex(RestRepository.java:154)
        at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.write(EsHiveOutputFormat.java:63)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:689)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:120)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
        at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:493)
        ... 9 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-0: Map: 4   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Can anyone help me with this issue? I manually deleted the json message in error due to privacy.
I have over 2 million rows in HBase table that needs to be pushed to ES.

Thank you in advance.

@tugisuwon
Copy link
Author

I think it is due to regexp_replace part where I replace ''' with '"'. When I load data to hbase, json is stored as string with single quote. When I try to load json message from hbase to elasticsearch on Hive, I have to convert single quote to double quote. When this is done in mapreduce, it seems that Hive fails to replace single quote to double quote before sending it to ElasticSearch.

I might be wrong, but that is what I am guessing. Please let me know what I can do.

Thanks again.

@costin
Copy link
Member

costin commented Apr 29, 2015

@tugisuwon I've been meaning to reply. First off, I've added minor formatting to your post to make it readable - hope you don't mind.
Your setup is sound and based on the exception, it looks indeed that the input is incorrect. JSON can be tricky and a missing comma or quotes can render it useless. Do note that JSON in general doesn't accept ' but rather ".
To help with debugging, you could dumb the data to HDFS, do the replace there and then try and validate it - this should quickly spot any problems that you might have.

By the way, if you need to do replacing, it means your initial data is not proper JSON so I would rather correct that; this would simplify the issue not just in this case but for any other system that consumes JSON that you might want to use.

Hope this helps,

@tugisuwon
Copy link
Author

@costin Thank you for a quick reply. It's my first time using github, so please make any modification if necessary :)

Let me share a little more details about my setup.

I loaded data to hbase using python, so actual json message was converted to string during this process. So you are right that initial data is not proper JSON anymore, but a string.

This is why I added regexp_replace(message, ''','"') on hive before sending it to ES as JSON.

example:
column value in HBase that needs to be loaded to ES: {'a':'123', 'b':'abc'}
Hive: regexp_replace => {"a":"123", "b":"abc"}

Is there any way for me to load HBase with JSON instead of string so that I don't have to worry about this? Or do you know any way for me to convert string to JSON in hive correctly?

Sorry about the massive questions... but really appreciate for your help and feedback.

Thanks.

@costin
Copy link
Member

costin commented Apr 29, 2015

Why not correct the JSON string directly. So instead of doing the replacing of quotes, use a JSON library when loading the data in HBase from python?

Blindly replacing the quotes will cause problems - for example consider this string "what's this?" In HBase it's probably stored as 'what's this?'. So replacing'with"becomes"what"s this?".

Basically what I'm suggesting is to address the problem at the source - when you are generating the String that is about to be loaded in HBase since again, that is not JSON and there's too much information loss for a simple replace to work. Not to mention that it's inefficient since instead of streaming the data, you know have to transform it which will act as a bottleneck that should not be needed.

@tugisuwon
Copy link
Author

@costin Thanks again.

I finally found out that there was a non-English word in the message that caused the problem. I was able to run everything now.

Thanks again.

@costin
Copy link
Member

costin commented Apr 30, 2015

Glad to hear it however my advice still stands. By using a JSON library, this case that requires quoting or escaping will not occur any longer.

Closing the issue.

@costin costin closed this as completed Apr 30, 2015
@costin costin added the invalid label Apr 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants