Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incompatible with CDH4 YARN/Hadoop 2 #8

Closed
github4venkat opened this issue Mar 19, 2013 · 18 comments
Closed

incompatible with CDH4 YARN/Hadoop 2 #8

github4venkat opened this issue Mar 19, 2013 · 18 comments

Comments

@github4venkat
Copy link

Will you be releasing a version of elasticsearch-hadoop compilable with hadoop 2.X
OR could someone help how to make it work for the mentioned version.

Thanks,
Venkat

Pig Stack Trace

ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.elasticsearch.hadoop.mr.ESOutputFormat.checkOutputSpecs(ESOutputFormat.java:104)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:77)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)

@Downchuck
Copy link

At first glance, I don't see any interface changes between CDH2 and CDH4 - I was able to get CDH2 running with this package. checkOutputSpecs doesn't do much, you can certainly download this project and just have the method do nothing (noop). Given how new this project is, I strongly recommend downloading and patching for the next week or two. It's quite simple source code.

@costin
Copy link
Member

costin commented Mar 19, 2013

Hi,

CDH is just one of the distros which will be testing against to make sure we're compatible. Currently we're aiming just for vanilla Apache since we're in the early days and things are still in flux.
I suspect the error comes from the Hadoop 2.x/YARN changes ported back to CDH4 - I know it's not a lot but it's a start.
We'll look into it as soon as possible but it will definitely not be this week (am currently travelling).

Cheers,

@costin
Copy link
Member

costin commented Mar 20, 2013

@github4venkat Btw, what version of CDH are you using - 4.1.x, 4.5.x? Is upgrade to the latest one an option?

@github4venkat
Copy link
Author

@ALL Thanks for the replies.

@costin ...... I am using 4.1.x....
For making it work, I had to ignore the code(comment/replace) which was only compilable with old hadoop core...
for example, TaskAttemptContext used as a class was not acceptable with hadoop 2.x.....
But the project works very well....... was able to index bulk documents into my es: cluster.
Thanks,
Venkat

@costin
Copy link
Member

costin commented Mar 20, 2013

@github4venkat Great to hear it works for you - this is just an initial drop and we have a lot more goodies in store.

The class incompatibilities in Hadoop 2.x/CDH 4 are quite unfortunate (they could have used a different package) and we'll probably have to create a separate branch/artifact.package for it since it's not backwards compatible. Will keep you posted.

@github4venkat
Copy link
Author

@costin yes, i understand this community is pretty new! but helpful. Sure, I will look forward for updates in this space. This came in timely as I was looking to write my own loader to bulk index data from hdfs to elasticsearch.

@costin
Copy link
Member

costin commented Mar 26, 2013

@github4venkat Hi, I've tried to replicate the issue with little success. I assume you're using YARN/Hadoop-2.x version instead of MR1/Hadoop 1.x?
Since JobContext class has been changed from a class to an interface (so much for backwards compatibility - why not add another interface?) one needs to recompile. However you mentioned you also had to comment out TaskAttemptContext but I don't see why one would have to do that - could you comment on this?

Note that support for Hadoop 2 (not CDH4 MR1) is problematic since it's not yet stable. I would not recommend using it since it's simply work in progress and many (if not all) of the projects within Hadoop eco-systems, such as Pig or Hive, do not support it.

@costin
Copy link
Member

costin commented Mar 26, 2013

@github4venkat A quick update - I've added a new branch, called cdh, that allows the project to be compiled against CDH4 YARN/Hadoop 2.x (you can change the version to that of MRv1 as well). The project compiles cleanly on both versions - please confirm whether you still experience any issues.
Note the master has been updated as well - the branch however contains the CDH4 dependencies to make it easier.
By the way, instead of listing the dependencies by hand, I've used that Cloudera recommended approach listed here

P.S. I've updated the issue title as well to better reflect the problem.

@github4venkat
Copy link
Author

@costin Thanks for creating the branch.... Meanwhile I had written custom MR jobs to load data into elasticsearch...
But I will try your branch to compare performance and use it if better..
And yes, I did comment out those 'Interface but class was expected' part, because that was the simplest for me to do at that time, as all I needed from those was es config details.(Did that to quickly test if the actually functionality suits my use case)

I will post here, once I have tested your branch on my Hadoop 2.x environment. Thanks.

@costin
Copy link
Member

costin commented Mar 28, 2013

@github4venkat I'd be interested to know what's the difference between your custom MR jobs and what we provide - what do we lack?

@github4venkat
Copy link
Author

@costin Just ran a test to load 54 mb file... with no additional setting in my pig script, it takes 1.5 hrs. Not sure if the code hits REST for each event, if so is there a setting where I can use Bulk loading into elastic search?

In my custom map reduce, I am writing such that reducers bulk load using a Bulk processor for elasticsearch.

Let me know if any external settings will better my loading performance.

@costin
Copy link
Member

costin commented Mar 28, 2013

@github4venkat There are two things to consider here:

  1. the code is currently still in its early days. the REST interface doesn't use the bulk loading endpoint yet but it will shortly. More over the load will be done in parallel which should give similar if not better performance.
  2. pig/hive add a significant overhead over a custom MR job. We do support dedicated input/output format (it's what Hive and Pig are using underneath).

Note that the main advantage of REST, and why that route was chosen, was to have a small, stand-alone, no-dependency jar. That's why ElasticSearch jar is not pulled in - we don't want any extra dependencies since this tend to be tedious once deployed across multi-node clusters.

To conclude, stay tuned as the functionality matures in the next couple of weeks. Performance is a key element that I'm aiming for but currently am focused on drafting the integration points (Hive/Pig in place with Cascading following suit).

@github4venkat
Copy link
Author

@costin sure, I will look forward for bulk loading support in this space... thanks....

@tzolov
Copy link

tzolov commented Apr 22, 2013

@costin do you plan to merge the cdh branch with the master?
It will be quite handy. One can build against different hadoop distributions/versions without manually tweaking the gradle configuration.
Or perhaps there is a better way to do this with Gradle?

@costin
Copy link
Member

costin commented Apr 24, 2013

At some point yes. In case of a build one could specify a prefix/profile while in the build system, we'll probably produce and upload different artifacts.

@costin
Copy link
Member

costin commented Oct 23, 2013

Closing this as we publish a yarn binary (since before 1.3 M1)

@costin costin closed this as completed Oct 23, 2013
@nfx
Copy link

nfx commented Jan 10, 2014

it's still the issue with cdh4.5.0

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.elasticsearch.hadoop.mr.ESOutputFormat.checkOutputSpecs(ESOutputFormat.java:163)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:987)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)

@costin
Copy link
Member

costin commented Jan 10, 2014

Make sure you are using the 'yarn' artifact:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/download.html

On Fri, Jan 10, 2014 at 5:58 PM, Serge Smertin notifications@github.comwrote:

it's still the issue with cdh4.5.0

java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.elasticsearch.hadoop.mr.ESOutputFormat.checkOutputSpecs(ESOutputFormat.java:163)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:987)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)


Reply to this email directly or view it on GitHubhttps://github.com//issues/8#issuecomment-32039128
.

Costin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants