java.io.FileNotFoundException: File does not exist #137

Closed
geofferyzh opened this Issue Oct 5, 2012 · 15 comments

Comments

Projects
None yet
3 participants

I installed RHadoop (rmr2,rhdfs packages) on my cloudera cdh3 virtual machine yesterday. When I tried to run the second tutorial example, the job seemed to finish correctly, but a "FileNotFoundException" occurred.

I had the same error message when trying to run the kmeans.R example. What did I do wrong here?

Thanks,
Shaohua


library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: functional
library(rhdfs)
Loading required package: rJava

HADOOP_CMD=/usr/bin/hadoop-0.20

Be sure to run hdfs.init()

hdfs.init()

groups = rbinom(32, n = 50, prob = 0.4)
groups = to.dfs(groups)

12/10/05 12:08:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/05 12:08:46 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/05 12:08:46 INFO compress.CodecPool: Got brand-new compressor

from.dfs(mapreduce(input = groups, map = function(k,v) keyval(v, 1), reduce = function(k,vv) keyval(k, length(vv))))
packageJobJar: [/tmp/RtmpYyfscn/rmr-local-env163a6ff6d07e, /tmp/RtmpYyfscn/rmr-global-env163a5affd46d, /tmp/RtmpYyfscn/rmr-streaming-map163a474285bd, /tmp/RtmpYyfscn/rmr-streaming-reduce163a63a9bfda, /var/lib/hadoop-0.20/cache/training/hadoop-unjar6175181393484515689/] [] /tmp/streamjob5614144549414339994.jar tmpDir=null
12/10/05 12:09:00 INFO mapred.FileInputFormat: Total input paths to process : 1
12/10/05 12:09:00 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-0.20/cache/training/mapred/local]
12/10/05 12:09:00 INFO streaming.StreamJob: Running job: job_201210051159_0001
12/10/05 12:09:00 INFO streaming.StreamJob: To kill this job, run:
12/10/05 12:09:00 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201210051159_0001
12/10/05 12:09:00 INFO streaming.StreamJob: Tracking URL: http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210051159_0001
12/10/05 12:09:01 INFO streaming.StreamJob: map 0% reduce 0%
12/10/05 12:09:09 INFO streaming.StreamJob: map 100% reduce 0%
12/10/05 12:09:18 INFO streaming.StreamJob: map 100% reduce 100%
12/10/05 12:09:20 INFO streaming.StreamJob: Job complete: job_201210051159_0001
12/10/05 12:09:20 INFO streaming.StreamJob: Output: /tmp/RtmpYyfscn/file163a71ea0500
Exception in thread "main" java.io.FileNotFoundException: File does not exist: 3
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546)
at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
$key
[1] 0

$val
[1] 50

Collaborator

piccolbo commented Oct 5, 2012

Thanks for your report. From other reports and our own experiments it
appears to be an innocuous error and it's already fixed in the upcoming
2.0.1 You can grab it from github (branch rmr-2.0.1 if you know how or just
wait for the next release.

Antonio

On Fri, Oct 5, 2012 at 12:13 PM, geofferyzh notifications@github.comwrote:

I installed RHadoop (rmr2,rhdfs packages) on my cloudera cdh3 virtual
machine yesterday. When I tried to run the second tutorial example, the job
seemed to finish correctly, but a "FileNotFoundException" occurred.

I had the same error message when trying to run the kmeans.R example. What
did I do wrong here?

Thanks,

Shaohua

library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: itertools
Loading required package: iterators
Loading required package: digest
Loading required package: functional
library(rhdfs)
Loading required package: rJava

HADOOP_CMD=/usr/bin/hadoop-0.20

Be sure to run hdfs.init()

hdfs.init()

groups = rbinom(32, n = 50, prob = 0.4)
groups = to.dfs(groups)

12/10/05 12:08:46 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/10/05 12:08:46 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
12/10/05 12:08:46 INFO compress.CodecPool: Got brand-new compressor

from.dfs(mapreduce(input = groups, map = function(k,v) keyval(v, 1),
reduce = function(k,vv) keyval(k, length(vv))))
packageJobJar: [/tmp/RtmpYyfscn/rmr-local-env163a6ff6d07e,
/tmp/RtmpYyfscn/rmr-global-env163a5affd46d,
/tmp/RtmpYyfscn/rmr-streaming-map163a474285bd,
/tmp/RtmpYyfscn/rmr-streaming-reduce163a63a9bfda,
/var/lib/hadoop-0.20/cache/training/hadoop-unjar6175181393484515689/] []
/tmp/streamjob5614144549414339994.jar tmpDir=null
12/10/05 12:09:00 INFO mapred.FileInputFormat: Total input paths to
process : 1
12/10/05 12:09:00 INFO streaming.StreamJob: getLocalDirs():
[/var/lib/hadoop-0.20/cache/training/mapred/local]
12/10/05 12:09:00 INFO streaming.StreamJob: Running job:
job_201210051159_0001
12/10/05 12:09:00 INFO streaming.StreamJob: To kill this job, run:
12/10/05 12:09:00 INFO streaming.StreamJob:
/usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost:8021
-kill job_201210051159_0001
12/10/05 12:09:00 INFO streaming.StreamJob: Tracking URL:
http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201210051159_0001
12/10/05 12:09:01 INFO streaming.StreamJob: map 0% reduce 0%
12/10/05 12:09:09 INFO streaming.StreamJob: map 100% reduce 0%
12/10/05 12:09:18 INFO streaming.StreamJob: map 100% reduce 100%
12/10/05 12:09:20 INFO streaming.StreamJob: Job complete:
job_201210051159_0001
12/10/05 12:09:20 INFO streaming.StreamJob: Output:
/tmp/RtmpYyfscn/file163a71ea0500
Exception in thread "main" java.io.FileNotFoundException: File does not
exist: 3
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:546)
at org.apache.hadoop.streaming.DumpTypedBytes.run(DumpTypedBytes.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
$key
[1] 0

$val
[1] 50


Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137.

Thanks for your quick reply.

@piccolbo piccolbo closed this Oct 5, 2012

@piccolbo piccolbo reopened this Oct 5, 2012

Collaborator

piccolbo commented Oct 5, 2012

Are you sure the result is correct? maybe you should run it once more but print groups before calling to.dfs so that you know what to expect. It appears your sample had 50 0s in it.

Ok, so I ran the following code to print the output, but got something not readable...

hadoop fs -cat /tmp/RtmpYyfscn/file163a403e27e2/part-00000

output:
Q�/org��2[training@server0 ~]$ .TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritableͤ������-���M����

what could have gone wrong?

Collaborator

piccolbo commented Oct 5, 2012

And where did it say that the default format is a text format? If you
really want to delve into the minutiae of formats like I have to do, pipe
that though hexdump, but it still takes some serious pain tolerance. My
suggestion: use from.dfs. If the data is to big, from.dfs(rmr.sample(....
If you need that data outside R possibly all the other formats are a
better choice, I would recommend "sequence.typedbytes" to connect with the
rest of the Java world or "csv" in any other case.

Antonio

On Fri, Oct 5, 2012 at 12:46 PM, geofferyzh notifications@github.comwrote:

Ok, so I ran the following code to print the output, but got something not
readable...

hadoop fs -cat /tmp/RtmpYyfscn/file163a403e27e2/part-00000

output:
Q/org2[training@server0 ~]$
.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritableͤ����-���M���

what could have gone wrong?


Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9187480.

Thanks. I'm new to both Hadoop and R and only worked with text file in hadoop. So I wrongfully assumed that "everything" is in text format...

The result I got was incorrect. I got all 50 0s. "from.dfs(to.dfs(rbinom(32, n = 50, prob = 0.4)))" gives me all zeros.

$key
NULL

$val
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0

Any pointers?

Collaborator

piccolbo commented Oct 9, 2012

Are you using a 32 bit platform (in Unix and OS X, at the shell prompt enter
uname -a, in windows I don't know)? Yesterday we found a problem with
serialization on 32 bit platforms. It's not something Revolution supports
but a user is working on a patch so this may get fixed in the next release.

Antonio

On Tue, Oct 9, 2012 at 8:04 AM, geofferyzh notifications@github.com wrote:

Thanks. I'm new to both Hadoop and R and only worked with text file in
hadoop. So I wrongfully assumed that "everything" is in text format...

The result I got was incorrect. I got all 50 0s.
"from.dfs(to.dfs(rbinom(32, n = 50, prob = 0.4)))" gives me all zeros.

$key
NULL

$val
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0

Any pointers?


Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9264119.

Thanks. I'm using Cloudera's CDH3 training VM on Windows 7.

While waiting for the next release, is there a way that I can install RMR1 so that I can start learning RHadoop ?

Collaborator

piccolbo commented Oct 9, 2012

Given that there are some important changes in the API moving to rmr2 and
that it seems in general simpler for people to pick up, it doesn't seem a
good investment of your time. What I suggest is that you change your VM,
not rmr version. I am running CDH4 cloudera VM in virtual box and it is
running 64 bit. Could you do a uname -a at a terminal inside the VM and
paste the output in a message?
This is what I get

[cloudera@localhost ~]$ uname -a
Linux localhost.localdomain 2.6.18-308.8.2.el5 #1 SMP Tue Jun 12 09:58:12
EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[cloudera@localhost ~]$

From the virtualbox manual it appears that it's a matter of choosing the
right configuration at VM creation time. It may be different with the
virtualization sw you are using.
Thanks

Antonio

On Tue, Oct 9, 2012 at 10:08 AM, geofferyzh notifications@github.comwrote:

Thanks. I'm using Cloudera's CDH3 training VM on Windows 7.

While waiting for the next release, is there a way that I can install RMR1
so that I can start learning RHadoop ?


Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9269846.

Linux server0.training.local 2.6.18-238.9.1.el5 #1 SMP Tue Apr 12 18:10:56 EDT 2011 i686 i686 i386 GNU/Linux

This is the printout i got.

I will try to install CDH4, thanks

I have the same issue geofferyzh had above
#137 (comment)

so I'm working on CentOS 5.8 32bit, I will wait for the patch.

Collaborator

piccolbo commented Oct 19, 2012

The patch is in the 2.0.1 branch but nobody has the resources to test it.
Your best bet is to take the matter in your own hands and test and fix it.
It's a conversion issue so if you do a from.dfs(to.dsf(1:10)) you'll have
the answer we need. Your even better bet is to switch to 64bit because of
the very thin support for 32 bits from most of the industry. Because I put
a tentative patch in I don't want you to think that 32 bit is making a
roaring come back, 32 bit is still on its way out.

On Fri, Oct 19, 2012 at 3:20 AM, Claudio Reggiani
notifications@github.comwrote:

I have the same issue geofferyzh had above
RevolutionAnalytics/RHadoop#137#137 (comment)

so I'm working on CentOS 5.8 32bit, I will wait for the patch.


Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9595843.

> from.dfs(to.dfs(1:10))
12/10/20 00:13:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:16 INFO compress.CodecPool: Got brand-new compressor
12/10/20 00:13:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:20 INFO compress.CodecPool: Got brand-new decompressor
$key
NULL

$val
 [1]  1  2  3  4  5  6  7  8  9 10

>

I think the patch is working, but there is another error, while calling mapreduce function (whatever from the tutorial) I get

12/10/20 00:12:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:12:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:12:01 INFO compress.CodecPool: Got brand-new compressor
Error in do.call(paste.options, backend.parameters) : 
  second argument must be a list
>

Finally I agree with you for the architecture, I'm using rhadoop on my personal computer and I'm not deploying anything in production mode, so I need to study MapReduce, Hadoop, R together.

Thanks for all

Collaborator

piccolbo commented Oct 19, 2012

On Fri, Oct 19, 2012 at 3:17 PM, Claudio Reggiani
notifications@github.comwrote:

from.dfs(to.dfs(1:10))
12/10/20 00:13:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:16 INFO compress.CodecPool: Got brand-new compressor
12/10/20 00:13:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:13:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:13:20 INFO compress.CodecPool: Got brand-new decompressor
$key
NULL

$val
[1] 1 2 3 4 5 6 7 8 9 10

I think the patch is working, but there is another error, while calling
mapreduce function (whatever from the tutorial) I get

12/10/20 00:12:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/20 00:12:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
12/10/20 00:12:01 INFO compress.CodecPool: Got brand-new compressor
Error in do.call(paste.options, backend.parameters) :
second argument must be a list

That was an obvious bug in my patch, sorry about that I checked in before
testing. Not this time, I will check in after testing, which is going on
right now.

Finally I agree with you for the architecture, I'm using rhadoop on my
personal computer and I'm not deploying anything in production mode, so I
need to study MapReduce, Hadoop, R together.

All of these run on 64-bit.

A

Thanks for all


Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/137#issuecomment-9622761.

Collaborator

piccolbo commented Oct 25, 2012

@nophiq I may have missed one of your reports (the backend.parameters problem) in the midst of your message, i believe I fixed it based on a separate report but please check again for me in the 2.0.1 branch. Also, it would really help me if we keep it to one problem per issue and the reason is that an issue is either open or close, if there are two problems in one issue I can't mark one closed and other open.

@piccolbo piccolbo closed this Oct 25, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment