hdfs.read() cannot load all data from huge csv file on hdfs #8

strategist922 · 2014-05-07T11:37:43Z

Hi,
I have many huge csv files(more 20GB) on my hortonworks HDP 2.0.6.0 GA cluster,
I use the following code to read file from HDFS:

Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");

When I use dim(data) to verify, it show me as following:
[1] 1523 7

But actually, it should be "134279407" instead of "1523".
I found the value of m show in RStudio is "raw [1:131072] 50 72 69 49 ...", and there is

a thread in hadoop-hdfs-user mailing list(why can FSDataInputStream.read() only read 2^17 bytes in hadoop2.0?) .
Ref.
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E

Is it a bug of hdfs.read() in rhdfs-1.0.8?

Best Regards,
James Chang

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdfs.read() cannot load all data from huge csv file on hdfs #8

hdfs.read() cannot load all data from huge csv file on hdfs #8

strategist922 commented May 7, 2014

hdfs.read() cannot load all data from huge csv file on hdfs #8

hdfs.read() cannot load all data from huge csv file on hdfs #8

Comments

strategist922 commented May 7, 2014