Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

First Line Text File Input #72

Open
ekohlwey opened this Issue Apr 5, 2012 · 3 comments

Comments

Projects
None yet
2 participants

ekohlwey commented Apr 5, 2012

In the world of stats in particular it is very common to get CSV (or similar) data dumps where the first row is a header with some sort of information.

Users may want to guarantee that these files are not split by TextInputFormat so that they can use global variables to check if the header has been read in or not yet, and do the appropriate thing if the line in question is the header (usually, extract some useful data from it that will be available to the rest of the task).

The current use case for this that I have is some data where the column headers represent indices of the column in a sparse matrix.

I'm thinking of adding a java package to Rhadoop with an unsplittable version of TextInputFormat (just extends TextInputFormat and always returns "no") - this would be exposed via the R api, probably as another input format ("unsplittable-text") or something similar.

Thoughts?

Collaborator

piccolbo commented Apr 5, 2012

Off the bat, adding a format is not a huge deal so the answer is of course let's do it. But I really don't understand what you are trying to do and how. headers are turned off for csv, that's correct and the reason are splits and it's not the most user friendly. Now correct me if I am wrong, a consequence of an unsplittable text format is a single map process. That's of course not something we want. What if we had a header file in each split? Would that be a better solution that we can also use on the output side? Another point is that I am not a big fan of global variables, but one alternative would be to use the fact that a map function is actually a closure so do something like

generate.map = function() {header = NULL; function(k,v) {if (first-line-of-input) {header <<- parse.headers(v)} else {normal-processing}

and then mapreduce(..., map = generate.map(),....) in an object oriented type construction.

In this particular use case, the header is different in each split.

After thinking about it, I've decided it would actually be better to address the immediate use case, instead creating a "FirstLineTextInputFormat". I've done some hacking on this and need to finish testing, then I'll send you a push.

Collaborator

piccolbo commented Apr 12, 2012

Please push to dev, before I take changes into master I have to do testing

Antonio

On Thu, Apr 12, 2012 at 3:06 AM, ekohlwey <
reply@reply.github.com

wrote:

In this particular use case, the header is different in each split.

After thinking about it, I've decided it would actually be better to
address the immediate use case, instead creating a
"FirstLineTextInputFormat". I've done some hacking on this and need to
finish testing, then I'll send you a push.


Reply to this email directly or view it on GitHub:

#72 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment