DistributedDataFrame #26

HarlanH opened this Issue Jul 16, 2012 · 9 comments

5 participants


Possible construction modes include pushing data out from a central process, or having individual nodes load chunks from a CSV file or another source. Rows could be split by row-number or by value, depending on the application. The DDF would be resident in RAM, and read/write operations could be performed, as well as distributed statistical operations.


I was thinking of the following model for this:

  • New objects: DistributedDataVector, DistributedDataFrame, DistributedGroupedDataFrame
  • DistributedDataVector: simplified 1D darray
  • DistributedDataFrame: collection of remote refs to DataFrames / DistributedDataVectors
  • DistributedGroupedDataFrame: collection of DistributedDataFrames
  • To start with we can have an interface to read from files with rows split by row numbers. Can use mmap to map different portions of a single large file.
  • Implement most operations on DataFrame and DataVector as defined in operators.jl
  • Have convert methods to get all parts of a DistributedDataVector / DistributedDataFrame locally as DataVector / DataFrame.
  • All operations implemented using pmap / pmapreduce underneath

I can have a go at this if it sounds good. Would be glad to have any comments/discussions before I start.


I wonder if a design is possible without using Distributed prefixes for everything. Right now, I can't think of an alternative.

Julia Statistics member

This would be great. I'd suggest just building the data structures up first since I think we'll see what needs to be handled there as we go on.

As you go, please let us know when you find functions that should be defined in terms of AbstractDataFrame so that DataFrame and DistributedDataFrame are handled together.


I thought of having just the DistributedDataVector (with an abstract DataVector), but was not sure if that would be enough to handle all functionalities.

Julia Statistics member

I think we really want the rows of a DataFrame to be distributed, rather than the columns.


Yes, I agree.

I was wondering if it would it be any better to implement the DistributedDataFrame as a collection of DistributedDataVectors rather than a collection of remote DataFrames.

Julia Statistics member

Oh, definitely. So long as you can guarantee that each of the vectors is split in the same way, I think we should follow the existing definition of DataFrame and define things in terms of columns.

Julia Statistics member

See prototypes branch for early implementation / inspiration.

@garborg garborg closed this Jan 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment