Possible construction modes include pushing data out from a central process, or having individual nodes load chunks from a CSV file or another source. Rows could be split by row-number or by value, depending on the application. The DDF would be resident in RAM, and read/write operations could be performed, as well as distributed statistical operations.
I was thinking of the following model for this:
I can have a go at this if it sounds good. Would be glad to have any comments/discussions before I start.
I wonder if a design is possible without using Distributed prefixes for everything. Right now, I can't think of an alternative.
This would be great. I'd suggest just building the data structures up first since I think we'll see what needs to be handled there as we go on.
As you go, please let us know when you find functions that should be defined in terms of AbstractDataFrame so that DataFrame and DistributedDataFrame are handled together.
I thought of having just the DistributedDataVector (with an abstract DataVector), but was not sure if that would be enough to handle all functionalities.
I think we really want the rows of a DataFrame to be distributed, rather than the columns.
Yes, I agree.
I was wondering if it would it be any better to implement the DistributedDataFrame as a collection of DistributedDataVectors rather than a collection of remote DataFrames.
Oh, definitely. So long as you can guarantee that each of the vectors is split in the same way, I think we should follow the existing definition of DataFrame and define things in terms of columns.
See prototypes branch for early implementation / inspiration.