Table to matrix #58

tlienart · 2019-01-26T06:47:43Z

(apologies if it's a dumb question)

In line with this discourse thread it seems that a good way forward is to have ML algorithms (such as, say, kmeans) accept Tables instead of "just" Matrix or AbstractMatrix.

In some cases, the algorithm may work directly using rows or columns of the Table but in some case it may be preferable to work with the matrix of values of the table directly (maybe an example would be PCA).

Would it therefore be possible/relevant for Tables.jl to implement a function like Tables.matrix(table) and possibly have that function take an argument depending on whether the user wants a matrix with rows-as-observation or column-as-observation?

As a suggested possible API I could see something like

Tables.matrix(table;
  rows=nothing, # to indicate a subset of the rows / nothing for all
  cols=nothing, # same 
  vardim=1) # 1 = row-as-observation, 2 = col-as-observation

Edit: here are what I think may be useful building blocks? (with guidance/hints I'm happy to try to give this a shot by the way, though I'm not very familiar with Julia's table universe...)

have tables implement a isroworiented / iscolumnoriented
use something like the conversion code in DataFrames: https://github.com/JuliaData/DataFrames.jl/blob/master/src/abstractdataframe/abstractdataframe.jl#L804_L828
if it's column oriented and vardim=2 and the promoted type is <:Number, I believe copy(transpose(matrix(..., vardim=1))) will be faster than a row iteration, if it's not a <:Number and therefore transpose might not work, permutedims should.
similar reasoning for row oriented

The text was updated successfully, but these errors were encountered:

nalimilan · 2019-01-29T12:44:49Z

That sounds like a useful feature. Though I think only one argument is needed: vardim. Subsetting rows and columns can be achieved using separate functions.

A few remarks:

have tables implement a isroworiented / iscolumnoriented

I don't think this is needed. Tables always have observations as rows and variables as columns. What can change is how the storage is actually done, but the Tables.jl interface ensures we don't have to care about it.

use something like the conversion code in DataFrames: https://github.com/JuliaData/DataFrames.jl/blob/master/src/abstractdataframe/abstractdataframe.jl#L804_L828

Yes, this can be an inspiration. But of course this should use Tables.columns and Tables.schema instead of DataFrames-specific functions.

if it's column oriented and vardim=2 and the promoted type is <:Number, I believe copy(transpose(matrix(..., vardim=1))) will be faster than a row iteration, if it's not a <:Number and therefore transpose might not work, permutedims should.

similar reasoning for row oriented

transpose isn't the right operation for this, permutedims is always what we want. And we should avoid making copies at all costs: it shouldn't be faster and it can be a problem with large data sets.

In general, iterating over columns will likely be the most efficient approach for most tables (even row-oriented ones) when vardim=2, since Matrix is column-major. But for some sources (like CSV) it would be faster to iterate over rows only once since the full row needs to be parsed to extract a single value. So I guess the best approach is to work column by column when Tables.columnaccess(x) == true, and row by row else. The converse is true when vardim=1.

@quinnj Is that right?

tlienart · 2019-01-29T12:52:01Z

I don't think this is needed. Tables always have observations as rows and variables as columns. What can change is how the storage is actually done, but the Tables.jl interface ensures we don't have to care about it.

The current readme says specifically for example, if MyTable is a row-oriented format, I might define my "sink" function like: so I assumed this could exist 🤷‍♂️

nalimilan · 2019-01-29T12:56:54Z

Yes, but this refers to the storage, not to the meaning of rows AFAIK. That's exactly like the difference between [1 2; 3 4] and transpose([1 3; 2 4]): they represent the same thing but with a different storage.

datnamer · 2019-01-29T17:50:04Z

Isn't this the point of statsmodels.jl?

https://www.youtube.com/watch?v=HORLJrsghs4

@kleinschmidt

nalimilan · 2019-01-29T18:21:06Z

StatsModels is much more complex, it parses formulas and processes them using contrasts. Here we just need to copy values to a matrix layout.

kleinschmidt · 2019-01-29T18:39:01Z

I dunno, tables can (in principle) have different eltypes for different columns. It might make sense to have a fallback StatsModels.model_cols method that just takes a table, extracts a schema, and converts it to a matrix using all the columns. Then you'd get automatic encoding of categorical variables, etc.

nalimilan · 2019-01-29T20:28:13Z

Yes, but here we don't even talk about transforming categorical variables to dummies. We just want to use promotion to find the best common type to all columns and copy the data to a matrix of that type. Anything more complex should indeed go through StatsModels.

quinnj · 2019-02-05T14:40:43Z

The new Tables.matrix function has been added and is included in the most recent release.

nalimilan · 2019-02-05T18:12:31Z

I think we still need to add the vardim argument to fully close this issue.

quinnj · 2019-02-05T18:22:36Z

Why? If you materialize a matrix, you can just do transpose(m), right? Why complicate the tables code when you essentially want a matrix operation?

nalimilan · 2019-02-05T18:33:45Z

The point is precisely that the packages which would like to use vardim=1 use that format because it's much more efficient for their access pattern (like computing distances in Distances.jl). So they don't want a Transpose object, they want to allocate a matrix with rows stored as contiguous memory blocks. vardim=1 will be equivalent to copy(transpose(x)), but without the intermediate copy.

…lowing the user to specify whether input columns should be materialized as matrix columns or rows

quinnj · 2019-02-08T05:29:19Z

Ok, @nalimilan @tlienart, see the PR here: #66

…lowing the user to specify whether input columns should be materialized as matrix columns or rows

#66) * Finixh #58 by adding the vardim keyword argument to Tables.matrix, allowing the user to specify whether input columns should be materialized as matrix columns or rows * Rename vardim to dims * Rename dims argument for Tables.matrix to transpose

quinnj · 2019-03-11T19:33:07Z

Alright, with #66 merged, I think we're good here.

tlienart mentioned this issue Jan 26, 2019

Suggestion of rewrite for MLJBase.matrix JuliaAI/MLJBase.jl#1

Closed

nalimilan mentioned this issue Jan 29, 2019

Dimension convention and table <--> matrix conversions JuliaAI/MLJ.jl#50

Open

quinnj mentioned this issue Jan 30, 2019

Add MatrixTable type to wrap a Matrix and provide Tables.jl interface. #61

Merged

quinnj closed this as completed Feb 5, 2019

nalimilan reopened this Feb 5, 2019

quinnj added a commit that referenced this issue Feb 8, 2019

Finixh #58 by adding the vardim keyword argument to Tables.matrix, al…

545778b

…lowing the user to specify whether input columns should be materialized as matrix columns or rows

quinnj added a commit that referenced this issue Mar 11, 2019

Finixh #58 by adding the vardim keyword argument to Tables.matrix, al…

9770a55

…lowing the user to specify whether input columns should be materialized as matrix columns or rows

quinnj closed this as completed Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table to matrix #58

Table to matrix #58

tlienart commented Jan 26, 2019 •

edited

Loading

nalimilan commented Jan 29, 2019

tlienart commented Jan 29, 2019

nalimilan commented Jan 29, 2019

datnamer commented Jan 29, 2019

nalimilan commented Jan 29, 2019

kleinschmidt commented Jan 29, 2019

nalimilan commented Jan 29, 2019

quinnj commented Feb 5, 2019

nalimilan commented Feb 5, 2019

quinnj commented Feb 5, 2019

nalimilan commented Feb 5, 2019

quinnj commented Feb 8, 2019

quinnj commented Mar 11, 2019

Table to matrix #58

Table to matrix #58

Comments

tlienart commented Jan 26, 2019 • edited Loading

nalimilan commented Jan 29, 2019

tlienart commented Jan 29, 2019

nalimilan commented Jan 29, 2019

datnamer commented Jan 29, 2019

nalimilan commented Jan 29, 2019

kleinschmidt commented Jan 29, 2019

nalimilan commented Jan 29, 2019

quinnj commented Feb 5, 2019

nalimilan commented Feb 5, 2019

quinnj commented Feb 5, 2019

nalimilan commented Feb 5, 2019

quinnj commented Feb 8, 2019

quinnj commented Mar 11, 2019

tlienart commented Jan 26, 2019 •

edited

Loading