Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table to matrix #58

Closed
tlienart opened this issue Jan 26, 2019 · 13 comments
Closed

Table to matrix #58

tlienart opened this issue Jan 26, 2019 · 13 comments

Comments

@tlienart
Copy link

tlienart commented Jan 26, 2019

(apologies if it's a dumb question)

In line with this discourse thread it seems that a good way forward is to have ML algorithms (such as, say, kmeans) accept Tables instead of "just" Matrix or AbstractMatrix.

In some cases, the algorithm may work directly using rows or columns of the Table but in some case it may be preferable to work with the matrix of values of the table directly (maybe an example would be PCA).

Would it therefore be possible/relevant for Tables.jl to implement a function like Tables.matrix(table) and possibly have that function take an argument depending on whether the user wants a matrix with rows-as-observation or column-as-observation?

As a suggested possible API I could see something like

Tables.matrix(table;
  rows=nothing, # to indicate a subset of the rows / nothing for all
  cols=nothing, # same 
  vardim=1) # 1 = row-as-observation, 2 = col-as-observation

Edit: here are what I think may be useful building blocks? (with guidance/hints I'm happy to try to give this a shot by the way, though I'm not very familiar with Julia's table universe...)

@nalimilan
Copy link
Member

That sounds like a useful feature. Though I think only one argument is needed: vardim. Subsetting rows and columns can be achieved using separate functions.

A few remarks:

  • have tables implement a isroworiented / iscolumnoriented

I don't think this is needed. Tables always have observations as rows and variables as columns. What can change is how the storage is actually done, but the Tables.jl interface ensures we don't have to care about it.

Yes, this can be an inspiration. But of course this should use Tables.columns and Tables.schema instead of DataFrames-specific functions.

  • if it's column oriented and vardim=2 and the promoted type is <:Number, I believe copy(transpose(matrix(..., vardim=1))) will be faster than a row iteration, if it's not a <:Number and therefore transpose might not work, permutedims should.
  • similar reasoning for row oriented

transpose isn't the right operation for this, permutedims is always what we want. And we should avoid making copies at all costs: it shouldn't be faster and it can be a problem with large data sets.

In general, iterating over columns will likely be the most efficient approach for most tables (even row-oriented ones) when vardim=2, since Matrix is column-major. But for some sources (like CSV) it would be faster to iterate over rows only once since the full row needs to be parsed to extract a single value. So I guess the best approach is to work column by column when Tables.columnaccess(x) == true, and row by row else. The converse is true when vardim=1.

@quinnj Is that right?

@tlienart
Copy link
Author

I don't think this is needed. Tables always have observations as rows and variables as columns. What can change is how the storage is actually done, but the Tables.jl interface ensures we don't have to care about it.

The current readme says specifically for example, if MyTable is a row-oriented format, I might define my "sink" function like: so I assumed this could exist 🤷‍♂️

@nalimilan
Copy link
Member

Yes, but this refers to the storage, not to the meaning of rows AFAIK. That's exactly like the difference between [1 2; 3 4] and transpose([1 3; 2 4]): they represent the same thing but with a different storage.

@datnamer
Copy link

Isn't this the point of statsmodels.jl?

https://www.youtube.com/watch?v=HORLJrsghs4

@kleinschmidt

@nalimilan
Copy link
Member

StatsModels is much more complex, it parses formulas and processes them using contrasts. Here we just need to copy values to a matrix layout.

@kleinschmidt
Copy link

I dunno, tables can (in principle) have different eltypes for different columns. It might make sense to have a fallback StatsModels.model_cols method that just takes a table, extracts a schema, and converts it to a matrix using all the columns. Then you'd get automatic encoding of categorical variables, etc.

@nalimilan
Copy link
Member

Yes, but here we don't even talk about transforming categorical variables to dummies. We just want to use promotion to find the best common type to all columns and copy the data to a matrix of that type. Anything more complex should indeed go through StatsModels.

@quinnj
Copy link
Member

quinnj commented Feb 5, 2019

The new Tables.matrix function has been added and is included in the most recent release.

@quinnj quinnj closed this as completed Feb 5, 2019
@nalimilan
Copy link
Member

I think we still need to add the vardim argument to fully close this issue.

@nalimilan nalimilan reopened this Feb 5, 2019
@quinnj
Copy link
Member

quinnj commented Feb 5, 2019

Why? If you materialize a matrix, you can just do transpose(m), right? Why complicate the tables code when you essentially want a matrix operation?

@nalimilan
Copy link
Member

The point is precisely that the packages which would like to use vardim=1 use that format because it's much more efficient for their access pattern (like computing distances in Distances.jl). So they don't want a Transpose object, they want to allocate a matrix with rows stored as contiguous memory blocks. vardim=1 will be equivalent to copy(transpose(x)), but without the intermediate copy.

quinnj added a commit that referenced this issue Feb 8, 2019
…lowing the user to specify whether input columns should be materialized as matrix columns or rows
@quinnj
Copy link
Member

quinnj commented Feb 8, 2019

Ok, @nalimilan @tlienart, see the PR here: #66

quinnj added a commit that referenced this issue Mar 11, 2019
…lowing the user to specify whether input columns should be materialized as matrix columns or rows
quinnj added a commit that referenced this issue Mar 11, 2019
#66)

* Finixh #58 by adding the vardim keyword argument to Tables.matrix, allowing the user to specify whether input columns should be materialized as matrix columns or rows

* Rename vardim to dims

* Rename dims argument for Tables.matrix to transpose
@quinnj
Copy link
Member

quinnj commented Mar 11, 2019

Alright, with #66 merged, I think we're good here.

@quinnj quinnj closed this as completed Mar 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants