Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow machine construction for large number of features #428

Closed
ablaom opened this issue Sep 28, 2020 · 10 comments
Closed

Slow machine construction for large number of features #428

ablaom opened this issue Sep 28, 2020 · 10 comments
Labels
Projects

Comments

@ablaom
Copy link
Member

ablaom commented Sep 28, 2020

As reported by slack user, this code is taking way too long:

using MLJ
X = MLJ.table(randn(200,10000));
y = randn(200)
model = ConstantRegressor()
mach = machine(model,  X, y);
@ablaom ablaom added the triage label Sep 28, 2020
@ablaom ablaom changed the title Slow machine construction for large number of feartures Slow machine construction for large number of features Sep 29, 2020
@ablaom
Copy link
Member Author

ablaom commented Sep 29, 2020

The slowdown is due to the fact that the constructor checks the number of rows of y and X are the same. However, MLJBase's implementation of rows to get nrows(X) first turns X into a column table. However, materialising a Tables.MatrixTable as a column table is doomed, if there are a large number of rows. The last line below hangs:

using Tables

X = Tables.table(randn(200,10000));
col_table = Tables.columntable(X);

Is this a Tables issue? No, not really. The problem is that julia named tuples with a large number of keys take forever to construct and I guess are a bad idea anyway (I think large tuples are generally a bad idea in julia?). So, for example, this hangs:

names = tuple((Symbol(string("Column", j)) for j in 1:10000)...);
values = rand(10000);
NamedTuple{names}(values);

MLJBase uses column tables not just in nrows but also in the fallbacks for selectrows (essential to resampling) and selectcols. So just fixing the nrows implementation is not enough. We have a wider issue here.

See also: #309

Note that in the slack user's use-case the data is not sparse. In any case, for dimension reduction models (eg, PCA) we would not want to restrict to sparse tables.

cc @OkonSamuel

@OkonSamuel
Copy link
Member

@ablaom. Yes this is a big issue. Maybe for now we could choose a Table type e,g DataFrames or MatrixTable and make it more efficient for high dimensional data. Then give efficient definitions for nrow, selectrows selectcols.
Then state somewhere in the docs that users use the selected Table type for high-dimensional data. ?

@ablaom
Copy link
Member Author

ablaom commented Sep 29, 2020

Sounds like a good workaround for now, thanks!

@nignatiadis
Copy link

FWIW I have been using the following snippet ( based on @OkonSamuel's suggestion) to run things locally as a temporary workaround (selectcols not implemented).

using Tables
import MLJModelInterface
const MMI = MLJModelInterface

MMI.nrows(X::Tables.MatrixTable) = size(MMI.matrix(X), 1)
MMI.selectrows(X::Tables.MatrixTable, ::Colon) = X
MMI.selectrows(X::Tables.MatrixTable, r::Integer) =
    MMI.selectrows(X::Tables.MatrixTable, r:r)
function MMI.selectrows(X::Tables.MatrixTable, r)
    new_matrix = MMI.matrix(X)[r, :]
    _names = getfield(X, :names)
    MMI.table(new_matrix; names = _names)
end

@jbrea
Copy link

jbrea commented Jul 2, 2021

Arghh, I just ran into this. For a not so large DataFrame with less rows than columns (708×4870) julia felt like hanging whenever I ran machine(model, X, y) until I killed the session with Ctrl+C (pretty annoying to debug :)).

Is there anything speaking against the workaround mentioned above?

@ablaom
Copy link
Member Author

ablaom commented Jul 2, 2021

@jbrea Thanks for reporting!

I suspect a different cause here because you are using a DataFrame (for which the aforementioned methods are already overloaded) not a wrapped matrix. Can you report scitype(X) and scitype(y)? If scitype(X) is hanging, and your dataset is large, this is likely because you have columns of some unusual eltype. In that case, what are the eltypes of your columns? What does MLJ.MLJBase.Tables.schema(X) return?

@jbrea
Copy link

jbrea commented Jul 3, 2021

Thanks for your response.

The data is all Float64, scitype(X) == Table{AbstractVector{Continuous}}, scitype(y) == AbstractVector{Continuous}.
What hangs is nrows(X) = nrows(get_interface_mode(), vtrait(X), X) with MLJModelInterface.vtrait(X) == Val{:table}(), because of Tables.columntable(X).

Why isn't just MMI.nrows(X::DataFrame) = DataFrames.nrow(X)?

@OkonSamuel
Copy link
Member

OkonSamuel commented Jul 3, 2021

Arghh, I just ran into this. For a not so large DataFrame with less rows than columns (708×4870) julia felt like hanging whenever I ran machine(model, X, y) until I killed the session with Ctrl+C (pretty annoying to debug :)).

Is there anything speaking against the workaround mentioned above?

Your slow code is because MLJ's machine constructor calls MMI.nrows() which is currently inefficient for tables with large number of columns.
I guess I'll open a PR to fix this soon.

@ablaom
Copy link
Member Author

ablaom commented Jul 3, 2021

Thanks to @OkonSamuel, the following addresses the nrows issue. @jbrea Can you confirm your own issue is resolved?

JuliaRegistries/General#40183

@jbrea
Copy link

jbrea commented Jul 4, 2021

Yes, my issue is resolved. Thanks a lot for the quick fix!

@ablaom ablaom added this to priority high / involved in General Jul 14, 2021
@ablaom ablaom closed this as completed Sep 22, 2021
General automation moved this from priority high / involved to Done Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

4 participants