Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multivariate kNN regression #327

Closed

Conversation

mateuszbaran
Copy link
Contributor

This should add support for multivariate kNN regression. How should I modify metadata_model to indicate this?

@mateuszbaran
Copy link
Contributor Author

OK, now I think the target of the kNN regressor is right. Thanks @ablaom and @tlienart for help on Slack 🙂 . Should I change something else?

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was not sufficiently clear. Either:

  • y is a AbstractVector{<:AbstractFloat} (guaranteed by scitype(y) <: AbstractVector{Continuous}) - the status quo

  • y of the above form or y is any table satisfying the Tables.jl interface with columns of AbstractFloat eltype (guaranteed by scitype(y) <: Table(Continuous)) - the multi-target case

src/NearestNeighbors.jl Outdated Show resolved Hide resolved
@mateuszbaran
Copy link
Contributor Author

So it is generally discouraged in MLJ to just store arrays in a column of a table, yes? I understand that I should put tables in that column.

@mateuszbaran
Copy link
Contributor Author

For example if I have a dataframe of the structure like df = DataFrame(a = randn(10), b = randn(10), c = collect(collect.(zip(randn(10), randn(10))))) then each element of column c should be converted to what exactly?

@mateuszbaran
Copy link
Contributor Author

The kNN regressor as implemented in this PR works for such dataframe.

@mateuszbaran
Copy link
Contributor Author

It looks like I should split c into two columns and then merge them in fit and predict.

@mateuszbaran
Copy link
Contributor Author

@ablaom
Copy link
Member

ablaom commented Oct 20, 2020

Yes, I wrote that. Sorry, I should have pointed it out to you earlier.

Sorry, I don't really understand your query. Is this a question about the API or how to implement it? Re the API, the main point is that if the user is going to supply a multi-target y, then y will be any table satisfying the Tables.jl interface (with all column eltypes <:AbstractFloat) and that is all you know. Each row of the table corresponds to a single observation and each column a different target being predicted. Your job to ensure predict returns a table of the same type, whatever that is, given new input features Xnew. I've made a suggestion about how to do this, but I don't have a strong opinion how you do it.

@ablaom
Copy link
Member

ablaom commented Oct 20, 2020

So it is generally discouraged in MLJ to just store arrays in a column of a table, yes?

Yes, in all models currently implemented, each column of a Tables.jl table is expected to correspond to a vector of data.

@ablaom
Copy link
Member

ablaom commented Oct 20, 2020

The kNN regressor as implemented in this PR works for such dataframe.
?

using MLJBase, MLJModels, DataFrames

X = (x1 = rand(10), x2 = rand(10))
y_columntable = (a = randn(10),
                 b = randn(10),
                 c = collect(collect.(zip(randn(10), randn(10))))) # a table
y = DataFrame(y_columntable)
model = @load KNNRegressor

fitresult, cache, report = MLJBase.fit(model, 1, X, y)
predict(model, fitresult, X)
julia> predict(model, fitresult, X)
ERROR: BoundsError: attempt to access 3-element Array{AbstractArray{T,1} where T,1} at index [[2, 3, 1, 7, 4]]
Stacktrace:
 [1] throw_boundserror(::Array{AbstractArray{T,1} where T,1}, ::Tuple{Array{Int64,1}}) at ./abstractarray.jl:541
 [2] checkbounds at ./abstractarray.jl:506 [inlined]
 [3] _getindex at ./multidimensional.jl:742 [inlined]
 [4] getindex at ./abstractarray.jl:1060 [inlined]
 [5] manipulate(::DataFrame, ::Array{Int64,1}; copycols::Bool, keeprows::Bool) at /Users/anthony/.julia/packages/DataFrames/GtZ1l/src/abstractdataframe/selection.jl:542
 [6] #select#296 at /Users/anthony/.julia/packages/DataFrames/GtZ1l/src/abstractdataframe/selection.jl:493 [inlined]
 [7] getindex at /Users/anthony/.julia/packages/DataFrames/GtZ1l/src/dataframe/dataframe.jl:468 [inlined]
 [8] getindex(::DataFrame, ::Array{Int64,1}) at ./deprecated.jl:72
 [9] predict(::MLJModels.NearestNeighbors_.KNNRegressor, ::Tuple{KDTree{StaticArrays.SArray{Tuple{2},Float64,1,2},Euclidean,Float64},DataFrame,Nothing}, ::NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}) at /Users/anthony/Dropbox/Julia7/MLJ/MLJModels/src/NearestNeighbors.jl:138
 [10] top-level scope at REPL[41]:1

Doesn't work on y_columntable either, which is also a Tables.jl table, because similar(y) throws error.

@mateuszbaran
Copy link
Contributor Author

OK, I think I see now. I will fix this PR to make it work similarly to MultivariateStats regressors. Sorry for bothering you with this, I should've read the documentation more carefully first. The API isn't intuitive for someone who is used to working with arrays instead of DataFrames and tables. By the way, predict worked for me because I was using it incorrectly (giving an array as y instead of a table).

Thanks for your help 👍, and once again sorry for my confusion here.

@mateuszbaran
Copy link
Contributor Author

I think this should be correct now 🙂 .

@@ -161,7 +175,7 @@ metadata_pkg.((KNNRegressor, KNNClassifier),

metadata_model(KNNRegressor,
input = Table(Continuous),
target = Union{AbstractVector{Continuous}, AbstractVector{<:AbstractArray{Continuous}}},
target = Union{AbstractVector{Continuous}, Table{Continuous}},
weights = true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if m.weights == :uniform
preds[i,:] .= sum(values .* w_) / sum(w_)
else
preds[i,:] .= sum(values .* w_ .* (1.0 .- dists_ ./ sum(dists_))) / (sum(w_) - 1)
end
end
if typeof(x) <: AbstractArray
if typeof(y) <: AbstractArray
return preds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct but probably less confusing (and future-proof) to use if typeof(y) <: AbstractVector.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Really appreciate the help!

Just those two items.

@mateuszbaran
Copy link
Contributor Author

Thank you for your help as well 👍 . I've fixed these two issues.

@ablaom
Copy link
Member

ablaom commented Oct 22, 2020

Closing in favour of #328 where target branch is dev (instead of master) and model registry is updated.

@ablaom ablaom closed this Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants