-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multivariate kNN regression #327
Support multivariate kNN regression #327
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was not sufficiently clear. Either:
-
y
is aAbstractVector{<:AbstractFloat}
(guaranteed byscitype(y) <: AbstractVector{Continuous}
) - the status quo -
y
of the above form ory
is any table satisfying the Tables.jl interface with columns ofAbstractFloat
eltype (guaranteed byscitype(y) <: Table(Continuous)
) - the multi-target case
So it is generally discouraged in MLJ to just store arrays in a column of a table, yes? I understand that I should put tables in that column. |
For example if I have a dataframe of the structure like |
The kNN regressor as implemented in this PR works for such dataframe. |
It looks like I should split |
Some relevant documentation I've found: https://alan-turing-institute.github.io/MLJ.jl/stable/adding_models_for_general_use/#Multivariate-targets-1 |
Yes, I wrote that. Sorry, I should have pointed it out to you earlier. Sorry, I don't really understand your query. Is this a question about the API or how to implement it? Re the API, the main point is that if the user is going to supply a multi-target |
Yes, in all models currently implemented, each column of a Tables.jl table is expected to correspond to a vector of data. |
using MLJBase, MLJModels, DataFrames
X = (x1 = rand(10), x2 = rand(10))
y_columntable = (a = randn(10),
b = randn(10),
c = collect(collect.(zip(randn(10), randn(10))))) # a table
y = DataFrame(y_columntable)
model = @load KNNRegressor
fitresult, cache, report = MLJBase.fit(model, 1, X, y)
predict(model, fitresult, X)
julia> predict(model, fitresult, X)
ERROR: BoundsError: attempt to access 3-element Array{AbstractArray{T,1} where T,1} at index [[2, 3, 1, 7, 4]]
Stacktrace:
[1] throw_boundserror(::Array{AbstractArray{T,1} where T,1}, ::Tuple{Array{Int64,1}}) at ./abstractarray.jl:541
[2] checkbounds at ./abstractarray.jl:506 [inlined]
[3] _getindex at ./multidimensional.jl:742 [inlined]
[4] getindex at ./abstractarray.jl:1060 [inlined]
[5] manipulate(::DataFrame, ::Array{Int64,1}; copycols::Bool, keeprows::Bool) at /Users/anthony/.julia/packages/DataFrames/GtZ1l/src/abstractdataframe/selection.jl:542
[6] #select#296 at /Users/anthony/.julia/packages/DataFrames/GtZ1l/src/abstractdataframe/selection.jl:493 [inlined]
[7] getindex at /Users/anthony/.julia/packages/DataFrames/GtZ1l/src/dataframe/dataframe.jl:468 [inlined]
[8] getindex(::DataFrame, ::Array{Int64,1}) at ./deprecated.jl:72
[9] predict(::MLJModels.NearestNeighbors_.KNNRegressor, ::Tuple{KDTree{StaticArrays.SArray{Tuple{2},Float64,1,2},Euclidean,Float64},DataFrame,Nothing}, ::NamedTuple{(:x1, :x2),Tuple{Array{Float64,1},Array{Float64,1}}}) at /Users/anthony/Dropbox/Julia7/MLJ/MLJModels/src/NearestNeighbors.jl:138
[10] top-level scope at REPL[41]:1 Doesn't work on
|
OK, I think I see now. I will fix this PR to make it work similarly to MultivariateStats regressors. Sorry for bothering you with this, I should've read the documentation more carefully first. The API isn't intuitive for someone who is used to working with arrays instead of DataFrames and tables. By the way, Thanks for your help 👍, and once again sorry for my confusion here. |
I think this should be correct now 🙂 . |
@@ -161,7 +175,7 @@ metadata_pkg.((KNNRegressor, KNNClassifier), | |||
|
|||
metadata_model(KNNRegressor, | |||
input = Table(Continuous), | |||
target = Union{AbstractVector{Continuous}, AbstractVector{<:AbstractArray{Continuous}}}, | |||
target = Union{AbstractVector{Continuous}, Table{Continuous}}, | |||
weights = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table(Continuous)
instead of Table{Continuous}
https://github.com/alan-turing-institute/ScientificTypes.jl#more-on-the-table-type
if m.weights == :uniform | ||
preds[i,:] .= sum(values .* w_) / sum(w_) | ||
else | ||
preds[i,:] .= sum(values .* w_ .* (1.0 .- dists_ ./ sum(dists_))) / (sum(w_) - 1) | ||
end | ||
end | ||
if typeof(x) <: AbstractArray | ||
if typeof(y) <: AbstractArray | ||
return preds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct but probably less confusing (and future-proof) to use if typeof(y) <: AbstractVector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Really appreciate the help!
Just those two items.
Thank you for your help as well 👍 . I've fixed these two issues. |
Closing in favour of #328 where target branch is dev (instead of master) and model registry is updated. |
This should add support for multivariate kNN regression. How should I modify
metadata_model
to indicate this?