Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize KNNRegressor to multitarget case #328

Merged
merged 23 commits into from
Oct 27, 2020
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
63004f7
Merge pull request #313 from alan-turing-institute/dev
ablaom Oct 12, 2020
67e775a
Merge pull request #316 from alan-turing-institute/dev
ablaom Oct 12, 2020
eb59c99
Merge pull request #320 from alan-turing-institute/dev
ablaom Oct 13, 2020
e07b3d5
Merge pull request #324 from alan-turing-institute/dev
ablaom Oct 16, 2020
b02cede
Merge pull request #326 from alan-turing-institute/dev
ablaom Oct 19, 2020
a4c8610
support multivariate kNN regression
mateuszbaran Oct 19, 2020
0b8c3bd
updated target of kNN regressor
mateuszbaran Oct 20, 2020
865a352
changing target of KNNRegressor
mateuszbaran Oct 20, 2020
3062a05
target of kNN regressor again
mateuszbaran Oct 20, 2020
e52ea0a
trying to make the multi-target kNN regressor work with tables
mateuszbaran Oct 21, 2020
a794ade
fixing kNN regressor
mateuszbaran Oct 21, 2020
6f28040
code review fixes
mateuszbaran Oct 22, 2020
9bc0dfc
update model registry
ablaom Oct 22, 2020
16ac6bf
update registry again
ablaom Oct 22, 2020
e7d5853
fix check_registry issue
ablaom Oct 22, 2020
46fea19
Update NearestNeighbors.jl
OkonSamuel Oct 22, 2020
d36948e
fix wrong call signature
OkonSamuel Oct 22, 2020
5325d5b
Update NearestNeighbors.jl
OkonSamuel Oct 22, 2020
a229988
replace `Tables.schema` with `MMI.schema`
OkonSamuel Oct 22, 2020
a1a9ca2
Update NearestNeighbors.jl
OkonSamuel Oct 22, 2020
ea2d9de
Update NearestNeighbors.jl
OkonSamuel Oct 22, 2020
dedcaee
Merge branch 'dev' of https://github.com/alan-turing-institute/MLJMod…
ablaom Oct 26, 2020
c89f593
Merge branch 'dev' into multiple-regression-knn2
ablaom Oct 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 29 additions & 7 deletions src/NearestNeighbors.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ const MMI = MLJModelInterface
using Distances

import ..NearestNeighbors
import ..Tables

const NN = NearestNeighbors

Expand Down Expand Up @@ -128,26 +129,47 @@ end
function MMI.predict(m::KNNRegressor, (tree, y, w), X)
Xmatrix = MMI.matrix(X, transpose=true) # NOTE: copies the data
idxs, dists = NN.knn(tree, Xmatrix, m.K)
preds = zeros(length(idxs))

return _predict(m, y, idxs, dists, dists)
end
function _predict(m::KNNRegressor, y::AbstractVector, idxs, dists)
preds = similar(y, length(idxs), 1)
w_ = ones(m.K)

for i in eachindex(idxs)
idxs_ = idxs[i]
dists_ = dists[i]
values = y[idxs_]
values = [view(y, j, :) for j in idxs_]
if w !== nothing
w_ = w[idxs_]
end
if m.weights == :uniform
preds[i] = sum(values .* w_) / sum(w_)
preds[i,:] .= sum(values .* w_) / sum(w_)
else
preds[i] = sum(values .* w_ .* (1.0 .- dists_ ./ sum(dists_))) / (sum(w_) - 1)
preds[i,:] .= sum(values .* w_ .* (1.0 .- dists_ ./ sum(dists_))) / (sum(w_) - 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding the formula in 147 (which predates this PR). @tlienart, can you recall how this was deduced or where it came from?

Naively, it seems to me that we want to write the prediction as a weighted sum of the target values, where the kth weight is simultaneously proportional to the prescribed sample weight w_[k] and inversely proportional to the distance dist_[k]. That is, prediction[i,:] = sum(w_ ./ dist_ .* values) / c where c is a normalisation constant, chosen such that the weights (coefficients of values in the sum) add up to one: c = sum(w_ ./ dist_). However, I can't reconcile this with the given formula. Am I missing something?

cc @mateuszbaran @OkonSamuel

Copy link
Member

@OkonSamuel OkonSamuel Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah had the same thoughts when i went through.
I would propose avoid passing weights to fit. i.e setting supports_weights trait to false since the weights needed for knn models are not per_observation weights but per_neighbor weights,
So we should stick to using weights derived from weights parameter passed during knn model construction. (i.e :uniform, :distance)

Copy link
Member Author

@ablaom ablaom Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah had the same thoughts when i went through.
I would propose avoid passing weights to fit. i.e setting supports_weights trait to false since the weights

I don't see anything wrong with mixing per sample weights with an inverse square law for the "neighbour" weights (if its done in a meaningful way). Also, presently, these two KNN models are one of the few models that support sample weights and are therefore used for testing 🙌

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numerator in this formula does make sense to me, multiplying by 1-dist[i]/sum(dist) should behave numerically better when distance to one of the neighbors is close to 0. The normalization constant in the denominator looks wrong though. Here is a comparison of different distance-based weights: https://perun.pmf.uns.ac.rs/radovanovic/publications/2016-kais-knn-weighting.pdf .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mateuszbaran Thanks for pointing out the paper. Worth noting the evaluation there is for classifiers making point predictions (the mli classifier is probabilistic and so needs normalisation not needed there) and the testing was restricted to time-series applications. And the current PR is about regression. That said the paper nicely summarises a number of weighting schemes that probably covers cases in common use for both probabilistic classification and deterministic regression (mlj cases).

(Interestingly, I don't see the 1 - dist[i]/sum(dist) in the paper, although maybe it's a special case of Macleod ??).

It would be nice to implement them all and cite the paper for the definitions. But I would deem that out of scope of the current PR.

For the record sk-learn implements 1/dist[i] (with no epson-smoothing) and uniform, but also allows a custom function.

I propose we keep the current 1 - dist[i]/sum(dist) weight (and the 1/dist[i] weight currently used for the classifier) and do the normalisation post facto, as we do for classification. (I
don't believe there's a more efficient way, if we are mixing in sample-weights.) We'll view this a a "bug fix" (patch release) and open a new issue to generalize and get consistency with regression and classification - which will be breaking. We can do that at the same time as migrating the interface to its own package (#244).

@OkonSamuel @mateuszbaran Happy with this?

Copy link
Member

@OkonSamuel OkonSamuel Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ablaom. This can be merged for now. Further changes would be made when migrating to MLJNearestNeighborsInterface.jl.#331
There we may add more weighting schemes as kernels

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll just merge this as is, despite the mysterious normalization, and we'll fix that when we migrate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that paper isn't about kNN regression but still it nicely collects many weighting functions that can be used here as well.

Your plan sounds great 👍 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your help with the PR; I realise it was a bit more involved than you probably thought it would be 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! Also thank you for your quick help in getting this done. I learned quite a bit and I think it was worth it.

end
end
return preds
end

function _predict(m::KNNRegressor, y, idxs, dists)
ymat = MMI.matrix(y)
preds = similar(ymat, length(idxs), size(ymat, 2))
w_ = ones(m.K)
for i in eachindex(idxs)
idxs_ = idxs[i]
dists_ = dists[i]
values = [view(ymat, j, :) for j in idxs_]
if w !== nothing
w_ = w[idxs_]
end
if m.weights == :uniform
preds[i,:] .= sum(values .* w_) / sum(w_)
else
preds[i,:] .= sum(values .* w_ .* (1.0 .- dists_ ./ sum(dists_))) / (sum(w_) - 1)
end
end
return MMI.table(preds, names=Tables.schema(y).names, prototype=y)
end

# ====

metadata_pkg.((KNNRegressor, KNNClassifier),
Expand All @@ -161,7 +183,7 @@ metadata_pkg.((KNNRegressor, KNNClassifier),

metadata_model(KNNRegressor,
input = Table(Continuous),
target = AbstractVector{Continuous},
target = Union{AbstractVector{Continuous}, Table(Continuous)},
weights = true,
descr = KNNRegressorDescription
)
Expand Down
20 changes: 10 additions & 10 deletions src/registry/Metadata.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
[NearestNeighbors.KNNClassifier]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s112,1} where _s112<:ScientificTypes.Finite`"
":target_scitype" = "`AbstractArray{_s33,1} where _s33<:ScientificTypes.Finite`"
":is_pure_julia" = "`true`"
":package_name" = "NearestNeighbors"
":package_license" = "MIT"
Expand All @@ -68,7 +68,7 @@
[NearestNeighbors.KNNRegressor]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{ScientificTypes.Continuous,1}`"
":target_scitype" = "`Union{AbstractArray{ScientificTypes.Continuous,1}, ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)}`"
":is_pure_julia" = "`true`"
":package_name" = "NearestNeighbors"
":package_license" = "MIT"
Expand Down Expand Up @@ -2356,7 +2356,7 @@
[DecisionTree.AdaBoostStumpClassifier]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:Union{AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous, AbstractArray{_s23,1} where _s23<:ScientificTypes.Count, AbstractArray{_s23,1} where _s23<:ScientificTypes.OrderedFactor}`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s112,1} where _s112<:ScientificTypes.Finite`"
":target_scitype" = "`AbstractArray{_s77,1} where _s77<:ScientificTypes.Finite`"
":is_pure_julia" = "`true`"
":package_name" = "DecisionTree"
":package_license" = "MIT"
Expand Down Expand Up @@ -2400,7 +2400,7 @@
[DecisionTree.DecisionTreeClassifier]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:Union{AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous, AbstractArray{_s23,1} where _s23<:ScientificTypes.Count, AbstractArray{_s23,1} where _s23<:ScientificTypes.OrderedFactor}`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s112,1} where _s112<:ScientificTypes.Finite`"
":target_scitype" = "`AbstractArray{_s77,1} where _s77<:ScientificTypes.Finite`"
":is_pure_julia" = "`true`"
":package_name" = "DecisionTree"
":package_license" = "MIT"
Expand Down Expand Up @@ -2444,7 +2444,7 @@
[DecisionTree.RandomForestClassifier]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:Union{AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous, AbstractArray{_s23,1} where _s23<:ScientificTypes.Count, AbstractArray{_s23,1} where _s23<:ScientificTypes.OrderedFactor}`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s112,1} where _s112<:ScientificTypes.Finite`"
":target_scitype" = "`AbstractArray{_s77,1} where _s77<:ScientificTypes.Finite`"
":is_pure_julia" = "`true`"
":package_name" = "DecisionTree"
":package_license" = "MIT"
Expand Down Expand Up @@ -3060,7 +3060,7 @@
[LIBSVM.LinearSVC]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s111,1} where _s111<:ScientificTypes.Finite`"
":target_scitype" = "`AbstractArray{_s76,1} where _s76<:ScientificTypes.Finite`"
":is_pure_julia" = "`false`"
":package_name" = "LIBSVM"
":package_license" = "unknown"
Expand Down Expand Up @@ -3104,7 +3104,7 @@
[LIBSVM.NuSVC]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s111,1} where _s111<:ScientificTypes.Finite`"
":target_scitype" = "`AbstractArray{_s76,1} where _s76<:ScientificTypes.Finite`"
":is_pure_julia" = "`false`"
":package_name" = "LIBSVM"
":package_license" = "unknown"
Expand All @@ -3126,7 +3126,7 @@
[LIBSVM.SVC]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s111,1} where _s111<:ScientificTypes.Finite`"
":target_scitype" = "`AbstractArray{_s76,1} where _s76<:ScientificTypes.Finite`"
":is_pure_julia" = "`false`"
":package_name" = "LIBSVM"
":package_license" = "unknown"
Expand All @@ -3147,7 +3147,7 @@

[LIBSVM.OneClassSVM]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)`"
":output_scitype" = "`AbstractArray{_s111,1} where _s111<:ScientificTypes.Finite{2}`"
":output_scitype" = "`AbstractArray{_s76,1} where _s76<:ScientificTypes.Finite{2}`"
":target_scitype" = "`ScientificTypes.Unknown`"
":is_pure_julia" = "`false`"
":package_name" = "LIBSVM"
Expand All @@ -3170,7 +3170,7 @@
[GLM.LinearBinaryClassifier]
":input_scitype" = "`ScientificTypes.Table{_s24} where _s24<:(AbstractArray{_s23,1} where _s23<:ScientificTypes.Continuous)`"
":output_scitype" = "`ScientificTypes.Unknown`"
":target_scitype" = "`AbstractArray{_s112,1} where _s112<:ScientificTypes.Finite{2}`"
":target_scitype" = "`AbstractArray{_s77,1} where _s77<:ScientificTypes.Finite{2}`"
":is_pure_julia" = "`true`"
":package_name" = "GLM"
":package_license" = "MIT"
Expand Down
3 changes: 2 additions & 1 deletion src/registry/src/check_registry.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
function check_registry()
basedir = joinpath(dirname(pathof(MLJModels)), "registry")
# basedir = joinpath(dirname(pathof(MLJModels)), "registry")
basedir = Registry.environment_path
Pkg.activate(basedir)

# Read Metadata.toml
Expand Down
16 changes: 14 additions & 2 deletions test/NearestNeighbors.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ using MLJModels.NearestNeighbors_
using CategoricalArrays
using MLJBase
using Random
using Tables

Random.seed!(5151)

Expand Down Expand Up @@ -109,7 +110,19 @@ p2 = predict(knnr, f2, xtest)
@test all(p[ntest+1:2*ntest] .≈ 2.0)
@test all(p[2*ntest+1:end] .≈ -2.0)

ymat = vcat(fill( 0.0, n, 2), fill(2.0, n, 2), fill(-2.0, n, 2))
yv = Tables.table(ymat; header = [:a, :b])

fv,_,_ = fit(knnr, 1, x, yv)
f2v,_,_ = fit(knnr, 1, x, yv, w)

pv = predict(knnr, fv, xtest)

for col in [:a, :b]
@test all(pv[col][1:ntest] .≈ [0.0])
@test all(pv[col][ntest+1:2*ntest] .≈ [2.0])
@test all(pv[col][2*ntest+1:end] .≈ [-2.0])
end



Expand All @@ -128,8 +141,7 @@ infos[:docstring]
infos = info_dict(knnr)

@test infos[:input_scitype] == Table(Continuous)
@test infos[:target_scitype] == AbstractVector{Continuous}

@test infos[:target_scitype] == Union{AbstractVector{Continuous}, Table(Continuous)}
infos[:docstring]

end
Expand Down