Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MinMaxScaler (and more) #816

Open
egolep opened this issue Jul 12, 2021 · 6 comments
Open

MinMaxScaler (and more) #816

egolep opened this issue Jul 12, 2021 · 6 comments

Comments

@egolep
Copy link

egolep commented Jul 12, 2021

It would be very nice to have more transformers than Standardizer, OneHotEncoding and BoxCox (and their univariate versions)

I even tried to implement a MinMaxScaler using Standardizer as an example, but I keep getting:

[ Info: Training Machine{MinMaxScaler,…} @194.
┌ Error: Problem fitting the machine Machine{MinMaxScaler,…} @194.
└ @ MLJBase ~/.julia/packages/MLJBase/AkJde/src/machines.jl:484
[ Info: Running type checks...
[ Info: Type checks okay.
ERROR: MethodError: no method matching fit(::MinMaxScaler, ::Int64, ::DataFrame)
Closest candidates are:
fit(::MLJBase.Stack{modelnames, inp_scitype, tg_scitype} where {modelnames, inp_scitype, tg_scitype}, ::Int64, ::Any, ::Any) at /home/egolep/.julia/packages/MLJBase/AkJde/src/composition/models/stacking.jl:277
fit(::Union{MLJIteration.DeterministicIteratedModel{M}, MLJIteration.ProbabilisticIteratedModel{M}} where M, ::Any, ::Any...) at /home/egolep/.julia/packages/MLJIteration/Twn0E/src/core.jl:51
fit(::Union{MLJTuning.DeterministicTunedModel{T, M}, MLJTuning.ProbabilisticTunedModel{T, M}}, ::Integer, ::Any...) where {T, M} at /home/egolep/.julia/packages/MLJTuning/QFcuQ/src/tuned_models.jl:592
...
Stacktrace:
[1] fit_only!(mach::Machine{MinMaxScaler, true}; rows::Vector{Int64}, verbosity::Int64, force::Bool)
@ MLJBase ~/.julia/packages/MLJBase/AkJde/src/machines.jl:482
[2] #fit!#98
@ ~/.julia/packages/MLJBase/AkJde/src/machines.jl:549 [inlined]
[3] top-level scope
@ REPL[120]:1

here my implementation (of both a univariate version and the multivariate one):

import MLJModelInterface.inverse_transform
mutable struct UnivariateMinMaxScaler <: Unsupervised
end

function fit(transformer::UnivariateMinMaxScaler, verbosity::Int, v::AbstractVector{T}) where T<:Real
min, max = minimum(v), maximum(v)
fitresult = (min, max)
cache = nothing
report = NamedTuple()
return fitresult, cache, report
end

function transform(transformer::UnivariateMinMaxScaler, fitresult, x::Real)
min, max = fitresult
x_std = (x .- min) ./ (max - min)
return x_std .* (max - min) + min
end

transform(tranformer::UnivariateMinMaxScaler, fitresult, v) = [transform(tranformer, fitresult, x) for x in v]

function inverse_transform(transformer::UnivariateMinMaxScaler, fitresult, y::Real)
min, max = fitresult
y_std = y .- min ./ (max - min)
return y_std .* (max - min) .+ min
end

inverse_transform(transformer::UnivariateMinMaxScaler, fitresult, w) = [inverse_transform(transformer, fitresult, y) for y in w]

mutable struct MinMaxScaler <: Unsupervised
features::Vector{Symbol}
end

MinMaxScaler(; features=Symbol[]) = MinMaxScaler(features)

function fit(transformer::MinMaxScaler, verbosity::Int, X::Any)

_schema =  schema(X)
all_features = _schema.names
types = scitypes(X)

# determine indices of all_features to be transformed
if isempty(transformer.features)
    cols_to_fit = filter!(eachindex(all_features)|>collect) do j
        types[j] <: Continuous
    end
else
    cols_to_fit = filter!(eachindex(all_features)|>collect) do j
        all_features[j] in transformer.features && types[j] <: Continuous
    end
end

fitresult_given_feature = Dict{Symbol,Tuple{Float64,Float64}}()

# fit each feature
verbosity < 2 || @info "Features scaled: "
for j in cols_to_fit
    col_fitresult, cache, report =
        fit(UnivariateMinMaxScaler(), verbosity - 1, selectcols(X, j))
    fitresult_given_feature[all_features[j]] = col_fitresult
    verbosity < 2 ||
        @info "  :$(all_features[j])    mu=$(col_fitresult[1])  sigma=$(col_fitresult[2])"
end

fitresult = fitresult_given_feature
cache = nothing
report = (features_fit=keys(fitresult_given_feature),)

return fitresult, cache, report

end

MLJ.fitted_params(::MinMaxScaler, fitresult) = (min_and_max_given_feature=fitresult,)

function transform(transformer::MinMaxScaler, fitresult, X)
features_to_be_transformerd = keys(fitresult)
all_features = schema(X).names

issubset(Set(features_to_be_transformerd), Set(all_features)) || 
			 error("Attempting to transform data with incompatible feature labels.")

col_transformer = UnivariateMinMaxScaler()

cols = map(all_features) do ftr
	if ftr in features_to_be_transformerd
		transform(col_transformer, fitresult[ftr], selectcols(X, ftr))
	else
		selectcols(X, ftr)
	end
end

named_cols = NamedTuple(all_features)(tuple(cols...))

return MLJBase.table(named_cols, prototype=X)

end

I get the same error using both the multivariate and the univariate one.

@egolep egolep changed the title MinMaxScaler (and more MinMaxScaler (and more) Jul 12, 2021
@ablaom
Copy link
Member

ablaom commented Jul 13, 2021

@egolep Thanks for this.

Despite your definition

function fit(transformer::MinMaxScaler, verbosity::Int, X::Any)

you are getting

ERROR: MethodError: no method matching fit(::MinMaxScaler, ::Int64, ::DataFrame)

Maybe this is a dumb question, but did you import MLJModelInterface.fit to extend it (or MLJBase.fit)? Perhaps you have only defined Main.fit which MLJ will not recognise.

@ablaom
Copy link
Member

ablaom commented Jul 13, 2021

BTW, you may want to focus on just the univariate case, in view of JuliaAI/MLJModels.jl#288 .

@egolep
Copy link
Author

egolep commented Jul 13, 2021

Hi @ablaom,
moving from MLJModelInterface to MLJBase did the trick. I'm starting to think that it could be a version-related problem inside the virtual environment I was using for this little experiment.

Now the univariate case does work and I'm going to call it a day, following your suggestion.

Many thanks for your reply!

@egolep
Copy link
Author

egolep commented Jul 13, 2021

I lied. Now also MinMaxScaler() works since leaving it unfinished triggered all my OCDs.

Thanks again for your replies. If I will implement more of this kind of models, could it be worth to create a pull request? Or these transformers are kept in a small number for a reason?

@ablaom
Copy link
Member

ablaom commented Jul 13, 2021

No, PR to MLJModels most welcome. You'll need to add a test...

@ablaom
Copy link
Member

ablaom commented Aug 2, 2021

@egolep You still interested in make a PR to MLJModels.jl ? Let me know if I can help you make that happen.

@ablaom ablaom closed this as completed Aug 2, 2021
@ablaom ablaom reopened this Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants