Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when only one class manifest in the training target #123

Closed
ablaom opened this issue Jul 7, 2022 · 5 comments
Closed

Problems when only one class manifest in the training target #123

ablaom opened this issue Jul 7, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@ablaom
Copy link
Member

ablaom commented Jul 7, 2022

In the training target below we have length(levels(y)) == 2 but y itself only exhibits one class. This is crashing fit. Occasionally, especially in smaller data sets, a large class may be "hidden" when we restrict to a particular fold, so this is an issue.

julia> X = (x1=rand(10), )
julia> using MLJBase
julia> y = coerce(vcat(fill("a", 10), ["b", ]), Multiclass)[1:10]
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"

julia> levels(y)
2-element Vector{String}:
 "a"
 "b"

julia> mach = machine(Multinomial
MultinomialClassifier   MultinomialNBClassifier
MultinomialLoss         MultinomialRegression
julia> mach = machine(MultinomialClassifier(), X, y) |> fit!
[ Info: Training machine(MultinomialClassifier(lambda = 1.0, ), ).
[ Info: Solver: LBFGS()
┌ Error: Problem fitting the machine machine(MultinomialClassifier(lambda = 1.0, ), ). 
└ @ MLJBase ~/MLJ/MLJBase/src/machines.jl:617
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: ArgumentError: invalid Array dimensions
Stacktrace:
  [1] Array
    @ ./boot.jl:459 [inlined]
  [2] Array
    @ ./boot.jl:467 [inlined]
  [3] zeros
    @ ./array.jl:525 [inlined]
  [4] zeros
    @ ./array.jl:522 [inlined]
  [5] zeros
    @ ./array.jl:520 [inlined]
  [6] scratch(n::Int64, p::Int64, c::Int64; i::Bool)
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/fit/default.jl:49
  [7] fit(glr::GeneralizedLinearRegression{MultinomialLoss{0}, ScaledPenalty{L2Penalty}}, X::Matrix{Float64}, y::Vector{Int64}; solver::LBFGS)                                         
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/fit/default.jl:42
  [8] fit(m::MultinomialClassifier, verb::Int64, X::NamedTuple{(:x1,), Tuple{Vector{Float64}}}, y::CategoricalArrays.CategoricalVector{String, UInt32, String, CategoricalArrays.CategoricalValue{String, UInt32}, Union{}})      
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/mlj/interface.jl:78
  [9] fit_only!(mach::Machine{MultinomialClassifier, true}; rows::Nothing, verbosity::Int64, force::Bool)                                                                         
    @ MLJBase ~/MLJ/MLJBase/src/machines.jl:615
 [10] fit_only!
    @ ~/MLJ/MLJBase/src/machines.jl:568 [inlined]
 [11] #fit!#58
    @ ~/MLJ/MLJBase/src/machines.jl:683 [inlined]
 [12] fit!
    @ ~/MLJ/MLJBase/src/machines.jl:681 [inlined]
 [13] |>(x::Machine{MultinomialClassifier, true}, f::typeof(fit!))
    @ Base ./operators.jl:966
 [14] top-level scope
    @ REPL[17]:1
@ablaom ablaom added the bug Something isn't working label Jul 7, 2022
@tlienart
Copy link
Collaborator

tlienart commented Jul 8, 2022

MLJLM uses base data types (vector of reals), so it doesn't get the levels information and will just see a vector with a single unique element. It may be made to take an optional levels argument to guarantee that this line:

c = getc(glr, y)

gets a c=2 in this case as opposed to c=1 (via maximum(y) here)

that would mean calling fit with an additional argument but I think that would be fine.

@tlienart
Copy link
Collaborator

tlienart commented Jul 8, 2022

This here

function MMI.fit(m::Union{CLF_MODELS...}, verb::Int, X, y)
Xmatrix = MMI.matrix(X)
sch = MMI.schema(X)
features = (sch === nothing) ? nothing : sch.names
yplain = convert.(Int, MMI.int(y))
classes = MMI.classes(y[1])
nclasses = length(classes)
if nclasses < 2
throw(DomainError("The target `y` needs to have two or more levels."))
elseif nclasses == 2
# recode to ± 1
yplain[yplain .== 1] .= -1
yplain[yplain .== 2] .= 1
# force the binary case
nclasses = 0
end
# NOTE: here the number of classes is either 0 or > 2
clf = glr(m, nclasses)
solver = m.solver === nothing ? _solver(clf, size(Xmatrix)) : m.solver
verb > 0 && @info "Solver: $(solver)"
# get the parameters
θ = fit(clf, Xmatrix, yplain, solver=solver)
# return
return (θ, features, classes, nclasses), nothing, NamedTuple{}()
end

should get the right number of classes though via the interface. Would need to figure out why the nclasses is not passed properly (or not computed properly)

there's definitely a bug here somewhere

@tlienart tlienart self-assigned this Jul 8, 2022
@ablaom
Copy link
Member Author

ablaom commented Jul 18, 2022

It has something to do with the encoding, which is weird. In the encoding, the classes are number starting with -1, not 0 or 1.

@ablaom
Copy link
Member Author

ablaom commented Aug 2, 2022

@tlienart The problem is that, for some reason I don't understand at all, the encoding of y is special-cased if the pool of y has only two classes:

elseif nclasses == 2

If we subsample and only see one of the two classes, then the encoded y looks like [-1, -1, ..., -1]. In view of this, I think the definition of getc at this line

getc(m::MultiClassLoss{0}, y) = maximum(y)

is incorrect. The problem for me is that I really can't figure out what this getc is computing, as I'm not familiar enough with the code. What does the scratch function do, for example? I think you're the only one who can safely make a fix here.

For what it's worth, a better and safer design would probably be to remove all this binary special-casing altogether, if that makes sense here. But maybe you have your reasons...

tlienart added a commit that referenced this issue Aug 2, 2022
@tlienart
Copy link
Collaborator

tlienart commented Aug 2, 2022

the distinction binary/multiclass is in the use of the internal representation of the vector. For binary it's more convenient to have -1,1 as it allows to have computations with a single column as opposed to doing computation with one column per class. I'm not quite ready to say that this is hugely essential and warrants the code style, but when initially working on this, I was keen to try to do stuff like this to squeeze out a bit more performances. Same with scratch space which initialises a bunch of arrays in which computations can be done in place so that you only need to allocate once.

Anyway here the problem is that I was trying to do a bit too much for the user:

  • user specifies a Multinomial
  • user has data with 2 levels (obtained via MMI.levels)
  • trying to force pass a LogisticClassifier by passing nclasses = 0 this would use a fallback where it tries to guess from data which is clearly undesirable here.

anyway, I've removed this by only recoding to -1/1 if the user explicitly specifies Binary; otherwise a Multinomial case with redundant computations is used. A test case with your example is added. maybe not the way you'd have preferred but I can't do much more at the moment

@tlienart tlienart reopened this Aug 2, 2022
@tlienart tlienart closed this as completed Aug 3, 2022
tlienart added a commit that referenced this issue Sep 23, 2022
* closes #129 by fixing the interface after #123

* project toml fix + ready for patch release

* extending the tests a bit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants