Problems when only one class manifest in the training target #123

ablaom · 2022-07-07T04:41:29Z

In the training target below we have length(levels(y)) == 2 but y itself only exhibits one class. This is crashing fit. Occasionally, especially in smaller data sets, a large class may be "hidden" when we restrict to a particular fold, so this is an issue.

julia> X = (x1=rand(10), )
julia> using MLJBase
julia> y = coerce(vcat(fill("a", 10), ["b", ]), Multiclass)[1:10]
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"

julia> levels(y)
2-element Vector{String}:
 "a"
 "b"

julia> mach = machine(Multinomial
MultinomialClassifier   MultinomialNBClassifier
MultinomialLoss         MultinomialRegression
julia> mach = machine(MultinomialClassifier(), X, y) |> fit!
[ Info: Training machine(MultinomialClassifier(lambda = 1.0, …), …).
[ Info: Solver: LBFGS()
┌ Error: Problem fitting the machine machine(MultinomialClassifier(lambda = 1.0, …), …). 
└ @ MLJBase ~/MLJ/MLJBase/src/machines.jl:617
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: ArgumentError: invalid Array dimensions
Stacktrace:
  [1] Array
    @ ./boot.jl:459 [inlined]
  [2] Array
    @ ./boot.jl:467 [inlined]
  [3] zeros
    @ ./array.jl:525 [inlined]
  [4] zeros
    @ ./array.jl:522 [inlined]
  [5] zeros
    @ ./array.jl:520 [inlined]
  [6] scratch(n::Int64, p::Int64, c::Int64; i::Bool)
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/fit/default.jl:49
  [7] fit(glr::GeneralizedLinearRegression{MultinomialLoss{0}, ScaledPenalty{L2Penalty}}, X::Matrix{Float64}, y::Vector{Int64}; solver::LBFGS)                                         
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/fit/default.jl:42
  [8] fit(m::MultinomialClassifier, verb::Int64, X::NamedTuple{(:x1,), Tuple{Vector{Float64}}}, y::CategoricalArrays.CategoricalVector{String, UInt32, String, CategoricalArrays.CategoricalValue{String, UInt32}, Union{}})      
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/mlj/interface.jl:78
  [9] fit_only!(mach::Machine{MultinomialClassifier, true}; rows::Nothing, verbosity::Int64, force::Bool)                                                                         
    @ MLJBase ~/MLJ/MLJBase/src/machines.jl:615
 [10] fit_only!
    @ ~/MLJ/MLJBase/src/machines.jl:568 [inlined]
 [11] #fit!#58
    @ ~/MLJ/MLJBase/src/machines.jl:683 [inlined]
 [12] fit!
    @ ~/MLJ/MLJBase/src/machines.jl:681 [inlined]
 [13] |>(x::Machine{MultinomialClassifier, true}, f::typeof(fit!))
    @ Base ./operators.jl:966
 [14] top-level scope
    @ REPL[17]:1

The text was updated successfully, but these errors were encountered:

tlienart · 2022-07-08T11:40:35Z

MLJLM uses base data types (vector of reals), so it doesn't get the levels information and will just see a vector with a single unique element. It may be made to take an optional levels argument to guarantee that this line:

MLJLinearModels.jl/src/fit/default.jl

Line 40 in 55f22b0

c = getc(glr, y)

gets a c=2 in this case as opposed to c=1 (via maximum(y) here)

that would mean calling fit with an additional argument but I think that would be fine.

tlienart · 2022-07-08T11:43:19Z

This here

MLJLinearModels.jl/src/mlj/interface.jl

Lines 57 to 81 in 55f22b0

    
           function MMI.fit(m::Union{CLF_MODELS...}, verb::Int, X, y) 
        
               Xmatrix  = MMI.matrix(X) 
        
               sch = MMI.schema(X) 
        
               features = (sch === nothing) ? nothing : sch.names 
        
               yplain   = convert.(Int, MMI.int(y)) 
        
               classes  = MMI.classes(y[1]) 
        
               nclasses = length(classes) 
        
               if nclasses < 2 
        
                   throw(DomainError("The target `y` needs to have two or more levels.")) 
        
               elseif nclasses == 2 
        
                   # recode to ± 1 
        
                   yplain[yplain .== 1] .= -1 
        
                   yplain[yplain .== 2] .= 1 
        
                   # force the binary case 
        
                   nclasses    = 0 
        
               end 
        
               # NOTE: here the number of classes is either 0 or > 2 
        
               clf = glr(m, nclasses) 
        
               solver = m.solver === nothing ? _solver(clf, size(Xmatrix)) : m.solver 
        
               verb > 0 && @info "Solver: $(solver)" 
        
               # get the parameters 
        
               θ = fit(clf, Xmatrix, yplain, solver=solver) 
        
               # return 
        
               return (θ, features, classes, nclasses), nothing, NamedTuple{}() 
        
           end

should get the right number of classes though via the interface. Would need to figure out why the nclasses is not passed properly (or not computed properly)

there's definitely a bug here somewhere

ablaom · 2022-07-18T03:08:21Z

It has something to do with the encoding, which is weird. In the encoding, the classes are number starting with -1, not 0 or 1.

ablaom · 2022-08-02T00:58:15Z

@tlienart The problem is that, for some reason I don't understand at all, the encoding of y is special-cased if the pool of y has only two classes:

MLJLinearModels.jl/src/mlj/interface.jl

Line 66 in 55f22b0

elseif nclasses == 2

If we subsample and only see one of the two classes, then the encoded y looks like [-1, -1, ..., -1]. In view of this, I think the definition of getc at this line

MLJLinearModels.jl/src/loss-penalty/generic.jl

Line 41 in 55f22b0

getc(m::MultiClassLoss{0}, y) = maximum(y)

is incorrect. The problem for me is that I really can't figure out what this getc is computing, as I'm not familiar enough with the code. What does the scratch function do, for example? I think you're the only one who can safely make a fix here.

For what it's worth, a better and safer design would probably be to remove all this binary special-casing altogether, if that makes sense here. But maybe you have your reasons...

tlienart · 2022-08-02T16:12:52Z

the distinction binary/multiclass is in the use of the internal representation of the vector. For binary it's more convenient to have -1,1 as it allows to have computations with a single column as opposed to doing computation with one column per class. I'm not quite ready to say that this is hugely essential and warrants the code style, but when initially working on this, I was keen to try to do stuff like this to squeeze out a bit more performances. Same with scratch space which initialises a bunch of arrays in which computations can be done in place so that you only need to allocate once.

Anyway here the problem is that I was trying to do a bit too much for the user:

user specifies a Multinomial
user has data with 2 levels (obtained via MMI.levels)
trying to force pass a LogisticClassifier by passing nclasses = 0 this would use a fallback where it tries to guess from data which is clearly undesirable here.

anyway, I've removed this by only recoding to -1/1 if the user explicitly specifies Binary; otherwise a Multinomial case with redundant computations is used. A test case with your example is added. maybe not the way you'd have preferred but I can't do much more at the moment

* closes #129 by fixing the interface after #123 * project toml fix + ready for patch release * extending the tests a bit

ablaom added the bug Something isn't working label Jul 7, 2022

tlienart self-assigned this Jul 8, 2022

tlienart added a commit that referenced this issue Aug 2, 2022

closes #123

1ecc27c

tlienart closed this as completed in ef9cb7a Aug 2, 2022

tlienart reopened this Aug 2, 2022

tlienart closed this as completed Aug 3, 2022

tlienart added a commit that referenced this issue Sep 23, 2022

closes #129 by fixing the interface after #123

94f74c7

tlienart mentioned this issue Sep 23, 2022

closes #129 by fixing the interface after #123 #130

Merged

tlienart added a commit that referenced this issue Sep 23, 2022

closes #129 by fixing the interface after #123 (#130)

1e8afa9

* closes #129 by fixing the interface after #123 * project toml fix + ready for patch release * extending the tests a bit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems when only one class manifest in the training target #123

Problems when only one class manifest in the training target #123

ablaom commented Jul 7, 2022

tlienart commented Jul 8, 2022 •

edited

Loading

tlienart commented Jul 8, 2022 •

edited

Loading

ablaom commented Jul 18, 2022

ablaom commented Aug 2, 2022

tlienart commented Aug 2, 2022 •

edited

Loading

Problems when only one class manifest in the training target #123

Problems when only one class manifest in the training target #123

Comments

ablaom commented Jul 7, 2022

tlienart commented Jul 8, 2022 • edited Loading

tlienart commented Jul 8, 2022 • edited Loading

ablaom commented Jul 18, 2022

ablaom commented Aug 2, 2022

tlienart commented Aug 2, 2022 • edited Loading

tlienart commented Jul 8, 2022 •

edited

Loading

tlienart commented Jul 8, 2022 •

edited

Loading

tlienart commented Aug 2, 2022 •

edited

Loading