Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: evaluate! crashes being called several times in a row when acceleration is used #788

Closed
irublev opened this issue May 21, 2021 · 8 comments

Comments

@irublev
Copy link

irublev commented May 21, 2021

Describe the bug

When evalulate! is called several times in a row, it crashes with the following stack trace:

[ Info: Performing evaluations using 12 threads.
Evaluating over 10 folds:  20%[=====>                   ]  ETA: 0:00:00ERROR: TaskFailedException

    nested task error: InexactError: trunc(Int64, NaN)
    Stacktrace:
     [1] trunc
       @ .\float.jl:716 [inlined]
     [2] round
       @ .\float.jl:296 [inlined]
     [3] calc_check_iterations
       @ D:\Users\Ilya\.julia\packages\ProgressMeter\l7LEt\src\ProgressMeter.jl:246 [inlined]
     [4] updateProgress!(p::ProgressMeter.Progress; showvalues::Tuple{}, truncate_lines::Bool, valuecolor::Symbol, offset::Int64, keep::Bool, desc::Nothing, ignore_predictor::Bool)
       @ ProgressMeter D:\Users\Ilya\.julia\packages\ProgressMeter\l7LEt\src\ProgressMeter.jl:293
     [5] updateProgress!
       @ D:\Users\Ilya\.julia\packages\ProgressMeter\l7LEt\src\ProgressMeter.jl:253 [inlined]
     [6] macro expansion
       @ D:\Users\Ilya\.julia\packages\MLJBase\DhBkA\src\resampling.jl:786 [inlined]
     [7] (::MLJBase.var"#270#274"{Channel{Bool}, ProgressMeter.Progress})()
       @ MLJBase .\task.jl:411
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:369
 [2] macro expansion
   @ .\task.jl:388 [inlined]
 [3] _evaluate!(func::MLJBase.var"#fit_and_extract_on_fold#284"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, typeof(predict), Bool, Bool, CategoricalArrays.CategoricalVector{String, UInt32, String, CategoricalArrays.CategoricalValue{String, UInt32}, Union{}}, NamedTuple{(:sepal_length, :sepal_width, :petal_length, :petal_width), NTuple{4, Vector{Float64}}}}, mach::Machine{LogisticClassifier, true}, accel::CPUThreads{Int64}, nfolds::Int64, verbosity::Int64)
   @ MLJBase D:\Users\Ilya\.julia\packages\MLJBase\DhBkA\src\resampling.jl:781
 [4] evaluate!(mach::Machine{LogisticClassifier, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{LogLoss{Float64}}, operation::typeof(predict), acceleration::CPUThreads{Int64}, force::Bool)
   @ MLJBase D:\Users\Ilya\.julia\packages\MLJBase\DhBkA\src\resampling.jl:900
 [5] evaluate!(::Machine{LogisticClassifier, true}, ::StratifiedCV, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{LogLoss{Float64}}, ::Function, ::CPUThreads{Int64}, ::Bool)
   @ MLJBase D:\Users\Ilya\.julia\packages\MLJBase\DhBkA\src\resampling.jl:965
 [6] #evaluate!#260
   @ D:\Users\Ilya\.julia\packages\MLJBase\DhBkA\src\resampling.jl:677 [inlined]
 [7] top-level scope
   @ REPL[17]:1

To Reproduce

using MLJ, MLJLinearModels, MLJBase

X, y = @load_iris
classifier = LogisticClassifier(lambda=10.0, gamma=0.0, penalty=:l2, fit_intercept=true, solver=LBFGS())
mach = machine(classifier, X, categorical(y)) |> fit!

cv = StratifiedCV(nfolds=10, shuffle=false)

evaluate!(mach, resampling=cv, measure=[LogLoss()], acceleration=CPUThreads())
evaluate!(mach, resampling=cv, measure=[LogLoss()], acceleration=CPUThreads())
evaluate!(mach, resampling=cv, measure=[LogLoss()], acceleration=CPUThreads())
evaluate!(mach, resampling=cv, measure=[LogLoss()], acceleration=CPUThreads())

Expected behavior

The behaviour should be the same with acceleration and without it, without any crashes.

Additional context

The code was run in Windows 10, julia Version 1.6.1 (2021-04-23), julia was launched with 12 threads:

julia -t 12

Versions

MLJ v0.16.4

MLJBase v0.18.6

MLJLinearModels v0.5.4

@irublev irublev changed the title evaluate! crashes being called several times in a row when acceleration is used evaluate! crashes being called several times in a row when acceleration is used May 21, 2021
@irublev irublev changed the title evaluate! crashes being called several times in a row when acceleration is used Bug: evaluate! crashes being called several times in a row when acceleration is used May 22, 2021
@ablaom
Copy link
Member

ablaom commented May 24, 2021

Thanks for reporting. Interesting issue.

Unfortunately I cannot yet reproduce. The issue appears related to ProgressMeter. It looks like we are not using the same version. Can you please post the output of using Pkg; Pkg.status(mode=PKGMODE_MANIFEST) ?

Also, if you know how to do this, can you pin your ProgressMeter version to 1.5 and see if you still get an error? (For example, from the REPL you could do ]add ProgressMeter@1.5

@irublev
Copy link
Author

irublev commented May 24, 2021

Thanks for your answer. The result of Pkg.status(mode=PKGMODE_MANIFEST) is as follows:

julia> Pkg.status(mode=PKGMODE_MANIFEST)
      Status `D:\Users\Ilya\Work\Projects\juliatest\Manifest.toml`
  [e3c3008a] AMLPipelineBase v0.1.9
  [4fba245c] ArrayInterface v3.1.12
  [08437348] AutoMLPipeline v0.3.5
  [fbb218c0] BSON v0.3.3
  [336ed68f] CSV v0.8.4
  [324d7699] CategoricalArrays v0.9.7
  [d360d2e6] ChainRulesCore v0.9.43
  [da1fd8a2] CodeTracking v1.0.5
  [944b1d66] CodecZlib v0.7.0
  [3da002f7] ColorTypes v0.10.12
  [bbf7d656] CommonSubexpressions v0.3.0
  [34da2185] Compat v3.28.0
  [ed09eef8] ComputationalResources v0.3.2
  [8f4d0f93] Conda v1.5.2
  [a8cc5b0e] Crayons v4.0.4
  [9a962f9c] DataAPI v1.6.0
  [a93c6f00] DataFrames v0.22.7
  [864edb3b] DataStructures v0.18.9
  [e2d170a0] DataValueInterfaces v1.0.0
  [7806a523] DecisionTree v0.10.10
  [163ba53b] DiffResults v1.0.3
  [b552c78f] DiffRules v1.0.2
  [b4f34e82] Distances v0.10.3
  [31c24e10] Distributions v0.24.18
  [ffbed154] DocStringExtensions v0.8.4
  [792122b4] EarlyStopping v0.1.8
  [e2ba6199] ExprTools v0.1.3
  [8f5d6c58] EzXML v1.1.0
  [48062228] FilePathsBase v0.9.10
  [1a297f60] FillArrays v0.11.7
  [6a86dc24] FiniteDiff v2.8.0
  [53c48c17] FixedPointNumbers v0.8.4
  [59287772] Formatting v0.4.2
  [f6369f11] ForwardDiff v0.10.18
  [cd3eb016] HTTP v0.8.19
  [eafb193a] Highlights v0.5.1
  [615f187c] IfElse v0.1.0
  [83e8ac13] IniFile v0.5.0
  [41ab1584] InvertedIndices v1.0.0
  [c8e1da08] IterTools v1.3.0
  [b3c1a2ee] IterationControl v0.4.0
  [42fd0dbc] IterativeSolvers v0.9.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [1019f520] JLFzf v0.1.3
  [692b3bcd] JLLWrappers v1.3.0
  [9da8a3cd] JLSO v2.5.0
  [682c06a0] JSON v0.21.1
  [aa1ae85d] JuliaInterpreter v0.8.16
  [a5e1c1ea] LatinHypercubeSampling v1.8.0
  [7f8f8fb0] LearnBase v0.3.0
  [d3d80556] LineSearches v7.1.1
  [7a12625a] LinearMaps v3.3.0
  [2ab3a3ac] LogExpFunctions v0.2.3
  [30fc2ffe] LossFunctions v0.6.0
  [6f1432cf] LoweredCodeUtils v2.1.0
  [f0e99cf1] MLBase v0.8.0
  [add582a8] MLJ v0.16.4
  [a7f614a8] MLJBase v0.18.1
  [614be32b] MLJIteration v0.3.0
  [6ee0df7b] MLJLinearModels v0.5.4
  [e80e1ace] MLJModelInterface v1.0.0
  [d491faf4] MLJModels v0.14.6
  [cbea4545] MLJOpenML v1.0.0
  [2e2323e0] MLJScientificTypes v0.4.5
  [17bed46d] MLJSerialization v1.1.2
  [03970b2e] MLJTuning v0.6.5
  [1914dd2f] MacroTools v0.5.6
  [739be429] MbedTLS v1.0.3
  [f28f55f0] Memento v1.1.2
  [e1d29d7a] Missings v0.4.5
  [78c3b35d] Mocking v0.7.1
  [d41bc354] NLSolversBase v7.8.0
  [77ba4419] NaNMath v0.3.5
  [5fb14364] OhMyREPL v0.5.10
  [429524aa] Optim v1.3.0
  [bac558e1] OrderedCollections v1.4.1
  [90014a1f] PDMats v0.11.0
  [d96e819e] Parameters v0.12.2
  [69de0a69] Parsers v1.1.0
  [b1ad91c1] PersistenceDiagramsBase v0.1.1
  [b98c9c47] Pipe v1.3.0
  [2dfb63ee] PooledArrays v1.2.1
  [85a6dd25] PositiveFactorizations v0.2.4
  [21216c6a] Preferences v1.2.2
  [08abe8d2] PrettyTables v0.11.1
  [92933f4c] ProgressMeter v1.6.2
  [438e738f] PyCall v1.92.3
  [1fd47b50] QuadGK v2.4.1
  [3cdcf5f2] RecipesBase v1.1.1
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.1.3
  [295af30f] Revise v3.1.16
  [79098fc4] Rmath v0.7.0
  [321657f4] ScientificTypes v1.1.2
  [6e75b9c4] ScikitLearnBase v0.5.0
  [91c51154] SentinelArrays v1.2.16
  [a2af1166] SortingAlgorithms v0.3.1
  [276daf66] SpecialFunctions v1.4.0
  [860ef19b] StableRNGs v1.0.0
  [aedffcd0] Static v0.2.4
  [90137ffa] StaticArrays v1.2.0
  [64bff920] StatisticalTraits v1.0.0
  [82ae8749] StatsAPI v1.0.0
  [2913bbd2] StatsBase v0.33.8
  [4c63d2b9] StatsFuns v0.9.8
  [856f2bd8] StructTypes v1.7.2
  [cea106d9] Syslogs v0.3.0
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.4.2
  [f269a46b] TimeZones v1.5.5
  [0796e94c] Tokenize v0.5.16
  [3bb67fe8] TranscodingStreams v0.9.5
  [3a884ed6] UnPack v1.0.2
  [81def892] VersionParsing v1.2.0
  [94ce4f54] Libiconv_jll v1.16.0+7
  [efe28fd5] OpenSpecFun_jll v0.5.4+0
  [f50d1b31] Rmath_jll v0.3.0+0
  [02c8fc9c] XML2_jll v2.9.11+0
  [214eeab7] fzf_jll v0.24.4+0
  [0dad84c5] ArgTools
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [f43a241f] Downloads
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [4607b0f0] SuiteSparse
  [fa267f1f] TOML
  [a4e569a6] Tar
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll
  [deac9b47] LibCURL_jll
  [29816b5a] LibSSH2_jll
  [c8ffd9c3] MbedTLS_jll
  [14a3606d] MozillaCACerts_jll
  [83775a58] Zlib_jll
  [8e850ede] nghttp2_jll
  [3f19e933] p7zip_jll

@irublev
Copy link
Author

irublev commented May 24, 2021

@ablaom After pinning ProgressMeter version to 1.5 the error disappeared.

@ablaom
Copy link
Member

ablaom commented May 24, 2021

@irublev Great, thanks. So the workaround is to pin ProgressMeter to version 1.5.

Preliminary investigation. Looks like ProgressMeter introduced an new check that is tripping the code sometimes. The stack trace beginning this thread points to the relevant code in version 1.6.2 of ProgessMeter. The relevant MLJBase call is here: https://github.com/alan-turing-institute/MLJBase.jl/blob/f04698bc62dd8876b53326aec38758ab7bd373c4/src/resampling.jl#L785 . Something about our input is generating an NaN in the ProgressMeter check.

The fact that the error only occurs sometimes smells of something not being thread safe.

@OkonSamuel Be great if you have a chance to look at this.

@ablaom
Copy link
Member

ablaom commented May 31, 2021

For the record, this is turning up in MLJTuning as well, in a non mulit-theading context: https://github.com/alan-turing-institute/MLJTuning.jl/runs/2707686278

@ablaom
Copy link
Member

ablaom commented Jun 1, 2021

@irublev It seems that the issue was with ProgressMeter and a fix has been released.

Can you update your environment and, ensuring ProgressMeter is at 1.7.1, see if you can still reproduce the fail?

Thanks for your patience.

cc @OkonSamuel

@irublev
Copy link
Author

irublev commented Jun 1, 2021

@ablaom Thanks for your help, it seems all is fixed with updating ProgressMeter to ver. 1.7.1, I cannot reproduce the problem.

And may be it is not the right place at all to put this question here, but could you please take a look at JuliaAI/MLJLinearModels.jl#98? I'd like to investigate what is the reason of significantly slower performance for LogisticClassifier in comparison to LogisticRegression in scikit-learn in Python (I'd like to note that some other models like DecisionTreeClassifier perform better than scikit-learn). And I do not understand what to do further. It is impossible to configure the solver I used without injecting into the code (to make a comparison fair). But first of all I do not understand where may be the problem: in MLJLinearModels or in Optim.jl itself?

All I'd like to do by now is to ask for your advice, may be it is better to create an issue not only in MLJLinearModels, but somewhere else, just to attract more attention of the community? Thank you very much in advance.

@ablaom
Copy link
Member

ablaom commented Jun 1, 2021

And may be it is not the right place at all to put this question here, but could you please take a look at JuliaAI/MLJLinearModels.jl#98?

I'm afraid I would have nothing to add beyond what has been posted at the referenced issue by the author of MLJLinearModels. You could try a different forum, say JuliaDiscourse, but I expect persevering with that discussion is your best strategy, as that author is the most familiar with the package, as well as Optim.jl, and has already been quite helpful, it seems to me.

@ablaom ablaom closed this as completed Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants