Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FastRGF] FastRGF doesn't work for small sample and need to fix integration test for FastRGF #92

Open
fukatani opened this issue Dec 3, 2017 · 18 comments
Labels

Comments

@fukatani
Copy link
Member

fukatani commented Dec 3, 2017

#Now, sklearn integration tests for FastRGFClassifier and FastRGFClassifier.

FastRGF doesn't work well for small samples, that is reason for test failed.
I doubt inside Fast RGF executable inside.
I inspect Fast RGF by debugger, discretization boundaries are invalid.

At least we should raise understandable error from RGF python if discretization failed.

@fukatani fukatani added the bug label Dec 3, 2017
@StrikerRUS
Copy link
Member

I've asked Tong Zhang to look into this issue (and also another one about small values of weights) and he answered the following:

Appreciate the bug reports. I’d try to look into them this weekend.

Also he asked you to create a PR if you fixed this

I inspect Fast RGF by debugger, discretization boundaries are invalid.

Also if your friend finds some bugs he fixed, please let me know, and maybe he’d like to contribute directly to the source code.

@fukatani
Copy link
Member Author

Unfortunately, fixing this issue is difficult for me...

@StrikerRUS
Copy link
Member

But maybe you can share your findings though isssue in their repo then?

@fukatani
Copy link
Member Author

Of course I can!
But please give me time, I'm little busy and forgot about debugging.

@StrikerRUS
Copy link
Member

@fukatani Any updates?

@StrikerRUS
Copy link
Member

StrikerRUS commented Jun 18, 2018

Some details from debugging

Steps to reproduce:

  1. git clone https://github.com/baidu/fast_rgf.git
  2. compile forest_train and forest_predict
  3. cd fast_rgf/examples/ex2
  4. leave less than 28 (I used 25) first rows in fast_rgf/examples/ex2/inputs/housing.train
  5. run command gdb --args "../../bin/forest_train" -config="inputs/config" trn.x-file="inputs/housing.train" trn.x-file_format="y.x" trn.target=REAL model.save="outputs/model-rgf"

Operating System: Windows 10 64-bit
Compiler: MinGW-w64 (x86_64-posix-seh-rev0) g++ 5.4.0

Output:

D:\Users\nekit\Downloads\fast_rgf\examples\ex2>gdb --args "../../bin/forest_train" -config="inputs/config" trn.x-file="inputs/housing.train" trn.x-file_format="y.x"  trn.target=REAL model.save="outputs/model-rgf"
GNU gdb (GDB) 7.10.1
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../bin/forest_train...done.
(gdb) run
Starting program: D:\Users\nekit\Downloads\fast_rgf\bin\forest_train.exe "-config=inputs/config" "trn.x-file=inputs/housing.train" "trn.x-file_format=y.x" "trn.target=REAL" "model.save=outputs/model-rgf"
[New Thread 5716.0x19d0]
[New Thread 5716.0x1ec8]
[New Thread 5716.0x348]
[New Thread 5716.0x17ec]

reading options from configuration file <inputs/config>

 using up to 8 openmp threads
 the number of threads is set to 8, which is the maximum number of logical hardware threads including hyperthreads
 the optimal number of threads is often the number of physical cores that may be smaller than 8
 for example, to achieve better performance, you may try to set the number of threads to 4

loading training data ...
  trn.target=REAL
  trn.x-file_format=y.x
  trn.x-file=inputs/housing.train
  trn.y-file=
  trn.w-file=
[New Thread 5716.0x213c]
[New Thread 5716.0x2548]
[New Thread 5716.0x94c]
[New Thread 5716.0x2500]
[New Thread 5716.0x1a5c]
[New Thread 5716.0x1bf8]
[New Thread 5716.0x2750]
loading time: wall time=0 seconds; cpu time=0 seconds.
discretizing training data ...
  discretize.dense.min_bucket_weights=5.000000
  discretize.dense.max_buckets=250
  discretize.dense.lamL2=10
  discretize.sparse.min_bucket_weights=5.000000
  discretize.sparse.max_buckets=200
  discretize.sparse.max_features=80000
  discretize.sparse.min_occrrences=5
  discretize.sparse.missing_type=MIN
  discretize.sparse.lamL2=2.000000
discritizer training time: wall time=0 seconds; cpu time=0 seconds.


training decision forest ...
  dtree.loss=LS
  dtree.max_level=6
  dtree.max_nodes=50
  dtree.new_tree_gain_ratio=1.0
  dtree.min_sample=5
  dtree.lamL1=10
  dtree.lamL2=1000
  forest.opt=rgf
  forest.ntrees=1000
  forest.eval_frequency=50
  forest.save_frequency=0


  training data size= 25 with 0 dense features and 1 sparse feature groups


build tree     1/ 1000
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 5716.0x2750]
0x000000000042fa86 in _decisionTreeTrainer::YW_struct::add (this=0xabc0b0310, yp=-0, wp=1) at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:40
40            y+=yp;
(gdb) backtrace
#0  0x000000000042fa86 in _decisionTreeTrainer::YW_struct::add (this=0xabc0b0310, yp=-0, wp=1) at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:40
#1  0x0000000000426f99 in _decisionTreeTrainer::TrainTarget::yw_LS_add (this=0x1626100, yw=..., w=1, res=0) at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:118
#2  0x000000000042596a in _decisionTreeTrainer::TrainTarget::compute_yw (this=0x1626100, reverse_index=0x162cfe0, b=28, e=25, yw=0x162d190, num_yw=1)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:268
#3  0x00000000005300b8 in _decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR::map_range(int, int, int) (this=0xcfef70, tid=7, b=28, e=25)
    at D:\Users\nekit\Downloads\fast_rgf\src\forest\dtree_trainer.cpp:550
#4  0x0000000000445e0d in rgf::MapReduceRunner::single_thread_map_reduce<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR>(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR&, int, int, int, int, bool) (this=0xcfefb0, mr=..., begin=0, end=25, tid=7, nthreads=8, run_range=true) at D:/Users/nekit/Downloads/fast_rgf/include/utils.h:211
#5  0x0000000000404ddf in rgf::MapReduceRunner::run_threads<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR> () at D:/Users/nekit/Downloads/fast_rgf/include/utils.h:256
#6  0x0000000063607498 in libgomp-1!gomp_free_thread () from D:\Program Files\MinGW\mingw64\bin\libgomp-1.dll
#7  0x000000006494b98c in pthread_create_wrapper () from D:\Program Files\MinGW\mingw64\opt\bin\libwinpthread-1.dll
#8  0x00007ffef4ecb2ba in msvcrt!_beginthreadex () from C:\Windows\System32\msvcrt.dll
#9  0x00007ffef4ecb38c in msvcrt!_endthreadex () from C:\Windows\System32\msvcrt.dll
#10 0x00007ffef4f38364 in KERNEL32!BaseThreadInitThunk () from C:\Windows\System32\kernel32.dll
#11 0x00007ffef5717091 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#12 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

UPD:

Operating System: Ubuntu 18.04 64-bit
Compiler: g++ 5.5.0/7.3.0

Output:
No errors.

UPD2:

Operating System: Ubuntu 16.04 32-bit
Compiler: g++ 5.4.0

Output:
No errors.

UPD3:

Operating System: macOS 10.12 64-bit
Compiler: g++ 8.1.0

Output:
No errors.

UPD4:

Operating System: Windows 10 64-bit
Compiler: MinGW-w64 (x86_64-posix-seh-rev0) g++ 8.1.0

Output:

D:\Users\nekit\Downloads\fast_rgf\examples\ex2>gdb --args "../../bin/forest_train" -config="inputs/config" trn.x-file="inputs/housing.train" trn.x-file_format="y.x" trn.target=REAL model.save="outputs/model-rgf"
GNU gdb (GDB) 8.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../bin/forest_train...done.
(gdb) run
Starting program: D:\Users\nekit\Downloads\fast_rgf\bin\forest_train.exe "-config=inputs/config" "trn.x-file=inputs/housing.train" "trn.x-file_format=y.x" "trn.target=REAL" "model.save=outputs/model-rgf"
[New Thread 4548.0x1c98]
[New Thread 4548.0x26cc]
[New Thread 4548.0x23d0]
[New Thread 4548.0x1c04]

reading options from configuration file <inputs/config>

 using up to 8 openmp threads
 the number of threads is set to 8, which is the maximum number of logical hardware threads including hyperthreads
 the optimal number of threads is often the number of physical cores that may be smaller than 8
 for example, to achieve better performance, you may try to set the number of threads to 4

loading training data ...
  trn.target=REAL
  trn.x-file_format=y.x
  trn.x-file=inputs/housing.train
  trn.y-file=
  trn.w-file=
[New Thread 4548.0x270c]
[New Thread 4548.0xb0c]
[New Thread 4548.0x23b8]
[New Thread 4548.0xe30]
[New Thread 4548.0x2468]
[New Thread 4548.0x1b74]
[New Thread 4548.0x2740]
loading time: wall time=0 seconds; cpu time=0 seconds.
discretizing training data ...
  discretize.dense.min_bucket_weights=5.000000
  discretize.dense.max_buckets=250
  discretize.dense.lamL2=10
  discretize.sparse.min_bucket_weights=5.000000
  discretize.sparse.max_buckets=200
  discretize.sparse.max_features=80000
  discretize.sparse.min_occrrences=5
  discretize.sparse.missing_type=MIN
  discretize.sparse.lamL2=2.000000
discritizer training time: wall time=0.0010082 seconds; cpu time=0.001 seconds.


training decision forest ...
  dtree.loss=LS
  dtree.max_level=6
  dtree.max_nodes=50
  dtree.new_tree_gain_ratio=1.0
  dtree.min_sample=5
  dtree.lamL1=10
  dtree.lamL2=1000
  forest.opt=rgf
  forest.ntrees=1000
  forest.eval_frequency=50
  forest.save_frequency=0


  training data size= 25 with 0 dense features and 1 sparse feature groups


build tree     1/ 1000
Thread 11 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 4548.0x2740]
0x000000000042f5f6 in _decisionTreeTrainer::YW_struct::add (this=0xabac29e60, yp=-0, wp=1)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:40
40            y+=yp;
(gdb) backtrace
#0  0x000000000042f5f6 in _decisionTreeTrainer::YW_struct::add (this=0xabac29e60, yp=-0, wp=1)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:40
#1  0x0000000000426cd0 in _decisionTreeTrainer::TrainTarget::yw_LS_add (this=0x7e550, yw=..., w=1, res=0)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:118
#2  0x000000000042575a in _decisionTreeTrainer::TrainTarget::compute_yw (this=0x7e550, reverse_index=0x7c960, b=28,
    e=25, yw=0x7c8a0, num_yw=1) at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:268
#3  0x00000000005316e6 in _decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR::map_range(int, int, int) (this=0xd3ef70, tid=7, b=28, e=25)
    at D:\Users\nekit\Downloads\fast_rgf\src\forest\dtree_trainer.cpp:550
#4  0x0000000000445649 in rgf::MapReduceRunner::single_thread_map_reduce<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR>(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR&, int, int, int, int, bool) (this=0xd3efb0, mr=...,
    begin=0, end=25, tid=7, nthreads=8, run_range=true) at D:/Users/nekit/Downloads/fast_rgf/include/utils.h:211
#5  0x0000000000404d28 in rgf::MapReduceRunner::run_threads<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR>(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR&, int, int, bool) (this=0x100525f3c, mr=..., begin=4635445,
    end=62914048, run_range=false) at D:/Users/nekit/Downloads/fast_rgf/include/utils.h:256
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

UPD5:

Operating System: Windows Server 2016 64-bit
Compiler: MinGW-w64 (x86_64-posix-seh-rev0) g++ 6.4.0

Output:

C:\Users\ntitov\Downloads\fast_rgf\examples\ex2>gdb --args "../../bin/forest_train" -config="inputs/config" trn.x-file="inputs/housing.train" trn.x-file_format="y.x" trn.target=REAL model.save="outputs/model-rgf"
GNU gdb (GDB) 8.0
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../bin/forest_train...done.
(gdb) run
Starting program: C:\Users\ntitov\Downloads\fast_rgf\bin\forest_train.exe "-config=inputs/config" "trn.x-file=inputs/housing.train" "trn.x-file_format=y.x" "trn.target=REAL" "model.save=outputs/model-rgf"
[New Thread 8556.0x1748]
[New Thread 8556.0xec0]
[New Thread 8556.0x1d28]
[New Thread 8556.0x8e4]

reading options from configuration file <inputs/config>

 using up to 12 openmp threads
 the number of threads is set to 12, which is the maximum number of logical hardware threads including hyperthreads
 the optimal number of threads is often the number of physical cores that may be smaller than 12
 for example, to achieve better performance, you may try to set the number of threads to 6

loading training data ...
  trn.target=REAL
  trn.x-file_format=y.x
  trn.x-file=inputs/housing.train
  trn.y-file=
  trn.w-file=
[New Thread 8556.0x5e8]
[New Thread 8556.0x1c2c]
[New Thread 8556.0x19f0]
[New Thread 8556.0x10e0]
[New Thread 8556.0x1740]
[New Thread 8556.0x1120]
[New Thread 8556.0x1820]
[New Thread 8556.0x908]
[New Thread 8556.0xb04]
[New Thread 8556.0x21b4]
[New Thread 8556.0x95c]
loading time: wall time=0.0160004 seconds; cpu time=0.016 seconds.
discretizing training data ...
  discretize.dense.min_bucket_weights=5.000000
  discretize.dense.max_buckets=250
  discretize.dense.lamL2=10
  discretize.sparse.min_bucket_weights=5.000000
  discretize.sparse.max_buckets=200
  discretize.sparse.max_features=80000
  discretize.sparse.min_occrrences=5
  discretize.sparse.missing_type=MIN
  discretize.sparse.lamL2=2.000000
discritizer training time: wall time=0.0029965 seconds; cpu time=0.002 seconds.


training decision forest ...
  dtree.loss=LS
  dtree.max_level=6
  dtree.max_nodes=50
  dtree.new_tree_gain_ratio=1.0
  dtree.min_sample=5
  dtree.lamL1=10
  dtree.lamL2=1000
  forest.opt=rgf
  forest.ntrees=1000
  forest.eval_frequency=50
  forest.save_frequency=0


  training data size= 25 with 0 dense features and 1 sparse feature groups


build tree     1/ 1000
Thread 15 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 8556.0x95c]
0x0000000000430656 in _decisionTreeTrainer::YW_struct::add (this=0xfeef2dc70, yp=-0, wp=1)
    at C:/Users/ntitov/Downloads/fast_rgf/src/forest/training_target.h:40
40            y+=yp;
(gdb) backtrace
#0  0x0000000000430656 in _decisionTreeTrainer::YW_struct::add (this=0xfeef2dc70, yp=-0, wp=1)
    at C:/Users/ntitov/Downloads/fast_rgf/src/forest/training_target.h:40
#1  0x0000000000427b60 in _decisionTreeTrainer::TrainTarget::yw_LS_add (this=0xd14700, yw=..., w=1, res=0)
    at C:/Users/ntitov/Downloads/fast_rgf/src/forest/training_target.h:118
#2  0x000000000042656a in _decisionTreeTrainer::TrainTarget::compute_yw (this=0xd14700, reverse_index=0xd32fa0, b=33,
    e=25, yw=0x26320, num_yw=1) at C:/Users/ntitov/Downloads/fast_rgf/src/forest/training_target.h:268
#3  0x0000000000532f78 in _decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR::map_range(int, int, int) (this=0xcfef70, tid=11, b=33, e=25)
    at C:\Users\ntitov\Downloads\fast_rgf\src\forest\dtree_trainer.cpp:550
#4  0x000000000044675d in rgf::MapReduceRunner::single_thread_map_reduce<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR>(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR&, int, int, int, int, bool) (this=0xcfefb0, mr=...,
    begin=0, end=25, tid=11, nthreads=12, run_range=true) at C:/Users/ntitov/Downloads/fast_rgf/include/utils.h:211
#5  0x0000000000404d98 in rgf::MapReduceRunner::run_threads<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR> () at C:/Users/ntitov/Downloads/fast_rgf/include/utils.h:256
#6  0x000000006360d548 in omp_in_final ()
   from C:\Program Files\mingw-w64\x86_64-6.4.0-posix-seh-rt_v5-rev0\mingw64\bin\libgomp-1.dll
#7  0x0000000064944af4 in pthread_create_wrapper ()
   from C:\Program Files\mingw-w64\x86_64-6.4.0-posix-seh-rt_v5-rev0\mingw64\opt\bin\libwinpthread-1.dll
#8  0x00007ffb987db2ba in msvcrt!_beginthreadex () from C:\Windows\System32\msvcrt.dll
#9  0x00007ffb987db38c in msvcrt!_endthreadex () from C:\Windows\System32\msvcrt.dll
#10 0x00007ffb985d8364 in KERNEL32!BaseThreadInitThunk () from C:\Windows\System32\kernel32.dll
#11 0x00007ffb9aede821 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#12 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

@StrikerRUS
Copy link
Member

@TongZhang-ML Please take a look at logs.

@StrikerRUS
Copy link
Member

StrikerRUS commented Jun 19, 2018

I've decided to reproduce the bug at continuous integration service, but still with no luck.

Tests (refer to #159) have been run on Ubuntu 14.04 and macOS 10.12 with g++ versions from 5 to 8. No errors occurred.

UPD:
Appveyor CI service with Windows Server 2012 and g++ 6.3.0 doesn't report any problems too.

It seems to me, that according to the logs the bug is hidden around the multithreading here (Race Condition?), and that why it's so hard to reproduce.
https://github.com/RGF-team/rgf_python/blob/5202dcec94d30a575ea45ec9ec6d5d47d0335b06/include/fast_rgf/include/utils.h#L254-L257

@StrikerRUS
Copy link
Member

Tried without OpenMP:

Set this OFF:
https://github.com/RGF-team/rgf_python/blob/3e7858c5b600d699b8d76154f93bc8b409a13370/include/fast_rgf/CMakeLists.txt#L8

D:\Users\nekit\Downloads\fast_rgf\examples\ex2>gdb --args "../../bin/forest_train" -config="inputs/config" trn.x-file="inputs/housing.train" trn.x-file_format="y.x" trn.target=REAL model.save="outputs/model-rgf"
GNU gdb (GDB) 8.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../bin/forest_train...done.
(gdb) run
Starting program: D:\Users\nekit\Downloads\fast_rgf\bin\forest_train.exe "-config=inputs/config" "trn.x-file=inputs/housing.train" "trn.x-file_format=y.x" "trn.target=REAL" "model.save=outputs/model-rgf"
[New Thread 4352.0x28bc]
[New Thread 4352.0xc90]
[New Thread 4352.0x26a0]
[New Thread 4352.0x2540]

reading options from configuration file <inputs/config>

 using up to 8 threads
 the number of threads is set to 8, which is the maximum number of logical hardware threads including hyperthreads
 the optimal number of threads is often the number of physical cores that may be smaller than 8
 for example, to achieve better performance, you may try to set the number of threads to 4

loading training data ...
  trn.target=REAL
  trn.x-file_format=y.x
  trn.x-file=inputs/housing.train
  trn.y-file=
  trn.w-file=
[New Thread 4352.0x1644]
[New Thread 4352.0x13dc]
[Thread 4352.0x1644 exited with code 0]
[Thread 4352.0x13dc exited with code 0]
[New Thread 4352.0x29c4]
[New Thread 4352.0x1e0c]
[Thread 4352.0x29c4 exited with code 0]
[Thread 4352.0x1e0c exited with code 0]
[New Thread 4352.0x2b9c]
[Thread 4352.0x2b9c exited with code 0]
[New Thread 4352.0xdc4]
[Thread 4352.0xdc4 exited with code 0]
[New Thread 4352.0x2654]
[Thread 4352.0x2654 exited with code 0]
[New Thread 4352.0x1c54]
[Thread 4352.0x1c54 exited with code 0]
loading time: wall time=0.0156195 seconds; cpu time=0.016 seconds.
discretizing training data ...
  discretize.dense.min_bucket_weights=5.000000
  discretize.dense.max_buckets=250
  discretize.dense.lamL2=10
  discretize.sparse.min_bucket_weights=5.000000
  discretize.sparse.max_buckets=200
  discretize.sparse.max_features=80000
  discretize.sparse.min_occrrences=5
  discretize.sparse.missing_type=MIN
  discretize.sparse.lamL2=2.000000
[New Thread 4352.0xe00]
[Thread 4352.0xe00 exited with code 0]
[New Thread 4352.0x1898]
[Thread 4352.0x1898 exited with code 0]
[New Thread 4352.0x10f8]
[New Thread 4352.0x1f28]
[Thread 4352.0x10f8 exited with code 0]
[New Thread 4352.0x1f80]
[Thread 4352.0x1f28 exited with code 0]
[New Thread 4352.0xa7c]
[New Thread 4352.0x17a8]
[Thread 4352.0xa7c exited with code 0]
[Thread 4352.0x1f80 exited with code 0]
[New Thread 4352.0x2afc]
[Thread 4352.0x17a8 exited with code 0]
[Thread 4352.0x2afc exited with code 0]
[New Thread 4352.0x247c]
[New Thread 4352.0x1434]
[New Thread 4352.0x2148]
[Thread 4352.0x247c exited with code 0]
[Thread 4352.0x1434 exited with code 0]
[Thread 4352.0x2148 exited with code 0]
[New Thread 4352.0x2bc]
[New Thread 4352.0x1404]
[New Thread 4352.0x2514]
[Thread 4352.0x2bc exited with code 0]
[Thread 4352.0x1404 exited with code 0]
[Thread 4352.0x2514 exited with code 0]
[New Thread 4352.0x1edc]
[New Thread 4352.0x2470]
[Thread 4352.0x1edc exited with code 0]
[Thread 4352.0x2470 exited with code 0]
discritizer training time: wall time=0.0312523 seconds; cpu time=0.032 seconds.


training decision forest ...
  dtree.loss=LS
  dtree.max_level=6
  dtree.max_nodes=50
  dtree.new_tree_gain_ratio=1.0
  dtree.min_sample=5
  dtree.lamL1=10
  dtree.lamL2=1000
  forest.opt=rgf
  forest.ntrees=1000
  forest.eval_frequency=50
  forest.save_frequency=0


  training data size= 25 with 0 dense features and 1 sparse feature groups


build tree     1/ 1000[New Thread 4352.0x12d4]
[Thread 4352.0x12d4 exited with code 0]
[New Thread 4352.0x2bc0]
[Thread 4352.0x2bc0 exited with code 0]
[New Thread 4352.0x444]
[Thread 4352.0x444 exited with code 0]
[New Thread 4352.0x18dc]
[Thread 4352.0x18dc exited with code 0]
[New Thread 4352.0x29a8]
[Thread 4352.0x29a8 exited with code 0]
[New Thread 4352.0x2708]
[Thread 4352.0x2708 exited with code 0]
[New Thread 4352.0x374]
[Thread 4352.0x374 exited with code 0]
[New Thread 4352.0x2a74]
[Thread 4352.0x2a74 exited with code 0]
[New Thread 4352.0x14b4]
[Thread 4352.0x14b4 exited with code 0]
[New Thread 4352.0x1088]
[Thread 4352.0x1088 exited with code 0]
[New Thread 4352.0x63c]
[Thread 4352.0x63c exited with code 0]
[New Thread 4352.0x28c4]
[Thread 4352.0x28c4 exited with code 0]
[New Thread 4352.0x99c]
[Thread 4352.0x99c exited with code 0]
[New Thread 4352.0x1940]
[Thread 4352.0x1940 exited with code 0]
[New Thread 4352.0x1270]
[Thread 4352.0x1270 exited with code 0]
[New Thread 4352.0x1ef0]
[Thread 4352.0x1ef0 exited with code 0]
[New Thread 4352.0x2658]
[Thread 4352.0x2658 exited with code 0]
[New Thread 4352.0x1b3c]
[Thread 4352.0x1b3c exited with code 0]
[New Thread 4352.0x28b0]
[Thread 4352.0x28b0 exited with code 0]
[New Thread 4352.0x40c]
[Thread 4352.0x40c exited with code 0]
[New Thread 4352.0x6d0]
[Thread 4352.0x6d0 exited with code 0]
[New Thread 4352.0x2464]
[Thread 4352.0x2464 exited with code 0]
[New Thread 4352.0x2330]
[Thread 4352.0x2330 exited with code 0]
[New Thread 4352.0x260]

Thread 52 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 4352.0x260]
0x000000000042d086 in _decisionTreeTrainer::YW_struct::add (this=0xabc21d750, yp=-0, wp=1)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:40
40            y+=yp;
(gdb) backtrace
#0  0x000000000042d086 in _decisionTreeTrainer::YW_struct::add (this=0xabc21d750, yp=-0, wp=1)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:40
#1  0x0000000000424760 in _decisionTreeTrainer::TrainTarget::yw_LS_add (this=0x1757d70, yw=..., w=1, res=0)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:118
#2  0x000000000042311a in _decisionTreeTrainer::TrainTarget::compute_yw (this=0x1757d70, reverse_index=0x1757b30, b=28, e=25, yw=0x17588b0, num_yw=1)
    at D:/Users/nekit/Downloads/fast_rgf/src/forest/training_target.h:268
#3  0x0000000000550c56 in _decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR::map_range(int, int, int) (this=0xf6ef70, tid=7, b=28,
    e=25) at D:\Users\nekit\Downloads\fast_rgf\src\forest\dtree_trainer.cpp:550
#4  0x0000000000444249 in rgf::MapReduceRunner::single_thread_map_reduce<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR>(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR&, int, int, int, int, bool) (this=0xf6efb0, mr=..., begin=0, end=25, tid=7, nthreads=8,
    run_range=true) at D:/Users/nekit/Downloads/fast_rgf/include/utils.h:211
#5  0x00000000005317e7 in std::__invoke_impl<void, void (rgf::MapReduceRunner::*)(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)&), rgf::MapReduceRunner*, std::reference_wrapper<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)>, int, int, int, int, bool>(std::__invoke_memfun_deref, void (rgf::MapReduceRunner::*&&)(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)&), rgf::MapReduceRunner*&&, std::reference_wrapper<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)>&&, int&&, int&&, int&&, int&&, bool&&) (__f=
    @0x1757f70: (void (rgf::MapReduceRunner::*)(rgf::MapReduceRunner * const, _decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::Tree_YW_MR &, int, int, int, int, bool)) 0x4441c0 <rgf::MapReduceRunner::single_thread_map_reduce<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR>(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR&, int, int, int, int, bool)>, __t=@0x1757f68: 0xf6efb0, __args#0=...,
    __args#1=@0x1757f58: 0, __args#2=@0x1757f54: 25, __args#3=@0x1757f50: 7, __args#4=@0x1757f4c: 8, __args#5=@0x1757f48: true)
    at D:/Program Files/mingw-w64/x86_64-8.1.0-posix-seh-rt_v6-rev0/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/invoke.h:73
#6  0x0000000000549005 in std::__invoke<void (rgf::MapReduceRunner::*)(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)&), rgf::MapReduceRunner*, std::reference_wrapper<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)>, int, int, int, int, bool>(void (rgf::MapReduceRunner::*&&)(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)&), rgf::MapReduceRunner*&&, std::reference_wrapper<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_de---Type <return> to continue, or q <return> to quit---
cisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)>&&, int&&, int&&, int&&, int&&, bool&&) (__fn=
    @0x1757f70: (void (rgf::MapReduceRunner::*)(rgf::MapReduceRunner * const, _decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::Tree_YW_MR &, int, int, int, int, bool)) 0x4441c0 <rgf::MapReduceRunner::single_thread_map_reduce<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR>(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR&, int, int, int, int, bool)>, __args#0=@0x1757f68: 0xf6efb0,
    __args#1=..., __args#2=@0x1757f58: 0, __args#3=@0x1757f54: 25, __args#4=@0x1757f50: 7, __args#5=@0x1757f4c: 8, __args#6=@0x1757f48: true)
    at D:/Program Files/mingw-w64/x86_64-8.1.0-posix-seh-rt_v6-rev0/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/bits/invoke.h:95
#7  0x00000000004fb989 in std::thread::_Invoker<std::tuple<void (rgf::MapReduceRunner::*)(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)&), rgf::MapReduceRunner*, std::reference_wrapper<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)>, int, int, int, int, bool> >::_M_invoke<0ull, 1ull, 2ull, 3ull, 4ull, 5ull, 6ull, 7ull>(std::_Index_tuple<0ull, 1ull, 2ull, 3ull, 4ull, 5ull, 6ull, 7ull>) (this=0x1757f48)
    at D:/Program Files/mingw-w64/x86_64-8.1.0-posix-seh-rt_v6-rev0/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/thread:234
#8  0x00000000004fb9e7 in std::thread::_Invoker<std::tuple<void (rgf::MapReduceRunner::*)(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)&), rgf::MapReduceRunner*, std::reference_wrapper<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)>, int, int, int, int, bool> >::operator()() (this=0x1757f48)
    at D:/Program Files/mingw-w64/x86_64-8.1.0-posix-seh-rt_v6-rev0/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/thread:243
#9  0x00000000004f68dc in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (rgf::MapReduceRunner::*)(_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)&), rgf::MapReduceRunner*, std::reference_wrapper<_decisionTreeTrainer::TreeToIndex<unsigned short, int, unsigned char>::update_predictions(_decisionTreeTrainer::TrainTarget&, rgf::DecisionTree<unsigned short, int, unsigned char>::TrainParam&, rgf::Timer&, rgf::Timer&)::Tree_YW_MR(int, int, int, int, bool)>, int, int, int, int, bool> > >::_M_run() (this=0x1757f40)
    at D:/Program Files/mingw-w64/x86_64-8.1.0-posix-seh-rt_v6-rev0/mingw64/lib/gcc/x86_64-w64-mingw32/8.1.0/include/c++/thread:186
#10 0x000000000055329f in execute_native_thread_routine ()
#11 0x0000000064944a94 in pthread_create_wrapper () from D:\Program Files\mingw-w64\x86_64-8.1.0-posix-seh-rt_v6-rev0\mingw64\opt\bin\libwinpthread-1.dll
#12 0x00007ff9228cb2ba in msvcrt!_beginthreadex () from C:\Windows\System32\msvcrt.dll
#13 0x00007ff9228cb38c in msvcrt!_endthreadex () from C:\Windows\System32\msvcrt.dll
#14 0x00007ff922938364 in KERNEL32!BaseThreadInitThunk () from C:\Windows\System32\kernel32.dll
#15 0x00007ff924707091 in ntdll!RtlUserThreadStart () from C:\Windows\SYSTEM32\ntdll.dll
#16 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

@fukatani
Copy link
Member Author

yw_LS_add(my_yw[j],my_w?my_w[i]:1.0,my_r[i]);

Any variable is not nullptr at here?

For example, could you add code for dumping as follows?

printf("%f\n", my_yw[j]);  // dump my_yw[j]
yw_LS_add(my_yw[j],my_w?my_w[i]:1.0,my_r[i]);

With my expectation, it will crash with printf function.

@fukatani
Copy link
Member Author

fukatani commented Jun 20, 2018

Or

assert(j < num_yw);
printf("%f\n", my_yw[j]);  // dump my_yw[j]
yw_LS_add(my_yw[j],my_w?my_w[i]:1.0,my_r[i]);

If j>= num_yw, the code referencing my_yw[j] will crash.

@StrikerRUS
Copy link
Member

@fukatani Seems you're right!

      if (loss==TrainLoss::LS) {
	for (i=0; i<size; i++) {
	  j= my_i[i];
	  assert(j < num_yw);
	  printf("%f\n", my_yw[j]);  // dump my_yw[j]
	  yw_LS_add(my_yw[j],my_w?my_w[i]:1.0,my_r[i]);
	}
      }

results in

D:\Users\nekit\Downloads\fast_rgf\examples\ex2>"../../bin/forest_train" -config="inputs/config" trn.x-file="inputs/housing.train" trn.x-file_format="y.x" trn.target=REAL model.save="outputs/model-rgf"

reading options from configuration file <inputs/config>

 using up to 8 openmp threads
 the number of threads is set to 8, which is the maximum number of logical hardware threads including hyperthreads
 the optimal number of threads is often the number of physical cores that may be smaller than 8
 for example, to achieve better performance, you may try to set the number of threads to 4

loading training data ...
  trn.target=REAL
  trn.x-file_format=y.x
  trn.x-file=inputs/housing.train
  trn.y-file=
  trn.w-file=
loading time: wall time=0.0032088 seconds; cpu time=0.003 seconds.
discretizing training data ...
  discretize.dense.min_bucket_weights=5.000000
  discretize.dense.max_buckets=250
  discretize.dense.lamL2=10
  discretize.sparse.min_bucket_weights=5.000000
  discretize.sparse.max_buckets=200
  discretize.sparse.max_features=80000
  discretize.sparse.min_occrrences=5
  discretize.sparse.missing_type=MIN
  discretize.sparse.lamL2=2.000000
discritizer training time: wall time=0.0011295 seconds; cpu time=0.001 seconds.


training decision forest ...
  dtree.loss=LS
  dtree.max_level=6
  dtree.max_nodes=50
  dtree.new_tree_gain_ratio=1.0
  dtree.min_sample=5
  dtree.lamL1=10
  dtree.lamL2=1000
  forest.opt=rgf
  forest.ntrees=1000
  forest.eval_frequency=50
  forest.save_frequency=0


  training data size= 25 with 0 dense features and 1 sparse feature groups


build tree     1/ 10000.0000Assertion failed!

Program: D:\Users\nekit\Downloads\fast_rgf\bin\forest_train.exe
File: D:\Users\nekit\Downloads\fast_rgf\src\forest\training_target.h, Line 268

Expression: j < num_yw

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
00
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000

@fukatani
Copy link
Member Author

Does assert passes in other OS?
If so what variable is different with Mac OS?

@StrikerRUS
Copy link
Member

Yes, on macOS all asserts passed.

Caution! Huge log.

macOS log.txt

@TongZhang-ML
Copy link
Member

TongZhang-ML commented Jun 22, 2018 via email

@StrikerRUS
Copy link
Member

StrikerRUS commented Jun 26, 2018

Today I've spend in many attempts to reproduce the error at Azure. I've tried all types of instances, which are available with free trial subscription (B, E, D, DS, F and so on), unfortunately, without any success. The only important thing that could be learned from this is that free trial limits the number of virtual cores to 4, however, failures occurred on machines with 8 vcores (Windows 10 64-bit from this my comment #92 (comment)) and 12 vcores (Windows Server 2016 64-bit from the same comment).

@TongZhang-ML So, it seems, that if you want to reproduce the error, you should try to do it with more than 4 vcores Amazon instance.

@fukatani
Copy link
Member Author

It can be an index error rather than a race condition.
Since the used index depends on the number of threads.

@TongZhang-ML
Could you tell me overview of compute_yw such as,

  • explanation for each argument (what is b? what is reverse_index and its expectated value and size.)
  • explanation for local variables (my_w, my_r, my_i, my_L)
    ?

@fukatani
Copy link
Member Author

If it is an index error, I think we have to limits threads depending on the size of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants