Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL & LevelDB #6204

Open
farhan333 opened this issue Jan 29, 2018 · 2 comments
Open

NCCL & LevelDB #6204

farhan333 opened this issue Jan 29, 2018 · 2 comments

Comments

@farhan333
Copy link

Is this a bug?
Training using NCCL with 2 gpus 1080 and 1060 and a LevelDB Data Layer?

When using single GPU this does not happen.

It naively appears to me that the levelDB is trying to be opened twice.

I0129 13:13:42.833976 25110 net.cpp:213] pool1 needs backward computation.
I0129 13:13:42.833979 25110 net.cpp:213] relu1 needs backward computation.
I0129 13:13:42.833981 25110 net.cpp:213] conv1 needs backward computation.
I0129 13:13:42.833984 25110 net.cpp:215] data does not need backward computation.
I0129 13:13:42.833986 25110 net.cpp:257] This network produces output loss
I0129 13:13:42.833997 25110 net.cpp:270] Network initialization done.
I0129 13:13:42.834034 25110 solver.cpp:56] Solver scaffolding done.
I0129 13:13:42.834445 25110 caffe.cpp:248] Starting Optimization
F0129 13:13:43.096998 25119 db_leveldb.cpp:16] Check failed: status.ok() Failed to open leveldb /home/farhan/intl910-200a/Training_1
IO error: lock /home/farhan/intl910-200a/Training_1/LOCK: already held by process
*** Check failure stack trace: ***
@ 0x7f4419d835cd google::LogMessage::Fail()
@ 0x7f4419d85433 google::LogMessage::SendToLog()
@ 0x7f4419d8315b google::LogMessage::Flush()
@ 0x7f4419d85e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f441a46fd8b caffe::db::LevelDB::Open()
@ 0x7f441a3e27ff caffe::DataLayer<>::DataLayer()
@ 0x7f441a3e29c2 caffe::Creator_DataLayer<>()
@ 0x7f441a48d3e0 caffe::Net<>::Init()
@ 0x7f441a4903fe caffe::Net<>::Net()
@ 0x7f441a2d9405 caffe::Solver<>::InitTrainNet()
@ 0x7f441a2da875 caffe::Solver<>::Init()
@ 0x7f441a2dab8f caffe::Solver<>::Solver()
@ 0x7f441a49c941 caffe::Creator_SGDSolver<>()
@ 0x416e0c caffe::SolverRegistry<>::CreateSolver()
@ 0x7f441a4c5ecb caffe::Worker<>::InternalThreadEntry()
@ 0x7f441a4afba5 caffe::InternalThread::entry()
@ 0x7f441a4b0ace boost::detail::thread_data<>::run()
@ 0x7f4418a545d5 (unknown)
@ 0x7f441882d6ba start_thread
@ 0x7f4418d703dd clone
@ (nil) (unknown)
Aborted (core dumped)

Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help.
Do not post such requests to Issues. Doing so interferes with the development of Caffe.

Please read the guidelines for contributing before submitting this issue.

Issue summary

Steps to reproduce

If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.

Your system configuration

Operating system:
Compiler:
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS:
Python or MATLAB version (for pycaffe and matcaffe respectively):

@lmy418lmy
Copy link

I have encountered the same problem and I would like to ask you how to solve it?

@ShawKai666
Copy link

I have encountered the same problem and I would like to ask you how to solve it?
Convert data to LMDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants