Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caffe.NCCL.new_uid() - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 17: invalid start byte #5347

Open
GaryKT opened this issue Mar 2, 2017 · 21 comments

Comments

@GaryKT
Copy link

GaryKT commented Mar 2, 2017

How can we work around this error? If we set uid to a fixed number it doesn't work and with caffe.NCCL.new_uid() we get the following error:

uid = caffe.NCCL.new_uid()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 17: invalid start byte

Any help highly appreciated!

Steps to reproduce

Run train.py

uid = caffe.NCCL.new_uid()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 17: invalid start byte

Your system configuration

Operating system: Windows
Compiler: MS Visual Studio 2015
CUDA version (if applicable): 8
CUDNN version (if applicable): 5.1
BLAS: ?
Python

@willyd
Copy link
Contributor

willyd commented Mar 2, 2017

Thanks for reporting. I haven't tried the training script yet, I will get back to you on this ASAP. What python version are you using?

@GaryKT
Copy link
Author

GaryKT commented Mar 2, 2017

Python 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)] on win32

Let me know if there is some work around and/or if you are able to reproduce. Thanks.

@willyd willyd added the Python 3 label Mar 3, 2017
@willyd
Copy link
Contributor

willyd commented Mar 3, 2017

I can reproduce the issue. I think the problem is that boost python maps std::string to a standard string (i.e. not unicode) in python 2 and maps it to a unicode string in python 3.

Can you try to see if the issue exists as well with python 2.7?

I think it would not be to hard write wrappers around the constructor or NCCL and the new_uid method returning and accepting bp::objects and manually converting from bytes objects to string objects in those wrappers.

See this boostorg/python#85 (comment) for some ideas on how to implement the wrappers.

@cypof Any comments? Has any @BVLC member tried to use multi-GPU training with python 3?

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

How would the wrapper work? Can we do a wrapper in python?

@willyd
Copy link
Contributor

willyd commented Mar 3, 2017

No. The wrapper needs to be C++.

Try to replace your CAFFE_ROOT/python/caffe/_caffe.cpp with

https://gist.github.com/willyd/0dbd1fabb06eeedc3289e656be03a022

Let me know if that works for you.

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

@willyd
I did a quick compile now there is a different error:
Traceback (most recent call last):
File "D:\PROGRAMMING\caffe_philipp\philipp_caffenet\test\train.py", line 102, in
train(args.solver, args.snapshot, args.gpus, args.timing)
File "D:\PROGRAMMING\caffe_philipp\philipp_caffenet\test\train.py", line 18, in train
uid = (caffe.NCCL.new_uid())
AttributeError: type object 'NCCL' has no attribute 'new_uid'

Let me know if you are able to reproduce and/or create a different fix.
Thanks for your help. At least it seems like this is the right area to look at.

@willyd
Copy link
Contributor

willyd commented Mar 3, 2017

You need to enable NCCL by setting USE_NCCL=ON. It should download a windows compatible (though maybe crippled) nccl and build it automatically.

Unless you are really want to train with multiple GPUs on windows you can use caffe.exe to or a caffe.Solver from your own python script.

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

@willyd
where? USE_NCCL=ON in visual studio project SLN? where exactly?
I am just trying to get the sample code to work at this stage :-)

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

@willyd
Added USE_NCCL=ON as a preprocessor definition in Visual Studio. Now there's a compile error.
Where do we get those NCCL files? Is there a download link?

Severity Code Description Project File Line Suppression State
Error C1083 Cannot open include file: 'nccl.h': No such file or directory pycaffe D:\PROGRAMMING\caffe\include\caffe\util\nccl.hpp 5

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

@willyd
Got a NCCL library from Nvidia
Added nccl\src in the include path for the solution so that nccl.h is found

Now there is 3 linker errors. Any ID? One of the error says NCCL::new_uid is unresolved.
Do we need to manually build a library first or should Caffe compile NCCL libs? The NCCL distrubtion from Nvidia has no Windows/VS makefile or Windows build instructions...

Severity Code Description Project File Line Suppression State
Error LNK1120 3 unresolved externals pycaffe D:\PROGRAMMING\caffe\build\lib\Release_caffe.pyd 1
Error LNK2019 unresolved external symbol "public: static class std::basic_string<char,struct std::char_traits,class std::allocator > __cdecl caffe::NCCL::new_uid(void)" (?new_uid@?$NCCL@M@caffe@@sa?AV?$basic_string@DU?$char_traits@D@std@@v?$allocator@D@2@@std@@xz) referenced in function "class boost::python::api::object __cdecl caffe::NCCL_New_Uid(void)" (?NCCL_New_Uid@caffe@@ya?AVobject@api@python@boost@@xz) pycaffe D:\PROGRAMMING\caffe\build\python_caffe.obj 1
Error LNK2019 unresolved external symbol "public: void __cdecl caffe::NCCL::Broadcast(void)" (?Broadcast@?$NCCL@M@caffe@@QEAAXXZ) referenced in function "void __cdecl caffe::init_module__caffe(void)" (?init_module__caffe@caffe@@yaxxz) pycaffe D:\PROGRAMMING\caffe\build\python_caffe.obj 1
Error LNK2019 unresolved external symbol "public: __cdecl caffe::NCCL::NCCL(class boost::shared_ptr<class caffe::Solver >,class std::basic_string<char,struct std::char_traits,class std::allocator > const &)" (??0?$NCCL@M@caffe@@qeaa@V?$shared_ptr@V?$Solver@M@caffe@@@boost@@aebv?$basic_string@DU?$char_traits@D@std@@v?$allocator@D@2@@std@@@z) referenced in function "class boost::shared_ptr<class caffe::NCCL > __cdecl caffe::NCCL_Init(class boost::shared_ptr<class caffe::Solver >,class boost::python::api::object)" (?NCCL_Init@caffe@@ya?AV?$shared_ptr@V?$NCCL@M@caffe@@@boost@@v?$shared_ptr@V?$Solver@M@caffe@@@3@Vobject@api@python@3@@z) pycaffe D:\PROGRAMMING\caffe\build\python_caffe.obj 1

@willyd
Copy link
Contributor

willyd commented Mar 3, 2017

USE_NCCL is an option of the cmake build:

https://github.com/BVLC/caffe/blob/windows/scripts/build_win.cmd#L79

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

@willyd
Do we need to compile this NCCL library prior or is that taken care of the main build files? Thanks.

@willyd
Copy link
Contributor

willyd commented Mar 3, 2017

Just set the option to USE_NCCL=1 and CMake will take care of the rest have a look at: https://github.com/BVLC/caffe/blob/windows/cmake/External/nccl.cmake

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

Like this? Getting strange errors. Not sure where they now are coming from...

D:\PROGRAMMING\caffe>scripts\build_win.cmd
The system cannot find the drive specified.
INFO: ============================================================
INFO: Summary:
INFO: ============================================================
INFO: MSVC_VERSION = 14
INFO: WITH_NINJA = 0
INFO: CMAKE_GENERATOR = "Visual Studio 14 2015 Win64"
INFO: CPU_ONLY = 0
INFO: CMAKE_CONFIG = Release
INFO: USE_NCCL = 1
INFO: CMAKE_BUILD_SHARED_LIBS = 0
INFO: PYTHON_VERSION = 2
INFO: BUILD_PYTHON = 1
INFO: BUILD_PYTHON_LAYER = 1
INFO: BUILD_MATLAB = 1
INFO: PYTHON_EXE = "python"
INFO: RUN_TESTS = 0
INFO: RUN_LINT = 0
INFO: RUN_INSTALL = 0
INFO: ============================================================
The input line is too long.
The syntax of the command is incorrect.

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

Turns out: had to start a fresh/new terminal.
Scripts keep appending to the path, which causes the path to get too long after many tries.

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

There is an error about NCCL in the output of build_win.cmd now:
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARIES)

Is there something else I need to specify in build_win.cmd

I simply dumped the NCCL zip file from Nvidia in
D:\PROGRAMMING\caffe\python\caffe_old\nccl
It has a folder: debian, fortran, src and test. Some license.txt, README.md and a Makefile. There is no libs.

@willyd
Copy link
Contributor

willyd commented Mar 3, 2017

There is an error about NCCL in the output of build_win.cmd now:
-- Could NOT find NCCL (missing: NCCL_INCLUDE_DIR NCCL_LIBRARIES)

This is only a warning. It did not find nccl so it should build it. Same thing happens on appveyor.

AFAIK Nvidia do not provide binaries for nccl, they don't even support windows officially.

@GaryKT
Copy link
Author

GaryKT commented Mar 3, 2017

Great. It worked now. The NCCL error is gone.

Now there is a GPU memory error. Is it loading all data into GPU at the same time? I'm running another app right now as well maybe that has an impact. This machine has 8GB GPU VRAM.
How will this work with very large datasets? Or is there a different setting I should try?
Thanks in advance!

I0303 17:33:40.142139 40544 layer_factory.cpp:58] Creating layer data
I0303 17:33:40.186722 40544 db_lmdb.cpp:40] Opened lmdb philipp_lmbd
I0303 17:33:40.208281 40544 net.cpp:86] Creating Layer data
I0303 17:33:40.208281 40544 net.cpp:382] data -> data
I0303 17:33:40.209285 40544 net.cpp:382] data -> label
I0303 17:33:40.209792 40544 data_transformer.cpp:25] Loading mean file from: philipp_mean_image.binaryproto
I0303 17:33:40.247885 40544 common.cpp:36] System entropy source not available, using fallback algorithm to generate seed instead.
I0303 17:33:40.249892 40544 data_layer.cpp:45] output data size: 5000,3,200,200
F0303 17:33:44.152439 40544 syncedmem.cpp:78] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***

@willyd
Copy link
Contributor

willyd commented Mar 14, 2017

Should be fixed by #5400.

@shelhamer
Copy link
Member

Closing as presumed fixed until we hear otherwise.

@willyd
Copy link
Contributor

willyd commented Apr 12, 2017

@shelhamer I will keep the issue open as a reminder until I merge #5400.

@willyd willyd reopened this Apr 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants