Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI error while compiling #269

Closed
jplu opened this issue Mar 22, 2016 · 27 comments
Closed

MPI error while compiling #269

jplu opened this issue Mar 22, 2016 · 27 comments

Comments

@jplu
Copy link

jplu commented Mar 22, 2016

Hello,

I'm facing an issue while I compile CNTK with MPI on a Ubuntu 14.04 machine. Here the steps I did:

../../configure --1bitsgd=yes
Defaulting to --with-buildtype=release
Found cuda at /usr/local/cuda-7.5
Found gdk at /usr/.
Found CUB at /usr/local/cub-1.4.1
Found cuDNN at /usr/local
Found OpenCV at /usr/local/opencv-3.0.0
Cannot locate libzip files
ImageReader will be built without zip container support.
Generating /home/plu/git/CNTK/build/release/Config.make
Generating /home/plu/git/CNTK/build/release/Makefile
run
>make -j all
to build
make all

The error during the make is the following:

creating /home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.o for with build type release
mpic++ -c Source/SGDLib/SGD.cpp -o /home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.o -D_POSIX_SOURCE -D_XOPEN_SOURCE=600 -D__USE_XOPEN2K -std=c++11 -DUSE_CUDNN -DUSE_ACML -DNDEBUG -DNO_SYNC -DQUANTIZED_GRADIENT_AGGREGATION  -msse3 -std=c++0x -fopenmp -fpermissive -fPIC -Werror -fcheck-new -Wno-error=literal-suffix -g -O4 -ISource/Common/Include -ISource/Math -ISource/CNTK -ISource/ActionsLib -ISource/ComputationNetworkLib -ISource/SGDLib -ISource/SequenceTrainingLib -ISource/CNTK/BrainScript -ISource/Readers/ReaderLib -I/usr/./include/nvidia/gdk -I/usr/local/cub-1.4.1 -I/usr/local/cuda-7.5/include -I/usr/local/cuda/include -I/usr/local/acml5.3.1/ifort64_mp/include -I/usr/local/opencv-3.0.0/include -ISource/1BitSGD -MD -MP -MF /home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.d
Source/SGDLib/SGD.cpp: In instantiation of 'void Microsoft::MSR::CNTK::SGD<ElemType>::InitDistGradAgg(int, int) [with ElemType = float]':
Source/SGDLib/SGD.cpp:2283:16:   required from here
Source/SGDLib/SGD.cpp:1846:27: error: no matching function for call to 'Microsoft::MSR::CNTK::AllReduceDistGradAggregator<float>::AllReduceDistGradAggregator(std::shared_ptr<Microsoft::MSR::CNTK::MPIWrapper>&, int&, bool&, bool, bool&, int&, int&)'
             m_distGradAgg = new AllReduceDistGradAggregator<ElemType>(m_mpi, m_numGradientBits, m_zeroThresholdFor1Bit, true /*useQuantizationForSelfStripe*/, m_bufferedAsyncGradientAggregation, traceLevel, m_syncStatsTrace);
                           ^
Source/SGDLib/SGD.cpp:1846:27: note: candidate is:
In file included from Source/SGDLib/SGD.cpp:12:0:
Source/1BitSGD/AllReduceDistGradAggregator.h:43:5: note: Microsoft::MSR::CNTK::AllReduceDistGradAggregator<ElemType>::AllReduceDistGradAggregator(Microsoft::MSR::CNTK::MPIWrapper*, int, bool, bool, bool, int, int) [with ElemType = float]
     AllReduceDistGradAggregator(MPIWrapper* mpi, int nBits, bool zeroThresholdFor1Bit, bool useQuantizationForSelfStripe, bool useAsyncAggregation, int traceLevel, int syncStatsTrace)
     ^
Source/1BitSGD/AllReduceDistGradAggregator.h:43:5: note:   no known conversion for argument 1 from 'std::shared_ptr<Microsoft::MSR::CNTK::MPIWrapper>' to 'Microsoft::MSR::CNTK::MPIWrapper*'
Source/SGDLib/SGD.cpp: In instantiation of 'void Microsoft::MSR::CNTK::SGD<ElemType>::InitDistGradAgg(int, int) [with ElemType = double]':
Source/SGDLib/SGD.cpp:2284:16:   required from here
Source/SGDLib/SGD.cpp:1846:27: error: no matching function for call to 'Microsoft::MSR::CNTK::AllReduceDistGradAggregator<double>::AllReduceDistGradAggregator(std::shared_ptr<Microsoft::MSR::CNTK::MPIWrapper>&, int&, bool&, bool, bool&, int&, int&)'
             m_distGradAgg = new AllReduceDistGradAggregator<ElemType>(m_mpi, m_numGradientBits, m_zeroThresholdFor1Bit, true /*useQuantizationForSelfStripe*/, m_bufferedAsyncGradientAggregation, traceLevel, m_syncStatsTrace);
                           ^
Source/SGDLib/SGD.cpp:1846:27: note: candidate is:
In file included from Source/SGDLib/SGD.cpp:12:0:
Source/1BitSGD/AllReduceDistGradAggregator.h:43:5: note: Microsoft::MSR::CNTK::AllReduceDistGradAggregator<ElemType>::AllReduceDistGradAggregator(Microsoft::MSR::CNTK::MPIWrapper*, int, bool, bool, bool, int, int) [with ElemType = double]
     AllReduceDistGradAggregator(MPIWrapper* mpi, int nBits, bool zeroThresholdFor1Bit, bool useQuantizationForSelfStripe, bool useAsyncAggregation, int traceLevel, int syncStatsTrace)
     ^
Source/1BitSGD/AllReduceDistGradAggregator.h:43:5: note:   no known conversion for argument 1 from 'std::shared_ptr<Microsoft::MSR::CNTK::MPIWrapper>' to 'Microsoft::MSR::CNTK::MPIWrapper*'
In file included from Source/SGDLib/SimpleEvaluator.h:16:0,
                 from Source/SGDLib/SGD.h:9,
                 from Source/SGDLib/SGD.cpp:6:
Source/SGDLib/SimpleDistGradAggregator.h: In instantiation of 'void Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradientsImpl(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, bool) [with ElemType = double]':
Source/SGDLib/SimpleDistGradAggregator.h:110:129:   required from 'Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = double]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:110:112:   required from 'struct Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = double]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:111:57:   required from 'bool Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = double]'
Source/SGDLib/SGD.cpp:2608:3:   required from here
Source/SGDLib/SimpleDistGradAggregator.h:283:186: error: 'MPI_Iallreduce' was not declared in this scope
             MPI_Iallreduce(MPI_IN_PLACE, reductionBuffer, gradients[i]->GetNumElements(), MPIWrapper::GetDataType(reductionBuffer), MPI_SUM, m_mpi->Communicator(), &allReduceRequests[i]) || MpiFail("MPI_Iallreduce");
                                                                                                                                                                                          ^
Source/SGDLib/SimpleDistGradAggregator.h: In instantiation of 'void Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradientsImpl(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, bool) [with ElemType = float]':
Source/SGDLib/SimpleDistGradAggregator.h:110:129:   required from 'Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = float]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:110:112:   required from 'struct Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = float]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:111:57:   required from 'bool Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = float]'
Source/SGDLib/SGD.cpp:2608:3:   required from here
Source/SGDLib/SimpleDistGradAggregator.h:283:186: error: 'MPI_Iallreduce' was not declared in this scope
make[1]: *** [/home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.o] Error 1
make[1]: Leaving directory `/home/plu/git/CNTK'
make: *** [all] Error 2

My $LD_LIBRARY_PATH variable looks like this:

/usr/local/cuda/lib64:/usr/local/acml5.3.1/ifort64/lib/:/usr/local/acml5.3.1/ifort64_mp/lib/:/usr/local/mpi/lib/

The path where MPI is, is the same than the one proposed here

Thanks for any help you can provide.

@alexeyo26
Copy link
Member

How did you install MPI? As we suggest (from sources) of from a Debian package? If the latter - that's the reason. What is offered there is old and misses some libraries.

@alexeyo26
Copy link
Member

Oh, and I noticed you have CUDA 7.5. I'm afraid this will be the next thing, that fails. Today we're on CUDA 7.0 (see setup section)

@jplu
Copy link
Author

jplu commented Mar 22, 2016

Here the steps I followed to install MPI from the sources:

wget https://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.1.tar.gz
tar -xzvf ./openmpi-1.10.1.tar.gz
cd openmpi-1.10.1
./configure --prefix=/usr/local/mpi
make -j all
make install
export PATH=/usr/local/mpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/mpi/lib:$LD_LIBRARY_PATH

Exactly like in the setup section.

About CUDA, I have both version installed and I wanted to try first with the latest, if not working, then switch the symlink location to 7.0.

@alexeyo26
Copy link
Member

I see you used make install and not sudo make install. That should have resulted in errors due to insufficient access rights. Try to get back to the directory you were using to build MPI and execute sudo make install

(Clear about CUDA)

@jplu
Copy link
Author

jplu commented Mar 22, 2016

My user have the write permissions on /usr/local/ and all its subfolders. But I gonna try, we never know.

@jplu
Copy link
Author

jplu commented Mar 23, 2016

Unfortunately, this does not changes anything :(

@alexeyo26
Copy link
Member

Then please start configure from scratch, do make (not make -j) and attach the output.

Thank you,
Alexey

@jplu
Copy link
Author

jplu commented Mar 23, 2016

Sorry, you want I attach the make of MPI or CNTK?

@jplu
Copy link
Author

jplu commented Mar 24, 2016

Ok, in the doubt, you can find in attachment the configure and the make output of both compilation.
configure_mpi.txt
make_mpi.txt
configure_cntk.txt
make_cntk.txt

@amos0531
Copy link

Hi jplu!
OpenMPI have not provide MPI_Iallreduce API, MPI_Iallreduce is only Microsoft MS-MPI MPI.

@jplu
Copy link
Author

jplu commented Mar 28, 2016

@amos0531 nop! The function MPI_Iallreduce is correctly provided by OpenMPI.

@alexeyo26
Copy link
Member

It looks like you're trying to build CNTK with 1Bit SGD support. Do you have 1bit SGD code in Source/1BitSGD? If no, please get it as described here or rerun CNTK configure WITHOUT --1bitsgd=yes parameter.

@jplu
Copy link
Author

jplu commented Mar 28, 2016

I did clone the repo with this command already:

git clone --recursive https://github.com/Microsoft/cntk/

And I do have the Source/1BitSGD folder locally.

@alexeyo26
Copy link
Member

Sorry for maybe a childish question, but do you have the code inside Source/1BitSGD folder? In particular do you see AllReduceDistGradAggregator.h (this is exactly where your missing classes are) and MatrixQuantizer.h? Something could go wrong during the recursive clone (the folder itself will always be there, but in case recursive clone was not performed, it will be empty)

@jplu
Copy link
Author

jplu commented Mar 28, 2016

It is a legitimate question :)

And yes those two files are in the folder:

ls /home/plu/git/CNTK/Source/1BitSGD/
AllReduceDistGradAggregator.h  LICENSE.md  MatrixQuantizer.h  README.md

@alexeyo26
Copy link
Member

Julien, please do the following for me :
cd /home/plu/git/CNTK/Source/1BitSGD
git log
and post the output.
Thank you.

@jplu
Copy link
Author

jplu commented Mar 28, 2016

Here the output:

[plu@arbois - Mon Mar 28 - 15:08:24 - ~]$ 
cd /home/plu/git/CNTK/Source/1BitSGD/
[plu@arbois - Mon Mar 28 - 15:09:41 - /home/plu/git/CNTK/Source/1BitSGD]$ 
git log              
commit f785679a6bd5cc089b138b3c6bcb68e4b1f345ae
Author: Mark Hillebrand <Mark.Hillebrand@microsoft.com>
Date:   Fri Jan 22 17:33:16 2016 +0100

    Initial commit

@alexeyo26
Copy link
Member

Got it! Your code for 1bit SGD is too old. Do:

 cd /home/plu/git/CNTK
git submodule update --recursive

You should get commit 41c1f55b9d5115c4dd051391f38eed8e93fb1860 as the result of this. Hope it will work then.

@jplu
Copy link
Author

jplu commented Mar 28, 2016

Unfortunately no:

creating /home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.o for with build type release
mpic++ -c Source/SGDLib/SGD.cpp -o /home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.o -D_POSIX_SOURCE -D_XOPEN_SOURCE=600 -D__USE_XOPEN2K -std=c++11 -DUSE_CUDNN -DUSE_ACML -DNDEBUG -DNO_SYNC -DQUANTIZED_GRADIENT_AGGREGATION  -msse3 -std=c++0x -fopenmp -fpermissive -fPIC -Werror -fcheck-new -Wno-error=literal-suffix -g -O4 -ISource/Common/Include -ISource/Math -ISource/CNTK -ISource/ActionsLib -ISource/ComputationNetworkLib -ISource/SGDLib -ISource/SequenceTrainingLib -ISource/CNTK/BrainScript -ISource/Readers/ReaderLib -I/usr/./include/nvidia/gdk -I/usr/local/cub-1.4.1 -I/usr/local/cuda-7.5/include -I/usr/local/cuda/include -I/usr/local/acml5.3.1/ifort64_mp/include -I/usr/local/opencv-3.0.0/include -ISource/1BitSGD -MD -MP -MF /home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.d
In file included from Source/SGDLib/SimpleEvaluator.h:16:0,
                 from Source/SGDLib/SGD.h:9,
                 from Source/SGDLib/SGD.cpp:6:
Source/SGDLib/SimpleDistGradAggregator.h: In instantiation of 'void Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradientsImpl(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, bool) [with ElemType = double]':
Source/SGDLib/SimpleDistGradAggregator.h:110:129:   required from 'Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = double]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:110:112:   required from 'struct Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = double]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:111:57:   required from 'bool Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = double]'
Source/SGDLib/SGD.cpp:2599:3:   required from here
Source/SGDLib/SimpleDistGradAggregator.h:283:186: error: 'MPI_Iallreduce' was not declared in this scope
             MPI_Iallreduce(MPI_IN_PLACE, reductionBuffer, gradients[i]->GetNumElements(), MPIWrapper::GetDataType(reductionBuffer), MPI_SUM, m_mpi->Communicator(), &allReduceRequests[i]) || MpiFail("MPI_Iallreduce");
                                                                                                                                                                                          ^
Source/SGDLib/SimpleDistGradAggregator.h: In instantiation of 'void Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradientsImpl(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, bool) [with ElemType = float]':
Source/SGDLib/SimpleDistGradAggregator.h:110:129:   required from 'Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = float]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:110:112:   required from 'struct Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = float]::__lambda23'
Source/SGDLib/SimpleDistGradAggregator.h:111:57:   required from 'bool Microsoft::MSR::CNTK::SimpleDistGradAggregator<ElemType>::AggregateGradients(const std::vector<Microsoft::MSR::CNTK::Matrix<ElemType>*>&, Microsoft::MSR::CNTK::DistGradHeader*, int) [with ElemType = float]'
Source/SGDLib/SGD.cpp:2599:3:   required from here
Source/SGDLib/SimpleDistGradAggregator.h:283:186: error: 'MPI_Iallreduce' was not declared in this scope
make[1]: *** [/home/plu/git/CNTK/build/release/.build/Source/SGDLib/SGD.o] Error 1
make[1]: Leaving directory `/home/plu/git/CNTK'
make: *** [all] Error 2
[plu@arbois - Mon Mar 28 - 15:49:52 - /home/plu/git/CNTK/Source/1BitSGD]$ 
git log
commit 41c1f55b9d5115c4dd051391f38eed8e93fb1860
Author: Amit Agarwal <amitaga@microsoft.com>
Date:   Fri Mar 11 23:55:40 2016 -0800

    Bugfix: Avoid running any non training or eval related commands in parallel

commit f785679a6bd5cc089b138b3c6bcb68e4b1f345ae
Author: Mark Hillebrand <Mark.Hillebrand@microsoft.com>
Date:   Fri Jan 22 17:33:16 2016 +0100

    Initial commit

@alexeyo26
Copy link
Member

The error is different now (see your previous output posted). Are you on the latest master? The latest commit should be e859de561f966c8d4fff8a4caac426f7d710d871

@jplu
Copy link
Author

jplu commented Mar 28, 2016

Oh, indeed not the same. My current log is:

git log -1 --format="%H"
e859de561f966c8d4fff8a4caac426f7d710d871

I did a pull in same time than for the Source/1BitSGD module.

@alexeyo26
Copy link
Member

New enough. Let us do some internal brainstorming

@jplu
Copy link
Author

jplu commented Mar 28, 2016

Perfect, do not hesitate to ask if you want I do some other commands to help in your brainstorming.

@alexeyo26
Copy link
Member

This is the idea, proposed by one of our team members:

The mpi.h that the build is picking up appears to be an older MPI version that does not have MPI_Iallreduce. What OpenMPI version does the mpic++ used in the build correspond to?
https://www.open-mpi.org/doc/v1.8/man3/MPI_Iallreduce.3.php

In other words, are you sure you don't have some other installation of MPI on your system? Say by a Debian package installed by some other package as a dependency? (This may be checked by dpkg --list | grep mpi)

@jplu
Copy link
Author

jplu commented Mar 29, 2016

This is a very good catch Alexey:

dpkg -l | grep openmpi
ii  libopenmpi-dev                         1.6.5-8                                    amd64        high performance message passing library -- header files
ii  libopenmpi1.6                          1.6.5-8                                    amd64        high performance message passing library -- shared library
ii  openmpi-bin                            1.6.5-8                                    amd64        high performance message passing library -- binaries
ii  openmpi-common                         1.6.5-8                                    all          high performance message passing library -- common files

Before to uninstall it, I gonna check if there is not some packages that use it. It is a desktop provided by my lab then I don't really now what is already installed.

@jplu
Copy link
Author

jplu commented Mar 29, 2016

Good news! When I change the default version with symlinks that points to the correct OpenMPI version, it compiles like a charm.

I gonna try to modify the configure in order to let the user provide the proper path where MPI is, as for the other libs. And then doing a PR.

@anijain2305
Copy link

@jplu Can you please explain which symlinks you changed to get the build working? I think I am also stuck on these issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants