Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile the caffe #19

Closed
Jcdidi opened this issue Mar 7, 2017 · 15 comments
Closed

compile the caffe #19

Jcdidi opened this issue Mar 7, 2017 · 15 comments

Comments

@Jcdidi
Copy link

Jcdidi commented Mar 7, 2017

hello,when I run the make -j8 && make install,it shows the error following:

[ 87%] [ 88%] make[2]: *** No rule to make target /path/to/cudnn/lib64/libcudnn.so', needed by lib/libcaffe.so'. Stop.
make[2]: *** Waiting for unfinished jobs....
Building CXX object src/caffe/CMakeFiles/caffe.dir/data_transformer.cpp.o
Building CXX object src/caffe/CMakeFiles/caffe.dir/syncedmem.cpp.o
make[1]: *** [src/caffe/CMakeFiles/caffe.dir/all] Error 2
make: *** [all] Error 2

I wonder if it is the error of the path?
Another question:
Can I unuse the "cudnn"and "openmpi"?I only have one server but have 4 gpus,Iwonder if I can use the "-gpu al" to replace the "openmpi".
thanks!

@Cysu
Copy link
Collaborator

Cysu commented Mar 7, 2017

Sorry but we modified the official caffe for our project. So we rely on openmpi if you want to use multiple GPUs (we only use multi-gpu for testing, but not training).

We also highly recommend you install cudnn v5. After downloading and extracting it, you need to replace the /path/to/cudnn in the cmake command with your own directory path. For example, if you copy the cudnn files to /usr/local/cuda, then the cmake command should be

cmake .. -DUSE_MPI=ON -DCUDNN_INCLUDE=/usr/local/cuda/include -DCUDNN_LIBRARY=/usr/local/cuda/lib64/libcudnn.so

@Jcdidi
Copy link
Author

Jcdidi commented Mar 7, 2017

Thanks,But according to my environment which has only one server with 4 GPUs,can I use the openmpi?

@Cysu
Copy link
Collaborator

Cysu commented Mar 7, 2017

Sure. You can change these two lines to

mpirun -n 4 python2 tools/eval_test.py \
  --gpu 0,1,2,3 \

@Jcdidi
Copy link
Author

Jcdidi commented Mar 7, 2017

um,thanks.
"boost >= 1.55 (A tip for Ubuntu 14.04: sudo apt-get autoremove libboost1.54* then sudo apt-get install libboost1.55-all-dev)"
it must be >=1.55? 

@Cysu
Copy link
Collaborator

Cysu commented Mar 7, 2017

Yes. It should be >= 1.55.

@Jcdidi
Copy link
Author

Jcdidi commented Mar 7, 2017

xd@amax-1080:~/person_search-master$ experiments/scripts/eval_test.sh resnet50 50000 resnet50
[amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_crs_none: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)


A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: amax-1080
Framework: crs
Component: none

[amax-1080:00334] *** Process received signal ***
[amax-1080:00334] Signal: Segmentation fault (11)
[amax-1080:00334] Signal code: Address not mapped (1)
[amax-1080:00334] Failing at address: 0x28
[amax-1080:00334] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7f65b2b0a330]
[amax-1080:00334] [ 1] /usr/lib/libmpi.so.1(mca_base_select+0x11e) [0x7f652bf16f1e]
[amax-1080:00334] [ 2] /usr/lib/libmpi.so.1(opal_crs_base_select+0x7e) [0x7f652beff28e]
[amax-1080:00334] [ 3] /usr/lib/libmpi.so.1(opal_cr_init+0x3fc) [0x7f652bf1ff1c]
[amax-1080:00334] [ 4] /usr/lib/libmpi.so.1(opal_init+0x1d0) [0x7f652bf28810]
[amax-1080:00334] [ 5] /usr/lib/libmpi.so.1(orte_init+0x37) [0x7f652beb86e7]
[amax-1080:00334] [ 6] /usr/lib/libmpi.so.1(ompi_mpi_init+0x174) [0x7f652be78024]
[amax-1080:00334] [ 7] /usr/lib/libmpi.so.1(PMPI_Init_thread+0xd4) [0x7f652be8f7f4]
[amax-1080:00334] [ 8] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(initMPI+0x4716) [0x7f652c27d0a6]
[amax-1080:00334] [ 9] python2(_PyImport_LoadDynamicModule+0x9b) [0x427992]
[amax-1080:00334] [10] python2() [0x55642f]
[amax-1080:00334] [11] python2() [0x4e2dec]
[amax-1080:00334] [12] python2() [0x556cf1]
[amax-1080:00334] [13] python2() [0x569c08]
[amax-1080:00334] [14] python2(PyEval_CallObjectWithKeywords+0x6b) [0x4c8c8b]
[amax-1080:00334] [15] python2(PyEval_EvalFrameEx+0x2958) [0x5264a8]
[amax-1080:00334] [16] python2() [0x567d14]
[amax-1080:00334] [17] python2(PyRun_FileExFlags+0x92) [0x465bf4]
[amax-1080:00334] [18] python2(PyRun_SimpleFileExFlags+0x2ee) [0x46612d]
[amax-1080:00334] [19] python2(Py_Main+0xb5e) [0x466d92]
[amax-1080:00334] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f65b2756f45]
[amax-1080:00334] [21] python2() [0x577c2e]
[amax-1080:00334] *** End of error message ***
jxd@amax-1080:~/person_search-master$

I do not do the pretrain work,and directly use the trained model.
As you say,I can do the test without MPI,so I do not use the MPI with "use only one GPU, remove the mpirun -n 8 in L14 and change L16 to --gpu 0",but it show the error above.How can I solve it,thanks.
In addtion,when I use the MPI following what you advise,it also show the errors like this.

@Cysu
Copy link
Collaborator

Cysu commented Mar 7, 2017

It seems that you have different versions of openmpi. Let's say if you compile openmpi and install it into a local directory like /home/jxd/openmpi. Then please add the following lines in your ~/.bashrc:

export PATH=/home/jxd/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/home/jxd/openmpi/lib:$LD_LIBRARY_PATH

Restart the terminal, rm -rf build, and compile the caffe again.

@Jcdidi
Copy link
Author

Jcdidi commented Mar 8, 2017

Hello,I have successfully installed the openmpi,and test it that it can be used.Then I cmake the caffe successfully,but I still exist the questions above.So I try to do the training,it meets the same questions.
Thanks!

jxd@amax-1080:~/person_search-master$ experiments/scripts/train.sh 0 --set EXP_DIR resnet50

  • set -e
  • export PYTHONUNBUFFERED=True
  • PYTHONUNBUFFERED=True
  • GPU_ID=0
  • NET=resnet50
  • DATASET=psdb
  • array=($@)
  • len=4
  • EXTRA_ARGS='--set EXP_DIR resnet50'
  • EXTRA_ARGS_SLUG=--set_EXP_DIR_resnet50
  • case $DATASET in
  • TRAIN_IMDB=psdb_train
  • TEST_IMDB=psdb_test
  • PT_DIR=psdb
  • ITERS=50000
    ++ date +%Y-%m-%d_%H-%M-%S
  • LOG=experiments/logs/psdb_train_resnet50_--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
  • exec
    ++ tee -a experiments/logs/psdb_train_resnet50_--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
  • echo Logging output to experiments/logs/psdb_train_resnet50_--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
    Logging output to experiments/logs/psdb_train_resnet50_--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
  • python2 tools/train_net.py --gpu 0 --solver models/psdb/resnet50/solver.prototxt --weights data/imagenet_models/resnet50.caffemodel --imdb psdb_train --iters 50000 --cfg experiments/cfgs/resnet50.yml --rand --set EXP_DIR resnet50
    [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
    [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
    [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
    [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
    [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_crs_none: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: amax-1080
Framework: crs
Component: none

[amax-1080:22914] *** Process received signal ***
[amax-1080:22914] Signal: Segmentation fault (11)
[amax-1080:22914] Signal code: Address not mapped (1)
[amax-1080:22914] Failing at address: 0x28
[amax-1080:22914] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7f0a35507330]
[amax-1080:22914] [ 1] /usr/lib/libmpi.so.1(mca_base_select+0x11e) [0x7f09acb8bf1e]
[amax-1080:22914] [ 2] /usr/lib/libmpi.so.1(opal_crs_base_select+0x7e) [0x7f09acb7428e]
[amax-1080:22914] [ 3] /usr/lib/libmpi.so.1(opal_cr_init+0x3fc) [0x7f09acb94f1c]
[amax-1080:22914] [ 4] /usr/lib/libmpi.so.1(opal_init+0x1d0) [0x7f09acb9d810]
[amax-1080:22914] [ 5] /usr/lib/libmpi.so.1(orte_init+0x37) [0x7f09acb2d6e7]
[amax-1080:22914] [ 6] /usr/lib/libmpi.so.1(ompi_mpi_init+0x174) [0x7f09acaed024]
[amax-1080:22914] [ 7] /usr/lib/libmpi.so.1(PMPI_Init_thread+0xd4) [0x7f09acb047f4]
[amax-1080:22914] [ 8] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(initMPI+0x4716) [0x7f09acef20a6]
[amax-1080:22914] [ 9] python2(_PyImport_LoadDynamicModule+0x9b) [0x427992]
[amax-1080:22914] [10] python2() [0x55642f]
[amax-1080:22914] [11] python2() [0x4e2dec]
[amax-1080:22914] [12] python2() [0x556cf1]
[amax-1080:22914] [13] python2() [0x569c08]
[amax-1080:22914] [14] python2(PyEval_CallObjectWithKeywords+0x6b) [0x4c8c8b]
[amax-1080:22914] [15] python2(PyEval_EvalFrameEx+0x2958) [0x5264a8]
[amax-1080:22914] [16] python2() [0x567d14]
[amax-1080:22914] [17] python2(PyRun_FileExFlags+0x92) [0x465bf4]
[amax-1080:22914] [18] python2(PyRun_SimpleFileExFlags+0x2ee) [0x46612d]
[amax-1080:22914] [19] python2(Py_Main+0xb5e) [0x466d92]
[amax-1080:22914] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f0a35153f45]
[amax-1080:22914] [21] python2() [0x577c2e]
[amax-1080:22914] *** End of error message ***
experiments/scripts/train.sh: line 47: 22914 Segmentation fault (core dumped) python2 tools/train_net.py --gpu ${GPU_ID} --solver models/${PT_DIR}/${NET}/solver.prototxt --weights data/imagenet_models/${NET}.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/${NET}.yml --rand ${EXTRA_ARGS}

@Cysu
Copy link
Collaborator

Cysu commented Mar 8, 2017

Could you please check the output of the following commands:

which mpirun
ldd $(which mpirun) | grep mpi
ldd caffe/build/install/bin/caffe | grep mpi

@Jcdidi
Copy link
Author

Jcdidi commented Mar 8, 2017

yeah,maybe I do not cmake caffe successfully as there's no information about it ?

ldd: caffe/build/install/bin/caffe: No such file or directory

jxd@amax-1080:$ which mpirun
/usr/local/openmpi/bin/mpirun
jxd@amax-1080:
$ ldd $(which mpirun) | grep mpi
libopen-rte.so.12 => /usr/local/openmpi/lib/libopen-rte.so.12 (0x00007f75c7edc000)
libopen-pal.so.13 => /usr/local/openmpi/lib/libopen-pal.so.13 (0x00007f75c7bfe000)
jxd@amax-1080:~$ ldd caffe/build/install/bin/caffe | grep mpi
ldd: caffe/build/install/bin/caffe: No such file or directory

@Cysu
Copy link
Collaborator

Cysu commented Mar 8, 2017

OK. You have another self-compiled openmpi installed at /usr/local/openmpi. So you need to add these lines to ~/.bashrc:

export PATH=/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH

Restart the terminal, remove the build directory under caffe, and recompile it following the steps in the README file.

@Jcdidi
Copy link
Author

Jcdidi commented Mar 8, 2017

Yes,I have added these lines to ~/.bashrc,and recompile it yesterday.Are there two openmpi installed in the system?
Now I try to remove the build directory again and recompile it.Thanks

@Cysu
Copy link
Collaborator

Cysu commented Mar 8, 2017

Right. In your previous log, it complaints

mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)

So you have a system-installed openmpi at /usr/lib, and a self-installed /usr/local/openmpi.

@Jcdidi
Copy link
Author

Jcdidi commented Mar 8, 2017

thanks a lot!
I found the issue.I add the line to ~/.bashrc:

export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so

all detection:
recall = 79.37%
ap = 74.82%
labeled only detection:
recall = 97.76%
search ranking:
mAP = 75.41%
top- 1 = 78.48%
top- 5 = 90.07%
top-10 = 92.34%

@Cysu
Copy link
Collaborator

Cysu commented Mar 8, 2017

Good to hear that! Will close the issue for now, and please feel free to reopen it if there are further problems.

@Cysu Cysu closed this as completed Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants