Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gdrcopy on CentOS 7.7 #93

Closed
afernandezody opened this issue Sep 27, 2019 · 16 comments
Closed

gdrcopy on CentOS 7.7 #93

afernandezody opened this issue Sep 27, 2019 · 16 comments
Assignees

Comments

@afernandezody
Copy link

Hello,
Installing gdrcopy is causing some unexpected trouble. I suspect that the cause is some incompatibility with CentOS 7.7 (this is not the 1st piece of software complaining about the new version). The error message reads:

make CUDA=/usr/local/cuda-10.1 all
GDRAPI_ARCH=X86
cd src/gdrdrv && \
make
make[1]: Entering directory `/home/odyhpc/gdrcopy/src/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[2]: Entering directory `/usr/src/kernels/3.10.0-1062.1.1.el7.x86_64'
arch/x86/Makefile:166: *** CONFIG_RETPOLINE=y, but not supported by the compiler. Compiler update recommended..  Stop.
make[2]: Leaving directory `/usr/src/kernels/3.10.0-1062.1.1.el7.x86_64'
make[1]: *** [module] Error 2
make[1]: Leaving directory `/home/odyhpc/gdrcopy/src/gdrdrv'
make: *** [driver] Error 2

The kernel is updated to 3.10.0-1062.1.1 and I updated the compilers (GNU) to 9.2.
Thanks,
Arturo

@pakmarkthub pakmarkthub self-assigned this Sep 27, 2019
@pakmarkthub
Copy link
Collaborator

@afernandezody

This issue is quite common when compiling a kernel module (driver) with non-standard gcc versions. I suggest that you use the same gcc version (at least the major version) as what your linux kernel was compiled with. You can use cat /proc/version to query what gcc version you should use.

@afernandezody
Copy link
Author

Hi Pak,
Thanks. I got the error even before upgrading the compilers (when using gcc 4.8.5). Upgrading the compilers to 9.2.0, which takes some time, was only prompted by the error message that mentions some operation not supported by the compiler. The error message was the same for both gcc 4.8.5 and 9.2.0.

@pakmarkthub
Copy link
Collaborator

Thank you for trying that. I will try to reproduce your bug on our system. In the meantime, can I ask you to post the output of cat /proc/version?

@afernandezody
Copy link
Author

Linux version 3.10.0-1062.1.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Sep 13 22:55:44 UTC 2019

@pakmarkthub
Copy link
Collaborator

Hi @afernandezody,

I have a few questions and requests.

  1. How did you install CentOS 7.7? Did you upgrade it from a previous version or was this a fresh install?

  2. Could you check your kernel configuration with cat /boot/config-$(uname -r) | grep CONFIG_RETPOLINE? You should get CONFIG_RETPOLINE=y.

  3. Please check that your gcc supports retpoline. You can try gcc -mindirect-branch=thunk-extern -mindirect-branch-register. If it does not support, gcc will print out gcc: error: unrecognized command line option ....

  4. Please try this gcc http://mirror.centos.org/centos-7/7/os/x86_64/Packages/gcc-4.8.5-39.el7.x86_64.rpm. It should support retpoline and is the version that is used for compiling your kernel.

@afernandezody
Copy link
Author

Hi @pakmarkthub ,
I'm traveling and have limited access to the systems until Thursday when can I provide further information. In the meantime, this is the available data:
1 - I'm using fresh deployments on cloud instances, which automatically install version 7.7 (the switch from 7.6 to 7.7 took place about 10 or 15 days ago).
2 - It returns 'y'
3 - It compiles without any issue
Thanks.

@pakmarkthub
Copy link
Collaborator

Hi @afernandezody,

Based on your reply to question 3, your gcc should support retpoline. Now, the possibility that make picks a different compiler is high.

  1. Can you try cd src/gdrdrv; make -n? It should print out the steps it takes to compile. For example, on my machine I got:
....
set -e; echo '  CC [M]  /home/pmarkthub/Projects/gdrcopy-v2.0/src/gdrdrv/nv-p2p-dummy.o'; gcc  -Wp,-MD,/home/pmarkthub/Projects/gdrcopy-v2.0/src/gdrdrv/.nv-p2p-dummy.o.d ...
....

So, I know that it picks the default gcc.

  1. Try forcing the compiler you want to use with make CC=<path-to-your-compiler>. I believe that your gcc you tested in Question 3 should be fine. Please input the full path to avoid make picking up something else.

@afernandezody
Copy link
Author

Hi @pakmarkthub ,
I came back and was able to try out your suggestions. However, it's still not compiling for some reason. To sum things up, I'm the only user, there is only one compiler (4.8.5-39) and compiling from the src/gdrdrv subdirectory with the CC flag returns:

[odyhpc@gpucruncher4020 gdrdrv]$ make CC=/usr/bin -n
echo "Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems."
make -C /lib/modules/3.10.0-1062.1.2.el7.x86_64/build  M=/home/odyhpc/gdrcopy/src/gdrdrv modules
make[1]: Entering directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
arch/x86/Makefile:96: stack-protector enabled but compiler support broken
arch/x86/Makefile:166: *** CONFIG_RETPOLINE=y, but not supported by the compiler. Compiler update recommended..  Stop.
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
make: *** [module] Error 2

Not sure if you were able to reproduce the error in your system. Thanks.

@pakmarkthub
Copy link
Collaborator

Hi @afernandezody,

Your command is incorrect. If the full path of your gcc is /usr/bin/gcc, please use make CC=/usr/bin/gcc. With -n, the code would not be compiled but it should show the steps how the compiler is called on your system. So, please try make CC=/usr/bin/gcc first. If it does not work, please post the output of make CC=/usr/bin/gcc -n.

To give you more details, our Makefile relies on your system's Makefile. Normally, it should pick your default compiler. For some reasons, I think that it does not pick your gcc, which should support retpoline according to your answer to my Question 3.

@afernandezody
Copy link
Author

Hi @pakmarkthub,
My mistake, I was trying different syntaxes to check out if it made any difference. I'm still getting very similar error messages regardless of the flags:

[odyhpc@gpucruncher4020 gdrdrv]$ make CC=/usr/bin/gcc
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[1]: Entering directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
arch/x86/Makefile:166: *** CONFIG_RETPOLINE=y, but not supported by the compiler. Compiler update recommended..  Stop.
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
make: *** [module] Error 2
[odyhpc@gpucruncher4020 gdrdrv]$ make CC=/usr/bin/gcc -n
echo "Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems."
make -C /lib/modules/3.10.0-1062.1.2.el7.x86_64/build  M=/home/odyhpc/gdrcopy/src/gdrdrv modules
make[1]: Entering directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
arch/x86/Makefile:166: *** CONFIG_RETPOLINE=y, but not supported by the compiler. Compiler update recommended..  Stop.
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
make: *** [module] Error 2

From your comment, could something wrong with the system Makefile be the cause of these problems? Thanks.

@drossetti
Copy link
Member

@afernandezody I cannot reproduce your problem on my devel box, which is still at 3.10.0-1062.1.1.el7.x86_64, see below.

Can you please verify the version of tools which are installed on that cloud instance, say by reporting the output of the commands below ?

drossetti@geppetto 10:51 (34) gdrdrv>uname -r
3.10.0-1062.1.1.el7.x86_64

drossetti@geppetto 10:51 (35) gdrdrv>rpm -q gcc
gcc-4.8.5-39.el7.x86_64

drossetti@geppetto 10:51 (36) gdrdrv>gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

drossetti@geppetto 10:51 (38) gdrdrv>make clean all
rm -rf *.o .*.o.d *.ko* *.mod.* .*.cmd Module.symvers modules.order .tmp_versions/ *~ core .depend TAGS .cache.mk 
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.87.00/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[1]: Entering directory `/usr/src/kernels/3.10.0-1062.1.1.el7.x86_64'
  CC [M]  /ivylogin/home/drossetti/work/gdrcopy/src/gdrdrv/nv-p2p-dummy.o
  CC [M]  /ivylogin/home/drossetti/work/gdrcopy/src/gdrdrv/gdrdrv.o
  Building modules, stage 2.
  MODPOST 2 modules
  CC      /ivylogin/home/drossetti/work/gdrcopy/src/gdrdrv/gdrdrv.mod.o
  LD [M]  /ivylogin/home/drossetti/work/gdrcopy/src/gdrdrv/gdrdrv.ko
  CC      /ivylogin/home/drossetti/work/gdrcopy/src/gdrdrv/nv-p2p-dummy.mod.o
  LD [M]  /ivylogin/home/drossetti/work/gdrcopy/src/gdrdrv/nv-p2p-dummy.ko
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1062.1.1.el7.x86_64'
drossetti@geppetto 10:52 (39) gdrdrv>cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

drossetti@geppetto 10:57 (43) gdrdrv>rpm -q binutils
binutils-2.27-41.base.el7.x86_64

drossetti@geppetto 10:57 (44) gdrdrv>rpm -q glibc
glibc-2.17-292.el7.x86_64

@afernandezody
Copy link
Author

afernandezody commented Oct 7, 2019

Hi @drossetti,
I was just about to post when your input came to my inbox. Earlier, I took a fresh look by trying the installation on a baremetal box rather than on a GCP instance. No problem whatsoever, the compilation went smoothly. This kind of strengthens my suspicion that the root cause might be the hypervisor (or something else Google has installed) causing some type of interference. Thanks.

@afernandezody
Copy link
Author

afernandezody commented Oct 8, 2019

I went back to the GCP instance and got the following output:

[odyhpc@gpucruncher4020 gdrdrv]$ uname -r
3.10.0-1062.1.2.el7.x86_64
[odyhpc@gpucruncher4020 gdrdrv]$ rpm -q gcc
gcc-4.8.5-39.el7.x86_64
[odyhpc@gpucruncher4020 gdrdrv]$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[odyhpc@gpucruncher4020 gdrdrv]$ make clean all
rm -rf *.o .*.o.d *.ko* *.mod.* .*.cmd Module.symvers modules.order .tmp_versions/ *~ core .depend TAGS .cache.mk
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.67/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[1]: Entering directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
arch/x86/Makefile:166: *** CONFIG_RETPOLINE=y, but not supported by the compiler. Compiler update recommended..  Stop.
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1062.1.2.el7.x86_64'
make: *** [module] Error 2
[odyhpc@gpucruncher4020 gdrdrv]$ rpm -q binutils
binutils-2.27-41.base.el7.x86_64
[odyhpc@gpucruncher4020 gdrdrv]$ rpm -q glibc
glibc-2.17-292.el7.x86_64

@xuhao120833
Copy link

这个问题最后解决了吗?

@xuhao120833
Copy link

Have the problem be solved,last?

@danielgora
Copy link

danielgora commented May 31, 2023

The problem is usually that you are building as a non-root user and the kernel makefiles cannot create a temporary directory in /usr/src/kernels/version to be able to detect the cc flags.

When the kernel tries to determine if a CC flag is available it runs the 'try-run' macro in scripts/Makefile.compiler (or scripts/Kbuild.include depending on the kernel version) which creates a temp directory:

 # output directory for tests below
 TMPOUT = $(if $(KBUILD_EXTMOD),$(firstword $(KBUILD_EXTMOD))/).tmp_$$$$

 # try-run
 # Usage: option = $(call try-run, $(CC)...-o "$$TMP",option-ok,otherwise)
 # Exit code chooses option. "$$TMP" serves as a temporary file and is
 # automatically cleaned up.
 try-run = $(shell set -e;		\
	TMP=$(TMPOUT)/tmp;		\
	trap "rm -rf $(TMPOUT)" EXIT;	\
	mkdir -p $(TMPOUT);		\
	if ($(1)) >/dev/null 2>&1;	\
	then echo "$(2)";		\
	else echo "$(3)";		\
	fi)

 __cc-option = $(call try-run,\
	$(1) -Werror $(2) $(3) -c -x c /dev/null -o "$$TMP",$(3),$(4))

So you can see here that if KBUILD_EXTMOD is not set then $TMPOUT will end up using /usr/src/kernels/version. If you don't have write permissions there, the mkdir -p $(TMPOUT) will fail and then the cc-option macro will return nothing.

So the solution is one of the following:
!) build as root. root has write permission in /usr/src/kernels/version
2) make KBUILD_EXTMOD=$(pwd). This way KBUILD_EXTMOD will be set and $TMPOUT will use the directory where you are building and presumably have write permission.
3) sudo chmod o+w /usr/src/kernels/<version>/. . Just allow other users to write to /usr/src/kernels/ so that the macros will work for non-root users.

Note that if you are building a module with something like make -C /usr/src/kernels/<version> M=$(pwd) modules, then you shouldn't have to do this because the M flag will set KBUILD_EXTMOD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants