Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation Error Ubuntu 18.04 and CUDA 10.1 #59

Open
Samos2b opened this issue Apr 26, 2019 · 25 comments
Open

Compilation Error Ubuntu 18.04 and CUDA 10.1 #59

Samos2b opened this issue Apr 26, 2019 · 25 comments

Comments

@Samos2b
Copy link

Samos2b commented Apr 26, 2019

Hello everyone,

I'm facing a compilation issue with RapidCFD on Ubuntu 18.04 using CUDA10.1 maybe someone can help me.
Some files are compiling well, others with warnings and some are not compiling at all due to errors with an "unsigned int" class (see attached file Error1).

Also the applications libraries are not compiling at all due to lot of "undefined reference class" (see attach file Error2, sorry for the french error lines).

Looks like a little bit tricky to me, maybe should I using an other version of CUDA? Problem is on Nvidia Website only CUDA10 and CUDA10.1 are referenced for Ubuntu 18.04.

Thanks for your help.

Files:
Error1.txt
Error2.txt

@TonkomoLLC
Copy link
Contributor

Hi, I started to debug compilation with CUDA 10.1 but stopped. CUDA 10.0 works, as does CUDA 9.1 installed from the Ubuntu 18.04 repository. I've tested CUDA 10.0 with Ubuntu 16.04 and Ubuntu repository installed CUDA 9.1 with Ubuntu 18.04.

Hope this helps

@Samos2b
Copy link
Author

Samos2b commented Apr 26, 2019

Thanks I will try with other versions then. I was thinking CUDA 10.1 will work since RCFD works with CUDA10.

I'll let you know.

@TonkomoLLC
Copy link
Contributor

TonkomoLLC commented Apr 26, 2019

By the way, for you or anyone else reading this thread in the future, debugging CUDA 10.1 and RapidCFD, the error you point out in Error1.txt can be fixed as follows.

file : $WM_PROJECT_DIR/src/OpenFOAM/containers/Lists/packedList/packedListI.H`

line 383:

Old:

list_->resize(index_ + 1);

Change to:

(*list_).resize(index_ + 1);

I did not fix other errors with CUDA 10.1 compilation... unfortunately

@Samos2b
Copy link
Author

Samos2b commented Apr 26, 2019

Thanks for the info, I'm gonna correct this file.

@Samos2b
Copy link
Author

Samos2b commented Apr 26, 2019

here an update,
The change you suggested fixed the Error1 but did not fix de compilation problem on CUDA10.1 as excepted.

So I uninstalled CUDA 10.1 and installed CUDA 10 from the local runfile provided by nvidia website.
I tested the installation with the provided Samples and everything is working fine, so CUDA 10.0 installation was successful

However I'm still facing Error2 while compiling with CUDA10. It looks like the compiler cannot find any references for some libraries. I checked the nvcc version, its on 10.0.

Is that error can come from the Nvidia Driver? I'm running on version 430.
Also I'm running on gcc 7.3.0.

Thanks for your time.

@TonkomoLLC
Copy link
Contributor

Hello,

Thanks for the update. Please make sure that you completely clean the RapidCFD installation before you compile.

e.g.,

cd src
wclean all
cd ../applications
wclean all
cd ..
./Allwmake

While you're at it see #39 and #54 --- to fix the thrust device error at the end of executing a solver:

In case anyone wants to look this up in the future...
The above referenced "device free" error that occurs with CUDA-9 comes from thrust/system/cuda/detail/malloc_and_free.h

For example, on Ubuntu 18.04 with CUDA-9 installed from the official repository, this file can be found at:
/usr/include/thrust/system/cuda/detail/malloc_and_free.h

The error in question is new to CUDA-9, and I do not know the history as to why this error message is not in the CUDA-8 thrust source code. A quick-and-dirty work-around is to just comment out the error. For example, line 87 of the referenced malloc_and_free.h for CUDA-9.1 installed from the Ubuntu 18.04 repository.

// cuda_cub::throw_on_error(status, "device free failed"); 
After recompiling, RapidCFD+CUDA-9 will exit without throwing an error.

To be clear, you'll want to edit malloc_and_free.h before you type ./Allwmake too recompile RapidCFD.

I personally found that I had to use the correct sm_# (compute capability) for my particular GPU in $WM_PROJECT_DIR/wmake/rules ... /c and /c++

hope all this helps!

Good luck with the recompile.

@Samos2b
Copy link
Author

Samos2b commented Apr 26, 2019

I will try it and let you know,
Thanks for the quick reply!

@TonkomoLLC
Copy link
Contributor

Oh and one more comment... you'll want to edit the wmake rules, too, before you recompile.

Hope this works out for you.

@Samos2b
Copy link
Author

Samos2b commented Apr 26, 2019

I already edited the wmake rules before I compiled for the first time to change the arch from sm_35 to sm_61 to better fit my GPU.

In my installation, the malloc_and_free.h was located in/usr/local/cuda-10.0/include/thrust/system/cuda/detail /I found the line and commented it as you suggested.

I'm gonna clean RCFD installation and compile it again and let you know.

Thanks again.

@TonkomoLLC
Copy link
Contributor

I can't wait to hear what happens! Hope you have a good result.

@Samos2b
Copy link
Author

Samos2b commented Apr 26, 2019

After recompiling, it seems to change anything.

I still have errors like this following while compiling application solver :

//usr/lib/libmeshTools.so : undefined reference « Foam::UPstream::allToAll(Foam::UList<int> const&, Foam::UList<int>&, int) »
or

/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib/libmeshTools.so : undefined reference « Foam::polyMesh::findCell(Foam::Vector<double> const&, Foam::polyMesh::cellDecomposition) const »
I forgot to mention that I did not restart my computer after swaping from CUDA10.1 to CUDA10. Is this should have an incidence?

I'm sorry to have so many difficulties to compile with a configuration that normally work.

@TonkomoLLC
Copy link
Contributor

TonkomoLLC commented Apr 26, 2019

A reboot is a good idea. Maybe this will fix the problem, maybe not. But I really recommend you reboot before continuing to troubleshoot.

As I understand your error message, libmeshTools.so didn't compile, which is why this library could not be found. To track this down you can try:

cd $WM_PROJECT_DIR/src/meshTools
wmake

Since you don't have the libmeshTools.so library, this build will probably fail. You'll want to note why the compilation of libmeshTools.so failed.

However, if the compilation of libmeshTools.so fails after reboot maybe you'll probably want to wclean src and applications as per the note above and retry the compile (just in case there were any problems leftover by not rebooting the system before your last compile). Then, with a freshly cleaned RapidCFD $WM_PROJECT_DIR, if the compilation problem with libmeshTools.so still persists, please report here the compilation error for that particular library.

Other small things that may solve the compilation errors if they persist.

  1. Please make sure that your paths are properly setup. For my CUDA 10 setup, I have the following added to my $WM_PROJECT_DIR/etc/basrc file:
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH
  1. I think I struggled getting RapidCFD to compile with CUDA 10 and I tried several changes at once. Attached are my c and c++ files for $WM_PROJECT_DIR/wmake/rules/linux64Nvcc. Note that I added -g and -G compile flags. I have no idea if one or both of these helped. I also used the -gencode flag. I refer specifically to:
CC          = nvcc -Xptxas -dlcm=cg -m64 -gencode=arch=compute_52,code=sm_52 -arch=sm_52 -g -G

I never looked back after I got RapidCFD to compile with CUDA 10 to figure out which of these additions to the rules files c and c++ may have helped.

Be aware that if you change c and c++ you need to completely clean RapidCFD and start the compile all over again.

As I mentioned, I use CUDA 10 with Ubuntu 16.04. On my Ubuntu 18.04 system, I installed CUDA from the Ubuntu repository (CUDA 9.1). That is, I think I basically used sudo apt install nvidia-cuda-dev but I didn't keep notes, so I am not 100% certain. I am using the nvidia 418 driver. Note that I had to modify the malloc_and_free.h file as we discussed ... and otherwise everything compiled, no problems. Maybe this can be your backup plan (CUDA 9.1 from repository).

Good luck!

c.txt
c++.txt

@Samos2b
Copy link
Author

Samos2b commented Apr 27, 2019

Hello Eric,

here an update of what I did this morning.

After rebooting and verify all the PATH and the rules files I completely cleaned up the RapidCFD install and tried to recompile again but I got same error using CUDA 10.0. Then I tried to compile the libmeshTools but I got the following error :

1 error detected in the compilation of "/tmp/tmpxft_000067eb_00000000-6_cellClassification.cpp1.ii".
cellClassification/cellClassification.dep:433: recipe for target 'Make/linux64NvccDPOpt/cellClassification.o' failed
make: *** [Make/linux64NvccDPOpt/cellClassification.o] Error 1

After that, I downgraded my CUDA version to 9.1 (cleaned the 10.0 install, reboot, installed the 9.1 from the repository and rebooted again), cleaned up the RapidCFD install and compiled again. I got the exact same error I got with CUDA 10.0 and CUDA 10.1.
I tried to recompile the libmeshTools and had the same previous error, so it's definitely a problem with the compilation of libmeshTools.

Also I got a lot of warnings during the compilation like :

/opt/RapidCFD-dev/src/OpenFOAM/lnInclude/boundBoxI.H:207:13: warning: ‘bool Foam::operator!=(const Foam::boundBox&, const Foam::boundBox&)’ has not been declared within Foam
 inline bool Foam::operator!=(const boundBox& a, const boundBox& b)
             ^~~~
/opt/RapidCFD-dev/src/OpenFOAM/lnInclude/boundBox.H:237:20: note: only here as a friend
         inline friend bool operator!=(const boundBox&, const boundBox&);

Is that something I have to worried about?

The only thing I didn't tried is to compile with driver 418 on CUDA 10.0 and 9.1, but I don't really thing is gonna change something since I used driver 418 on CUDA 10.1 and got the exact same error.

Is there any possibility that my CUDA install is not completely cleaned up and some files from CUDA 10.1 or CUDA 10 are still there and enter in conflict with other CUDA installation?

Thanks for your help Eric.

@TonkomoLLC
Copy link
Contributor

Hello,

I apologize but I cannot reproduce your error.

Here is some information about my system

4.18.0-18-generic #19~18.04.1-Ubuntu SMP Fri Apr 5 10:22:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   41C    P8    18W / 250W |    511MiB /  6075MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

All of my NVidia and CUDA software is from the Ubuntu or system76 (my vendor for my computer) repository - nothing "customized" or downloaded from NVidia directly.

I started with a clean version of the repository commit f3775ac, although so as to not mess with my working version of RapidCFD, I edited $WM_PROJECT_DIR/etc/bashrc and changed the version from dev to dev7 (the 7th installation I have on this particular machine, for various debugging purposes, like in this situation).

Since my GPU is compute capability 3.5, I did not change c or c++ rules files under $WM_PROJECT_DIR/wmake ... etc.

With ./Allwmake I can compile RapidCFD, without problems.

So to your specific questions:

  • Yes, I agree, driver 418 is probably not the issue

  • You can ignore the warnings

  • I do not know the reason for your error:

1 error detected in the compilation of "/tmp/tmpxft_000067eb_00000000-6_cellClassification.cpp1.ii".
cellClassification/cellClassification.dep:433: recipe for target 'Make/linux64NvccDPOpt/cellClassification.o' failed
make: *** [Make/linux64NvccDPOpt/cellClassification.o] Error 1

It sounds like you are using a Pascal type GPU, given your note above about SM_61. I do not know if there are complications from using SM_61 that will affect your compilation of RapidCFD.

I have an Ubuntu 16.04 system that has both a K20 and a Ti1050GX card. To get RapidCFD to compile and run properly on both GPU's installed in this system, I edited the c++ rules in the wmake directory as follows:

CC          = nvcc -Xptxas -dlcm=cg -m64 -gencode=arch=compute_30,code=sm_30 \
              -gencode=arch=compute_35,code=sm_35 \
              -gencode=arch=compute_61,code=sm_61 \
              -gencode=arch=compute_35,code=compute_35

I am really stretching here... I don't quite have the same configuration as you, but I have compiled RapidCFD on Ubuntu 18.04 with SM_35 + CUDA 9.1, and on Ubuntu 16.04 with SM_35 + SM_61 and CUDA 8.0.

The final parting advice I have for you at this moment is to always be sure that you reboot after making driver changes, and fully clean out your RapidCFD install + recompile. If in doubt, you can delete completely RapidCFD and reinstall fresh with git clone from the RapidCFD repository. I believe you are doing these steps now, but I repeat here because it is usually vital to freshly clean RapidCFD and recompile from scratch when there are changes to wmake rules or other major system configurations (e.g., changing CUDA versions).

I really wish you well... there must be some way to get this compiled for your system....

@Samos2b
Copy link
Author

Samos2b commented Apr 27, 2019

Dear Eric,

I tried to restart from a clean sheet. I cleanup my Ubuntu 18.04 installation, download a clean folder of RapidCFD and installed CUDA 9.1 from repository :

sudo apt-get install nvidia-cuda-dev nvidia-cuda-toolkit

Also, to match your configuration, I installed driver 418 and set arch=sm_35 in c and c++ file (modified as you suggested) instead of arch=sm_61 which is the compute capability of my 1070GTX.

I rebooted twice after installing CUDA and the driver.
I modified the malloc_and_free.h as you suggested to prevent the running error from CUDA9.1.

Since I did a fresh install of Ubuntu 18.04, I also had to install flex and zlib1g-dev to run the compilation properly.

However compilation failed again and the error is still here.
After some research on the Internet, I found this maybe related to gcc version :

https://stackoverflow.com/questions/17251316/nvcc-unable-to-compile

https://stackoverflow.com/questions/17233656/when-i-compile-my-cuda-code-it-said1-error-detected-in-the-compilation-of-tmp

According to CUDA documentation, I should have a specific version of gcc and Ubuntu Kernel.
I found that for CUDA 10.0 and above I should have kernel 4.15.0 and gcc 7.3.0. Nothing cocnerning Ubuntu 18.04 for CUDA 9.1. Do you thing this is probably the issue?

I have Kernel version 4.18 and gcc 7.3.0 on my installation.

I'm sorry to point out such difficulties.

@TonkomoLLC
Copy link
Contributor

Hello,

For my Ubuntu 18.04 setup, I am also using kernel 4.18 and gcc 7.3.0.

gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I am unsure what differences remain between your Ubuntu 18.04 and my Ubuntu 18.04 system. Same kernel; same CUDA version; same gcc. Same RapidCFD distribution.

Ok, now I am grasping for ideas.

  1. Is there any line number or file (other than the .dep file) associated with the error? Or literally, is this the extent of your information?
1 error detected in the compilation of "/tmp/tmpxft_000067eb_00000000-6_cellClassification.cpp1.ii".
cellClassification/cellClassification.dep:433: recipe for target 'Make/linux64NvccDPOpt/cellClassification.o' failed
make: *** [Make/linux64NvccDPOpt/cellClassification.o] Error 1
  1. If you look at $WM_PROJECT_DIR/src/meshTools/cellClassification/cellClassification.dep, especially at line 433, is there anything unusual?

On my machine, line 433 of cellClassification.dep says:

        @SOURCE_DIR=cellClassification

I am really starting to run out of ideas for you to try.... and just to be absolutely sure...you're compiling RapidCFD as root in /opt, and you are starting with a fresh clone of the repository?

Sorry for such basic questions... I cannot understand why your copy of RapidCFD does not compile...

@Samos2b
Copy link
Author

Samos2b commented Apr 27, 2019

So it's definitely not kernel and gcc version.
I'm trying to figure out what step I'm doing wrong, because there is no reason it's working on your system and not on mine since configurations are the same.

I checked for cellClassification.dep on line 433 and I got the same as you mentioned.

Usually I'm following this flowchart when I want to compile RapidCFD
1 - Download the zip file from GitHub, unzip it and copy as root on /opt (I'm not using git clone but I should).
2 - Modify rules c and c++ for matching my arch
3 - Modify the foamInstall variable in etc/bashrc to match folder location. Usually, and I don't know why, I have to remove $WM_PROJECT from foamInstall=/opt/$WM_PROJECT variable because otherwise it will not point to the good location when sourcing the file (I basically got "files not found" if I don't remove $WM_PROJECT).
4 - Then run ./Allwmake

However since I started from a clean sheet I tried to compile from ~/RapidCFD-dev but it makes any differences as I explained.

For now I'm pretty sure problem is related to my installation flowchart since I got the same configuration as yours.

I'm gonna review that tomorrow I getting tired right now.
Thanks again for you help Eric I really appreciate it.

@TonkomoLLC
Copy link
Contributor

While I am not sure why this is a problem, the main item that appears most different in our setups is your flow chart item number 3:

3 - Modify the foamInstall variable in etc/bashrc to match folder location. Usually, and I don't know why, I have to remove $WM_PROJECT from foamInstall=/opt/$WM_PROJECT variable because otherwise it will not point to the good location when sourcing the file (I basically got "files not found" if I don't remove $WM_PROJECT).

I point this out since the compilation error showing "recipe for target" might mean something is not found? I am not sure.

So... Yes... I agree that the problem is most likely related to your installation flow chart...and within this flowchart... I am most suspicious of your step 3, the editing of the foamInstall environment variable. This said, I do not see anything wrong with what you did... nor am I sure why changing foamInstall is causing a problem...

On my machine:
echo $FOAM_INST_DIR yields /opt

echo $WM_PROJECT yields RapidCFD

echo $WM_PROJECT_DIR yields /opt/RapidCFD-dev

Hopefully you are on a path to find the root of the compilation issue. From this dialogue I hope you have eliminated a lot of possible explanations, narrowing down the list of possibilities...

Cheers

@Samos2b
Copy link
Author

Samos2b commented Apr 28, 2019

Dear Eric,

Looking at compilation logs, I found this kind of warnings when the the programs tries to compile files related to surfZone, MeshedSurfaceAllocator, surfaceFormat, surfMesh (and others) :

lnInclude/MeshedSurface.C:541:1: warning: the compiler can assume that the address of ‘zoneLst’ will always evaluate to ‘true’ [-Waddress]
lnInclude/MeshedSurface.C: In instantiation of ‘void Foam::MeshedSurface<Face>::remapFaces(const labelUList&) [with Face = Foam::face; Foam::labelUList = Foam::UList<int>]’:
/tmp/tmpxft_00001c5d_00000000-5_surfMesh.cudafe1.stub.c:130:27:   required from here
lnInclude/MeshedSurface.C:436:16: warning: the compiler can assume that the address of ‘faceMap’ will always evaluate to ‘true’ [-Waddress]
     if (&faceMap && faceMap.size())

The warning contains/tmp/tmpxft_00001c5d_00000000-5_surfMesh.cudafe1.stub.c:130:27 which looks like the same as for the error makes us crazy since yesterday. I have this king of warning on every files related to this part of the compilation.

It looks like every the warnings are always pointing an if(), is this maybe something we should look at?

Regards.

@TonkomoLLC
Copy link
Contributor

Hello,

I have these same warnings, and they do not seem to impact operation of the solvers.

I am not certain how the nvcc compiler works, but I think there are a bunch of temporary files that are placed in /tmp (for example, watch the files in /tmp while you cmpile.. all kinds of tmp files come and go during the compilation process).

I am never in a position to state that a complex code like this is bug free, but these warnings are not preventing the compilation of working solvers and associated libraries on my Ubuntu 18.04 system.

I apologize, but I do not have a firm reason for your compilation problem.

@TonkomoLLC
Copy link
Contributor

Hello,

I am still fixated on your need to change foamInstall in bashrc.

Is this cfd-online post or this other cfd-online post related to the errors you see when sourcing the $WM_PROJECT_DIR/etc/bashrc file?

The edits you made to bashrc are the most significant difference I can identify between my setup and your setup...

Good luck!

@TonkomoLLC
Copy link
Contributor

TonkomoLLC commented May 15, 2019

I managed to compile RapidCFD with CUDA 10.1 using the following procedure.

  1. Start with a "fresh" copy of RapidCFD-dev.

  2. Optional: Start with a 'fresh" copy of ThirdParty-dev

  3. Install CUDA toolkit 10.1 update 1 following instructions at NVIDIA's website

  4. Replace two files (in the attached zip), to prevent system errors at the conclusion of a calculation:

/usr/local/cuda/include/thrust/system/cuda/memory_resource.h
/usr/local/cuda/include/thrust/system/cuda/detail/malloc_and_free.h

  1. Optional: If you do not have paths setup in your .bashrc file, edit /opt/RapidCFD-dev/etc/bashrc to include the source for CUDA binaries and libraries. See the end of the bashrc file in the attached zip as an example of how to update environment variables.

  2. Optional: If needed, change the wake rules for c and c++ (e.g., to change the compute capability for your card; by default sm_30 is selected). These files are located in $WM_PROJECT_DIR/wmake/rules/linux64Nvcc

  3. Optional: If you require ThirdParty-dev, note that Allwmake needs a slight edit for the script to run. This modified script is also included in the attached zip, and can replace Allwmake in ThirdParty-dev.

  4. Replace /opt/RapidCFD-dev/src/OpenFOAM/containers/Lists/PackedList/PackedListI.H with the file included in the attached Zip file

  5. source /opt/RapidCFD/etc/bashrc and from $WM_PROJECT_DIR, run ./Allwmake. At the conclusion you should have a working copy of RapidCFD for CUDA 10.1.

CUDA-10.1-RapidCFD-files.zip

@Samos2b
Copy link
Author

Samos2b commented Jul 8, 2019

Dear Eric,

I'm so sorry for responding so late but I didn't have time this past month to focus on RCFD.

However, I finally succeed to compile RCFD couple days ago following your instructions and it's working pretty well on Ubuntu 18.04 a,d CUDA 10.1.

I'm now sure problem came from a conflict in my CUDA installation. I think I did not purge all the files between two installation of different CUDA version.

Thank you very much for your time and for manage this problem.

Regards.

@ikaruc
Copy link

ikaruc commented Apr 17, 2021

Hello,

First of all, sorry for my bad english. :(

In the step 1, i have to install this https://github.com/Atizar/RapidCFD-dev?.

I am usign ubuntu 18.04 and i want to improve the speed of my simulation in OF

Thanks in advance for your help

@TonkomoLLC
Copy link
Contributor

Hello,

Yes, you will need to compile RapidCFD for your computer. The installation instructions on the home page of this repository are generally correct, except you also need to set the -arch=sm_## flag to the correct value for your GPU in the wmake rules file.

I hope the installation goes well.

Best regards,

Eric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants