-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use CUDA or OpenCL without AVX2? #317
Comments
CUDA shouldn't need AVX at all. This sounds a bit like this problem: abetlen/llama-cpp-python#284 |
If you're running on windows - that is technically true but there are too many build targets to support to create binaries for all combinations of software/hardware. Thus the default set of targets doesn't include AVX1+Cuda, so if you need that then you may have to do a build yourself for now. |
I'm glad I saw this Issue. I was wondering why GPU wasn't being used, wasn't very obvious what the cause was. Time to look at getting a build made. |
I tried and can't figure out the commands I need to compile this correctly. @LostRuins What needs to be configured? I followed the instructions here and was able to create the .exe (which didn't have cublas), but then I saw this section after the fact. I then followed that, but I'm unable to generate the w64devkit output: ~/koboldcpp $ set LLAMA_CUBLAS=1
~/koboldcpp $ make
I llama.cpp build info:
I UNAME_S: Windows_NT
I UNAME_P: unknown
I UNAME_M: x86_64
I CFLAGS: -I. -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c11 -fPIC -DGGML_USE_K_QUANTS -pthread -s
I CXXFLAGS: -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings
I LDFLAGS:
I CC: cc (GCC) 13.1.0
I CXX: g++ (GCC) 13.1.0
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml.o ggml_v2.o ggml_v1.o expose.o common.o gpttype_adapter.o k_quants.o -shared -o koboldcpp.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_failsafe.o ggml_v2_failsafe.o ggml_v1_failsafe.o expose.o common.o gpttype_adapter_failsafe.o k_quants_failsafe.o -shared -o koboldcpp_failsafe.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_openblas.o ggml_v2_openblas.o ggml_v1.o expose.o common.o gpttype_adapter.o k_quants.o lib/libopenblas.lib -shared -o koboldcpp_openblas.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_openblas_noavx2.o ggml_v2_openblas_noavx2.o ggml_v1_failsafe.o expose.o common.o gpttype_adapter_failsafe.o k_quants_noavx2.o lib/libopenblas.lib -shared -o koboldcpp_openblas_noavx2.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_clblast.o ggml_v2_clblast.o ggml_v1.o expose.o common.o gpttype_adapter_clblast.o ggml-opencl.o ggml_v2-opencl.o ggml_v2-opencl-legacy.o k_quants.o lib/OpenCL.lib lib/clblast.lib -shared -o koboldcpp_clblast.dll I'm not all that familiar with building, but I sort of have to if I want CUBLAS with no AVX2 to work. UpdateI was able to build the .dll by using cmake, it wasn't clear that w64devkit is not used for this. The program still won't offload to GPU though 🤷♂️ Update 2I'm able to run with python, but now I hit the same issue as #290 where the workaround defeats the purpose of everything I'm trying to do |
@TFWol if you're on windows, perhaps you'd have better results using the CMakeLists file instead. That one should be very straightforward - install CUDA toolkit and visual studio, open the project, and build. |
Not sure of you saw my updates at the bottom what with notifications only showing the initial post, but I was able to run it via Python, but I'm hitting the RAM offload glitch. |
Can you copy the terminal output that shows when you try to load the model? |
PythonAutomatically uses OpenBLAS if C:\Users\user\koboldcpp135>python koboldcpp.py --noavx2 --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Attempting to use non-avx2 compatibility library with OpenBLAS. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas_noavx2.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=True, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]
---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 9807.49 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001 This is where it will load into VRAM and RAM, but never lets go of RAM C:\Users\user\koboldcpp135>python koboldcpp.py
***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Failed to use new GUI. Reason: No module named 'customtkinter'
Make sure customtkinter is installed!!!
Attempting to use old GUI...
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]
---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 2145.75 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9750 MB
llama_new_context_with_model: kv self size = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001 ExeNot too important until Python is working. Not seeing the library C:\Users\user\koboldcpp135>koboldcpp.exe --noavx2 --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Attempting to use non-avx2 compatibility library with OpenBLAS. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas_noavx2.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=True, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]
---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 9807.49 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001 C:\Users\user\koboldcpp135>koboldcpp.exe --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Warning: CuBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]
---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
Traceback (most recent call last):
File "koboldcpp.py", line 1453, in <module>
main(args)
File "koboldcpp.py", line 1378, in main
loadok = load_model(modelname)
File "koboldcpp.py", line 212, in load_model
ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d
[24872] Failed to execute script 'koboldcpp' due to unhandled exception! |
You can try The best way to compare would be to see the total RAM usage with X layers offloaded vs with zero layers offloaded. If there is a difference, then you know the offload is working. You may not see a dramatic decrease because the RAM is supposed to be freed piecemeal during the loading process. |
Did you try the --nommap? That seems to be a precondition for freeing ram. |
? |
Update: Fixed in latest releaseThe new Koboldcpp v1.39 and v1.39.1 fail to compile the koboldcpp_cublas.dll with error nvcc fatal : Unsupported gpu architecture 'compute_37' I can get it to compile if I revert Lines 99 to 103 in b40550c
back to 1.38's version if (LLAMA_CUDA_DMMV_F16)
set(CMAKE_CUDA_ARCHITECTURES "61") # needed for f16 CUDA intrinsics
else()
set(CMAKE_CUDA_ARCHITECTURES "52;61") # lowest CUDA 12 standard + lowest for integer intrinsics
endif() I'm unsure of the hidden repercussions. |
Actually compute_37 is super deprecated and should not be used. |
I can get clblast working at least, but I had to modify the Makefile to build it. |
Why is that being used anyways? Is it driver or toolkit version related? BTW, the latest version you released fixes that build error. |
To enable support for older GPU's such as the K80 |
I'm still getting that issue with VRAM not offloading when trying to use CuBLAS though, despite every combination of the other settings like mmap, mmq, GPU Layers, etc. Upstream problem? |
No I don't think it's an upstream problem. Maybe a misunderstanding. If you're already using the CMAKE file to build for CUBLAS (not the makefile), then it should work, though it ignores whatever you set in the makefile. instead, you need to edit the launch settings in cmake. Once you finish building koboldcpp_cublas.dll from CMAKE in visual studio, you need to copy it back into the koboldcpp directory. Then it will be able to be loaded correctly. The normal makefile does not automatically build for cublas on windows. |
Right, I'm aware cmake uses CMakeLists.txt and w64dev uses Makefile.
|
I asked about upstream since oobabooga's text-gen has the same offload problem when it uses llamacpp with CUDA |
could it be a driver problem then? do you actually see the GPU being listed when the cublas DLL loads? it should list your GPU name if detected |
Yes, if you look at my previous replies above, you can see my GPU is listed in the output.
I wish it were that simple. |
I can use CUDA with no AVX2 just fine on oobabooga webui, they simply just mixed up AVX and AVX2 python wheel anyway, I saw people say this backend is more lightweight but this one won't work with CUDA if cpu don't have avx2 |
@AG-w Does your RAM still hold onto the full .GUFF file in oobabooga?
Yeah, the only way around that is to compile it for your system. It's mentioned in the Readme, that second bullet point, but very sparse on details. There's more information here (search for Took me forever to get a compiling procedure of some sort working. The RAM offload is still a problem though. The only way to make it reserve way less RAM is checking the Disable MMAP button. |
Yes, I can see Vram filled up in task manager after I changed python wheel in oobabooga
this is just frustrating and the reason why I'll still use bloated webUI instead of compiled binary, I just don't want to setup millions of environment just to fix some error, I dealt with similar things for Krita and I just don't bother to compile anything myself to fix some error with Python I can just edit it in text editor and problem is gone |
Yeah, I have waaaay too many environments as well. Maybe I can leverage AI to do it for me 😆 |
Just updating this old issue, now it is possible to use CLBlast with |
I have 1660S and i5-3570K. The program crashed in CuBLAS. I think it's because CuBLAS works with AVX2. If I run without avx2, then the program uses Failsafe mode.
How to use CUDA on old cpu with AVX1?
The text was updated successfully, but these errors were encountered: