Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use CUDA or OpenCL without AVX2? #317

Closed
ZeroYuni opened this issue Jul 13, 2023 · 28 comments
Closed

How to use CUDA or OpenCL without AVX2? #317

ZeroYuni opened this issue Jul 13, 2023 · 28 comments
Labels
enhancement New feature or request

Comments

@ZeroYuni
Copy link

I have 1660S and i5-3570K. The program crashed in CuBLAS. I think it's because CuBLAS works with AVX2. If I run without avx2, then the program uses Failsafe mode.

yt7trftrt

How to use CUDA on old cpu with AVX1?

@Fr0d0Beutl1n
Copy link

CUDA shouldn't need AVX at all. This sounds a bit like this problem: abetlen/llama-cpp-python#284

@LostRuins
Copy link
Owner

If you're running on windows - that is technically true but there are too many build targets to support to create binaries for all combinations of software/hardware. Thus the default set of targets doesn't include AVX1+Cuda, so if you need that then you may have to do a build yourself for now.

@TFWol
Copy link

TFWol commented Jul 21, 2023

CUDA shouldn't need AVX at all. This sounds a bit like this problem: abetlen/llama-cpp-python#284

I'm glad I saw this Issue. I was wondering why GPU wasn't being used, wasn't very obvious what the cause was. Time to look at getting a build made.

@TFWol
Copy link

TFWol commented Jul 22, 2023

I tried and can't figure out the commands I need to compile this correctly.

@LostRuins What needs to be configured? I followed the instructions here and was able to create the .exe (which didn't have cublas), but then I saw this section after the fact.

I then followed that, but I'm unable to generate the koboldcpp_cublas.dll that's mentioned. I'm actually confused on the wording of that paragraph, too.

w64devkit output:

~/koboldcpp $ set LLAMA_CUBLAS=1
~/koboldcpp $ make
I llama.cpp build info:
I UNAME_S:  Windows_NT
I UNAME_P:  unknown
I UNAME_M:  x86_64
I CFLAGS:   -I.              -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c11   -fPIC -DGGML_USE_K_QUANTS -pthread -s
I CXXFLAGS: -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings
I LDFLAGS:
I CC:       cc (GCC) 13.1.0
I CXX:      g++ (GCC) 13.1.0

g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings  ggml.o ggml_v2.o ggml_v1.o expose.o common.o gpttype_adapter.o k_quants.o -shared -o koboldcpp.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_failsafe.o ggml_v2_failsafe.o ggml_v1_failsafe.o expose.o common.o gpttype_adapter_failsafe.o k_quants_failsafe.o -shared -o koboldcpp_failsafe.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_openblas.o ggml_v2_openblas.o ggml_v1.o expose.o common.o gpttype_adapter.o k_quants.o lib/libopenblas.lib -shared -o koboldcpp_openblas.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_openblas_noavx2.o ggml_v2_openblas_noavx2.o ggml_v1_failsafe.o expose.o common.o gpttype_adapter_failsafe.o k_quants_noavx2.o lib/libopenblas.lib -shared -o koboldcpp_openblas_noavx2.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_clblast.o ggml_v2_clblast.o ggml_v1.o expose.o common.o gpttype_adapter_clblast.o ggml-opencl.o ggml_v2-opencl.o ggml_v2-opencl-legacy.o k_quants.o lib/OpenCL.lib lib/clblast.lib -shared -o koboldcpp_clblast.dll

I'm not all that familiar with building, but I sort of have to if I want CUBLAS with no AVX2 to work.

Update

I was able to build the .dll by using cmake, it wasn't clear that w64devkit is not used for this.

The program still won't offload to GPU though 🤷‍♂️

Update 2

I'm able to run with python, but now I hit the same issue as #290 where the workaround defeats the purpose of everything I'm trying to do ☹️

@LostRuins
Copy link
Owner

@TFWol if you're on windows, perhaps you'd have better results using the CMakeLists file instead. That one should be very straightforward - install CUDA toolkit and visual studio, open the project, and build.

@TFWol
Copy link

TFWol commented Jul 22, 2023

Not sure of you saw my updates at the bottom what with notifications only showing the initial post, but I was able to run it via Python, but I'm hitting the RAM offload glitch.

@LostRuins
Copy link
Owner

Can you copy the terminal output that shows when you try to load the model?

@TFWol
Copy link

TFWol commented Jul 22, 2023

Python

Automatically uses OpenBLAS if --noavx2 is used and ignores --usecublas

C:\Users\user\koboldcpp135>python koboldcpp.py --noavx2  --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Attempting to use non-avx2 compatibility library with OpenBLAS. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas_noavx2.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=True, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9807.49 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

This is where it will load into VRAM and RAM, but never lets go of RAM

C:\Users\user\koboldcpp135>python koboldcpp.py
***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Failed to use new GUI. Reason: No module named 'customtkinter'
Make sure customtkinter is installed!!!
Attempting to use old GUI...
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2145.75 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9750 MB
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

image

Exe

Not too important until Python is working. Not seeing the library

C:\Users\user\koboldcpp135>koboldcpp.exe --noavx2  --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Attempting to use non-avx2 compatibility library with OpenBLAS. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas_noavx2.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=True, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9807.49 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
C:\Users\user\koboldcpp135>koboldcpp.exe --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Warning: CuBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
Traceback (most recent call last):
  File "koboldcpp.py", line 1453, in <module>
    main(args)
  File "koboldcpp.py", line 1378, in main
    loadok = load_model(modelname)
  File "koboldcpp.py", line 212, in load_model
    ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d
[24872] Failed to execute script 'koboldcpp' due to unhandled exception!

@LostRuins
Copy link
Owner

You can try --nommap with cublas too in order to free RAM. Alternatively, you can also try --useclblast instead (with nommap if it doesnt work)

The best way to compare would be to see the total RAM usage with X layers offloaded vs with zero layers offloaded. If there is a difference, then you know the offload is working. You may not see a dramatic decrease because the RAM is supposed to be freed piecemeal during the loading process.

@TFWol
Copy link

TFWol commented Jul 22, 2023

Forgot to mention I had tried:
--nommap --usecublas --gpulayers 100 - RAM not released
image

Expand for code
***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=True, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=[], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7759.49 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2145.75 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9750 MB
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

--nommap --usecublas normal 1 --gpulayers 0 - GPU hardly used as expected
image

Expand for code
***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=True, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=[], gpulayers=0)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7759.49 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 9807.49 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/43 layers to GPU
llama_model_load_internal: total VRAM used: 480 MB
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

-useclblast 1 0 --gpulayers 100 - Forces AVX2, which I don't have

Expand for code
***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=[1, 0], usecublas=None, gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
Traceback (most recent call last):
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1453, in <module>
    main(args)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1378, in main
    loadok = load_model(modelname)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 212, in load_model
    ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d

--useclblast 1 0 --gpulayers 100 --nommap - Same issue as above; Forces AVX2, which I don't have

Expand for code
***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=[1, 0], usecublas=None, gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
Traceback (most recent call last):
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1453, in <module>
    main(args)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1378, in main
    loadok = load_model(modelname)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 212, in load_model
    ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d

@LostRuins
Copy link
Owner

Did you try the --nommap? That seems to be a precondition for freeing ram.

@TFWol
Copy link

TFWol commented Jul 23, 2023

Did you try the --nommap? That seems to be a precondition for freeing ram.

?
I had posted the commands I used above, which included --nommap

@TFWol
Copy link

TFWol commented Aug 8, 2023

Update: Fixed in latest release


The new Koboldcpp v1.39 and v1.39.1 fail to compile the koboldcpp_cublas.dll with error

nvcc fatal   : Unsupported gpu architecture 'compute_37'

I can get it to compile if I revert

koboldcpp/CMakeLists.txt

Lines 99 to 103 in b40550c

if (LLAMA_CUDA_DMMV_F16)
set(CMAKE_CUDA_ARCHITECTURES "60;61") # needed for f16 CUDA intrinsics
else()
set(CMAKE_CUDA_ARCHITECTURES "37;52;61") # lowest CUDA 12 standard + lowest for integer intrinsics
endif()

back to 1.38's version

        if (LLAMA_CUDA_DMMV_F16)
            set(CMAKE_CUDA_ARCHITECTURES "61") # needed for f16 CUDA intrinsics
        else()
            set(CMAKE_CUDA_ARCHITECTURES "52;61") # lowest CUDA 12 standard + lowest for integer intrinsics
        endif()

I'm unsure of the hidden repercussions.

@LostRuins
Copy link
Owner

Actually compute_37 is super deprecated and should not be used.
@henk717 did the Kepler guy actually get it working on K80? Otherwise, I feel I should just drop it if it's causing issues.

@TFWol
Copy link

TFWol commented Aug 9, 2023

I can get clblast working at least, but I had to modify the Makefile to build it.
I made a gist (mostly for myself)

@TFWol
Copy link

TFWol commented Aug 9, 2023

Actually compute_37 is super deprecated and should not be used.

Why is that being used anyways? Is it driver or toolkit version related?

BTW, the latest version you released fixes that build error.

@henk717
Copy link

henk717 commented Aug 9, 2023

To enable support for older GPU's such as the K80

@TFWol
Copy link

TFWol commented Aug 9, 2023

I'm still getting that issue with VRAM not offloading when trying to use CuBLAS though, despite every combination of the other settings like mmap, mmq, GPU Layers, etc.

Upstream problem?

@LostRuins
Copy link
Owner

No I don't think it's an upstream problem. Maybe a misunderstanding.

If you're already using the CMAKE file to build for CUBLAS (not the makefile), then it should work, though it ignores whatever you set in the makefile. instead, you need to edit the launch settings in cmake.

image

Once you finish building koboldcpp_cublas.dll from CMAKE in visual studio, you need to copy it back into the koboldcpp directory. Then it will be able to be loaded correctly.

The normal makefile does not automatically build for cublas on windows.

@TFWol
Copy link

TFWol commented Aug 11, 2023

Right, I'm aware cmake uses CMakeLists.txt and w64dev uses Makefile.
I've been building CuBLAS dll with CMakeLists.txt changes (AVX2 OFF, FMA OFF)

option(MAKE_MISC_FILES              "MAKE_MISC_FILES"                                       OFF)

# instruction set specific
option(LLAMA_AVX                    "llama: enable AVX"                                     ON)
option(LLAMA_AVX2                   "llama: enable AVX2"                                    OFF)
option(LLAMA_AVX512                 "llama: enable AVX512"                                  OFF)
option(LLAMA_AVX512_VBMI            "llama: enable AVX512-VBMI"                             OFF)
option(LLAMA_AVX512_VNNI            "llama: enable AVX512-VNNI"                             OFF)
option(LLAMA_FMA                    "llama: enable FMA"                                     OFF)
# in MSVC F16C is implied with AVX2/AVX512

@TFWol
Copy link

TFWol commented Aug 11, 2023

I asked about upstream since oobabooga's text-gen has the same offload problem when it uses llamacpp with CUDA

@LostRuins
Copy link
Owner

could it be a driver problem then? do you actually see the GPU being listed when the cublas DLL loads? it should list your GPU name if detected

@TFWol
Copy link

TFWol commented Aug 14, 2023

do you actually see the GPU being listed when the cublas DLL loads?

Yes, if you look at my previous replies above, you can see my GPU is listed in the output.

could it be a driver problem then?

I wish it were that simple.
Looks like it's an issue upstream.
oobabooga/text-generation-webui#3475 (comment)
This comment.

@AG-w
Copy link

AG-w commented Oct 1, 2023

I asked about upstream since oobabooga's text-gen has the same offload problem when it uses llamacpp with CUDA

I can use CUDA with no AVX2 just fine on oobabooga webui, they simply just mixed up AVX and AVX2 python wheel
oobabooga/text-generation-webui#3803 (comment)
I edited requirement_noavx2.txt and update it then the webui works fine with CUDA and no AVX2 cpu

anyway, I saw people say this backend is more lightweight but this one won't work with CUDA if cpu don't have avx2
(it doesn't matter if you use noavx2 and usecublas flag at same time, the program will just ignore the cublas flag)
so I can only give up and go back now

@TFWol
Copy link

TFWol commented Oct 3, 2023

@AG-w Does your RAM still hold onto the full .GUFF file in oobabooga?
My issue is I can run the model, but it will load the full thing into RAM, copy it to GPU, and then still keep the full thing into RAM.

this one won't work with CUDA if cpu don't have avx2
(it doesn't matter if you use noavx2 and usecublas flag at same time, the program will just ignore the cublas flag)

Yeah, the only way around that is to compile it for your system. It's mentioned in the Readme, that second bullet point, but very sparse on details. There's more information here (search for Windows, Compiling from Source Code)

Took me forever to get a compiling procedure of some sort working. The RAM offload is still a problem though. The only way to make it reserve way less RAM is checking the Disable MMAP button.

@LostRuins LostRuins added the enhancement New feature or request label Oct 3, 2023
@AG-w
Copy link

AG-w commented Oct 5, 2023

@AG-w Does your RAM still hold onto the full .GUFF file in oobabooga? My issue is I can run the model, but it will load the full thing into RAM, copy it to GPU, and then still keep the full thing into RAM.

Yes, I can see Vram filled up in task manager after I changed python wheel in oobabooga
I saw GPU usage remain low and suspect something wrong, but later I realized if I don't load all layers to Vram, the CPU will bottlenecked GPU

Yeah, the only way around that is to compile it for your system.

this is just frustrating and the reason why I'll still use bloated webUI instead of compiled binary, I just don't want to setup millions of environment just to fix some error, I dealt with similar things for Krita and I just don't bother to compile anything myself to fix some error

with Python I can just edit it in text editor and problem is gone

@TFWol
Copy link

TFWol commented Oct 17, 2023

Yeah, I have waaaay too many environments as well. Maybe I can leverage AI to do it for me 😆

@LostRuins
Copy link
Owner

Just updating this old issue, now it is possible to use CLBlast with --noavx2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants