How to use CUDA or OpenCL without AVX2? #317

ZeroYuni · 2023-07-13T15:37:06Z

I have 1660S and i5-3570K. The program crashed in CuBLAS. I think it's because CuBLAS works with AVX2. If I run without avx2, then the program uses Failsafe mode.

How to use CUDA on old cpu with AVX1?

Fr0d0Beutl1n · 2023-07-18T07:10:01Z

CUDA shouldn't need AVX at all. This sounds a bit like this problem: abetlen/llama-cpp-python#284

LostRuins · 2023-07-18T07:13:55Z

If you're running on windows - that is technically true but there are too many build targets to support to create binaries for all combinations of software/hardware. Thus the default set of targets doesn't include AVX1+Cuda, so if you need that then you may have to do a build yourself for now.

TFWol · 2023-07-21T21:51:15Z

CUDA shouldn't need AVX at all. This sounds a bit like this problem: abetlen/llama-cpp-python#284

I'm glad I saw this Issue. I was wondering why GPU wasn't being used, wasn't very obvious what the cause was. Time to look at getting a build made.

TFWol · 2023-07-22T02:32:40Z

I tried and can't figure out the commands I need to compile this correctly.

@LostRuins What needs to be configured? I followed the instructions here and was able to create the .exe (which didn't have cublas), but then I saw this section after the fact.

I then followed that, but I'm unable to generate the koboldcpp_cublas.dll that's mentioned. I'm actually confused on the wording of that paragraph, too.

w64devkit output:

~/koboldcpp $ set LLAMA_CUBLAS=1
~/koboldcpp $ make
I llama.cpp build info:
I UNAME_S:  Windows_NT
I UNAME_P:  unknown
I UNAME_M:  x86_64
I CFLAGS:   -I.              -I./include -I./include/CL -I./otherarch -I./otherarch/tools -Ofast -DNDEBUG -std=c11   -fPIC -DGGML_USE_K_QUANTS -pthread -s
I CXXFLAGS: -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings
I LDFLAGS:
I CC:       cc (GCC) 13.1.0
I CXX:      g++ (GCC) 13.1.0

g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings  ggml.o ggml_v2.o ggml_v1.o expose.o common.o gpttype_adapter.o k_quants.o -shared -o koboldcpp.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_failsafe.o ggml_v2_failsafe.o ggml_v1_failsafe.o expose.o common.o gpttype_adapter_failsafe.o k_quants_failsafe.o -shared -o koboldcpp_failsafe.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_openblas.o ggml_v2_openblas.o ggml_v1.o expose.o common.o gpttype_adapter.o k_quants.o lib/libopenblas.lib -shared -o koboldcpp_openblas.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_openblas_noavx2.o ggml_v2_openblas_noavx2.o ggml_v1_failsafe.o expose.o common.o gpttype_adapter_failsafe.o k_quants_noavx2.o lib/libopenblas.lib -shared -o koboldcpp_openblas_noavx2.dll
g++ -I. -I./examples -I./include -I./include/CL -I./otherarch -I./otherarch/tools -O3 -DNDEBUG -std=c++11 -fPIC -DGGML_USE_K_QUANTS -pthread -s -Wno-multichar -Wno-write-strings ggml_clblast.o ggml_v2_clblast.o ggml_v1.o expose.o common.o gpttype_adapter_clblast.o ggml-opencl.o ggml_v2-opencl.o ggml_v2-opencl-legacy.o k_quants.o lib/OpenCL.lib lib/clblast.lib -shared -o koboldcpp_clblast.dll

I'm not all that familiar with building, but I sort of have to if I want CUBLAS with no AVX2 to work.

Update

I was able to build the .dll by using cmake, it wasn't clear that w64devkit is not used for this.

The program still won't offload to GPU though 🤷‍♂️

Update 2

I'm able to run with python, but now I hit the same issue as #290 where the workaround defeats the purpose of everything I'm trying to do ☹️

LostRuins · 2023-07-22T08:51:53Z

@TFWol if you're on windows, perhaps you'd have better results using the CMakeLists file instead. That one should be very straightforward - install CUDA toolkit and visual studio, open the project, and build.

TFWol · 2023-07-22T08:55:30Z

Not sure of you saw my updates at the bottom what with notifications only showing the initial post, but I was able to run it via Python, but I'm hitting the RAM offload glitch.

LostRuins · 2023-07-22T09:03:47Z

Can you copy the terminal output that shows when you try to load the model?

TFWol · 2023-07-22T11:24:18Z

Python

Automatically uses OpenBLAS if --noavx2 is used and ignores --usecublas

C:\Users\user\koboldcpp135>python koboldcpp.py --noavx2  --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Attempting to use non-avx2 compatibility library with OpenBLAS. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas_noavx2.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=True, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9807.49 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

This is where it will load into VRAM and RAM, but never lets go of RAM

C:\Users\user\koboldcpp135>python koboldcpp.py
***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Failed to use new GUI. Reason: No module named 'customtkinter'
Make sure customtkinter is installed!!!
Attempting to use old GUI...
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2145.75 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9750 MB
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

Exe

Not too important until Python is working. Not seeing the library

C:\Users\user\koboldcpp135>koboldcpp.exe --noavx2  --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Attempting to use non-avx2 compatibility library with OpenBLAS. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas_noavx2.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=True, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9807.49 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

C:\Users\user\koboldcpp135>koboldcpp.exe --usecublas normal 1 --gpulayers 100 --model orca-mini-13b.ggmlv3.q4_1.bin
***
Welcome to KoboldCpp - Version 1.35
Warning: CuBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp.dll
==========
Namespace(model='orca-mini-13b.ggmlv3.q4_1.bin', model_param='orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['normal', '1'], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
Traceback (most recent call last):
  File "koboldcpp.py", line 1453, in <module>
    main(args)
  File "koboldcpp.py", line 1378, in main
    loadok = load_model(modelname)
  File "koboldcpp.py", line 212, in load_model
    ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d
[24872] Failed to execute script 'koboldcpp' due to unhandled exception!

LostRuins · 2023-07-22T15:23:53Z

You can try --nommap with cublas too in order to free RAM. Alternatively, you can also try --useclblast instead (with nommap if it doesnt work)

The best way to compare would be to see the total RAM usage with X layers offloaded vs with zero layers offloaded. If there is a difference, then you know the offload is working. You may not see a dramatic decrease because the RAM is supposed to be freed piecemeal during the loading process.

TFWol · 2023-07-22T23:47:51Z

Forgot to mention I had tried:
--nommap --usecublas --gpulayers 100 - RAM not released

Expand for code

***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=True, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=[], gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7759.49 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2145.75 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 9750 MB
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

--nommap --usecublas normal 1 --gpulayers 0 - GPU hardly used as expected

Expand for code

***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=True, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=[], gpulayers=0)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7759.49 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 9807.49 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/43 layers to GPU
llama_model_load_internal: total VRAM used: 480 MB
llama_new_context_with_model: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

-useclblast 1 0 --gpulayers 100 - Forces AVX2, which I don't have

Expand for code

***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=[1, 0], usecublas=None, gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
Traceback (most recent call last):
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1453, in <module>
    main(args)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1378, in main
    loadok = load_model(modelname)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 212, in load_model
    ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d

--useclblast 1 0 --gpulayers 100 --nommap - Same issue as above; Forces AVX2, which I don't have

Expand for code

***
Welcome to KoboldCpp - Version 1.35
For command line arguments, please refer to --help
***
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.dll
==========
Namespace(model=None, model_param='C:/Users/user/koboldcpp135/orca-mini-13b.ggmlv3.q4_1.bin', port=5001, port_param=5001, host='', launch=False, lora=None, threads=5, blasthreads=5, psutil_set_threads=False, highpriority=False, contextsize=2048, blasbatchsize=512, linearrope=False, stream=False, smartcontext=False, unbantokens=False, bantokens=None, usemirostat=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=[1, 0], usecublas=None, gpulayers=100)
==========
Loading model: C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

---
Identified as LLAMA model: (ver 5)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\Users\user\koboldcpp135\orca-mini-13b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
Traceback (most recent call last):
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1453, in <module>
    main(args)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 1378, in main
    loadok = load_model(modelname)
  File "C:\Users\user\koboldcpp135\koboldcpp.py", line 212, in load_model
    ret = handle.load_model(inputs)
OSError: [WinError -1073741795] Windows Error 0xc000001d

LostRuins · 2023-07-23T01:48:47Z

Did you try the --nommap? That seems to be a precondition for freeing ram.

TFWol · 2023-07-23T12:27:09Z

Did you try the --nommap? That seems to be a precondition for freeing ram.

?
I had posted the commands I used above, which included --nommap

TFWol · 2023-08-08T02:44:13Z

Update: Fixed in latest release

The new Koboldcpp v1.39 and v1.39.1 fail to compile the koboldcpp_cublas.dll with error

nvcc fatal   : Unsupported gpu architecture 'compute_37'

I can get it to compile if I revert

koboldcpp/CMakeLists.txt

Lines 99 to 103 in b40550c

    
           if (LLAMA_CUDA_DMMV_F16) 
        
               set(CMAKE_CUDA_ARCHITECTURES "60;61") # needed for f16 CUDA intrinsics 
        
           else() 
        
               set(CMAKE_CUDA_ARCHITECTURES "37;52;61") # lowest CUDA 12 standard + lowest for integer intrinsics 
        
           endif()

back to 1.38's version

        if (LLAMA_CUDA_DMMV_F16)
            set(CMAKE_CUDA_ARCHITECTURES "61") # needed for f16 CUDA intrinsics
        else()
            set(CMAKE_CUDA_ARCHITECTURES "52;61") # lowest CUDA 12 standard + lowest for integer intrinsics
        endif()

I'm unsure of the hidden repercussions.

LostRuins · 2023-08-08T15:45:40Z

Actually compute_37 is super deprecated and should not be used.
@henk717 did the Kepler guy actually get it working on K80? Otherwise, I feel I should just drop it if it's causing issues.

TFWol · 2023-08-09T19:27:18Z

I can get clblast working at least, but I had to modify the Makefile to build it.
I made a gist (mostly for myself)

TFWol · 2023-08-09T19:33:10Z

Actually compute_37 is super deprecated and should not be used.

Why is that being used anyways? Is it driver or toolkit version related?

BTW, the latest version you released fixes that build error.

henk717 · 2023-08-09T20:28:20Z

To enable support for older GPU's such as the K80

TFWol · 2023-08-09T21:00:53Z

I'm still getting that issue with VRAM not offloading when trying to use CuBLAS though, despite every combination of the other settings like mmap, mmq, GPU Layers, etc.

Upstream problem?

LostRuins · 2023-08-10T03:03:44Z

No I don't think it's an upstream problem. Maybe a misunderstanding.

If you're already using the CMAKE file to build for CUBLAS (not the makefile), then it should work, though it ignores whatever you set in the makefile. instead, you need to edit the launch settings in cmake.

Once you finish building koboldcpp_cublas.dll from CMAKE in visual studio, you need to copy it back into the koboldcpp directory. Then it will be able to be loaded correctly.

The normal makefile does not automatically build for cublas on windows.

TFWol · 2023-08-11T18:52:53Z

Right, I'm aware cmake uses CMakeLists.txt and w64dev uses Makefile.
I've been building CuBLAS dll with CMakeLists.txt changes (AVX2 OFF, FMA OFF)

option(MAKE_MISC_FILES              "MAKE_MISC_FILES"                                       OFF)

# instruction set specific
option(LLAMA_AVX                    "llama: enable AVX"                                     ON)
option(LLAMA_AVX2                   "llama: enable AVX2"                                    OFF)
option(LLAMA_AVX512                 "llama: enable AVX512"                                  OFF)
option(LLAMA_AVX512_VBMI            "llama: enable AVX512-VBMI"                             OFF)
option(LLAMA_AVX512_VNNI            "llama: enable AVX512-VNNI"                             OFF)
option(LLAMA_FMA                    "llama: enable FMA"                                     OFF)
# in MSVC F16C is implied with AVX2/AVX512

TFWol · 2023-08-11T19:11:37Z

I asked about upstream since oobabooga's text-gen has the same offload problem when it uses llamacpp with CUDA

LostRuins · 2023-08-12T03:00:35Z

could it be a driver problem then? do you actually see the GPU being listed when the cublas DLL loads? it should list your GPU name if detected

TFWol · 2023-08-14T04:35:13Z

do you actually see the GPU being listed when the cublas DLL loads?

Yes, if you look at my previous replies above, you can see my GPU is listed in the output.

could it be a driver problem then?

I wish it were that simple.
Looks like it's an issue upstream.
oobabooga/text-generation-webui#3475 (comment)
This comment.

AG-w · 2023-10-01T15:56:28Z

I asked about upstream since oobabooga's text-gen has the same offload problem when it uses llamacpp with CUDA

I can use CUDA with no AVX2 just fine on oobabooga webui, they simply just mixed up AVX and AVX2 python wheel
oobabooga/text-generation-webui#3803 (comment)
I edited requirement_noavx2.txt and update it then the webui works fine with CUDA and no AVX2 cpu

anyway, I saw people say this backend is more lightweight but this one won't work with CUDA if cpu don't have avx2
(it doesn't matter if you use noavx2 and usecublas flag at same time, the program will just ignore the cublas flag)
so I can only give up and go back now

TFWol · 2023-10-03T09:25:50Z

@AG-w Does your RAM still hold onto the full .GUFF file in oobabooga?
My issue is I can run the model, but it will load the full thing into RAM, copy it to GPU, and then still keep the full thing into RAM.

this one won't work with CUDA if cpu don't have avx2
(it doesn't matter if you use noavx2 and usecublas flag at same time, the program will just ignore the cublas flag)

Yeah, the only way around that is to compile it for your system. It's mentioned in the Readme, that second bullet point, but very sparse on details. There's more information here (search for Windows, Compiling from Source Code)

Took me forever to get a compiling procedure of some sort working. The RAM offload is still a problem though. The only way to make it reserve way less RAM is checking the Disable MMAP button.

AG-w · 2023-10-05T06:59:21Z

@AG-w Does your RAM still hold onto the full .GUFF file in oobabooga? My issue is I can run the model, but it will load the full thing into RAM, copy it to GPU, and then still keep the full thing into RAM.

Yes, I can see Vram filled up in task manager after I changed python wheel in oobabooga
I saw GPU usage remain low and suspect something wrong, but later I realized if I don't load all layers to Vram, the CPU will bottlenecked GPU

Yeah, the only way around that is to compile it for your system.

this is just frustrating and the reason why I'll still use bloated webUI instead of compiled binary, I just don't want to setup millions of environment just to fix some error, I dealt with similar things for Krita and I just don't bother to compile anything myself to fix some error

with Python I can just edit it in text editor and problem is gone

TFWol · 2023-10-17T03:31:34Z

Yeah, I have waaaay too many environments as well. Maybe I can leverage AI to do it for me 😆

LostRuins · 2024-02-08T16:05:40Z

Just updating this old issue, now it is possible to use CLBlast with --noavx2

TFWol mentioned this issue Aug 7, 2023

OSError: [WinError -1073741795] Windows Error 0xc000001d Help fixing? oobabooga/text-generation-webui#3475

Closed

LostRuins added the enhancement New feature or request label Oct 3, 2023

LostRuins closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use CUDA or OpenCL without AVX2? #317

How to use CUDA or OpenCL without AVX2? #317

ZeroYuni commented Jul 13, 2023

Fr0d0Beutl1n commented Jul 18, 2023

LostRuins commented Jul 18, 2023

TFWol commented Jul 21, 2023

TFWol commented Jul 22, 2023 •

edited

Loading

LostRuins commented Jul 22, 2023

TFWol commented Jul 22, 2023

LostRuins commented Jul 22, 2023

TFWol commented Jul 22, 2023

LostRuins commented Jul 22, 2023

TFWol commented Jul 22, 2023 •

edited

Loading

LostRuins commented Jul 23, 2023

TFWol commented Jul 23, 2023

TFWol commented Aug 8, 2023 •

edited

Loading

LostRuins commented Aug 8, 2023

TFWol commented Aug 9, 2023

TFWol commented Aug 9, 2023 •

edited

Loading

henk717 commented Aug 9, 2023

TFWol commented Aug 9, 2023

LostRuins commented Aug 10, 2023

TFWol commented Aug 11, 2023

TFWol commented Aug 11, 2023

LostRuins commented Aug 12, 2023

TFWol commented Aug 14, 2023 •

edited

Loading

AG-w commented Oct 1, 2023 •

edited

Loading

TFWol commented Oct 3, 2023

AG-w commented Oct 5, 2023

TFWol commented Oct 17, 2023

LostRuins commented Feb 8, 2024

How to use CUDA or OpenCL without AVX2? #317

How to use CUDA or OpenCL without AVX2? #317

Comments

ZeroYuni commented Jul 13, 2023

Fr0d0Beutl1n commented Jul 18, 2023

LostRuins commented Jul 18, 2023

TFWol commented Jul 21, 2023

TFWol commented Jul 22, 2023 • edited Loading

Update

Update 2

LostRuins commented Jul 22, 2023

TFWol commented Jul 22, 2023

LostRuins commented Jul 22, 2023

TFWol commented Jul 22, 2023

Python

Exe

LostRuins commented Jul 22, 2023

TFWol commented Jul 22, 2023 • edited Loading

LostRuins commented Jul 23, 2023

TFWol commented Jul 23, 2023

TFWol commented Aug 8, 2023 • edited Loading

Update: Fixed in latest release

LostRuins commented Aug 8, 2023

TFWol commented Aug 9, 2023

TFWol commented Aug 9, 2023 • edited Loading

henk717 commented Aug 9, 2023

TFWol commented Aug 9, 2023

LostRuins commented Aug 10, 2023

TFWol commented Aug 11, 2023

TFWol commented Aug 11, 2023

LostRuins commented Aug 12, 2023

TFWol commented Aug 14, 2023 • edited Loading

AG-w commented Oct 1, 2023 • edited Loading

TFWol commented Oct 3, 2023

AG-w commented Oct 5, 2023

TFWol commented Oct 17, 2023

LostRuins commented Feb 8, 2024

TFWol commented Jul 22, 2023 •

edited

Loading

TFWol commented Jul 22, 2023 •

edited

Loading

TFWol commented Aug 8, 2023 •

edited

Loading

TFWol commented Aug 9, 2023 •

edited

Loading

TFWol commented Aug 14, 2023 •

edited

Loading

AG-w commented Oct 1, 2023 •

edited

Loading