Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running LLamaSharp on gpu #189

Closed
jonty-esterhuizen opened this issue Oct 9, 2023 · 36 comments
Closed

Running LLamaSharp on gpu #189

jonty-esterhuizen opened this issue Oct 9, 2023 · 36 comments
Labels
bug Something isn't working

Comments

@jonty-esterhuizen
Copy link

jonty-esterhuizen commented Oct 9, 2023

so i am currently using LLamaSharp like this

    public StatefulChatService ( IConfiguration configuration )
        {
            _configuration = configuration;

            modelPath = Path.Combine(_configuration [ "ModelPath" ], _configuration [ "ModelName" ]);
            systemPrompt = File.ReadAllText(_configuration [ "ChatWithBase" ]);
            _context = new LLamaContext(new LLama.Common.ModelParams(modelPath)
            {
                ContextSize = 2048,
                GpuLayerCount = 50,
                MainGpu=0
            });

            _chatSession = new ChatSession(new InteractiveExecutor(_context));
            _chatSession.AddInputTransform(new MyInputTransform1());
            _chatSession.AddInputTransform(new MyInputTransform2());
            _chatSession.AddInputTransform(new MyInputTransform3());
        } 
        public void Dispose ()
        {
            _chatSession.SaveSession(_configuration [ "SavedSessionPath" ]);
            _context?.Dispose();
        }

        public async Task<string> ChatAsync ( SendMessageInput input )
        {
            var userInput = input.Text;
            if (!_continue)
            {
                userInput = systemPrompt + userInput;
                Console.Write(systemPrompt);
                _continue = true;
            }

            Console.ForegroundColor = ConsoleColor.Green;
            Console.Write(input.Text);

            Console.ForegroundColor = ConsoleColor.White;
            var outputs = _chatSession.ChatAsync(userInput, new LLama.Common.InferenceParams()
            {
                RepeatPenalty = 1.0f,
                AntiPrompts = new string [ ] { "User:" },
                FrequencyPenalty = 1.0f,
                Temperature=0.5f
            });
            var result = "";
            await foreach (var output in outputs)
            {
                Console.Write(output);
                result += output;
            }

            return result;
        }

but the issue i am encountering is that i cant seem to let this run on my gpu it only uses my cpu and ram

@AsakusaRinne
Copy link
Collaborator

Hi, which backend package did you install for your project? There's three kinds of backends: cpu, cuda and Metal (MAC), be care that cpu and cuda backend shouldn't be installed at the same time.

@jonty-esterhuizen
Copy link
Author

jonty-esterhuizen commented Oct 11, 2023

so when using LLamaSharp.Backend.cuda12 and LLamaSharp both versions 0.5.1
i encountered this error
RuntimeError: The native library cannot be found. It could be one of the following reasons:

  1. No LLamaSharp backend was installed. Please search LLamaSharp.Backend and install one of them.
  2. You are using a device with only CPU but installed cuda backend. Please install cpu backend instead.
  3. The backend is not compatible with your system cuda environment. Please check and fix it. If the environment is expected not to be changed, then consider build llama.cpp from source or submit an issue to LLamaSharp.
  4. One of the dependencies of the native library is missing.

image
here is my system specs
image
image

@jonty-esterhuizen
Copy link
Author

i also moved over to importing
image
image

and using this as a direct reference to my project
my debugger is set to gpu
image
this is how i set my contex

    _context = LLamaWeights.LoadFromFile(new LLama.Common.ModelParams(configPaths.ModelPath)
    {
        ContextSize = 1024,
        GpuLayerCount = 50,
        LowVram = false,
        Seed = random.Next(1, 99999),
        ConvertEosToNewLine = true,
        EmbeddingMode=true,
        Encoding = Encoding.UTF8,
        

    }).CreateContext(new LLama.Common.ModelParams(configPaths.ModelPath)
    {
        ContextSize = 1024,
        GpuLayerCount = 50,
        LowVram = false,
        Seed = random.Next(1, 99999),
        ConvertEosToNewLine = true,
        EmbeddingMode = true,
        Encoding = Encoding.UTF8,


    });
  _chatSession = new ChatSession(new InteractiveExecutor(_context));

and this is my chat method


        public async Task<string> ChatAsync ( SendMessageInput input )
        {
            var userInput = input.Text;
            if (!_continue)
            {
                userInput = systemPrompt + userInput;
                Console.Write(systemPrompt);
                _continue = true;
            }

            Console.ForegroundColor = ConsoleColor.Green;
            Console.Write(input.Text);

            Console.ForegroundColor = ConsoleColor.White;
            var outputs = _chatSession.ChatAsync(userInput, new LLama.Common.InferenceParams()
            {
                RepeatPenalty = 1.0f,
                AntiPrompts = new string [ ] { "User:" },
                FrequencyPenalty = 1.0f,
                Temperature = 0.5f
            });
            var result = "";
            await foreach (var output in outputs)
            {
                Console.Write(output);
                result += output;
            }

            return result;
        }

@martindevans
Copy link
Member

With this new setup do you get the RuntimeError: The native library cannot be found message, or does it run but on CPU only?

@jonty-esterhuizen
Copy link
Author

jonty-esterhuizen commented Oct 12, 2023 via email

@martindevans
Copy link
Member

I think if you've got all of the dependencies it prefers to use the CPU ones at the moment, so you may have to delete all the non-CUDA DLLs.

@jonty-esterhuizen
Copy link
Author

jonty-esterhuizen commented Oct 12, 2023

what dlls do I delete and do I keep to get this to work
image
and where can I get the correct dlls

@martindevans
Copy link
Member

If you've cloned this repo you should have the right DLLs, you just need to trim it down to just the ones you want:

firefox_2023-10-12_13-36-02

Having all the DLLs is like having all of the backends installed.

I'm guessing that once you're loading the CUDA DLL you'll get the Runtime error again. If so that's probably because some dependencies it requires are missing.

@jonty-esterhuizen
Copy link
Author

System.TypeInitializationException: The type initializer for 'LLama.Native.NativeApi' threw an exception.
 ---> LLama.Exceptions.RuntimeError: The native library cannot be found. It could be one of the following reasons: 
1. No LLamaSharp backend was installed. Please search LLamaSharp.Backend and install one of them. 
2. You are using a device with only CPU but installed cuda backend. Please install cpu backend instead. 
3. The backend is not compatible with your system cuda environment. Please check and fix it. If the environment is expected not to be changed, then consider build llama.cpp from source or submit an issue to LLamaSharp.
4. One of the dependency of the native library is missed.

yea it seems to pop out that error is there a way i can see what is missing

@martindevans
Copy link
Member

Unfortunately no, as far as I know. I've always wondered why that exception doesn't just list exactly what it's missing :'(

A couple of things to check:

@jonty-esterhuizen
Copy link
Author

okay here is the output of the cmd
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.99 Driver Version: 536.99 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1660 Ti WDDM | 00000000:01:00.0 On | N/A |
| 79% 32C P8 15W / 130W | 1246MiB / 6144MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2252 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 2836 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 N/A N/A 3064 C+G ...l\Microsoft\Teams\current\Teams.exe N/A |
| 0 N/A N/A 3592 C+G ...m Files\Ultimaker Cura 4.6\Cura.exe N/A |
| 0 N/A N/A 3624 C+G ...U\ServiceHub.ThreadedWaitDialog.exe N/A |
| 0 N/A N/A 3780 C+G ...b3d8bbwe\Microsoft.Media.Player.exe N/A |
| 0 N/A N/A 4556 C+G ...l\Microsoft\Teams\current\Teams.exe N/A |
| 0 N/A N/A 5224 C+G ...GeForce Experience\NVIDIA Share.exe N/A |
| 0 N/A N/A 7232 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 7712 C+G ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe N/A |
| 0 N/A N/A 7748 C+G ...22\Community\Common7\IDE\devenv.exe N/A |
| 0 N/A N/A 9748 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 10628 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 11844 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 12736 C+G ...64__8wekyb3d8bbwe\CalculatorApp.exe N/A |
| 0 N/A N/A 13248 C+G ...siveControlPanel\SystemSettings.exe N/A |
| 0 N/A N/A 13604 C+G ...72.0_x64__8wekyb3d8bbwe\GameBar.exe N/A |
| 0 N/A N/A 15968 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 17636 C+G ...on\117.0.2045.60\msedgewebview2.exe N/A |
| 0 N/A N/A 18204 C+G C:\Windows\System32\mstsc.exe N/A |
| 0 N/A N/A 18724 C+G C:\Windows\System32\mstsc.exe N/A |
+---------------------------------------------------------------------------------------+

@jonty-esterhuizen
Copy link
Author

so after downloading and installing (https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html)

 Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware[1]
      An unhandled exception has occurred while executing the request.
      System.TypeInitializationException: The type initializer for 'LLama.Native.NativeApi' threw an exception.
       ---> LLama.Exceptions.RuntimeError: The native library cannot be found. It could be one of the following reasons:

@martindevans
Copy link
Member

I'm at a bit of a loss what could be the problem in that case :/

Does the normal llama.cpp demo application (main.exe) work on GPU?

@saddam213
Copy link
Collaborator

did you rename the file libllama-cuda12.dll to libllama.dll?

This has tripped up a few people I have helped with LLamaSharp

@PaulaScholz
Copy link

PaulaScholz commented Oct 20, 2023

I have a very similar issue. I have one machine, a 10 core I9 with 128gb and 3090 GPU. LlamaSharp runs very fast and is certainly using the GPU. On another machine, a 56 core Xeon with 128GB, 4 NVIDIA A6000 GPUs, it runs dog slow, uses up almost all the CPU. It does seem to use one GPU, but not much.

I have CUDA 12 and the latest NVIDIA drivers on each system. Output from both say they are using the GPU, but 3090 is super fast and the other is super slow. I am using the 0.5.1 Nuget package and the 0.5.1 Backend.Cuda12 on both.

Also, how would I get LlamaSharp to use more than 1 GPU?

@martindevans
Copy link
Member

Also, how would I get LlamaSharp to use more than 1 GPU?

Currently not possible, but I've just opened this PR to add support for it: #202

I don't personally use CUDA and I don't have multiple GPUs to test with, so if you could test it and confirm that it works for you that would be much appreciated!

@PaulaScholz
Copy link

PaulaScholz commented Oct 20, 2023

I would be delighted to try it but I can't get it working at all with any GPU on the multi-GPU system. Also, FYI, the 70b model doesn't seem to work on my single GPU system (128gb ram, 3090 card). Nonetheless, I will try the updated code, maybe it will fix the underlying issue.

@PaulaScholz
Copy link

So, I cloned the MultiGpu branch and tried it. Got SystemAccessViolationException when trying to load the 13b model in native.cs, SafeLlamaModelHandle at
var model_ptr = NativeApi.llama_load_model_from_file(modelPath, lparams);

Here is the model output:

ggml_init_cublas: found 4 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6
Device 1: NVIDIA RTX A6000, compute capability 8.6
Device 2: NVIDIA RTX A6000, compute capability 8.6
Device 3: NVIDIA RTX A6000, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from C:\PaulaLlamaModels\llama-2-13b-chat.Q5_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q5_K [ 5120, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 19: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 20: blk.10.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 21: blk.10.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 22: blk.10.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 23: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 24: blk.10.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 25: blk.10.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 26: blk.10.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 27: blk.10.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 28: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 29: blk.11.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 30: blk.11.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 31: blk.11.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 32: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 33: blk.11.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 34: blk.11.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 35: blk.11.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 36: blk.11.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 37: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 38: blk.12.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 39: blk.12.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 40: blk.12.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 41: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 42: blk.12.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 43: blk.12.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 44: blk.12.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 45: blk.12.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 46: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 47: blk.13.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 48: blk.13.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 49: blk.13.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 50: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 51: blk.13.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 52: blk.13.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 53: blk.13.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 54: blk.13.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 55: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 56: blk.14.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 57: blk.14.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 58: blk.14.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 59: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 60: blk.14.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 61: blk.14.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 62: blk.14.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 63: blk.14.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 64: blk.15.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 65: blk.15.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 66: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 67: blk.2.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 68: blk.2.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 69: blk.2.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 70: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 71: blk.2.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 72: blk.2.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 73: blk.2.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 74: blk.2.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 75: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 76: blk.3.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 77: blk.3.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 78: blk.3.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 79: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 80: blk.3.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 81: blk.3.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 82: blk.3.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 83: blk.3.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 84: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 85: blk.4.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 86: blk.4.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 87: blk.4.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 88: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 89: blk.4.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 90: blk.4.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 91: blk.4.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 92: blk.4.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 93: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 94: blk.5.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 95: blk.5.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 96: blk.5.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 97: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 98: blk.5.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 99: blk.5.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 100: blk.5.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 101: blk.5.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 102: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 103: blk.6.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 104: blk.6.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 105: blk.6.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 106: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 107: blk.6.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 108: blk.6.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 109: blk.6.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 110: blk.6.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 111: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 112: blk.7.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 113: blk.7.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 114: blk.7.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 115: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 116: blk.7.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 117: blk.7.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 118: blk.7.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 119: blk.7.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 120: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 121: blk.8.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 122: blk.8.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 123: blk.8.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 124: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 125: blk.8.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 126: blk.8.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 127: blk.8.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 128: blk.8.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 129: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 130: blk.9.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 131: blk.9.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 132: blk.9.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 133: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 134: blk.9.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 135: blk.9.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 136: blk.9.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 137: blk.9.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 139: blk.15.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 141: blk.15.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 142: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 143: blk.15.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 144: blk.15.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 145: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 146: blk.16.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 147: blk.16.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 148: blk.16.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 149: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 150: blk.16.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 151: blk.16.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 152: blk.16.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 153: blk.16.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 154: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 155: blk.17.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 156: blk.17.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 157: blk.17.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 158: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 159: blk.17.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 160: blk.17.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 161: blk.17.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 162: blk.17.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 163: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 164: blk.18.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 165: blk.18.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 166: blk.18.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 167: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 168: blk.18.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 169: blk.18.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 170: blk.18.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 171: blk.18.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 172: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 173: blk.19.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 174: blk.19.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 175: blk.19.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 176: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 177: blk.19.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 178: blk.19.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 179: blk.19.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 180: blk.19.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 181: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 182: blk.20.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 183: blk.20.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 184: blk.20.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 185: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 186: blk.20.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 187: blk.20.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 188: blk.20.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 189: blk.20.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 190: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 191: blk.21.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 192: blk.21.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 193: blk.21.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 194: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 195: blk.21.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 196: blk.21.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 197: blk.21.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 198: blk.21.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 199: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 200: blk.22.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 201: blk.22.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 202: blk.22.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 203: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 204: blk.22.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 205: blk.22.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 206: blk.22.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 207: blk.22.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 208: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 209: blk.23.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 210: blk.23.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 211: blk.23.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 212: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 213: blk.23.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 214: blk.23.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 215: blk.23.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 216: blk.23.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 217: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 218: blk.24.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 219: blk.24.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 220: blk.24.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 221: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 222: blk.24.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 223: blk.24.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 224: blk.24.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 225: blk.24.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 226: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 227: blk.25.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 228: blk.25.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 229: blk.25.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 230: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 231: blk.25.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 232: blk.25.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 233: blk.25.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 234: blk.25.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 235: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 236: blk.26.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 237: blk.26.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 238: blk.26.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 239: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 240: blk.26.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 241: blk.26.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 242: blk.26.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 243: blk.26.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 244: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 245: blk.27.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 246: blk.27.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 247: blk.27.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 248: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 249: blk.27.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 250: blk.27.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 251: blk.27.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 252: blk.27.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 253: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 254: blk.28.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 255: blk.28.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 256: blk.28.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 257: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 258: blk.28.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 259: blk.28.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 260: blk.28.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 262: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 263: blk.29.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 271: blk.30.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 277: output.weight q6_K [ 5120, 32000, 1, 1 ]
llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 279: blk.30.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 280: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 283: blk.31.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 284: blk.31.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 285: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 286: blk.31.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 287: blk.31.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 289: blk.31.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 290: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 291: blk.32.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 292: blk.32.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 293: blk.32.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 294: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 295: blk.32.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 296: blk.32.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 297: blk.32.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 298: blk.32.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 299: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 300: blk.33.ffn_down.weight q5_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 301: blk.33.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 302: blk.33.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 303: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 304: blk.33.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 305: blk.33.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 306: blk.33.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 307: blk.33.attn_v.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 308: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 309: blk.34.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 310: blk.34.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 311: blk.34.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 312: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 313: blk.34.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 314: blk.34.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 315: blk.34.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 316: blk.34.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 317: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 318: blk.35.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 319: blk.35.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 320: blk.35.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 321: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 322: blk.35.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 323: blk.35.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 324: blk.35.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 325: blk.35.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 326: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 327: blk.36.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 328: blk.36.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 329: blk.36.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 330: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 331: blk.36.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 332: blk.36.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 333: blk.36.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 334: blk.36.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 335: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 336: blk.37.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 337: blk.37.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 338: blk.37.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 339: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 340: blk.37.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 341: blk.37.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 342: blk.37.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 343: blk.37.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 344: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 345: blk.38.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 346: blk.38.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 347: blk.38.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 348: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 349: blk.38.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 350: blk.38.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 351: blk.38.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 352: blk.38.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 353: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 354: blk.39.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ]
llama_model_loader: - tensor 355: blk.39.ffn_gate.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 356: blk.39.ffn_up.weight q5_K [ 5120, 13824, 1, 1 ]
llama_model_loader: - tensor 357: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - tensor 358: blk.39.attn_k.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 359: blk.39.attn_output.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 360: blk.39.attn_q.weight q5_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 361: blk.39.attn_v.weight q6_K [ 5120, 5120, 1, 1 ]
llama_model_loader: - tensor 362: output_norm.weight f32 [ 5120, 1, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q5_K: 241 tensors
llama_model_loader: - type q6_K: 41 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 1024
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 0.0
llm_load_print_meta: freq_scale = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 13.02 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: using CUDA for GPU acceleration
warning: cannot set main_device=916964780 because there are only 4 devices. Using device 0 instead.
llm_load_tensors: mem required = 8581.45 MB (+ 800.00 MB per state)
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/43 layers to GPU
llm_load_tensors: VRAM used: 221 MB
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:

at LLama.Native.NativeApi.llama_load_model_from_file(System.String, LLama.Native.LLamaContextParams)

at LLama.Native.SafeLlamaModelHandle.LoadFromFile(System.String, LLama.Native.LLamaContextParams)
at LLama.LLamaWeights.LoadFromFile(LLama.Abstractions.IModelParams)
at Program.

$(System.String[])

@PaulaScholz
Copy link

PaulaScholz commented Oct 21, 2023

Maybe you need to make changes in the CUDA 12 backend? Maybe you could debug on Azure VM with multiple GPUs? I might could get one for a small while.

@martindevans
Copy link
Member

How did you setup your test?

The error here:

warning: cannot set main_device=916964780 because there are only 4 devices. Using device 0 instead.

Makes it looks like you've set IModelParams.MainGpu = 916964780!

I assume you didn't actually do that, so I guess that there must be some mismatch between the C# and C++ sides. For example a field in the wrong place in the llama_model_params struct. That would also explain the AccessViolationException if it tried to use some other random data as a pointer.

@PaulaScholz
Copy link

PaulaScholz commented Oct 21, 2023

Must be a mismatch because I didn't set that in the ModelParams, just these:

ContextSize = 2048,
GpuLayerCount = 48

@martindevans
Copy link
Member

I don't have any good ideas to debug that I'm afraid.

The relevent bit of llama.cpp is this struct and the equivalent in that PR is this. As far as I can see those two agree with each other.

I can't reproduce this on my PC (with just one GPU), which is odd because if there really was something misaligned in that struct I'd expect everything to fail very quickly (even basic loading of weights).

@martindevans martindevans mentioned this issue Oct 22, 2023
1 task
@jonty-esterhuizen
Copy link
Author

is there a way i can test this for you or if we can have a discord call ?

@JeremyBickel
Copy link

JeremyBickel commented Oct 24, 2023

I'm at a bit of a loss what could be the problem in that case :/

Does the normal llama.cpp demo application (main.exe) work on GPU?

I can get llama.cpp main to work with cpu, but it just drops back to command line after a couple seconds with cuda:

C:\Users\Jeremy\Downloads\llama-b1414-bin-win-cublas-cu12.2.0-x64>main -m C:\Users\Jeremy\LLM\MODELS\TheBloke\CausalLM-14B-GGUF\causallm-14b-q5_1.gguf --file prompt.txt
Log start
main: build = 1414 (96981f3)
main: built with MSVC 19.35.32217.1 for x64
main: seed  = 1698159182
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Quadro K1200, compute capability 5.0

C:\Users\Jeremy\Downloads\llama-b1414-bin-win-cublas-cu12.2.0-x64>

I also tried renaming libllama-cuda12.dll (and the .so file for good measure). When that didn't work, I went a step further and copied libllama-cuda12.dll and .so and pasted them with the names of the other dll and so files:

 Directory of C:\Users\Jeremy\LLM\LLamaSharp\LLama\runtimes

10/24/2023  10:45 AM    <DIR>          .
10/24/2023  10:45 AM    <DIR>          ..
10/24/2023  10:31 AM    <DIR>          baks
10/23/2023  09:59 PM    <DIR>          build
10/23/2023  09:59 PM         5,168,128 ggml-metal.metal
10/23/2023  09:59 PM         5,168,128 libllama-cuda11.dll
10/23/2023  09:59 PM         5,253,824 libllama-cuda11.so
10/23/2023  09:59 PM         5,168,128 libllama-cuda12.dll
10/23/2023  09:59 PM         5,253,824 libllama-cuda12.so
10/23/2023  09:59 PM         5,168,128 libllama.dll
10/23/2023  09:59 PM         5,253,824 libllama.dylib
10/23/2023  09:59 PM         5,253,824 libllama.so

I followed some advice to replace the dll and run it from the command line from the debug folder

\LLamaSharp\LLama.Examples\bin\Debug\net6.0>LLama.Examples.exe
and I got this error:

CUDA error 209 at D:\a\LLamaSharp\LLamaSharp\ggml-cuda.cu:6862: no kernel image is available for execution on the device
current device: 0

BTW, I don't have a "D" drive, and the code isn't in an "a" directory. Is this path hardwired somewhere in the code? Maybe it doesn't matter here, but it's curious.

@martindevans
Copy link
Member

I don't have a "D" drive, and the code isn't in an "a" directory. Is this path hardwired somewhere in the code?

That's just a path that was baked in from the build environment, it's not a problem.

no kernel image is available for execution on the device

This is the error that's causing you problems - it means that the compiled code does not support your specific CUDA version.

It's odd that llama.cpp isn't printing the same error, but I think that's the fundamental problem in both cases.

@JeremyBickel
Copy link

I thought I saw it over at llama.cpp, but the error message there was "invalid configuration file" coming from the same line of code (ggml-cuda.cu:6862). I don't know if it's related or not.

ggerganov/llama.cpp#3740

@atonalfreerider
Copy link

atonalfreerider commented Oct 26, 2023

EDIT: I swapped GPUs from one of my PCs to the other. The RTX 2070 has intermittent performance on both PCs. The GTX 1070 has none at all.

If it helps, I'm experiencing similar GPU issues to OP. I have tried the following combinations and have seen only one case where one of my GPUs intermittently functions:

PC1: Windows 10, GTX 1070

  • tried CUDA 11.8
  • tried CUDA 12.3
  • tried building libllama-cuda11.dll from latest source (today) and renamed to libllama.dll
    Result: NO GPU USAGE

PC2: Windows 10, RTX 2070, CUDA 11.8
Result: INTERMITTENT GPU USAGE. The output from the model is still incredibly slow. The GPU spikes every time a token is generated, but it takes 6-10 seconds between spikes.

Results displayed below:

perf copy

I think this is a great library. I really hope to be able to use it. Thank you.

@AsakusaRinne
Copy link
Collaborator

@atonalfreerider Hey, noticed that you have compiled the DLL from llama.cpp yourself, could you please further test if GPU is used when running directly with llama.cpp examples? That will help us to see if it's an issue of llama.cpp or LLamaSharp. :)

@atonalfreerider
Copy link

Someone posted a similar issue here to llama.cpp today:
ggerganov/llama.cpp#3806

From llama.cpp I downloaded the latest release b1429

I downgraded my CUDA SDK to 12.2 to match the release version and ran on RTX 2070

I ran main.exe with several different models. I ran it with the exact same model that I was using in my comment above, with the same prompt. Experiencing the same behavior as posted above.

I also ran LLaMaSharp.Examples, with choice (14) coding assistant and gave the same prompt, and experienced the same behavior.

Here is the output from llama-bench.exe. I'll post to the llamacpp issues:

C:\Users\johnb\Desktop\llamacpp-bin\llama-b1429-bin-win-cublas-cu12.2.0-x64>llama-bench.exe
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | pp 512     |  1165.85 ± 60.52 |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | tg 128     |     76.31 ± 0.11 |

@atonalfreerider
Copy link

I may have found a partial explanation for the behavior. I explained on the llama.cpp issue ticket referenced above:
ggerganov/llama.cpp#3806 (comment)

It would appear from testing that the model needs to fit into the GPU memory in order to run efficiently. Please take this result with a grain of salt.

@AsakusaRinne
Copy link
Collaborator

Thank you for the feedback!

@atonalfreerider
Copy link

IMPORTANT UPDATE from the llamacpp thread. It is now possible to disable VRAM overflow with driver version 564.01:

https://old.reddit.com/r/LocalLLaMA/comments/17kl8gu/psa_with_nvidia_driver_56401_its_now_possible_to/

Guide:
https://nvidia.custhelp.com/app/answers/detail/a_id/5490

You have to individually add the path to your program executable to the NVIDIA 3D settings and opt out of RAM fallback.

WARNING: This will cause your program to crash, rather than just slow down, if you overflow the memory with a large model. But it should make full use of VRAM.

@AsakusaRinne
Copy link
Collaborator

It seems to have been resolved by ggerganov/llama.cpp#3906. I think it will be included in the future release of LLamaSharp.

@martindevans
Copy link
Member

That 2 character change has caused a lot of grief 😆 😢

New binaries have been merged into master (thru #249) which should speed everything up.

@sania96
Copy link

sania96 commented Feb 5, 2024

I have a very similar issue. I have one machine, a 10 core I9 with 128gb and 3090 GPU. LlamaSharp runs very fast and is certainly using the GPU. On another machine, a 56 core Xeon with 128GB, 4 NVIDIA A6000 GPUs, it runs dog slow, uses up almost all the CPU. It does seem to use one GPU, but not much.

I have CUDA 12 and the latest NVIDIA drivers on each system. Output from both say they are using the GPU, but 3090 is super fast and the other is super slow. I am using the 0.5.1 Nuget package and the 0.5.1 Backend.Cuda12 on both.

Also, how would I get LlamaSharp to use more than 1 GPU?

HI, how did you run it on gpu? i cannot find .dll files for cuda from where i can get that?

@martindevans
Copy link
Member

At the moment you need to have the CUDA toolkit installed (https://developer.nvidia.com/cuda-toolkit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants