Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ollama: 0.0.17 -> 0.1.7 #257760

Merged
merged 2 commits into from
Nov 3, 2023
Merged

ollama: 0.0.17 -> 0.1.7 #257760

merged 2 commits into from
Nov 3, 2023

Conversation

elohmeier
Copy link
Contributor

@elohmeier elohmeier commented Sep 28, 2023

Description of changes

Updated the package and added llama-cpp as a required dependency. Also added support for Metal, CUDA, ROCm hw acceleration.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.11 Release Notes (or backporting 23.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@IogaMaster
Copy link
Contributor

We need to allow to be built with gpu support before merge

@zopieux
Copy link
Contributor

zopieux commented Oct 3, 2023

Note 0.1.1 is already available.

@elohmeier
Copy link
Contributor Author

Thanks, I've updated to 0.1.1 and also added CUDA/ROCm support (based on https://github.com/ggerganov/llama.cpp/blob/master/flake.nix) by integrating llama-cpp as a package. I don't have hardware to test CUDA and ROCm locally, maybe someone could try that out?

@benneti
Copy link
Contributor

benneti commented Oct 12, 2023

Thank you for the effort, is there a reason that there is no opencl option (I think it should be supported and might be easier to setup on some devices than rocm and cuda), see
ollama/ollama#259
and https://github.com/ggerganov/llama.cpp
(or is it the default and I misunderstood something in the nixpkg as I dont see clblast or the likes).

Additionally I tested the the compiled binary with and without rocmSupport(ollama.override { llama-cpp = (llama-cpp.override {rocmSupport = true;}); })
and while both seem to build and start the server, when running a model I get "Error: failed to start a llama runner" and in the serve logs I see

2023/10/12 15:33:49 llama.go:316: starting llama runner
2023/10/12 15:33:49 llama.go:352: waiting for llama runner to start responding
{"timestamp":1697117629,"level":"WARNING","function":"server_params_parse","line":887,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1}
error: unknown argument: --gqa
usage: /nix/store/srlrbyrfs3kjfiaacnv2nfni7yjf8vcb-llama-cpp-2023-10-11/bin/llama-cpp-server [options]

options:
  -h, --help            show this help message and exit
  -v, --verbose         verbose output (default: disabled)
  -t N, --threads N     number of threads to use during computation (default: 8)
  -c N, --ctx-size N    size of the prompt context (default: 512)
  --rope-freq-base N    RoPE base frequency (default: loaded from model)
  --rope-freq-scale N   RoPE frequency scaling factor (default: loaded from model)
  -b N, --batch-size N  batch size for prompt processing (default: 512)
  --memory-f32          use f32 instead of f16 for memory key+value (default: disabled)
                        not recommended: doubles context memory required and no measurable increase in quality
  --mlock               force system to keep model in RAM rather than swapping or compressing
  --no-mmap             do not memory-map model (slower load but may reduce pageouts if not using mlock)
  --numa                attempt optimizations that help on some NUMA systems
  -m FNAME, --model FNAME
                        model path (default: models/7B/ggml-model-f16.gguf)
  -a ALIAS, --alias ALIAS
                        set an alias for the model, will be added as `model` field in completion response
  --lora FNAME          apply LoRA adapter (implies --no-mmap)
  --lora-base FNAME     optional model to use as a base for the layers modified by the LoRA adapter
  --host                ip address to listen (default  (default: 127.0.0.1)
  --port PORT           port to listen (default  (default: 8080)
  --path PUBLIC_PATH    path from which to serve static files (default examples/server/public)
  -to N, --timeout N    server read/write timeout in seconds (default: 600)
  --embedding           enable embedding vector output (default: disabled)

2023/10/12 15:33:49 llama.go:326: llama runner exited with error: exit status 1
2023/10/12 15:33:50 llama.go:333: error starting llama runner: llama runner process has terminated
[GIN] 2023/10/12 - 15:33:50 | 500 |  200.926851ms |       127.0.0.1 | POST     "/api/generate"

With the version in nixpkgs/master it works (though without the option for gpu support)

@elohmeier
Copy link
Contributor Author

Rebased & updated ollama to 0.1.3.

@elohmeier
Copy link
Contributor Author

@benneti Thanks for testing. I'll look into that issue reg. ROCm. I've also added OpenCL support, could you test that as well?

@elohmeier
Copy link
Contributor Author

I've patched the passing of the deprecated --gqa flag, which seems to be only needed for GGML format models. Could you test that again?

@benneti
Copy link
Contributor

benneti commented Oct 14, 2023

As I am unable (I run out of space while compiling rocblas and am too lazy to change my tmpfs setup for this) even with the fixes necessary (see code comment), I only tried the standard version and opencl support.
Both work and seem to be much more performant than the previous version, thanks for the effort again.
For others that quickly want to check whether it works, only GGUF models are supported in this version and they are not clearly labeled, for me a model that worked was ollama run zephyr (see ollama/ollama#738 (comment)).

pkgs/by-name/ll/llama-cpp/package.nix Outdated Show resolved Hide resolved
@a-kenji
Copy link
Contributor

a-kenji commented Oct 30, 2023

I can also confirm that it now works as intended,
thanks for keeping at it!

@elohmeier
Copy link
Contributor Author

The issue was fixed upstream, so I updated & removed the patch.

@pbsds
Copy link
Contributor

pbsds commented Nov 3, 2023

I read through the change logs and it seems there are only additions and no major changes, and i don't consider this a breaking change for the 23.11 release. but the aarch64 darwin ofborg builder isn't doing too good, i think it could use some friends

@DrRuhe DrRuhe mentioned this pull request Nov 3, 2023
@markuskowa
Copy link
Member

I read through the change logs and it seems there are only additions and no major changes, and i don't consider this a breaking change for the 23.11 release. but the aarch64 darwin ofborg builder isn't doing too good, i think it could use some friends

Version 0.0.17 is broken with the current models. This update puts it back in a working state.
Is there anyone who can confirm this PR builds with aarch64-darwin?

@elohmeier
Copy link
Contributor Author

I can confirm that, actively using it on aarch64-darwin w/ Metal support.

@markuskowa markuskowa merged commit 47ccb89 into NixOS:master Nov 3, 2023
21 of 22 checks passed
@elohmeier elohmeier deleted the ollama branch November 3, 2023 14:58
@geekodour
Copy link
Contributor

I installed ollama from unstable and I am getting:

{"timestamp":1699272675,"level":"WARNING","function":"server_params_parse","line":1994,"message":"Not compiled with GPU offload support, --n-
gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1}

Related upstream issues:

I am not sure, what's the issue. I have 11.8 toolkit running. Building ollama directly also doesn't seem to pickup the includes. I think this (ollama/ollama#958) should finally fix it?

@geekodour
Copy link
Contributor

geekodour commented Nov 6, 2023

The override looks like this:

ollamagpu = pkgs.unstable.ollama.override { llama-cpp = (pkgs.unstable.llama-cpp.override {cudaSupport = true;}); };

But it seems like compiling llma-cpp fails with issues mentioned here: ggerganov/llama.cpp#2481

EDIT: following worked

ollamagpu = pkgs.unstable.ollama.override { llama-cpp = (pkgs.unstable.llama-cpp.override {cudaSupport = true; openblasSupport = false; }); };

@pbsds
Copy link
Contributor

pbsds commented Nov 6, 2023

Should we provide ollameWithNvidia and ollamaWithRocm in all packages like torch does?

@markuskowa
Copy link
Member

Should we provide ollameWithNvidia and ollamaWithRocm in all packages like torch does?

I would avoid to create *withNvidia|Rocm attributes. A better way to turn on CUDA support is to set cudaSupport = true in nixpkgs config, which consistently enables CUDA across nixpkgs (a similar thing will probably be established soon for Rocm). Note, that Hydra would not build CUDA packages anyway since CUDA itself is marked as unfree (i.e. you have to build it locally anyway).

@teto
Copy link
Member

teto commented Dec 3, 2023

Just wanted to thank @geekodour for his fix #257760 (comment) . Worked for me as well.

@dltacube
Copy link

Does this still work for you guys? I'm getting the following error:

error: llama-cpp: exactly one of openclSupport, openblasSupport and rocmSupport should be enabled

And removing the disable cublas option then complains about running an unsupported compiler version.

@teto
Copy link
Member

teto commented Dec 26, 2023

Does this still work for you guys? I'm getting the following error:

error: llama-cpp: exactly one of openclSupport, openblasSupport and rocmSupport should be enabled

And removing the disable cublas option then complains about running an unsupported compiler version.

It stopped working for me so I removed the check altogether. Not sure what's the proper fix.

@dltacube
Copy link

Does this still work for you guys? I'm getting the following error:

error: llama-cpp: exactly one of openclSupport, openblasSupport and rocmSupport should be enabled

And removing the disable cublas option then complains about running an unsupported compiler version.

It stopped working for me so I removed the check altogether. Not sure what's the proper fix.

So you're just using the CPU right? I removed the override and it installs just fine but holy hell is it slow lol. Must be some new configuration for llama-cpp I'm going to lookup if I have time later.

@dltacube
Copy link

dltacube commented Dec 28, 2023

I don't have a proper fix but ran into the following errors:

  • complained about it using gcc 12 rather than version 11
  • passing -DALLOW_UNSUPPORTED_COMPILER=ON to gcc had no impact on nvcc
  • overriding nativeBuildInputs to use super.gcc11 yielded other errors
  • complained about git not being found, so I added that in an overlay
  • proceeds to run the git command except it's not in any git folder so it has no effect
  • here's what I'm left with and it works for some reason w/ gpu usage (checked w/ nvtop)
    (self: super: {
      llama-cpp = super.llama-cpp.overrideAttrs (oldAttrs: {
        nativeBuildInputs = oldAttrs.nativeBuildInputs ++ [ super.git ];
      });
    })
    (self: super: {
      llama-cpp = super.llama-cpp.override (args: {
        openclSupport = true;
        openblasSupport = false;
      });
    })

I don't think that first overlay is doing much of anything.

Globally, cuda support is not turned on. Just thought I'd leave this here in case someone else stumbles on this issue. I'm also not using unstable.

@happysalada
Copy link
Contributor

to try to address this, I made #277709
let me know if of course, I missed something.

@Green-Sky
Copy link

Green-Sky commented Dec 30, 2023

ggerganov/llama.cpp@68eccbd

there has be a rewrite of the upstream nix files. It included a stdenv fix for cuda versions.

(edit: oops, actually wanted to reply in #277709 , but here works too)

@happysalada
Copy link
Contributor

Thank you! SomeoneSerge nicely commented on what had to change, I think all is incorporated now. Let me know if you notice something I missed.

@elohmeier elohmeier mentioned this pull request Feb 15, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet