-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low level new loading system #64
Low level new loading system #64
Conversation
…w system has split model weights and contexts into two separate things, allowing one set of weights to be shared between many contexts. This change _only_ implements the low level API and makes no effort to update the LlamaSharp higher level abstraction. It is built upon llama `b3f138d`, necessary DLLs are **not** included in this commit.
This is all working perfect and I am now building on top of this PR One thing I noticed was no GPU layers are not loaded anymore, I suspect it may be to do with this commit ae178ab Since you have done these native API upgrades I was wondering if you could have a look as you already have a branch open for this stuff :) |
@martindevans I have created a PR for what I mentioned above, it fixes the GPU issues This now gets us up to date with llama.cpp @ commit edcc7ae Thanks for the great work |
Thanks for testing that @saddam213, I'm not using GPU acceleration locally so I'd never have noticed that! Would you mind creating a PR into this branch with just your fix for that issue? That way this PR remains usable in isolation. I'll update the comment at the top of this thread to indicate that edcc7ae is supported. |
@martindevans made a PR over on your branch for the GPU issue fix :) |
LLamaContextParams epsilon and tensor split changes
Thanks, I've merged that so it's part of this PR. |
In MacOS this PR doesn't work. I will try to review how to Marshall this float[] type: Unhandled exception. System.TypeLoadException: Cannot marshal field 'tensor_split' of type 'LLama.Native.LLamaContextParams': Invalid managed/unmanaged type combination (Array fields must be paired with ByValArray). |
@SignalRT did you make sure to update the native deps before testing? The updated binaries are not included in this PR! Check the top message for a link to the appropriate commit in llama.cpp :) |
It's tested with today llama.cpp (well there are several "today" llama.cpp versions...) |
In that case I think the extra TensorSplits fix will need to be updated. Rather than passing in a |
Seems odd this is affecting a single OS, I really really need a mac to test stuff :p EDIT: the calling and result of |
I don't think it uses a fixed size array any more - looking at the cpp code it's just a pointer to some floats. |
confused by this, tried latest build, still works fine on win and linux, Need someone with more c++ knowlege to look at this for me please |
According to the docs:
So I think the current code is just incorrect as the array currently has no marshalling attribute. I'm not sure what that works on Windows! I've added |
I did try a few attributes this morning, however all ened up setting tensor_split to a non-null value, and that cause all properties after it to become wrong values (RoPE etc) And we cant set a SizeCost to 0 on any of the Marshal attributes. Ill leave this one for the mac guys as I can't reproduce the error on the other OS's EDIT: Setting a breakpoint after the call of |
After further digging using a
|
I made some test after review the C++ source code. In my opinion it should be something like:
Where LLAMA_MAX_DEVICE should be calculate using llama_max_devices() that I import also in the C# native api. It seems something that AsakusaRinne add in llama.cpp sometime ago: This is working on some test, but once solved that issue I arrive to this #67 |
Yeah, I think its time I leave this to the experts. Works for windows and linux so at least I can continue development on my branch and I can merge in any changes that are made for the mac guys later. |
It shouldn't be marshalled as a typedef struct
{
float TensorSplits[3];
} DemoStruct; That's how it worked up until 2 weeks ago, when ggerganov made this change. Since then the definition in llama.cpp (here) looks like this: const float * tensor_split; // how to split layers across multiple GPUs (size: LLAMA_MAX_DEVICES) i.e. just a pointer to the first value, which is the default marshalling as @saddam213 pointed out. If the |
I just tested this on Windows using the |
@AsakusaRinne Reminder that this will break master until the correct DLLs are added. They are not included in this PR! |
Yes 468ea24 should be fine. You could probably use a more up to date version, but I haven't tried it. |
@martindevans It works with llama.cpp 5f631c2 I update in my fork the binaries. |
Updated to use the new loading system in llama (llama_state). This new system has split model weights and contexts into two separate things, allowing one set of weights to be shared between many contexts.
This PR does not try to take advantage of this new capability in any way! It's just implementing the smallest change to the low level API. In future PRs I think we'll want to make quite extensive changes to the API (especially with state management).
This is built upon llama.cpp
468ea24
, necessary DLLs are NOT included in this PR. I don't know how to build them all and assumed you'd rather add them yourself anyway (as a matter of security).Edit: I've expanded the
SafeLlamaModelHandle
to include methods for all the things you can do with allama_model
. Most of these are not used yet, since we'll need to decide how to expose this to the higher level API.