Low level new loading system #64

martindevans · 2023-07-25T00:23:11Z

Updated to use the new loading system in llama (llama_state). This new system has split model weights and contexts into two separate things, allowing one set of weights to be shared between many contexts.

This PR does not try to take advantage of this new capability in any way! It's just implementing the smallest change to the low level API. In future PRs I think we'll want to make quite extensive changes to the API (especially with state management).

This is built upon llama.cpp 468ea24, necessary DLLs are NOT included in this PR. I don't know how to build them all and assumed you'd rather add them yourself anyway (as a matter of security).

Edit: I've expanded the SafeLlamaModelHandle to include methods for all the things you can do with a llama_model. Most of these are not used yet, since we'll need to decide how to expose this to the higher level API.

…w system has split model weights and contexts into two separate things, allowing one set of weights to be shared between many contexts. This change _only_ implements the low level API and makes no effort to update the LlamaSharp higher level abstraction. It is built upon llama `b3f138d`, necessary DLLs are **not** included in this commit.

- Additional comments

saddam213 · 2023-07-28T02:47:16Z

This is all working perfect and I am now building on top of this PR

One thing I noticed was no GPU layers are not loaded anymore, I suspect it may be to do with this commit ae178ab
it changed the tensor_split from al array to a pointer,

Since you have done these native API upgrades I was wondering if you could have a look as you already have a branch open for this stuff :)

saddam213 · 2023-07-28T07:35:46Z

@martindevans I have created a PR for what I mentioned above, it fixes the GPU issues

This now gets us up to date with llama.cpp @ commit edcc7ae

Thanks for the great work

martindevans · 2023-07-28T12:15:46Z

Thanks for testing that @saddam213, I'm not using GPU acceleration locally so I'd never have noticed that!

Would you mind creating a PR into this branch with just your fix for that issue? That way this PR remains usable in isolation.

I'll update the comment at the top of this thread to indicate that edcc7ae is supported.

saddam213 · 2023-07-28T20:48:44Z

@martindevans made a PR over on your branch for the GPU issue fix :)
martindevans#1

LLamaContextParams epsilon and tensor split changes

martindevans · 2023-07-28T20:56:22Z

Thanks, I've merged that so it's part of this PR.

SignalRT · 2023-08-02T13:48:13Z

In MacOS this PR doesn't work. I will try to review how to Marshall this float[] type:

Unhandled exception. System.TypeLoadException: Cannot marshal field 'tensor_split' of type 'LLama.Native.LLamaContextParams': Invalid managed/unmanaged type combination (Array fields must be paired with ByValArray).
at LLama.Native.NativeApi.llama_context_default_params()
at LLama.Utils.InitLLamaContextFromModelParams(ModelParams params) in

martindevans · 2023-08-02T14:02:04Z

@SignalRT did you make sure to update the native deps before testing? The updated binaries are not included in this PR!

Check the top message for a link to the appropriate commit in llama.cpp :)

SignalRT · 2023-08-02T14:11:53Z

It's tested with today llama.cpp (well there are several "today" llama.cpp versions...)

martindevans · 2023-08-02T15:18:15Z

In that case I think the extra TensorSplits fix will need to be updated. Rather than passing in a float[] it'll have to be a float* and the array will have to be fixed in place when created. If you want to make a PR (into this branch) for it I'll be happy to review it and merge it. If not I'll look at it myself within a few days.

saddam213 · 2023-08-02T20:33:39Z

LLAMA_MAX_DEVICES was recently exposed on llama.cpp via LLAMA_API int llama_max_devices();
so we can atleast create the right sized fixed array now.

Seems odd this is affecting a single OS, I really really need a mac to test stuff :p

EDIT: the calling and result of llama_context_default_params would prolly be a really good first unit test to make

martindevans · 2023-08-02T20:39:01Z

I don't think it uses a fixed size array any more - looking at the cpp code it's just a pointer to some floats.

saddam213 · 2023-08-02T21:05:14Z

confused by this, tried latest build, still works fine on win and linux, llama_context_default_params returns nullptr for tensor_split so I dont see why this exception is thrown on mac, all other property values are corrcet so its not a sequence issue

Need someone with more c++ knowlege to look at this for me please

martindevans · 2023-08-02T22:15:37Z

According to the docs:

.NET array: Not allowed without a [MarshalAs] attribute (if marshalling as a field)

So I think the current code is just incorrect as the array currently has no marshalling attribute. I'm not sure what that works on Windows!

I've added MarshalAs(UnmanagedType.LPArray) which I think should marshal it as a pointer to the first element. @SignalRT would you mind testing it out on Mac again? Thanks.

saddam213 · 2023-08-02T22:31:46Z

I did try a few attributes this morning, however all ened up setting tensor_split to a non-null value, and that cause all properties after it to become wrong values (RoPE etc)

And we cant set a SizeCost to 0 on any of the Marshal attributes.

Ill leave this one for the mac guys as I can't reproduce the error on the other OS's

EDIT: Setting a breakpoint after the call of llama_context_default_params and checking the defaults is a quick way to see.

saddam213 · 2023-08-03T06:16:11Z

After further digging using a float[] for float* should work fine

In most cases, you don't need to use the MarshalAs attribute explicitly for a float[] when working with Platform Invoke (P/Invoke) in C#. The reason is that float[] is typically well-supported in the .NET Framework and has a matching unmanaged type in C/C++ (e.g., float[] in C# corresponds to float* in C/C++).

SignalRT · 2023-08-03T06:28:50Z

I made some test after review the C++ source code. In my opinion it should be something like:

   [MarshalAs(UnmanagedType.ByValArray, SizeConst = NativeApi.LLAMA_MAX_DEVICES)]

Where LLAMA_MAX_DEVICE should be calculate using llama_max_devices() that I import also in the C# native api. It seems something that AsakusaRinne add in llama.cpp sometime ago:

ggerganov/llama.cpp#2253

This is working on some test, but once solved that issue I arrive to this #67
either running on memory or in GPU (metal). So it seems I need to review this deeply and check llama.cpp changes since your PR.

saddam213 · 2023-08-03T06:34:44Z

Yeah, I think its time I leave this to the experts.

Works for windows and linux so at least I can continue development on my branch and I can merge in any changes that are made for the mac guys later.

martindevans · 2023-08-03T13:01:47Z

It shouldn't be marshalled as a ByValArray, that's for fized size in place arrays like this:

typedef struct
{
    float TensorSplits[3];
} DemoStruct;

That's how it worked up until 2 weeks ago, when ggerganov made this change.

Since then the definition in llama.cpp (here) looks like this:

const float * tensor_split; // how to split layers across multiple GPUs (size: LLAMA_MAX_DEVICES)

i.e. just a pointer to the first value, which is the default marshalling as @saddam213 pointed out.

If the MarshalAs(UnmanagedType.LPArray) didn't help on MacOS then we can get rid of that, it was just a guess at what might fix the Mac specific issue.

martindevans · 2023-08-03T13:19:21Z

I just tested this on Windows using the bin-win-avx512-x64 binaries from version 468ea24 and it works on Windows.

martindevans · 2023-08-05T00:58:39Z

@AsakusaRinne Reminder that this will break master until the correct DLLs are added. They are not included in this PR!

AsakusaRinne · 2023-08-05T01:04:28Z

468ea24

Should the dll come from 468ea24? I could compile it today and added them.

martindevans · 2023-08-05T01:06:37Z

Yes 468ea24 should be fine. You could probably use a more up to date version, but I haven't tried it.

SignalRT · 2023-08-05T11:32:55Z

@martindevans
This is the implementation the definition that for me works in McOS, Linux and Windows:
public nint tensor_split;

It works with llama.cpp 5f631c2 I update in my fork the binaries.

martindevans mentioned this pull request Jul 25, 2023

llama_state #62

Closed

martindevans added 6 commits July 25, 2023 16:23

- Fixed null check

c95b14d

- Additional comments

Moved LoRA loading into SafeLlamaModelHandle

44b1e93

Exposed some extra model properties on safe handle

b721072

Added TokenToString conversion on model handle

369c915

Added model Tokenize method

afb9d24

Removed leftover using

c974c84

martindevans mentioned this pull request Jul 27, 2023

Multiple sessions on one model. #41

Closed

Added comments on two properties

6985d3a

LLamaContextParams epsilon and tensor split changes

3e252c8

saddam213 mentioned this pull request Jul 28, 2023

Multiple context per model support #70

Closed

Merge pull request #1 from saddam213/new_llama_state_loading_mechanism

23629a1

LLamaContextParams epsilon and tensor split changes

Update LLamaContextParams.cs

2245b84

Removed MarshalAs on array

add3d55

AsakusaRinne merged commit 1d29b24 into SciSharp:master Aug 5, 2023

martindevans deleted the new_llama_state_loading_mechanism branch August 5, 2023 00:57

martindevans mentioned this pull request Aug 5, 2023

CPU Feature Detection #65

Merged

martindevans mentioned this pull request Aug 6, 2023

Improved Tensor Splits #81

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low level new loading system #64

Low level new loading system #64

martindevans commented Jul 25, 2023 •

edited

saddam213 commented Jul 28, 2023

saddam213 commented Jul 28, 2023

martindevans commented Jul 28, 2023

saddam213 commented Jul 28, 2023

martindevans commented Jul 28, 2023

SignalRT commented Aug 2, 2023

martindevans commented Aug 2, 2023

SignalRT commented Aug 2, 2023

martindevans commented Aug 2, 2023

saddam213 commented Aug 2, 2023 •

edited

martindevans commented Aug 2, 2023

saddam213 commented Aug 2, 2023 •

edited

martindevans commented Aug 2, 2023

saddam213 commented Aug 2, 2023 •

edited

saddam213 commented Aug 3, 2023

SignalRT commented Aug 3, 2023

saddam213 commented Aug 3, 2023

martindevans commented Aug 3, 2023

martindevans commented Aug 3, 2023 •

edited

martindevans commented Aug 5, 2023

AsakusaRinne commented Aug 5, 2023

martindevans commented Aug 5, 2023

SignalRT commented Aug 5, 2023

Low level new loading system #64

Low level new loading system #64

Conversation

martindevans commented Jul 25, 2023 • edited

saddam213 commented Jul 28, 2023

saddam213 commented Jul 28, 2023

martindevans commented Jul 28, 2023

saddam213 commented Jul 28, 2023

martindevans commented Jul 28, 2023

SignalRT commented Aug 2, 2023

martindevans commented Aug 2, 2023

SignalRT commented Aug 2, 2023

martindevans commented Aug 2, 2023

saddam213 commented Aug 2, 2023 • edited

martindevans commented Aug 2, 2023

saddam213 commented Aug 2, 2023 • edited

martindevans commented Aug 2, 2023

saddam213 commented Aug 2, 2023 • edited

saddam213 commented Aug 3, 2023

SignalRT commented Aug 3, 2023

saddam213 commented Aug 3, 2023

martindevans commented Aug 3, 2023

martindevans commented Aug 3, 2023 • edited

martindevans commented Aug 5, 2023

AsakusaRinne commented Aug 5, 2023

martindevans commented Aug 5, 2023

SignalRT commented Aug 5, 2023

martindevans commented Jul 25, 2023 •

edited

saddam213 commented Aug 2, 2023 •

edited

saddam213 commented Aug 2, 2023 •

edited

saddam213 commented Aug 2, 2023 •

edited

martindevans commented Aug 3, 2023 •

edited