Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low level new loading system #64

Merged

Conversation

martindevans
Copy link
Member

@martindevans martindevans commented Jul 25, 2023

Updated to use the new loading system in llama (llama_state). This new system has split model weights and contexts into two separate things, allowing one set of weights to be shared between many contexts.

This PR does not try to take advantage of this new capability in any way! It's just implementing the smallest change to the low level API. In future PRs I think we'll want to make quite extensive changes to the API (especially with state management).

This is built upon llama.cpp 468ea24, necessary DLLs are NOT included in this PR. I don't know how to build them all and assumed you'd rather add them yourself anyway (as a matter of security).

Edit: I've expanded the SafeLlamaModelHandle to include methods for all the things you can do with a llama_model. Most of these are not used yet, since we'll need to decide how to expose this to the higher level API.

…w system has split model weights and contexts into two separate things, allowing one set of weights to be shared between many contexts.

This change _only_ implements the low level API and makes no effort to update the LlamaSharp higher level abstraction.

It is built upon llama `b3f138d`, necessary DLLs are **not** included in this commit.
@martindevans martindevans mentioned this pull request Jul 25, 2023
@saddam213
Copy link
Collaborator

This is all working perfect and I am now building on top of this PR

One thing I noticed was no GPU layers are not loaded anymore, I suspect it may be to do with this commit ae178ab
it changed the tensor_split from al array to a pointer,

Since you have done these native API upgrades I was wondering if you could have a look as you already have a branch open for this stuff :)

@saddam213
Copy link
Collaborator

@martindevans I have created a PR for what I mentioned above, it fixes the GPU issues

This now gets us up to date with llama.cpp @ commit edcc7ae

Thanks for the great work

@martindevans
Copy link
Member Author

Thanks for testing that @saddam213, I'm not using GPU acceleration locally so I'd never have noticed that!

Would you mind creating a PR into this branch with just your fix for that issue? That way this PR remains usable in isolation.

I'll update the comment at the top of this thread to indicate that edcc7ae is supported.

@saddam213
Copy link
Collaborator

@martindevans made a PR over on your branch for the GPU issue fix :)
martindevans#1

LLamaContextParams epsilon and tensor split changes
@martindevans
Copy link
Member Author

Thanks, I've merged that so it's part of this PR.

@SignalRT
Copy link
Collaborator

SignalRT commented Aug 2, 2023

In MacOS this PR doesn't work. I will try to review how to Marshall this float[] type:

Unhandled exception. System.TypeLoadException: Cannot marshal field 'tensor_split' of type 'LLama.Native.LLamaContextParams': Invalid managed/unmanaged type combination (Array fields must be paired with ByValArray).
at LLama.Native.NativeApi.llama_context_default_params()
at LLama.Utils.InitLLamaContextFromModelParams(ModelParams params) in

@martindevans
Copy link
Member Author

@SignalRT did you make sure to update the native deps before testing? The updated binaries are not included in this PR!

Check the top message for a link to the appropriate commit in llama.cpp :)

@SignalRT
Copy link
Collaborator

SignalRT commented Aug 2, 2023

It's tested with today llama.cpp (well there are several "today" llama.cpp versions...)

@martindevans
Copy link
Member Author

In that case I think the extra TensorSplits fix will need to be updated. Rather than passing in a float[] it'll have to be a float* and the array will have to be fixed in place when created. If you want to make a PR (into this branch) for it I'll be happy to review it and merge it. If not I'll look at it myself within a few days.

@saddam213
Copy link
Collaborator

saddam213 commented Aug 2, 2023

LLAMA_MAX_DEVICES was recently exposed on llama.cpp via LLAMA_API int llama_max_devices();
so we can atleast create the right sized fixed array now.

Seems odd this is affecting a single OS, I really really need a mac to test stuff :p

EDIT: the calling and result of llama_context_default_params would prolly be a really good first unit test to make

@martindevans
Copy link
Member Author

I don't think it uses a fixed size array any more - looking at the cpp code it's just a pointer to some floats.

@saddam213
Copy link
Collaborator

saddam213 commented Aug 2, 2023

confused by this, tried latest build, still works fine on win and linux, llama_context_default_params returns nullptr for tensor_split so I dont see why this exception is thrown on mac, all other property values are corrcet so its not a sequence issue

Need someone with more c++ knowlege to look at this for me please

@martindevans
Copy link
Member Author

According to the docs:

.NET array: Not allowed without a [MarshalAs] attribute (if marshalling as a field)

So I think the current code is just incorrect as the array currently has no marshalling attribute. I'm not sure what that works on Windows!

I've added MarshalAs(UnmanagedType.LPArray) which I think should marshal it as a pointer to the first element. @SignalRT would you mind testing it out on Mac again? Thanks.

@saddam213
Copy link
Collaborator

saddam213 commented Aug 2, 2023

I did try a few attributes this morning, however all ened up setting tensor_split to a non-null value, and that cause all properties after it to become wrong values (RoPE etc)

And we cant set a SizeCost to 0 on any of the Marshal attributes.

Ill leave this one for the mac guys as I can't reproduce the error on the other OS's

EDIT: Setting a breakpoint after the call of llama_context_default_params and checking the defaults is a quick way to see.

@saddam213
Copy link
Collaborator

After further digging using a float[] for float* should work fine

In most cases, you don't need to use the MarshalAs attribute explicitly for a float[] when working with Platform Invoke (P/Invoke) in C#. The reason is that float[] is typically well-supported in the .NET Framework and has a matching unmanaged type in C/C++ (e.g., float[] in C# corresponds to float* in C/C++).

@SignalRT
Copy link
Collaborator

SignalRT commented Aug 3, 2023

I made some test after review the C++ source code. In my opinion it should be something like:

   [MarshalAs(UnmanagedType.ByValArray, SizeConst = NativeApi.LLAMA_MAX_DEVICES)]

Where LLAMA_MAX_DEVICE should be calculate using llama_max_devices() that I import also in the C# native api. It seems something that AsakusaRinne add in llama.cpp sometime ago:

ggerganov/llama.cpp#2253

This is working on some test, but once solved that issue I arrive to this #67
either running on memory or in GPU (metal). So it seems I need to review this deeply and check llama.cpp changes since your PR.

@saddam213
Copy link
Collaborator

Yeah, I think its time I leave this to the experts.

Works for windows and linux so at least I can continue development on my branch and I can merge in any changes that are made for the mac guys later.

@martindevans
Copy link
Member Author

It shouldn't be marshalled as a ByValArray, that's for fized size in place arrays like this:

typedef struct
{
    float TensorSplits[3];
} DemoStruct;

That's how it worked up until 2 weeks ago, when ggerganov made this change.

Since then the definition in llama.cpp (here) looks like this:

const float * tensor_split; // how to split layers across multiple GPUs (size: LLAMA_MAX_DEVICES)

i.e. just a pointer to the first value, which is the default marshalling as @saddam213 pointed out.

If the MarshalAs(UnmanagedType.LPArray) didn't help on MacOS then we can get rid of that, it was just a guess at what might fix the Mac specific issue.

@martindevans
Copy link
Member Author

martindevans commented Aug 3, 2023

I just tested this on Windows using the bin-win-avx512-x64 binaries from version 468ea24 and it works on Windows.

@AsakusaRinne AsakusaRinne merged commit 1d29b24 into SciSharp:master Aug 5, 2023
@martindevans martindevans deleted the new_llama_state_loading_mechanism branch August 5, 2023 00:57
@martindevans
Copy link
Member Author

@AsakusaRinne Reminder that this will break master until the correct DLLs are added. They are not included in this PR!

@AsakusaRinne
Copy link
Collaborator

468ea24

Should the dll come from 468ea24? I could compile it today and added them.

@martindevans
Copy link
Member Author

Yes 468ea24 should be fine. You could probably use a more up to date version, but I haven't tried it.

@martindevans martindevans mentioned this pull request Aug 5, 2023
@SignalRT
Copy link
Collaborator

SignalRT commented Aug 5, 2023

@martindevans
This is the implementation the definition that for me works in McOS, Linux and Windows:
public nint tensor_split;

It works with llama.cpp 5f631c2 I update in my fork the binaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants