Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major llama.cpp API Change #185

Merged
merged 12 commits into from
Oct 18, 2023

Conversation

martindevans
Copy link
Member

@martindevans martindevans commented Sep 29, 2023

llama.cpp ggerganov/llama.cpp#3228 (comment) just got merged with major changes to the API. See #184 with some notes on this.

This PR has basic changes to match the underlying API changes. Things to do:

  • LLamaContextParams and LLamaModelParams have been split into 2, we should split IModelParams in the same way
  • Fix cloning contexts (removed support)
  • Fix OutOfContext test
  • Add binaries for all platforms

@martindevans
Copy link
Member Author

@saddam213 I've made some changes which have broken some of the LLama.Web stuff at the moment, it looks like it's all places which construct a single context by loading an entirely new set of model weights (probably legacy code?). I can probably throw in some workarounds if you want, or do you want to do something more complete and push it into this PR?

@saddam213
Copy link
Collaborator

saddam213 commented Oct 3, 2023

@saddam213 I've made some changes which have broken some of the LLama.Web stuff at the moment, it looks like it's all places which construct a single context by loading an entirely new set of model weights (probably legacy code?). I can probably throw in some workarounds if you want, or do you want to do something more complete and push it into this PR?

No probs man, I'll get it done tonight, wont take long to get it up to date with your changes :)

The web project has fallen behind a bit, I need to get it up to date, got lost in the world of stable diffusion the last few weeks, lol

@martindevans
Copy link
Member Author

@saddam213 I actually just pushed up some workarounds to at least get it building, but I left some todos in place.

@martindevans
Copy link
Member Author

The web project has fallen behind a bit, I need to get it up to date, got lost in the world of stable diffusion the last few weeks, lol

haha I know how that is. I bounce back and forth between LLMs and stable diffusion all the time 😆

@saddam213
Copy link
Collaborator

saddam213 commented Oct 3, 2023

wondering if we should merge the Web and WebApi into one project, there is a lot of code that will need to be shared to get them both up to date, and we can just use an appsetting config to decide what it strats up, API or WEB

Or add a shared project, but that seems overkill for examples

I cant even port the LLamaStack api or web cause I went and make custom executors like an idiot :/

@saddam213
Copy link
Collaborator

Ok, so the changes to get the web up to date with version 0.5.1 are more than I thought.
(have done a untested prelim change on my master branch) Commits

I think its too big to mash into this PR, so I will create its own PR then we can merge that in, then we can merge this PR's changes overtop

That work for you?

@martindevans
Copy link
Member Author

That work for you?

That's fine by me, there are workarounds in place in this PR so that it at least builds and we can get your PR in right after this one is done.

I'll try to get this one finished and merged soon so we can get on with that :)

@martindevans
Copy link
Member Author

The only thing blocking this PR is proper testing on MacOS.

@martindevans martindevans marked this pull request as ready for review October 14, 2023 17:59
@SignalRT
Copy link
Collaborator

I will be at home tomorrow. This week I could check this PR

@SignalRT
Copy link
Collaborator

@martindevans

At least with today llama.cpp binaries It fail in two test. All related with the tokenizer.

This could be related with the latest changes in the tokenizer: ggerganov/llama.cpp#3538
, but I cannot test it in Windows today

These are the errors:

Xunit.Sdk.EqualException
Assert.Equal() Failure: Collections differ
↓ (pos 1)
Expected: [1, 450, 4996, 17354, 1701, ···]
Actual: [1, 1576, 4996, 17354, 1701, ···]
↑ (pos 1)
at LLama.Unittest.LLamaContextTests.Tokenize() in ...LLamaContextTests.cs:line 40

And

Xunit.Sdk.EqualException
Assert.Equal() Failure: Collections differ
↓ (pos 0)
Expected: [450, 4996, 17354, 1701, 29916]
Actual: [1576, 4996, 17354, 1701, 29916]
↑ (pos 0)
at LLama.Unittest.LLamaContextTests.TokenizeWithoutBOS() in ../LLamaContextTests.cs:line 48

@martindevans
Copy link
Member Author

Thanks for testing that, I'll retrigger the github action to recompile all of the binaries and fix the tokenizer tests.

Did you test with the included binaries at all to confirm that everything else works?

@SignalRT
Copy link
Collaborator

I didn’t tested it yet with the included binaries, I used my daily build binaries. Tomorrow I will test the binaries generated by the github build and I will make some test with additional projects and some Semantic Kernel examples.

@staviq
Copy link

staviq commented Oct 17, 2023

This could be related with the latest changes in the tokenizer: ggerganov/llama.cpp#3538 , but I cannot test it in Windows today

Hi, that PR only applies if you explicitly pass "special=true" to llama_tokenize, otherwise tokenizer behavior remains unchanged.

Using "special=true" is meant only for "internal" use, and should not be exposed to the end user ( application user )

Enabling special token processing causes the following:

  • No space prefix is added to the input string. This is important when prompt format expects word instead of word, for example when system prompt expects system instead of system which are different tokens.
  • Special ( control ) tokens, such as <s>, <|im_start|> etc. are explicitly matched first and directly converted to appropriate token id.
  • Remaining string segments, between previously matched special tokens are tokenized normally.
  • Results are merged into one chain of token ids.

This addition to the API is meant for processing things like prompt template, and allows including bos/eos directly in the input string.

User text should be tokenized normally, with "special=false".


Clarification on "special" tokens:

Model vocabulary naturally has a tree like structure, every normal token can be represented by two other shorter tokens, down to single character.

Tokenizer picks neighboring characters from input string, finds a token that represents that substring, and repeats merging neighbors until no valid longer token can be matched.

Special tokens break that rule, and are chosen to never be "reachable" by merging normal tokens, preventing the user from inserting control tokens in the prompt.

Previously, special tokens were simply tokenized as plaintext ( <s> would become something like <, s, > ), and various prompt formats weren't exactly working with llama.cpp.

This change, allows use of special ( control tokens ), directly in the input string, removing the need to fiddle with querying vocabulary manually to make prompt templates work.

 - Fixed stateless executor out-of-context handling
 - Fixed token tests
@martindevans
Copy link
Member Author

@SignalRT I've added new binaries and fixed a few small things (e.g. I think out-of-context handling should be about 100x faster now).

@staviq thanks very much for the extra info! I think the problem in this case was because SignalRT was using newer binaries than this PR was built for I hadn't added that new parameter yet on the C# side! I'm not quite sure what happens in this case (the call was still successful, so I guess whatever junk was on the stack gets used). Once I added the parameter and passed special=false the results went back to the usual values.

@SignalRT
Copy link
Collaborator

SignalRT commented Oct 18, 2023

@martindevans, all my test are working. I made some changes in the SK packages but I see that the master is right now on SK 1.0 beta so it makes no sense to change it in this branch

@martindevans
Copy link
Member Author

Thanks for testing that!

@martindevans martindevans merged commit d8434ea into SciSharp:master Oct 18, 2023
4 checks passed
@martindevans martindevans deleted the wip_major_api_change branch October 18, 2023 19:50
@AsakusaRinne
Copy link
Collaborator

It's a great work! Is it a good time to publish a new release?

@AsakusaRinne
Copy link
Collaborator

And maybe I should add an auto-release to ci, I'll do it this weekend.

@martindevans
Copy link
Member Author

It's a great work! Is it a good time to publish a new release?

I think so yeah, this has made a few breaking changes to the API but nothing massive. Good time to push out a new version 👍

@AsakusaRinne
Copy link
Collaborator

I think so yeah, this has made a few breaking changes to the API but nothing massive. Good time to push out a new version 👍

I'm making the auto-release ci now and I think there's something confusing me. There are two supports for MAC, CPU and Metal. However only one libllama.dylib is included in LLama/runtimes, is it both used for CPU and Metal?

@martindevans
Copy link
Member Author

@AsakusaRinne It was removed in this PR: #163

From the discussion there it looks like one binary supports CPU and Metal inference (but I don't have a Mac so I can't confirm that).

@SignalRT
Copy link
Collaborator

@AsakusaRinne, one binary supports CPU and GPU.

@AsakusaRinne
Copy link
Collaborator

@martindevans @SignalRT Thank you for your help! I've opened a PR #204 of the auto-release ci. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants