Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update llama.cpp and move core processing to native code #12

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dsd
Copy link

@dsd dsd commented Jul 1, 2023

Thanks for taking the initiative on Sherpa; I was also curious about the combination of low end devices, flutter, and open source AI, and it was nice to see that you had already been working on this.

It wasn't working on my phone (Samsung Galaxy S10) due to a llama.cpp crasher, and also memory exhaustion, but with the changes made here it works rather well with new 3B models.

dsd added 3 commits June 30, 2023 22:03
Set the main default prompt to chat-with-bob from llama.cpp.
This seems to produce much more useful conversations with llama-7b and
orca-mini-3b models that I have tested.

Also make the reverse prompt consistently "User:" in both default prompt
options, and set the default reverse prompt detection to the same value.
llama.cpp doesn't build for ARM32 because it calls into 64 bit neon
intrinsics. Not worth fixing that; lets just not offer this app on
ARM32.
Rather than using prebuilt libraries, build the llama.cpp git submodule
during the regular app build process.

The library will now be installed in a standard location, which simplifies
the logic needed to load it at runtime; there is no need to ship it as an
asset.

This works on Android, and also enables the app to build and run on Linux.
Windows build is untested.

One unfortunate side effect is that when building the app in Flutter's
debug mode, the llama lib is built unoptimized and it works very very
slowly, to the point where you might suspect the app is broken.
However release mode seems as fast as before.
Update llama.cpp to the latest version as part of an effort to make this
app usable on my Samsung Galaxy S10 smartphone.

The newer llama.cpp includes a double-close fix which was causing the app
to immediately crash upon starting the AI conversation (llama.cpp commit
47f61aaa5f76d04).

It also adds support for 3B models, which are considerably smaller. The
llama-7B models were causing Android's low memory killer to terminate
Sherpa after just a few words of conversation, whereas new models such as
orca-mini-3b.ggmlv3.q4_0.bin work on this device without quickly exhausting
all available memory.

llama.cpp's model compatibility has changed within this update, so ggml
files that were working in the previous version are unlikely to work now;
they need converting. However the orca-mini offering is already in the
new format and works out of the box.

llama.cpp's API has changed in this update. Rather than rework the Dart
code, I opted to leave it in C++, using llama.cpp's example code as a base.
This solution is included in a new "llamasherpa" library which calls
into llama.cpp. Since lots of data is passed around in large arrays,
I expect running this in Dart had quite some overhead, and this native
approach should perform considerably faster.

This eliminates the need for Sherpa's Dart code to call llama.cpp directly,
so there's no need to separately maintain a modified version of llama.cpp
and we can use the official upstream.
@moh21amed
Copy link

can it run 3b on mobile with 3gb ram ?

dsd added 2 commits July 3, 2023 07:41
On first run on my Android device, the pre-prompt is empty, it does
not get initialized to any value.

This is because SharedPreferences performs asynchronous disk I/O,
and initDefaultPrompts() uses a different SharedPreferences instance from
getPrePrompts(). There's no guarantee that a preferences update on one
instance will become immediately available in another.

Tweak the logic to not depend on synchronization between two
SharedPreferences instances.
The llama.cpp logic is built around the prompt ending with the
reverse-prompt and the actual user input being passed separately.

Adjust Sherpa to do the same, rather than appending the first line of
user input to the prompt.
@dsd
Copy link
Author

dsd commented Jul 3, 2023

can it run 3b on mobile with 3gb ram ?

Not sure, if you want to try it then there is an apk here
I suspect it won't work, because the 3B files I have seen are around 2GB, and I suspect your base OS is using at least 1GB RAM...

@dsd
Copy link
Author

dsd commented Jul 4, 2023

Looks like you have the wrong llama.cpp available under src/
Did you initialize it from git submodules?

@windmaple
Copy link

Somehow my src folder was messed up. I downloaded your src zip file and it works now.
Great work, btw!

@dsd
Copy link
Author

dsd commented Jul 4, 2023

Also debug mode is slooow - remember to run with --release it will be much faster :)

@windmaple
Copy link

I think there is sth. missing for Mac

flutter: llamasherpa loaded
flutter: MessageNewLineFromIsolate : [isolate 09:02:55] llamasherpa loaded
flutter: filePath : /Volumes/Macintosh HD/Users/wind-test/Desktop/orca-mini-3b.ggmlv3.q4_1.bin
[ERROR:flutter/runtime/dart_isolate.cc(1097)] Unhandled exception:
Invalid argument(s): Failed to lookup symbol 'llamasherpa_start': dlsym(RTLD_DEFAULT, llamasherpa_start): symbol not found
#0 DynamicLibrary.lookup (dart:ffi-patch/ffi_dynamic_library_patch.dart:33:70)
#1 NativeLibrary._llamasherpa_startPtr
generated_bindings_llamasherpa.dart:41
#2 NativeLibrary._llamasherpa_startPtr (package:sherpa/generated_bindings_llamasherpa.dart)
generated_bindings_llamasherpa.dart:1
#3 NativeLibrary._llamasherpa_start
generated_bindings_llamasherpa.dart:42
#4 NativeLibrary._llamasherpa_start (package:sherpa/generated_bindings_llamasherpa.dart)
generated_bindings_llamasherpa.dart:1
#5 NativeLibrary.llamasherpa_start
generated_bindings_llamasherpa.dart:27
#6 Lib.binaryIsolate

I did not see similar issue on Linux.

@dsd
Copy link
Author

dsd commented Jul 5, 2023

Yeah, I don't have any macOS/iOS experience or devices. Do the official sherpa versions work there?
If you want to try you could look at the instructions here under "FFI on macOS and iOS".
You will need to build both llamasherpa and llama.cpp as mentioned, via Xcode/Runner.

Then as for this bit:

nativeApiLib = Platform.isMacOS || Platform.isIOS ? DynamicLibrary.process()

The equivalent of this bit is already handled, so you just have to include the C++ code in the build and then it might work.

@windmaple
Copy link

Yeah, the original version works on Mac, although it crashes if you run the model twice in a row, which is a separate issue.

@NandhaKishorM
Copy link

NandhaKishorM commented Jul 23, 2023

Yeah, I don't have any macOS/iOS experience or devices. Do the official sherpa versions work there? If you want to try you could look at the instructions here under "FFI on macOS and iOS". You will need to build both llamasherpa and llama.cpp as mentioned, via Xcode/Runner.

Then as for this bit:

nativeApiLib = Platform.isMacOS || Platform.isIOS ? DynamicLibrary.process()

The equivalent of this bit is already handled, so you just have to include the C++ code in the build and then it might work.

In windows its showing error such as the dll library not Found. The file is "llamasherpa.dll"

@kmn1024
Copy link

kmn1024 commented Jan 15, 2024

Does any one have benchmarks (tokens/second) for running a 3B or 7B model on any low end device?
@dsd mentioned 3B running "quite well" on S10 in #12 (comment); how well is that =)
@windmaple also seems to have found success on an unknown device in #12 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants