Skip to content

Commit

Permalink
Merge pull request #22 from AsakusaRinne/v0.11.0_update
Browse files Browse the repository at this point in the history
docs: refactor documentation
  • Loading branch information
AsakusaRinne authored Mar 31, 2024
2 parents 156f369 + b944445 commit ee6bccc
Show file tree
Hide file tree
Showing 195 changed files with 11,689 additions and 5,711 deletions.
Binary file added Assets/LLamaSharp-Integrations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Assets/LLamaSharp-Integrations.vsdx
Binary file not shown.
Binary file added Assets/llava_demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions LLama.Examples/Examples/ChatChineseGB2312.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

namespace LLama.Examples.Examples;

// This example shows how to deal with Chinese input with gb2312 encoding.
public class ChatChineseGB2312
{
private static string ConvertEncoding(string input, Encoding original, Encoding target)
Expand Down
2 changes: 2 additions & 0 deletions LLama.Examples/Examples/ChatSessionStripRoleName.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

namespace LLama.Examples.Examples;

// When using chatsession, it's a common case that you want to strip the role names
// rather than display them. This example shows how to use transforms to strip them.
public class ChatSessionStripRoleName
{
public static async Task Run()
Expand Down
1 change: 1 addition & 0 deletions LLama.Examples/Examples/InstructModeExecute.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

namespace LLama.Examples.Examples
{
// This example shows how to use InstructExecutor to generate the response.
public class InstructModeExecute
{
public static async Task Run()
Expand Down
1 change: 1 addition & 0 deletions LLama.Examples/Examples/InteractiveModeExecute.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

namespace LLama.Examples.Examples
{
// This is an example which shows how to chat with LLM with InteractiveExecutor.
public class InteractiveModeExecute
{
public static async Task Run()
Expand Down
2 changes: 2 additions & 0 deletions LLama.Examples/Examples/LlavaInteractiveModeExecute.cs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@

namespace LLama.Examples.Examples
{
// This example shows how to chat with LLaVA model with both image and text as input.
// It uses the interactive executor to inference.
public class LlavaInteractiveModeExecute
{
public static async Task Run()
Expand Down
1 change: 1 addition & 0 deletions LLama.Examples/Examples/LoadAndSaveState.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

namespace LLama.Examples.Examples
{
// This example shows how to save/load state of the executor.
public class LoadAndSaveState
{
public static async Task Run()
Expand Down
156 changes: 85 additions & 71 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,84 +11,109 @@
[![LLamaSharp Badge](https://img.shields.io/nuget/v/LLamaSharp.Backend.OpenCL?label=LLamaSharp.Backend.OpenCL)](https://www.nuget.org/packages/LLamaSharp.Backend.OpenCL)


**The C#/.NET binding of [llama.cpp](https://github.com/ggerganov/llama.cpp). It provides higher-level APIs to inference the LLaMA Models and deploy it on local device with C#/.NET. It works on Windows, Linux and Mac without need to compile llama.cpp yourself. Even without a GPU or not enough GPU memory, you can still use LLaMA models! 🤗**
**LLamaSharp is a cross-platform library to run 🦙LLaMA/LLaVA model (and others) in local device. Based on [llama.cpp](https://github.com/ggerganov/llama.cpp), inference with LLamaSharp is efficient on both CPU and GPU. With the higher-level APIs and RAG support, it's convenient to deploy LLM (Large Language Model) in your application with LLamaSharp.**

**Please star the repo to show your support for this project!🤗**

---

**Furthermore, it provides integrations with other projects such as [semantic-kernel](https://github.com/microsoft/semantic-kernel), [kernel-memory](https://github.com/microsoft/kernel-memory) and [BotSharp](https://github.com/SciSharp/BotSharp) to provide higher-level applications.**

**Discussions about the roadmap to v1.0.0: [#287](https://github.com/SciSharp/LLamaSharp/issues/287)**

<details>
<summary>Table of Contents</summary>
<ul>
<li><a href="#Documentation">Documentation</a></li>
<li><a href="#Examples">Examples</a></li>
<li><a href="#Installation">Installation</a></li>
<li>
<a href="#(Quick Start)">Quick Start</a>
<ul>
<li><a href="#Model Inference and Chat Session">Model Inference and Chat Session</a></li>
<li><a href="#Quantization">Quantization</a></li>
<li><a href="#Web API">Web API</a></li>
</ul>
</li>
<li><a href="#Features">Features</a></li>
<li><a href="#Console Demo">Console Demo</a></li>
<li><a href="#Toolkits & Examples">Toolkits & Examples</a></li>
<li><a href="#Get started">Get started</a></li>
<li><a href="#FAQ">FAQ</a></li>
<li><a href="#Contributing">Contributing</a></li>
<li><a href="#Contact us">Contact us</a></li>
<li>
<a href="#Appendix">Appendix</a>
<ul>
<li><a href="#LLamaSharp and llama.cpp versions">LLamaSharp and llama.cpp versions</a></li>
</ul>
</li>
<li><a href="#Join the community">Join the community</a></li>
<li><a href="#Map of LLamaSharp and llama.cpp versions">Map of LLamaSharp and llama.cpp versions</a></li>
</ul>
</details>


## Documentation

- [Quick start](https://scisharp.github.io/LLamaSharp/latest/GetStarted/)
- [Tricks for FAQ](https://scisharp.github.io/LLamaSharp/latest/Tricks/)
- [Full documentation](https://scisharp.github.io/LLamaSharp/latest/)
- [API reference](https://scisharp.github.io/LLamaSharp/latest/xmldocs/)

## Examples

## Console Demo

<table class="center">
<tr style="line-height: 0">
<td width=50% height=30 style="border: none; text-align: center">LLaMA</td>
<td width=50% height=30 style="border: none; text-align: center">LLaVA</td>
</tr>
<tr>
<td width=25% style="border: none"><img src="Assets/console_demo.gif" style="width:100%"></td>
<td width=25% style="border: none"><img src="Assets/llava_demo.gif" style="width:100%"></td>
</tr>
</table>


## Toolkits & Examples

There are integarions for the following libraries, making it easier to develop your APP. Integrations for semantic-kernel and kernel-memory are developed in LLamaSharp repository, while others are developed in their own repositories.

- [semantic-kernel](https://github.com/microsoft/semantic-kernel): an SDK that integrates LLM like OpenAI, Azure OpenAI, and Hugging Face.
- [kernel-memory](https://github.com/microsoft/kernel-memory): a multi-modal AI Service specialized in the efficient indexing of datasets through custom continuous data hybrid pipelines, with support for RAG ([Retrieval Augmented Generation](https://en.wikipedia.org/wiki/Prompt_engineering#Retrieval-augmented_generation)), synthetic memory, prompt engineering, and custom semantic memory processing.
- [BotSharp](https://github.com/SciSharp/BotSharp): an open source machine learning framework for AI Bot platform builder.
- [Langchain](https://github.com/tryAGI/LangChain): a framework for developing applications powered by language models.


The following examples show how to build APPs with LLamaSharp.

- [Official Console Examples](./LLama.Examples/)
- [Unity Demo](https://github.com/eublefar/LLAMASharpUnityDemo)
- [LLamaStack (with WPF and Web support)](https://github.com/saddam213/LLamaStack)
- [LLamaStack (with WPF and Web demo)](https://github.com/saddam213/LLamaStack)
- [Blazor Demo (with Model Explorer)](https://github.com/alexhiggins732/BLlamaSharp.ChatGpt.Blazor)
- [ASP.NET Demo](./LLama.Web/)

![LLamaShrp-Integrations](./Assets/LLamaSharp-Integrations.png)

## Installation

1. Install [`LLamaSharp`](https://www.nuget.org/packages/LLamaSharp) package in NuGet:
## Get started

### Installation

To gain high performance, LLamaSharp interacts with a native library compiled from c++, which is called `backend`. We provide backend packages for Windows, Linux and MAC with CPU, Cuda, Metal and OpenCL. You **don't** need to handle anything about c++ but just install the backend packages.

If no published backend match your device, please open an issue to let us know. If compiling c++ code is not difficult for you, you could also follow [this guide](./docs/ContributingGuide.md) to compile a backend and run LLamaSharp with it.

1. Install [LLamaSharp](https://www.nuget.org/packages/LLamaSharp) package on NuGet:

```
PM> Install-Package LLamaSharp
```

2. Install **one** of these backends:
2. Install one or more of these backends, or use self-compiled backend.

- [`LLamaSharp.Backend.Cpu`](https://www.nuget.org/packages/LLamaSharp.Backend.Cpu): Pure CPU for Windows & Linux. Metal for Mac.
- [`LLamaSharp.Backend.Cuda11`](https://www.nuget.org/packages/LLamaSharp.Backend.Cuda11): CUDA11 for Windows and Linux
- [`LLamaSharp.Backend.Cuda12`](https://www.nuget.org/packages/LLamaSharp.Backend.Cuda12): CUDA 12 for Windows and Linux
- [`LLamaSharp.Backend.OpenCL`](https://www.nuget.org/packages/LLamaSharp.Backend.OpenCL): OpenCL for Windows and Linux
- If none of these backends is suitable you can compile [llama.cpp](https://github.com/ggerganov/llama.cpp) yourself. In this case, please **DO NOT** install the backend packages! Instead, add your DLL to your project and ensure it will be copied to the output directory when compiling your project. If you do this you must use exactly the correct llama.cpp commit, refer to the version table further down.
- [`LLamaSharp.Backend.Cpu`](https://www.nuget.org/packages/LLamaSharp.Backend.Cpu): Pure CPU for Windows & Linux & MAC. Metal (GPU) support for MAC.
- [`LLamaSharp.Backend.Cuda11`](https://www.nuget.org/packages/LLamaSharp.Backend.Cuda11): CUDA11 for Windows & Linux.
- [`LLamaSharp.Backend.Cuda12`](https://www.nuget.org/packages/LLamaSharp.Backend.Cuda12): CUDA 12 for Windows & Linux.
- [`LLamaSharp.Backend.OpenCL`](https://www.nuget.org/packages/LLamaSharp.Backend.OpenCL): OpenCL for Windows & Linux.

3. (optional) For [Microsoft semantic-kernel](https://github.com/microsoft/semantic-kernel) integration, install the [LLamaSharp.semantic-kernel](https://www.nuget.org/packages/LLamaSharp.semantic-kernel) package.
4. (optional) For [Microsoft kernel-memory](https://github.com/microsoft/kernel-memory) integration, install the [LLamaSharp.kernel-memory](https://www.nuget.org/packages/LLamaSharp.kernel-memory) package (this package currently only supports `net6.0`).
4. (optional) To enable RAG support, install the [LLamaSharp.kernel-memory](https://www.nuget.org/packages/LLamaSharp.kernel-memory) package (this package only supports `net6.0` or higher yet), which is based on [Microsoft kernel-memory](https://github.com/microsoft/kernel-memory) integration.

### Tips for choosing a version
### Model preparation

Llama.cpp is a fast moving project with frequent breaking changes, therefore breaking changes are expected frequently in LLamaSharp. LLamaSharp follows [semantic versioning](https://semver.org/) and will not introduce breaking API changes on patch versions.
There are two popular format of model file of LLM now, which are PyTorch format (.pth) and Huggingface format (.bin). LLamaSharp uses `GGUF` format file, which could be converted from these two formats. To get `GGUF` file, there are two options:

It is suggested to update to the latest patch version as soon as it is released, and to update to new major versions as soon as possible.
1. Search model name + 'gguf' in [Huggingface](https://huggingface.co), you will find lots of model files that have already been converted to GGUF format. Please take care of the publishing time of them because some old ones could only work with old version of LLamaSharp.

## Quick Start
2. Convert PyTorch or Huggingface format to GGUF format yourself. Please follow the instructions of [this part of llama.cpp readme](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize) to convert them with the python scripts.

#### Model Inference and Chat Session
Generally, we recommend downloading models with quantization rather than fp16, because it significantly reduce the required memory size while only slightly impact on its generation quality.

LLamaSharp provides two ways to run inference: `LLamaExecutor` and `ChatSession`. The chat session is a higher-level wrapping of the executor and the model. Here's a simple example to use chat session.

### Example of LLaMA chat session

Here is a simple example to chat with bot based on LLM in LLamaSharp. Please replace the model path with yours.

```cs
using LLama.Common;
Expand Down Expand Up @@ -140,45 +165,36 @@ while (userInput != "exit")
}
```

For more usage, please refer to [Examples](./LLama.Examples).
For more examples, please refer to [LLamaSharp.Examples](./LLama.Examples).

#### Web API

We provide [an integration with ASP.NET core](./LLama.WebAPI) and a [web app demo](./LLama.Web). Since we are in short of hands, if you're familiar with ASP.NET core, we'll appreciate it if you would like to help upgrading the Web API integration.
## FAQ

## Features
#### Why GPU is not used when I have installed CUDA

---
1. If you are using backend packages, please make sure you have installed the cuda backend package which matches the cuda version of your device. Please note that before LLamaSharp v0.10.0, only one backend package should be installed.
2. Add `NativeLibraryConfig.Instance.WithLogs(LLamaLogLevel.Info)` to the very beginning of your code. The log will show which native library file is loaded. If the CPU library is loaded, please try to compile the native library yourself and open an issue for that. If the CUDA libraty is loaded, please check if `GpuLayerCount > 0` when loading the model weight.

✅: completed. ⚠️: outdated for latest release but will be updated. 🔳: not completed
#### Why the inference is slow

---
Firstly, due to the large size of LLM models, it requires more time to generate outputs than other models, especially when you are using models larger than 30B.

✅ LLaMa model inference<br />
✅ Embeddings generation, tokenization and detokenization<br />
✅ Chat session<br />
✅ Quantization<br />
✅ Grammar<br />
✅ State saving and loading<br />
✅ BotSharp Integration [Online Demo](https://victorious-moss-007e11310.4.azurestaticapps.net/)<br />
✅ ASP.NET core Integration<br />
✅ Semantic-kernel Integration<br />
🔳 Fine-tune<br />
✅ Local document search (enabled by kernel-memory)<br />
🔳 MAUI Integration<br />
To see if that's a LLamaSharp performance issue, please follow the two tips below.

## Console Demo
1. If you are using CUDA, Metal or OpenCL, please set `GpuLayerCount` as large as possible.
2. If it's still slower than you expect it to be, please try to run the same model with same setting in [llama.cpp examples](https://github.com/ggerganov/llama.cpp/tree/master/examples). If llama.cpp outperforms LLamaSharp significantly, it's likely a LLamaSharp BUG and please report us for that.

![demo-console](Assets/console_demo.gif)

## FAQ
#### Why the program crashes before any output is generated

Generally, there are two possible cases for this problem:

1. GPU out of memory: Please try setting `n_gpu_layers` to a smaller number.
2. Unsupported model: `llama.cpp` is under quick development and often has breaking changes. Please check the release date of the model and find a suitable version of LLamaSharp to install, or generate `gguf` format weights from original weights yourself.
3. Cannot load native library:
- Ensure you have installed one of the backend packages.
- Run `NativeLibraryConfig.WithLogs()` at the very beginning of your code to print more information.
4. Models in GGUF format are compatible with LLamaSharp. It's a good idea to search for [`gguf` on huggingface](https://huggingface.co/models?search=gguf) to find a model. Another choice is generate a GGUF format file yourself, please refer to [convert.py](https://github.com/ggerganov/llama.cpp/blob/master/convert.py) for more information.
1. The native library (backend) you are using is not compatible with the LLamaSharp version. If you compiled the native library yourself, please make sure you have checkouted llama.cpp to the corresponding commit of LLamaSharp, which could be found at the bottom of README.
2. The model file you are using is not compatible with the backend. If you are using a GGUF file downloaded from huggingface, please check its publishing time.

#### Why my model is generating output infinitely

Please set anti-prompt or max-length when executing the inference.


## Contributing
Expand All @@ -193,15 +209,13 @@ You can also do one of the followings to help us make LLamaSharp better:
- Help to develop Web API and UI integration.
- Just open an issue about the problem you met!

## Contact us
## Join the community

Join our chat on [Discord](https://discord.gg/7wNVU65ZDY) (please contact Rinne to join the dev channel if you want to be a contributor).

Join [QQ group](http://qm.qq.com/cgi-bin/qm/qr?_wv=1027&k=sN9VVMwbWjs5L0ATpizKKxOcZdEPMrp8&authKey=RLDw41bLTrEyEgZZi%2FzT4pYk%2BwmEFgFcrhs8ZbkiVY7a4JFckzJefaYNW6Lk4yPX&noverify=0&group_code=985366726)

## Appendix

### LLamaSharp and llama.cpp versions
## Map of LLamaSharp and llama.cpp versions
If you want to compile llama.cpp yourself you **must** use the exact commit ID listed for each version.

| LLamaSharp | Verified Model Resources | llama.cpp commit id |
Expand Down
20 changes: 6 additions & 14 deletions docs/Architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,14 @@

## Architecture of main functions

The figure below shows the core framework structure, which is separated to four levels.
The figure below shows the core framework structure of LLamaSharp.

- **LLamaContext**: The holder of a model which directly interact with native library and provide some basic APIs such as tokenization and embedding. Currently it includes three classes: `LLamaContext`, `LLamaEmbedder` and `LLamaQuantizer`.
- **LLamaExecutors**: Executors which define the way to run the LLama model. It provides text-to-text APIs to make it easy to use. Currently we provide three kinds of executors: `InteractiveExecutor`, `InstructExecutor` and `StatelessExecutor`.
- **Native APIs**: LLamaSharp calls the exported C APIs to load and run the model. The APIs defined in LLamaSharp specially for calling C APIs are named `Native APIs`. We have made all the native APIs public under namespace `LLama.Native`. However, it's strongly recommended not to use them unless you know what you are doing.
- **LLamaWeights**: The holder of the model weight.
- **LLamaContext**: A context which directly interact with the native library and provide some basic APIs such as tokenization and embedding. It takes use of `LLamaWeights`.
- **LLamaExecutors**: Executors which define the way to run the LLama model. It provides text-to-text and image-to-text APIs to make it easy to use. Currently we provide four kinds of executors: `InteractiveExecutor`, `InstructExecutor`, `StatelessExecutor` and `BatchedExecutor`.
- **ChatSession**: A wrapping for `InteractiveExecutor` and `LLamaContext`, which supports interactive tasks and saving/re-loading sessions. It also provides a flexible way to customize the text process by `IHistoryTransform`, `ITextTransform` and `ITextStreamTransform`.
- **High-level Applications**: Some applications that provides higher-level integration. For example, [BotSharp](https://github.com/SciSharp/BotSharp) provides integration for vector search, Chatbot UI and Web APIs. [semantic-kernel](https://github.com/microsoft/semantic-kernel) provides various APIs for manipulations related with LLM. If you've made an integration, please tell us and add it to the doc!
- **Integrations**: Integrations with other libraries to expand the application of LLamaSharp. For example, if you want to do RAG ([Retrieval Augmented Generation](https://en.wikipedia.org/wiki/Prompt_engineering#Retrieval-augmented_generation)), kernel-memory integration is a good option for you.


![structure_image](media/structure.jpg)

## Recommended Use

Since `LLamaContext` interact with native library, it's not recommended to use the methods of it directly unless you know what you are doing. So does the `NativeApi`, which is not included in the architecture figure above.

`ChatSession` is recommended to be used when you want to build an application similar to ChatGPT, or the ChatBot, because it works best with `InteractiveExecutor`. Though other executors are also allowed to passed as a parameter to initialize a `ChatSession`, it's not encouraged if you are new to LLamaSharp and LLM.

High-level applications, such as BotSharp, are supposed to be used when you concentrate on the part not related with LLM. For example, if you want to deploy a chat bot to help you remember your schedules, using BotSharp may be a good choice.

Note that the APIs of the high-level applications may not be stable now. Please take it into account when using them.
Loading

0 comments on commit ee6bccc

Please sign in to comment.