<a href="https://colab.research.google.com/github/weedge/doraemon-nb/blob/main/gemma_inference_with_CPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [22]:
!lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  1
    Socket(s):           1
    Stepping:            0
    BogoMIPS:            4399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
                         lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
                         a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
                         ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd i

# gemma.cpp

开发者笔记

## 动机：为研究和实验设计的极简C++语言模型运行时

过去，神经网络推理类似于一个简单、不透明的无状态函数，具有单一的输入和输出。相比之下，基础模型运行时更像是具有多种形式的状态、子系统和异构输入输出的系统。它们通常与其他具有自己资源的系统（例如RAG和工具）集成，并可能与外部环境交互。它们已经成为计算引擎，用于在广泛、通用的世界模型中嵌入接近任务和目标。

考虑到这一点，我们相信开发一个灵活且易于接近的实验性运行时将使我们能够探索高级模型关注点与低级运行时计算之间的共同设计空间。

## 设计优先级

鉴于这些动机，我们提出以下优先级，用于决定代码库的方向和设计。

**在狭窄范围内最大化利用。** 我们专注于直接实现像Gemma这样的基础模型。这使我们能够集中精力解决特定模型的瓶颈。我们愿意牺牲通用性，以保持实现代码在所有层次上相对简单和可读，实现良好的性能，并保持小团队的速度。

**面向数据的设计。** 在可能的情况下遵循面向数据的设计原则，以最小化不必要的性能悲观。最好是在初始设计时或重构子组件时应用这些优化。第一步是考虑批次或普通旧数据（POD）类型的元组：分离的数组，而不是结构数组。第二步是降低控制流（if语句、虚函数和类层次结构）。第三步是了解数据的内在属性，并将其融入布局和算法中。

**优先考虑小批量延迟。** 由于生产服务解决方案已经可以大规模服务，由加速器优化吞吐量，该项目专注于基础模型的本地、交互式使用的可能性。虽然吞吐量仍然重要，但在其他条件相同的情况下，优先考虑低延迟和小批量大小。

**保持可移植的基线。** 我们的起点是一个可移植的CPU SIMD（通过[highway](https://github.com/google/highway)）。我们预计未来会添加加速器和混合CPU/GPU支持，但该项目应继续允许使用这个可移植基线进行构建。这确保了面向研究和实验的运行时和硬件平台即使没有专门的生产就绪部署路径，也能运行Gemma。

## 代码组织

实现代码大致分为从高到低的4层：

1. 前端（`run.cc`） - 交互式接口或自动化编排，与模型推理和生成（2）的调用进行交互。前端代码以用例目标的形式实现，通过调用模型推理和生成。将gemma.cpp作为库的项目被认为是`run.cc`的替代前端。我们将在未来添加额外前端的示例。

2. 模型（`gemma.cc`，`gemma.h`，`configs.h`） - 实现模型的计算图，包括使用层（3）提供的变换操作加载和压缩权重的支撑函数。

3. 操作（`ops.h`） - 使用计算后端（4）实现的最小变换和支撑数学操作集。此代码应对模型实现的计算图的具体细节（2）保持不知情。

4. 后端（`highway`） - 低级硬件接口（在highway的情况下是SIMD），支持（3）中的实现。

除了这些层次，支撑实用程序包括：

- `compression/` - 模型压缩操作。8位切换浮点模型转换在这里。
- `util/` - 命令行参数处理和任何其他实用程序。

## 风格和格式化

我们提供了一个`.clang-format`配置文件，其中包含我们的默认设置，请在提交PR之前运行源文件通过`clang-format`（或产生等效行为的格式化器）。

## 编译时标志（高级）

有几个编译时标志需要注意（请注意，这些可能或可能不会暴露给构建系统）：

- `GEMMA_WEIGHT_T`：设置权重的压缩级别（在CMakeLists.txt中作为WEIGHT_TYPE显示）。目前，如果没有指定标志，应将其设置为`SfpStream`（默认值，用于8位SFP），或`hwy::bfloat16_t`以启用更高保真度（但速度较慢）的bfloat16支持。这在`gemma.h`中定义。
- `GEMMA_MAX_SEQ_LEN`：设置KV缓存的预分配最大序列长度。默认值为4096个标记，但可以覆盖。这尚未通过`CMakeLists.txt`暴露。

从中期来看，这两个都可能被废弃，以支持在单个构建中处理多个权重压缩方案，并根据需要动态调整KV缓存的大小。

## 将gemma.cpp作为库使用（高级）

除非你正在进行更低级别的实现或研究，从应用的角度来看，你可以将gemma.h和gemma.cc视为库的“核心”。

你可以将`run.cc`视为你的应用程序替代的一个示例应用程序，所以在`run.cc`中看到的对gemma.h和gemma.cc的调用可能是你将要调用的函数。你可以在`run.cc`中找到对tokenizer方法和`GenerateGemma`的调用示例。

请记住，gemma.cpp面向的是更实验性/原型/研究应用程序。如果你的目标是生产，那么通过jax / pytorch / keras有更标准的NN部署路径。

### Gemma结构包含推理引擎的所有状态 - 分词器、权重和激活

`Gemma(...)` - 构造函数，创建一个gemma模型对象，它是分词器对象、权重、激活和KV缓存的包装。

在标准的LLM聊天应用程序中，你可能会直接使用Gemma对象，在更奇特的数据处理或研究应用程序中，你可能会直接分解处理权重、KV缓存和激活（例如，对于一组权重，你可能有多个KV缓存和激活），而不是仅使用Gemma对象。

## 使用Gemma对象中的分词器（或直接与分词器对象交互）

你几乎只与分词器做一些事情，调用`Encode()`将字符串提示转换为标记ID向量，或调用`Decode()`将模型的标记ID向量输出转换回字符串。

## 生成的主要入口点是`GenerateGemma()`

通过调用`GenerateGemma`并传递一个标记化的提示，将1) 改变`model`中的激活值 2) 调用StreamFunc - 一个lambda回调，用于每个生成的标记。

你的应用程序定义自己的StreamFunc作为lambda回调，每次从引擎流式传输一个标记字符串时（例如打印到屏幕、将数据写入磁盘、将字符串发送到服务器等）。你可以在`run.cc`中看到StreamFunc lambda负责在每个标记到达时将其打印到屏幕上。

可选地，你可以定义accept_token作为另一个lambda - 这主要用于受限解码类型的用例，你希望强制生成适应语法。如果你不做这个，你可以发送一个空lambda作为无操作，这就是`run.cc`所做的。

## 如果你想直接调用神经网络前向函数，调用`Transformer()`函数

对于高级应用程序，你可能只调用`GenerateGemma()`，从不直接与神经网络交互，但如果你正在做一些更定制的事情，你可以调用transformer，它对单个标记执行单个推理操作，并通过神经网络计算改变激活和KV缓存。

## 对于低级操作，定义新架构，直接调用`ops.h`函数

如果你正在编写其他NN架构或修改Gemma模型的推理路径，你会使用`ops.h`。

## Discord

我们也在尝试使用discord服务器进行讨论 - https://discord.gg/H5jCBAWxAe

## 从kaggle下载预训练好的 模型权重 和 tokenizer词表文件

https://www.kaggle.com/models/google/gemma/frameworks/gemmaCpp


In [12]:
#!wget "https://storage.googleapis.com/kagglesdsdata/models/8358/11366/tokenizer.spm?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240302%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240302T133349Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=173aa155f7c6321a75b36092f87d7a778548cafb0e61b5d95ea446d6bbadd97d67b738b5396c81c10ef80ac9570abb596cdec625bb8dfcde819232fa32f0a292243cac311ac2e1e4ea44065d612ce84537001a111a1a5429a20fc63bccfb1d2f90291aef9a0a657802eddf10d4ab5497465759f5e06689eddb57504175c69267028253776e948970801b2f52a816851fceac2a0f327b6345796a9c93895f1699c0961859dabd45477f1090af2d158531994df6b0fdb72254a315322797dbdc295a809425275ee7b8b7cc7be9bfd767787f0f59ffb96aa8145e8ee0239a12f885b1c0d1e9babf4ebcd0d02a0ff2b19baea6baa025f15382fa0ec46ed2bc5d820c" \
!wget "https://storage.googleapis.com/kagglesdsdata/models/8385/11363/tokenizer.spm?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240302%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240302T140024Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=30b7f62a8319d8583b708e9819f69b9575a1c51d46354cd4ec27b8bc401dc4354f257f9b12bf9c983de02a9ca8fe79cd382d323726cf3d0b7adaaddf254ecdf205e35ff84113e80100d647a19f2484674f3a770e736a1a3831c95c48e9fd62ceedaf18f7537f4d3039e5e0a1109c106d2847e44341c8e84b7a45d2b313457dedf56881d46807cb940f389e850ae5639a16b594f7a7270fe4a99688ff694fe14fc35181ce855e82cdbcdad406a39fbc90d9c40abc60220699db34bf9c296367651f2f280ad97893e89591bad931f5d2759134de7228e8bffde55d592b26df9283ad731bf656100aea337ba5950ce26d8594e6c0b2b0975a983fc80b00fb7af513" \
 -O tokenizer.spm

--2024-03-02 14:01:33--  https://storage.googleapis.com/kagglesdsdata/models/8385/11363/tokenizer.spm?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240302%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240302T140024Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=30b7f62a8319d8583b708e9819f69b9575a1c51d46354cd4ec27b8bc401dc4354f257f9b12bf9c983de02a9ca8fe79cd382d323726cf3d0b7adaaddf254ecdf205e35ff84113e80100d647a19f2484674f3a770e736a1a3831c95c48e9fd62ceedaf18f7537f4d3039e5e0a1109c106d2847e44341c8e84b7a45d2b313457dedf56881d46807cb940f389e850ae5639a16b594f7a7270fe4a99688ff694fe14fc35181ce855e82cdbcdad406a39fbc90d9c40abc60220699db34bf9c296367651f2f280ad97893e89591bad931f5d2759134de7228e8bffde55d592b26df9283ad731bf656100aea337ba5950ce26d8594e6c0b2b0975a983fc80b00fb7af513
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.119.207, 108.177.111.207, 142.250.1.207, ...
Connecting to storage.goog

## gemma 模型权重类型
2B instruction-tuned (`it`) and pre-trained (`pt`) models:

| Model name  | Description |
| ----------- | ----------- |
| `2b-it`     | 2 billion parameter instruction-tuned model, bfloat16 |
| `2b-it-sfp` | 2 billion parameter instruction-tuned model, 8-bit switched floating point |
| `2b-pt`     | 2 billion parameter pre-trained model, bfloat16 |
| `2b-pt-sfp` | 2 billion parameter pre-trained model, 8-bit switched floating point |

7B instruction-tuned (`it`) and pre-trained (`pt`) models:

| Model name  | Description |
| ----------- | ----------- |
| `7b-it`     | 7 billion parameter instruction-tuned model, bfloat16 |
| `7b-it-sfp` | 7 billion parameter instruction-tuned model, 8-bit switched floating point |
| `7b-pt`     | 7 billion parameter pre-trained model, bfloat16 |
| `7b-pt-sfp` | 7 billion parameter pre-trained model, 8-bit switched floating point |

**Tips**:

8位切换浮点（8-bit Switched Floating Point，简称8-bit SFP）是一种用于表示浮点数的压缩格式，它旨在减少存储和计算资源的使用，同时尽量保持数据的精度。在深度学习和机器学习领域，尤其是在模型部署和推理阶段，这种格式可以帮助减少模型的大小，提高内存效率，从而在资源受限的设备上实现更快的推理速度。

8-bit SFP通常涉及以下几个关键特性：

1. **精度**：与标准的32位浮点数（单精度）相比，8-bit SFP提供了较低的精度。这意味着在表示非常大或非常小的数值时，可能会有较大的舍入误差。

2. **范围**：由于只有8位，这种格式的数值范围比32位浮点数要小。这可能限制了它在某些需要宽数值范围的应用中的使用。

3. **压缩**：8-bit SFP可以显著减少模型的存储空间需求，因为它只需要8位来表示一个浮点数，而不是标准的32位。

4. **兼容性**：在某些硬件和软件平台上，8-bit SFP可能需要特定的支持才能正确处理。这可能涉及到在编译时设置特定的标志，或者在运行时进行特定的配置。

5. **性能**：在支持8-bit SFP的硬件上，使用这种格式可以提高计算性能，因为处理8位数据通常比处理32位数据更快。

在实际应用中，开发者需要权衡使用8-bit SFP带来的内存和性能优势与其可能引入的精度损失。在某些情况下，这种权衡是可接受的，尤其是在对模型大小和推理速度有严格要求的嵌入式系统或移动设备上。然而，在对精度要求较高的应用中，可能需要考虑其他压缩技术，或者在模型设计阶段就考虑到精度和资源的平衡。

### gemma-2b-it-sfp

In [11]:
!wget "https://storage.googleapis.com/kagglesdsdata/models/8385/11363/2b-it-sfp.sbs?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240302%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240302T135831Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=ac2a0bc761ddb609201e333c65a0ff98a3b1a0a87dde1af6c397cef6c64498bedda374657483a1b679cf78e777c3b5c3d2295aabc91ea374b3019f67de38c1d6aa323a71001ea91afa857725853e9502e7acc522f524449ca45c895f55c2befca963bc76b5d9bbafb691d930d93d2d1d9f537bb25d2b9531ea3fdf77eebd3fd06f01cf598eb41e33677567531804e74e1a931ab46239876fef630ade4b7b5987e3a57e5b729bead24cda908b23eb69511be09ad8cc60b2bdd38f5fcb95b122ce62e2f054930a65cc10571cea068bbdbf624388e22a697c43431deb0afe65a77d44240ad23641a35c49d391eaa1a2973a6138d3df7fcd24f9f334cb36c1124682" \
  -O 2b-it-sfp.sbs

--2024-03-02 13:59:33--  https://storage.googleapis.com/kagglesdsdata/models/8385/11363/2b-it-sfp.sbs?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240302%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240302T135831Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=ac2a0bc761ddb609201e333c65a0ff98a3b1a0a87dde1af6c397cef6c64498bedda374657483a1b679cf78e777c3b5c3d2295aabc91ea374b3019f67de38c1d6aa323a71001ea91afa857725853e9502e7acc522f524449ca45c895f55c2befca963bc76b5d9bbafb691d930d93d2d1d9f537bb25d2b9531ea3fdf77eebd3fd06f01cf598eb41e33677567531804e74e1a931ab46239876fef630ade4b7b5987e3a57e5b729bead24cda908b23eb69511be09ad8cc60b2bdd38f5fcb95b122ce62e2f054930a65cc10571cea068bbdbf624388e22a697c43431deb0afe65a77d44240ad23641a35c49d391eaa1a2973a6138d3df7fcd24f9f334cb36c1124682
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.126.207, 74.125.132.207, 74.125.201.207, ...
Connecting to storage.googl

### gemma-7b-it-sfp

In [38]:
!wget "https://storage.googleapis.com/kagglesdsdata/models/8445/11370/7b-it-sfp.sbs?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240302%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240302T163205Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=824f075e8de538d45977f02b0cfe6488b926933ad914086ca5416c918d87cfbf3ff48f3936a3bbd852dc3a773ae75e55c39be08a9893b092b1ece4a3acee94b732c2869e29452e7d4624932707f27d9564ace31e1e3fc9d0ef687e482e22b901451356d6508ced70f807cd5544d0675c506e9833c033f0036153d81d8337d8c9b6e7073dd8c5b305a128433e29106e4bdbb2bb0356f5bec75434d67a8549bc7030526496fce0fc7f379649519c1aa579d985cffc01a419f7d0aaf8459fa20651f306026236bee4f8b1e301c64386a6c11750fccdb0f0df504351088f7053e634fdc97bff17324c2f0f087f3a9c3e1fb5bff8ffe0fe0e9e0ff3d40bc83cc9cce2" \
  -O 7b-it-sfp.sbs

--2024-03-02 16:34:06--  https://storage.googleapis.com/kagglesdsdata/models/8445/11370/7b-it-sfp.sbs?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240302%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240302T163205Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=824f075e8de538d45977f02b0cfe6488b926933ad914086ca5416c918d87cfbf3ff48f3936a3bbd852dc3a773ae75e55c39be08a9893b092b1ece4a3acee94b732c2869e29452e7d4624932707f27d9564ace31e1e3fc9d0ef687e482e22b901451356d6508ced70f807cd5544d0675c506e9833c033f0036153d81d8337d8c9b6e7073dd8c5b305a128433e29106e4bdbb2bb0356f5bec75434d67a8549bc7030526496fce0fc7f379649519c1aa579d985cffc01a419f7d0aaf8459fa20651f306026236bee4f8b1e301c64386a6c11750fccdb0f0df504351088f7053e634fdc97bff17324c2f0f087f3a9c3e1fb5bff8ffe0fe0e9e0ff3d40bc83cc9cce2
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.120.207, 142.251.171.207, 142.250.159.207, ...
Connecting to storage.go

## 推理(inference)

In [4]:
!git clone https://github.com/google/gemma.cpp

Cloning into 'gemma.cpp'...
remote: Enumerating objects: 261, done.[K
remote: Counting objects: 100% (123/123), done.[K
remote: Compressing objects: 100% (80/80), done.[K
remote: Total 261 (delta 75), reused 59 (delta 43), pack-reused 138[K
Receiving objects: 100% (261/261), 161.06 KiB | 2.24 MiB/s, done.
Resolving deltas: 100% (129/129), done.


In [8]:
!echo $(nproc)

2


In [17]:
# defualt SfpStream use 8-bit switched floating point
!cd gemma.cpp && rm -rf build/* && cmake -B build -S . && make -C build -j $(nproc) gemma

#!cd gemma.cpp && rm -rf build/* && cmake -B build -S . -DPROFILER_ENABLED=1 && make -C build -j $(nproc) gemma
#!cd gemma.cpp && rm -rf build/* && cmake -B build -S . -DWEIGHT_TYPE=hwy::bfloat16_t && make -C build -j $(nproc) gemma
#!cd gemma.cpp && rm -rf build/* && cmake -B build -S . -DPROFILER_ENABLED=1 -DWEIGHT_TYPE=hwy::bfloat16_t && make -C build -j $(nproc) gemma


-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
  The OLD behavior for policy CMP0111 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.

[0m
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS
-- Performing Test ATOMICS_LOCK_FREE_INSTRUCTIONS - Success
-- Performing Tes

In [18]:
!./gemma.cpp/build/gemma -h

  __ _  ___ _ __ ___  _ __ ___   __ _   ___ _ __  _ __
 / _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \
| (_| |  __/ | | | | | | | | | | (_| || (__| |_) | |_) |
 \__, |\___|_| |_| |_|_| |_| |_|\__,_(_)___| .__/| .__/
  __/ |                                    | |   | |
 |___/                                     |_|   |_|

gemma.cpp : a lightweight, standalone C++ inference engine

To run gemma.cpp, you need to specify 3 required model loading arguments:
    --tokenizer
    --compressed_weights
    --model.

*Example Usage*

./gemma --tokenizer tokenizer.spm --compressed_weights 2b-it-sfp.sbs --model 2b-it

*Model Loading Arguments*

  --tokenizer : Path name of tokenizer model file.
    Required argument.
  --compressed_weights : Path name of compressed weights file, regenerated from `--weights` file if the compressed weights file does not exist.
    Required argument.
  --model : Model type
    2b-it (2B parameters, instruction-tuned)
    2b-pt (2B parameters, pretrained)
    7

In [19]:
!./gemma.cpp/build/gemma --tokenizer ./tokenizer.spm --compressed_weights ./2b-it-sfp.sbs --model 2b-it

[2J[1;1H  __ _  ___ _ __ ___  _ __ ___   __ _   ___ _ __  _ __
 / _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \
| (_| |  __/ | | | | | | | | | | (_| || (__| |_) | |_) |
 \__, |\___|_| |_| |_|_| |_| |_|\__,_(_)___| .__/| .__/
  __/ |                                    | |   | |
 |___/                                     |_|   |_|

tokenizer                     : ./tokenizer.spm
compressed_weights            : ./2b-it-sfp.sbs
model                         : 2b-it
weights                       : [no path specified]
max_tokens                    : 3072
max_generated_tokens          : 2048
multiturn                     : 0

*Usage*
  Enter an instruction and press enter (%C resets conversation, %Q quits).
  Since multiturn is set to 0, conversation will automatically reset every turn.

*Examples*
  - Write an email to grandma thanking her for the cookies.
  - What are some historical attractions to visit around Massachusetts?
  - Compute the nth fibonacci number in javascript.
  - 

In [39]:
!./gemma.cpp/build/gemma --tokenizer ./tokenizer.spm --compressed_weights ./7b-it-sfp.sbs --model 7b-it

[2J[1;1H  __ _  ___ _ __ ___  _ __ ___   __ _   ___ _ __  _ __
 / _` |/ _ \ '_ ` _ \| '_ ` _ \ / _` | / __| '_ \| '_ \
| (_| |  __/ | | | | | | | | | | (_| || (__| |_) | |_) |
 \__, |\___|_| |_| |_|_| |_| |_|\__,_(_)___| .__/| .__/
  __/ |                                    | |   | |
 |___/                                     |_|   |_|

tokenizer                     : ./tokenizer.spm
compressed_weights            : ./7b-it-sfp.sbs
model                         : 7b-it
weights                       : [no path specified]
max_tokens                    : 3072
max_generated_tokens          : 2048
multiturn                     : 0

*Usage*
  Enter an instruction and press enter (%C resets conversation, %Q quits).
  Since multiturn is set to 0, conversation will automatically reset every turn.

*Examples*
  - Write an email to grandma thanking her for the cookies.
  - What are some historical attractions to visit around Massachusetts?
  - Compute the nth fibonacci number in javascript.
  - 

# llama.cpp


1. https://github.com/ggerganov/llama.cpp
2. https://github.com/ggerganov/llama.cpp/pull/5631


In [20]:
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'llama.cpp'...
remote: Enumerating objects: 19553, done.[K
remote: Counting objects: 100% (6867/6867), done.[K
remote: Compressing objects: 100% (484/484), done.[K
remote: Total 19553 (delta 6656), reused 6418 (delta 6378), pack-reused 12686[K
Receiving objects: 100% (19553/19553), 23.24 MiB | 20.99 MiB/s, done.
Resolving deltas: 100% (13825/13825), done.


In [24]:
!cd llama.cpp && make clean && make -j $(nproc)

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -vrf *.o tests/*.o *.so *.a *.dll benchmark-matmult common/build-info.cpp *.do

In [25]:
!./llama.cpp/main -h


usage: ./llama.cpp/main [options]

options:
  -h, --help            show this help message and exit
  --version             show version and build info
  -i, --interactive     run in interactive mode
  --interactive-first   run in interactive mode and wait for input right away
  -ins, --instruct      run in instruction mode (use with Alpaca models)
  -cml, --chatml        run in chatml mode (use with ChatML-compatible models)
  --multiline-input     allows you to write or paste multiple lines without ending each in '\'
  -r PROMPT, --reverse-prompt PROMPT
                        halt generation at PROMPT, return control in interactive mode
                        (can be specified more than once for multiple prompts).
  --color               colorise output to distinguish prompt and user input from generations
  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)
  -t N, --threads N     number of threads to use during generation (default: 2)
  -tb N, --threads-batch 

In [26]:
!pip3 install huggingface-hub

In [27]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'stor

更改-ngl 32要offload到 GPU 的层数。如果您没有 GPU 加速，请将其删除。

更改-c 4096为所需的序列长度。对于扩展序列模型 - 例如 8K、16K、32K - 从 GGUF 文件中读取必要的 RoPE 缩放参数并由 llama.cpp 自动设置。

如果您想进行聊天式对话，请将-p <PROMPT>参数替换为-i -ins

其他参数以及使用方法请参考llama.cpp文档:

https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md

## gemma-2b

In [32]:
# https://huggingface.co/google/gemma-2b/tree/main
#!huggingface-cli download google/gemma-2b-it gemma-2b.gguf --local-dir ./ --local-dir-use-symlinks False

# or use https://huggingface.co/google/gemma-2b-GGUF
!huggingface-cli download google/gemma-2b-GGUF gemma-2b.gguf --local-dir ./ --local-dir-use-symlinks False


Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/google/gemma-2b-GGUF/resolve/main/gemma-2b.gguf to /root/.cache/huggingface/hub/tmpf03rv23_
gemma-2b.gguf: 100% 10.0G/10.0G [03:01<00:00, 55.3MB/s]
./gemma-2b.gguf


In [33]:
!./llama.cpp/main -m gemma-2b.gguf -n 256 -p "It is the best of time" --repeat-penalty 1.1


Log start
main: build = 2314 (6c32d8c7)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709393597
llama_model_loader: loaded meta data with 19 key-value pairs and 164 tensors from gemma-2b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-2b
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 18
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 gem

In [None]:
# Running the model on a single / multi GPU
!./llama.cpp/main -m gemma-2b.gguf -n 256 -p "It is the best of time" --repeat-penalty 1.1 -ngl 99

## gemma-2b-it

In [28]:
# https://huggingface.co/google/gemma-2b-it/tree/main
#!huggingface-cli download google/gemma-2b-it gemma-2b-it.gguf --local-dir ./ --local-dir-use-symlinks False

# or use https://huggingface.co/google/gemma-2b-it-GGUF
!huggingface-cli download google/gemma-2b-it-GGUF gemma-2b-it.gguf --local-dir ./ --local-dir-use-symlinks False


Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/google/gemma-2b-it/resolve/main/gemma-2b-it.gguf to /root/.cache/huggingface/hub/tmplklzja8q
gemma-2b-it.gguf: 100% 10.0G/10.0G [01:43<00:00, 96.9MB/s]
./gemma-2b-it.gguf


In [31]:
# Running the model on a CPU
!./llama.cpp/main -m gemma-2b-it.gguf -n 256 -p "Write a Python function to find sum of two numbers." --repeat-penalty 1.1

Log start
main: build = 2314 (6c32d8c7)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709392770
llama_model_loader: loaded meta data with 19 key-value pairs and 164 tensors from gemma-2b-it.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-2b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 18
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:              

In [None]:
# Running the model on a single / multi GPU
!./llama.cpp/main -m gemma-2b-it.gguf -n 256 -p "Write a Python function to find sum of two numbers." --repeat-penalty 1.1 -ngl 99


## gemma-7b-GGUF

In [34]:
# https://huggingface.co/google/gemma-7b/tree/main
#!huggingface-cli download google/gemma-7b gemma-7b.gguf --local-dir ./ --local-dir-use-symlinks False

# or use https://huggingface.co/google/gemma-7b-GGUF
!huggingface-cli download google/gemma-7b-GGUF gemma-7b.gguf --local-dir ./ --local-dir-use-symlinks False


Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/google/gemma-7b-GGUF/resolve/main/gemma-7b.gguf to /root/.cache/huggingface/hub/tmp5_pjmnye
gemma-7b.gguf: 100% 34.2G/34.2G [10:16<00:00, 55.4MB/s]
./gemma-7b.gguf


In [35]:
# Running the model on a CPU
!./llama.cpp/main -m gemma-7b.gguf -p "Penguins live in" --repeat-penalty 1.0


Log start
main: build = 2314 (6c32d8c7)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709394685
llama_model_loader: loaded meta data with 19 key-value pairs and 254 tensors from gemma-7b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-7b
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 28
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gem

In [None]:
# Running the model on a single / multi GPU
!./llama.cpp/main -m gemma-7b.gguf -p "Penguins live in" --repeat-penalty 1.0 -ngl 99

# need use quantize the model to 8-bits 0 ; the to run inference
# see: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize
#!./llama.cpp/main -m gemma-7b_q8_0.gguf -p "Penguins live in" --repeat-penalty 1.0 -ngl 99


## gemma-7b-it-GGUF

In [36]:
# https://huggingface.co/google/gemma-7b-it/tree/main
#!huggingface-cli download google/gemma-7b-it gemma-7b-it.gguf --local-dir ./ --local-dir-use-symlinks False

# or use https://huggingface.co/google/gemma-7b-it-GGUF
!huggingface-cli download google/gemma-7b-it-GGUF gemma-7b-it.gguf --local-dir ./ --local-dir-use-symlinks False


Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/google/gemma-7b-it-GGUF/resolve/main/gemma-7b-it.gguf to /root/.cache/huggingface/hub/tmp9br8dgro
gemma-7b-it.gguf: 100% 34.2G/34.2G [05:33<00:00, 102MB/s]
./gemma-7b-it.gguf


In [37]:
# Running the model on a CPU
!./llama.cpp/main -m gemma-7b-it.gguf -p "write me an ode to LLMs." --repeat-penalty 1.0


Log start
main: build = 2314 (6c32d8c7)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1709396611
llama_model_loader: loaded meta data with 19 key-value pairs and 254 tensors from gemma-7b-it.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                          gemma.block_count u32              = 28
llama_model_loader: - kv   4:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:              

In [None]:
# Running the model on a single / multi GPU
!./llama.cpp/main -m gemma-7b-it.gguf -p "write me an ode to LLMs." --repeat-penalty 1.0 -ngl 99

# need use quantize the model to 8-bits 0 ; the to run inference
# see: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize
#!./llama.cpp/main -m gemma-7b-it_q8_0.gguf -p "write me an ode to LLMs." --repeat-penalty 1.0 -ngl 99
