# 1. 导入库
在这部分，我们将引入所需的Python库，这些库将帮助我们进行数据处理、模型加载、推理等操作。

In [11]:
import numpy as np
import os
import time
import gc
from transformers import AutoTokenizer
from threading import Lock
from utils.session import Session
from config import InferenceConfig
import torch
import torch_npu

numpy：用于处理数组和进行数学运算的库。

os：提供与操作系统交互的功能，如文件处理。

time 和 gc：用于时间相关的操作和垃圾回收。

AutoTokenizer：来自 Hugging Face，用于加载预训练模型的分词器，将文本转化为模型可以理解的token。

Lock：来自threading模块，用于线程同步。

Session：自定义类，可能用于管理模型的会话。

InferenceConfig：自定义配置类，包含模型的超参数设置。

torch 和 torch_npu：PyTorch库，torch_npu用于处理NPU（神经处理单元）相关操作。

# 2. Inference 类

Inference 类是脚本的核心，负责与模型交互，包括加载模型、处理输入、生成预测以及管理会话状态。

## 初始化 (__init__ 方法)

In [2]:
class Inference:
    def __init__(self, config: InferenceConfig) -> None:
        # Initialize the necessary variables
        self.max_input_length = config.max_input_length
        self.max_output_length = config.max_output_length
        self.tokenizer = AutoTokenizer.from_pretrained(
            config.tokenizer_dir, trust_remote_code=True
        )
        self.session = Session.fromConfig(config)
        self.session_type = config.session_type
        self.kv_cache_length = config.kv_cache_length
        self.state: dict = {"code": 200, "isEnd": False, "message": ""}
        self.reset()
        self.lock = Lock()
        self.first = True
        print("[INFO] init success")

        # Set device to NPU
        # self.device = torch.device("npu" if torch.npu.is_available() else "cpu")
        # self.device = torch.device("npu")
        print("[INFO] NPU context set successfully")

self.max_input_length 和 self.max_output_length：从配置中读取最大输入和输出长度，用于控制模型处理的文本长度。

AutoTokenizer：加载一个预训练的分词器，将输入的文本转换成模型可以理解的token。

Session：自定义的类，用于管理和配置模型会话。

self.device：将计算设备设置为 NPU（如果可用），否则默认为CPU。

## 3.缓存生成 (generate_cache 方法)
该方法接收输入文本，将其转换为token，并通过模型生成预测结果。 
它也会计算下一个token并返回其logits（原始模型输出）。

In [3]:
    def generate_cache(self, prompt: str):
        if len(prompt) == 0:
            return
        self.first = False
        input_ids = np.asarray(self.tokenizer.encode(prompt), dtype=np.int64).reshape(1, -1)
        logits = self.session.run(input_ids)[0]
        next_token = self.sample_logits(logits[0][-1:])
        return next_token, logits

    Inference.generate_cache = generate_cache

## 4. 采样下一个token (sample_logits 方法)
该方法使用不同的采样方法（如贪婪、Top-k、Top-p）来选择下一个最可能的token。
它允许通过设置不同的参数来调整生成文本的多样性和随机性。

In [4]:
    def sample_logits(self, logits: np.ndarray, sampling_method: str = "greedy", sampling_value: float = None, temperature: float = 1.0) -> np.ndarray:
        if temperature == 0 or sampling_method == "greedy":
            return np.argmax(logits, axis=-1).astype(np.int64)
        elif sampling_method == "top_k" or sampling_method == "top_p":
            assert sampling_value is not None
            logits = logits.astype(np.float32)
            logits /= temperature
            probs = np.exp(logits) / np.sum(np.exp(logits))
            sorted_probs = np.sort(probs)[:, ::-1]
            sorted_indices = np.argsort(probs)[:, ::-1]

            if sampling_method == "top_k":
                index_of_interest = int(sampling_value)
            elif sampling_method == "top_p":
                p = sampling_value
                cumulative_probs = np.cumsum(sorted_probs, axis=-1)
                for index_of_interest, cumulative_prob in enumerate(cumulative_probs[0]):
                    if cumulative_prob > p:
                        break

            probs_of_interest = sorted_probs[:, : index_of_interest + 1]
            indices_of_interest = sorted_indices[:, : index_of_interest + 1]
            probs_of_interest /= np.sum(probs_of_interest)
            return np.array([np.random.choice(indices_of_interest[0], p=probs_of_interest[0])])
        else:
            raise Exception(f"Unknown sampling method {sampling_method}")

    Inference.sample_logits = sample_logits

# 5. 状态重置 (reset 方法)
这个方法会在每次推理后重置模型的状态，确保新的预测不会受到旧的预测结果的影响。

In [5]:
    def reset(self):
        self.first = True
        self.session.run_times = 0
        self.session.reset()
    
    # 把这个函数挂到 Inference 类上，作为方法
    Inference.reset = reset

# 6. 获取状态 (getState 方法)
这个方法用于获取当前会话的状态，返回一个状态的副本。

In [6]:
    def getState(self):
        with self.lock:
            return self.state.copy()

    Inference.getState = getState

# 7. 关闭会话 (close 方法)
该方法会在结束时清理资源，删除缓存并执行垃圾回收，确保没有未清理的资源占用。

In [7]:
    def close(self):
        if hasattr(self, "session"):
            if hasattr(self.session, "kv_cache"):
                del self.session.kv_cache
            del self.session.model
        gc.collect()
        torch_npu.npu.empty_cache()
        torch_npu.npu.synchronize()

    Inference.close = close

sampling_method：决定如何选择下一个token（例如，贪婪算法（greedy）、Top-k采样、Top-p采样）。

temperature：控制生成文本的随机性。温度越高，生成的文本越多样化。

# 8. 预测 (predict 方法)
这是模型推理的主方法，通过递归生成token来生成最终的文本输出。
它支持多种采样方法，可以生成对不同输入的预测结果。

In [8]:

    def predict(self, prompt, history=None, system_prompt="You are a helpful assistant.", max_new_tokens=1024, sampling_method="greedy", sampling_value=None, temperature=1.0):
        """
        添加了 sampling_method 参数，允许选择不同的采样方法（greedy, top_k, top_p）
        """
        if history is None:
            history = []
        
        # 处理消息
        messages = [{"role": "system", "content": system_prompt}]
        for (use_msg, bot_msg) in history:
            messages.append({"role": "user", "content": use_msg})
            messages.append({"role": "assistant", "content": bot_msg})
        messages.append({"role": "user", "content": prompt})

        # 将消息转换为token
        text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        input_ids = self.tokenizer([text], return_tensors="pt")["input_ids"].to(torch.long).reshape(1, -1)

        # 限制输入的最大长度
        input_ids = input_ids[:, -self.max_input_length:]
        ids_list = []
        input_length = input_ids.shape[1]
        max_output_len = self.max_output_length - input_length
        max_output_len = min(max_output_len, max_new_tokens)

        for i in range(max_output_len):
            # 运行模型得到logits
            logits = self.session.run(input_ids)
            
            # 使用传入的采样方法（sampling_method）
            input_ids = self.sample_logits(logits[0][-1:], sampling_method=sampling_method, sampling_value=sampling_value, temperature=temperature)
            input_ids = input_ids.reshape(1, -1)
            ids_list.append(input_ids[0].item())
            text_out = self.tokenizer.decode(ids_list)

            # 如果遇到EOS token则提前停止
            if input_ids[0] == self.tokenizer.eos_token_id:
                break

        return text_out

    # 确保这一行没有额外的缩进
    Inference.predict = predict


该方法通过递归生成token来实现文本生成。它会通过会话获取模型的输出，并根据指定的最大输出长度生成响应。

## 9. 配置设置

In [9]:

# Hardcoded configuration (you can set values directly here)
config = InferenceConfig(
    hf_model_dir="/data/shaos/data/Qwen2.5-VL-3B-Instruct",
    om_model_path="path_to_om_model",
    onnx_model_path="path_to_onnx_model",
    cpu_thread=4,
    session_type="pytorch",
    max_batch=1,
    max_output_length=2048,
    max_input_length=1024,
    kv_cache_length=2048,
    max_prefill_length=4,
    dtype="float32",
    torch_dtype="float32",
    device_str="npu"
    #tokenizer_dir="path_to_tokenizer"
)

config.session_type = "pytorch"
config.kvcache_method = "fixsize"   # 保持和导出 config 一致

# InferenceConfig：这个配置类包含了模型加载、设备设置、最大输入和输出长度等各种超参数和路径。

## 10. 执行预测

创建一个 Inference 对象，传入配置，并使用 predict 方法对给定的提示（"What is the capital of France?"）进行推理，生成并打印输出结果。

In [10]:

# Running the prediction
inference = Inference(config)
prompt = "What is the capital of France?"
output = inference.predict(prompt)
print("Generated Output:", output)

inference.close()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[INFO] PyTorchSession device = npu


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO] init success
[INFO] NPU context set successfully
Generated Output: The capital of France is Paris.<|im_end|>


关键概念和解释：

模型推理：指的是使用预训练的模型生成预测结果。在本脚本中，推理过程是通过传入文本提示并生成下一步最可能的token来实现的。

采样方法：脚本支持多种方法来选择下一个token（例如 贪婪（Greedy）、Top-k、Top-p）。贪婪方法简单地选择概率最高的token，而Top-k和Top-p则引入了一定的随机性，使生成的文本更加自然和多样。

会话管理：Session 类用于处理与模型的交互，它管理输入、输出、状态等。

NPU设备：如果支持NPU（神经处理单元），脚本会尝试使用NPU进行加速计算。如果没有NPU，它会回退到CPU或GPU。

状态重置：每次预测后，都会重置模型的状态，确保不会受到先前预测的影响。

建议的练习：

修改采样方法：尝试使用不同的采样方法（贪婪、Top-k、Top-p），观察它们如何影响生成文本的多样性和准确性。

调整输入输出长度：修改 max_input_length 和 max_output_length，看看这些参数如何影响模型的性能，尤其是在处理较长文本时。

优化推理速度：尝试批处理并设置不同的设备（例如使用GPU或CPU），以改善推理速度，尤其是对于大型模型。

使用不同的提示：更改 predict 方法中的提示，观察模型对不同类型的查询（事实性问题、创意问题等）的响应。

In [11]:
prompt = "What is the capital of France?"
inference = Inference(config)
# 使用贪婪采样
output_greedy = inference.predict(prompt, sampling_method="greedy")

print("Greedy Sampling Output:")
print(output_greedy)

inference.close()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[INFO] PyTorchSession device = npu


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO] init success
[INFO] NPU context set successfully
Greedy Sampling Output:
The capital of France is Paris.<|im_end|>


In [12]:
prompt = "What are the benefits of artificial intelligence?"
inference = Inference(config)
# 使用Top-k采样，选择概率前5的token
output_top_k = inference.predict(prompt, sampling_method="top_k", sampling_value=5)

print("Top-K Sampling Output:")
print(output_top_k)

inference.close()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[INFO] PyTorchSession device = npu


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO] init success
[INFO] NPU context set successfully
Top-K Sampling Output:
Artificial intelligence (AI) has the potential to revolutionize many fields including healthcare, finance, education, and more. Here are some of the benefits of AI:

1. Improved accuracy and efficiency: AI systems can analyze large amounts of data and provide accurate insights and predictions that humans might miss. This can lead to increased efficiency and productivity.

2. Increased accessibility: AI can be used to create applications and tools that make complex processes easier to understand and use, making them more accessible to individuals with varying levels of expertise.

3. Improved decision making: AI can help individuals and organizations make data-driven decisions by providing them with the most accurate and relevant information.

4. Enhanced personalization: AI can be used to create personalized experiences for individuals, such as in the case of personalized healthcare or tailored marketing mes

In [13]:
prompt = "Can you explain quantum computing?"
inference = Inference(config)
# 使用Top-p采样，选择累积概率不超过0.9的token
output_top_p = inference.predict(prompt, sampling_method="top_p", sampling_value=0.9)

print("Top-P Sampling Output:")
print(output_top_p)

inference.close()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[INFO] PyTorchSession device = npu


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO] init success
[INFO] NPU context set successfully
Top-P Sampling Output:
Quantum computing is a type of computing that uses the principles of quantum mechanics to perform calculations and solve problems. Unlike classical computing, which uses bits to represent and process information, quantum computing uses quantum bits, or qubits, which can represent multiple states simultaneously.
In classical computing, a bit can be either 0 or 1. In quantum computing, a qubit can be a 0, a 1, or a combination of both. This ability to represent multiple states simultaneously allows quantum computers to process vast amounts of information and solve problems that are infeasible for classical computers to solve.
Quantum computers use a complex system of quantum bits called a quantum computer, which can perform multiple calculations at the same time. This ability to perform multiple calculations at the same time is known as parallelism and allows quantum computers to solve problems much faster tha

In [14]:
prompt = "What is 2 + 2?"
inference = Inference(config)
# 设置较短的输入和输出长度
output_short = inference.predict(prompt, max_new_tokens=64)

print("Short Input/Output Example:")
print(output_short)

inference.close()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[INFO] PyTorchSession device = npu


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO] init success
[INFO] NPU context set successfully
Short Input/Output Example:
2 + 2 is equal to 4.<|im_end|>


In [15]:
prompt = "Can you explain how quantum entanglement works and its potential applications in future technologies?"
inference = Inference(config)
# 设置较长的输入和输出长度
output_long = inference.predict(prompt, max_new_tokens=64)

print("Long Input/Output Example:")
print(output_long)

inference.close()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[INFO] PyTorchSession device = npu


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO] init success
[INFO] NPU context set successfully
Long Input/Output Example:
Quantum entanglement is a phenomenon in which two or more particles become interconnected and their quantum states become correlated, even when they are separated by large distances. This means that the state of one particle can be instantly determined by the state of the other, regardless of the distance between them. This phenomenon was first observed in


In [10]:
prompt = "Can you give me a detailed explanation of machine learning algorithms, including supervised learning, unsupervised learning, and reinforcement learning, and provide examples of each?"
inference = Inference(config)
# 设置超长的输入和输出长度
output_extreme = inference.predict(prompt, max_new_tokens=128)

print("Extreme Length Input/Output Example:")
print(output_extreme)

inference.close()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[INFO] PyTorchSession device = npu


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INFO] init success
[INFO] NPU context set successfully
Extreme Length Input/Output Example:
Machine learning algorithms are a type of artificial intelligence that allows computers to learn from data and make predictions or decisions without being explicitly programmed. There are three main types of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset, meaning that the input data is accompanied by the correct output. The algorithm learns to map the input data to the correct output by finding a function that best fits the data. Supervised learning is used for tasks such as classification and regression, where the algorithm is trained to predict a specific output based on the input
