模型的部署参考： [learn-llm-deploy-easily](https://gitee.com/coderwillyan/learn-llm-deploy-easily) 

这里主要介绍如何调用已部署的模型

# embedding

## 使用embedding API

以zhipu为例

In [None]:
from langchain_community.embeddings import ZhipuAIEmbeddings

my_emb = ZhipuAIEmbeddings(
    model="embedding-2",
    api_key=os.environ["ZHIPUAI_API_KEY"]
)

## 使用本地部署的embedding模型

### Hugging Face Transformers

​​特点​​：底层灵活，支持BERT、RoBERTa等Transformer架构的Embedding模型，需自定义向量提取逻辑

In [None]:
# from transformers import AutoModel, AutoTokenizer
# model = AutoModel.from_pretrained("BAAI/bge-large-zh")
# embeddings = model(inputs).last_hidden_state.mean(dim=1)  # 提取句向量

### Sentence Transformers部署

​​特点​​：基于Transformers的封装，内置池化层，一键生成句子级嵌入，支持语义相似度计算

项目地址： https://github.com/UKPLab/sentence-transformers

参考文档： https://sbert.net.cn/index.html

In [1]:
# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("/opt/workspace/models/Qwen/Qwen3-Embedding-0.6B",  device="cuda:3")

# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "Qwen/Qwen3-Embedding-0.6B",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.7646, 0.1414],
#         [0.1355, 0.6000]])

ModuleNotFoundError: No module named 'sentence_transformers'

In [None]:
from sentence_transformers import SentenceTransformer

# 指定 GPU 设备（例如 GPU 0）
# device = "cuda:3" if torch.cuda.is_available() else "cpu"
# Load the model
model = SentenceTransformer("/opt/workspace/models/Qwen/Qwen3-Embedding-0.6B", device="cuda:3")
result = model.encode("这是一个测试")
result.tolist()[:5]

### HuggingFaceEmbeddings部署

In [None]:
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
my_emb = HuggingFaceEmbeddings(model_name='/opt/workspace/models/Qwen/Qwen3-Embedding-0.6B', model_kwargs={"device": "cuda:3"})
query = "如何调用HuggingFaceEmbeddings？"
query_vector = my_emb.embed_query(query) 
print("查询向量:", query_vector[:5], "...")

### ollama部署

In [None]:
# from langchain_community.embeddings import OllamaEmbeddings
from langchain_ollama.embeddings import OllamaEmbeddings
my_emb = OllamaEmbeddings(base_url='http://localhost:11434',model="dengcao/Qwen3-Embedding-0.6B:F16")

In [1]:
from langchain_ollama.embeddings import OllamaEmbeddings

# 初始化嵌入模型
my_emb = OllamaEmbeddings(
    base_url='http://localhost:11434',
    model="bge-m3:latest"
)

# 示例文本
query = "如何调用OllamaEmbeddings？"

# 生成嵌入向量
query_vector = my_emb.embed_query(query)

# 输出结果
print("查询向量:", query_vector[:5], "...")

查询向量: [-0.01782775, -0.010210706, -0.031165373, 0.029789878, -0.024330704] ...


### vllm部署

In [4]:
# Requires vllm>=0.8.5
import torch
import vllm
from vllm import LLM

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

# model = LLM(model="/opt/workspace/models/Qwen/Qwen3-Embedding-0.6B", task="embed")

model = LLM(
    model="/workspace/models/Qwen/Qwen3-Embedding-0___6B",
    task="embed",
    # trust_remote_code=True,  
    enforce_eager=True,      # 避免可能的图优化问题
    dtype="float16" 
)

outputs = model.embed(input_texts)
embeddings = torch.tensor([o.outputs.embedding for o in outputs])
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7620252966880798, 0.14078938961029053], [0.1358368694782257, 0.6013815999031067]]

INFO 08-19 10:30:38 [utils.py:326] non-default args: {'model': '/workspace/models/Qwen/Qwen3-Embedding-0___6B', 'task': 'embed', 'dtype': 'float16', 'disable_log_stats': True, 'enforce_eager': True}
INFO 08-19 10:30:38 [__init__.py:711] Resolved architecture: Qwen3ForCausalLM
INFO 08-19 10:30:38 [__init__.py:1750] Using max model len 32768
INFO 08-19 10:30:38 [__init__.py:3565] Cudagraph is disabled under eager mode
INFO 08-19 10:30:38 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1) with config: model='/workspace/models/Qwen/Qwen3-Embedding-0___6B', speculative_config=None, tokenizer='/workspace/models/Qwen/Qwen3-Embedding-0___6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_d

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.81it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.79it/s]


INFO 08-19 10:30:40 [default_loader.py:262] Loading weights took 0.29 seconds
INFO 08-19 10:30:40 [model_runner.py:1112] Model loading took 1.1098 GiB and 0.337335 seconds





INFO 08-19 10:30:40 [llm.py:298] Supported_tasks: ['encode', 'embed']


Adding requests: 100%|██████████| 4/4 [00:00<00:00, 2938.22it/s]
Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 80.88it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

[[0.7643852829933167, 0.1411387324333191], [0.13565093278884888, 0.5996948480606079]]





Docker方式部署

### xinference部署

In [None]:
from langchain_community.embeddings import XinferenceEmbeddings

# 替换为你的Xinference服务器URL和模型UID
xinference = XinferenceEmbeddings(
    server_url="http://localhost:9997",  # 注意：原代码中的"loaclhost"拼写错误，应为"localhost"
    model_uid="your_model_uid"  # 替换为实际的模型UID
)

# 输入文本
texts = ["你好，世界。", "LangChain 是一个强大的工具。"]

# 生成嵌入向量
vectors = xinference.embed_documents(texts)

# 打印结果
for idx, vector in enumerate(vectors):
    print(f"文本 {idx + 1}: {texts[idx]}")
    print(f"嵌入向量: {vector[:5]}... (维度: {len(vector)})")