模型的部署参考： [learn-llm-deploy-easily](https://gitee.com/coderwillyan/learn-llm-deploy-easily) 

这里主要介绍如何调用已部署的模型

# embedding

## 使用embedding API

以zhipu为例

In [None]:
from langchain_community.embeddings import ZhipuAIEmbeddings

my_emb = ZhipuAIEmbeddings(
    model="embedding-2",
    api_key=os.environ["ZHIPUAI_API_KEY"]
)

## 使用本地部署的embedding模型

### Hugging Face Transformers

​​特点​​：底层灵活，支持BERT、RoBERTa等Transformer架构的Embedding模型，需自定义向量提取逻辑

In [None]:
# from transformers import AutoModel, AutoTokenizer
# model = AutoModel.from_pretrained("BAAI/bge-large-zh")
# embeddings = model(inputs).last_hidden_state.mean(dim=1)  # 提取句向量

### Sentence Transformers部署

​​特点​​：基于Transformers的封装，内置池化层，一键生成句子级嵌入，支持语义相似度计算

项目地址： https://github.com/UKPLab/sentence-transformers

参考文档： https://sbert.net.cn/index.html

In [None]:
# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("/opt/workspace/models/Qwen/Qwen3-Embedding-0.6B",  device="cuda:3")

# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "Qwen/Qwen3-Embedding-0.6B",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.7646, 0.1414],
#         [0.1355, 0.6000]])

In [None]:
from sentence_transformers import SentenceTransformer

# 指定 GPU 设备（例如 GPU 0）
# device = "cuda:3" if torch.cuda.is_available() else "cpu"
# Load the model
model = SentenceTransformer("/opt/workspace/models/Qwen/Qwen3-Embedding-0.6B", device="cuda:3")
result = model.encode("这是一个测试")
result.tolist()[:5]

### HuggingFaceEmbeddings部署

In [None]:
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
my_emb = HuggingFaceEmbeddings(model_name='/opt/workspace/models/Qwen/Qwen3-Embedding-0.6B', model_kwargs={"device": "cuda:3"})
query = "如何调用HuggingFaceEmbeddings？"
query_vector = my_emb.embed_query(query) 
print("查询向量:", query_vector[:5], "...")

### ollama部署

In [None]:
# from langchain_community.embeddings import OllamaEmbeddings
from langchain_ollama.embeddings import OllamaEmbeddings
my_emb = OllamaEmbeddings(base_url='http://localhost:11434',model="dengcao/Qwen3-Embedding-0.6B:F16")

In [1]:
from langchain_ollama.embeddings import OllamaEmbeddings

# 初始化嵌入模型
my_emb = OllamaEmbeddings(
    base_url='http://localhost:11434',
    model="bge-m3:latest"
)

# 示例文本
query = "如何调用OllamaEmbeddings？"

# 生成嵌入向量
query_vector = my_emb.embed_query(query)

# 输出结果
print("查询向量:", query_vector[:5], "...")

查询向量: [-0.01782775, -0.010210706, -0.031165373, 0.029789878, -0.024330704] ...


### vllm部署

In [2]:
# 见qwen模型发布文档

### xinference部署

In [None]:
from langchain_community.embeddings import XinferenceEmbeddings

# 替换为你的Xinference服务器URL和模型UID
xinference = XinferenceEmbeddings(
    server_url="http://localhost:9997",  # 注意：原代码中的"loaclhost"拼写错误，应为"localhost"
    model_uid="your_model_uid"  # 替换为实际的模型UID
)

# 输入文本
texts = ["你好，世界。", "LangChain 是一个强大的工具。"]

# 生成嵌入向量
vectors = xinference.embed_documents(texts)

# 打印结果
for idx, vector in enumerate(vectors):
    print(f"文本 {idx + 1}: {texts[idx]}")
    print(f"嵌入向量: {vector[:5]}... (维度: {len(vector)})")