# Semantic Search

官方文档：https://sbert.net/examples/sentence_transformer/applications/semantic-search/README.html

给定一个 query，从语料库中找出最相关的文档。核心流程：
1. 预先编码所有文档为向量
2. 编码查询为向量
3. 计算相似度并排序

In [None]:
import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# 语料库
corpus = [
    "机器学习是让计算机无需显式编程即可学习的研究领域",
    "深度学习是基于人工神经网络的机器学习方法",
    "神经网络是受生物神经网络启发的计算系统",
    "火星探测车是在火星表面行驶收集数据的机器人",
    "可再生能源包括太阳能、风能、水能等",
]

# 预编码文档（实际场景中只需做一次）
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

In [None]:
# 查询
queries = ["人工神经网络是什么？", "有哪些清洁能源？"]

top_k = 3
for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)
    scores = model.similarity(query_embedding, corpus_embeddings)[0]
    top_scores, top_indices = torch.topk(scores, k=top_k)

    print(f"\nQuery: {query}")
    for score, idx in zip(top_scores, top_indices):
        print(f"  [{score:.4f}] {corpus[idx]}")

In [None]:
# 使用 util.semantic_search 工具函数（更高效）
from sentence_transformers import util

query_embedding = model.encode("深度学习和机器学习的关系", convert_to_tensor=True)
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)

for hit in hits[0]:
    print(f"  [{hit['score']:.4f}] {corpus[hit['corpus_id']]}")