# C8：语义搜索和 RAG

语义搜索主流技术：
  * 稠密搜索：基于文本嵌入，将搜索问题转化为查询向量与文档向量的最临近匹配过程
  * 重排序：多阶段处理，重排序模型对结果做相关性评分，并优化排序
  * RAG

## 重排序

1. 目标文本预处理和句子分割
2. 生成句子的向量表示
3. 简历搜索索引
4. 执行搜索并分析结果

缺点：
  * 不存在答案时仍然返回结果：通常返回结果由用户自行判断，并根据反馈持续优化模型
  * 无法精准匹配短语：需要结合关键字搜索
  * 在训练数据外性能显著下降
  * 长文本分块

### 长文本分块策略


#### 单文本单向量方案

* 仅嵌入文档的代表性段落，忽略剩余内容：适合可以概括核心观点的文章
* 分割后对各块做嵌入，聚合块嵌入到单个向量：压缩缺陷导致大量细节丢失

#### 单文档多向量方案

几种选择：

* 句子分割：可能粒度过小，无法捕捉足够的上下文
* 段落分割：段落较短时，或 3～8 个句子一个段落
* 上下文增强：
  * 块附加标题
  * 重叠结构：引入一部分上下文内容
* LLM 动态智能分块

### 搜索

* 搜索库：
  * Annoy
  * FAISS
* 向量数据库：增删向量无需重建索引等高级功能

### 微调

核心：拉进相关查询和文档的距离，推离不相关查询
 

In [3]:
#%pip install cohere rank_bm25
%pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp313-cp313-macosx_14_0_arm64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp313-cp313-macosx_14_0_arm64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m484.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
import cohere
import numpy as np
import pandas as pd
from tqdm import tqdm
import faiss

api_key = ''

co = cohere.Client(api_key)

text = ''

texts = [sentence.strip(' \n') for sentence in text.split('.')]
response = co.embed(texts, input_type='search_document').embeddings
embeds = np.array(response)
print(embeds.shape)

dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim)
print(index.is_trained)
index.add(np.float32(embeds))

def search(query, k=3):
    query_embed = co.embed(query, input_type='search_query').embeddings[0]
    distances, indices = index.search(np.float32([query_embed]), k)
    texts_np = np.array(texts)
    results = pd.DataFrame(data={'texts': texts_np[indices[0]], 'distances': distances[0]})

    return results


query = ''
results = search(query)
print(results)

## 重排序

先检索，一般采用关键字+稠密的混合方式

再通过重排序模型计算相关性，并根据评分排序搜索结果

### sentence-transformers

<https://www.sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html>

### 工作机制

将查询与每个候选结果共同输入交叉编码器架构的 LLM：允许模型同事分析查询文本与文档内容后生成相关性评分


In [None]:
query = ''

results = co.rerank(query=query, documents=texts, top_n=3, return_documents=True)

for idx, result in enumerate(results):
    print(idx, result.relevance_score, result.document.text)


## 检索评估指标

搜索系统评估框架三要素：
  * 文档库
  * 查询集合
  * 相关性判断

指标：
  * 基于平均精确率的单查询评分
  * 基于均值平均精确率的多查询评分

## RAG

检索 + 生成

### 基于知识的生成

核心：在搜索末端接入 LLM

具体实现：将用户的问题与检索获得的前若干个相关文档共同输入 LLM，使其基于检索提供的上下文生成答案

### 高级技术

#### 查询改写

使用 LLM 将原始查询改写成更利于 LLM 的简洁模式

#### 多查询 RAG

将复杂问题生成多个关联查询

对多次查询最佳结果输入模型进行事实性回答

或者给改写器自主判断能力：需要执行检索或直接生成可靠的答案

#### 多跳 RAG

分布复杂推离，执行连续检索

#### 查询路由

根据数据源做重定向检索

#### Agent RAG

数据源抽象为 Agent 的 tools

### 效果评估

核心维度：

* 流畅性：生成文本的语言流畅度和实用价值
* 感知效用：回答内容的信息价值和使用价值
* 引用召回率：外部事实陈述中获得完整引证支持的比例
* 引用精确率：引用内容对相关论断的支持的有效性

#### Ragas

开源工具库

评估指标：

* 重视度：答案和上下文一致性程度
* 答案相关性：答案和提问主题的契合度


In [4]:
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""


texts = [sentence.strip(' \n') for sentence in text.split('.')]

In [None]:
import faiss
from langchain import LlamaCpp
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
#<https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html#langchain_community.vectorstores.faiss.FAISS>
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_community.docstore.in_memory import InMemoryDocstore


llm = LlamaCpp(
    model_path='Phi-3-mini-4k-instruct-fp16.gguf',
    max_tokens=500,
    n_ctx=2048,
    seed=42,
    verbose=False,
)


embedding_model = HuggingFaceEmbeddings(
    model_name='BAAI/bge-small-en-v1.5'
)

index = faiss.IndexFlatL2(len(embedding_model.embed_query("hello world")))

db = FAISS(
    embedding_function=embedding_model,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

template = '''<|user|>
Relevant information:
{context}

Provide a concise answer the following question using the relevant information provided above:
{question}<|end|>
<|assistant|>
'''
prompt = PromptTemplate(
    template=template,
    input_variables=['context', 'question']
)

rag = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=db.as_retriever(),
    chain_type_kwargs={'prompt': prompt},
    verbose=True
)

query = 'Income generated'
result = rag.invoke(query)
print(result)

llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (n



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
{'query': 'Income generated', 'result': " I'm sorry, but there is no information provided above to answer the question about income generated. If you can provide details or context related to the income in question, I would be happy to help with an analysis or summary."}
