# 基于 LlamaIndex 的嵌入检索查询

主要目的和结论：

- 通过技术组合，逐步形成嵌入检索查询
- 初步结论，模型的大小影响嵌入查询的质量
  - 能在4GB显存的小模型（LLM/Embedding/Reranker）慢而且准确率很低
  - 当使用 `Qwen2:7B` 级别的模型，准确率有较大提高
  - 使用适当的 LLMReranker 在小数据集性能很好

## 基本思路

- 嵌入的存储使用 [Qdrent](https://github.com/qdrant/qdrant) 向量数据库
- 使用 [LlamaIndex](https://github.com/run-llama/llama_index) 框架简化 LLM 应用开发

## 准备工作

In [1]:
%%time
%%capture

# 安装所需的库

!pip install llama-index-vector-stores-qdrant
!pip install qdrant_client
!pip install llama-index-llms-openai-like
!pip install llama-index-readers-file
!pip install llama-index-embeddings-ollama

CPU times: user 64.4 ms, sys: 21.1 ms, total: 85.6 ms
Wall time: 11.9 s


In [2]:
%%time

# 导入库

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from IPython.display import Markdown, display
from llama_index.core import Settings
from llama_index.embeddings.ollama import OllamaEmbedding

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client.models import Distance, VectorParams

from llama_index.llms.openai_like import OpenAILike
from llama_index.core import Settings

CPU times: user 4.73 s, sys: 602 ms, total: 5.33 s
Wall time: 4.71 s


In [3]:
%%time

# 设置默认LLM

TOKEN="sk-W8fMtMdNWxNPxAf0F869DfB1Aa0c4bDf9263AbDfEa592d59"
HOST_URL="http://oneapi:3000"
MODEL_NAME="qwen2:1.5b"

llm = OpenAILike(model= MODEL_NAME, 
                 api_base= f"{HOST_URL}/v1", 
                 api_key= TOKEN,
                 is_chat_model= True,
                 temperature= 0.1
                )

Settings.llm =llm

CPU times: user 1.33 ms, sys: 180 μs, total: 1.51 ms
Wall time: 1.06 ms


In [4]:
%%time

# 初始化全局 embedding 模型

from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
    model_name="quentinz/bge-large-zh-v1.5",
    base_url="http://llms:11434",
    ollama_additional_kwargs={"mirostat": 0}, # -mirostat N 使用 Mirostat 采样。
)

Settings.embed_model = ollama_embedding

CPU times: user 895 μs, sys: 0 ns, total: 895 μs
Wall time: 638 μs


In [5]:
%%time

# 设置文本块的长度和重叠

Settings.chunk_size=128
Settings.chunk_overlap=10

Settings

CPU times: user 324 ms, sys: 26.5 ms, total: 350 ms
Wall time: 349 ms


_Settings(_llm=OpenAILike(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fd8ff3b7550>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x7fd9b5fb41f0>, completion_to_prompt=<function default_completion_to_prompt at 0x7fd9b5e50670>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, model='qwen2:1.5b', temperature=0.1, max_tokens=None, logprobs=None, top_logprobs=0, additional_kwargs={}, max_retries=3, timeout=60.0, default_headers=None, reuse_client=True, api_key='sk-W8fMtMdNWxNPxAf0F869DfB1Aa0c4bDf9263AbDfEa592d59', api_base='http://oneapi:3000/v1', api_version='', context_window=3900, is_chat_model=True, is_function_calling_model=False, tokenizer=None), _embed_model=OllamaEmbedding(model_name='quentinz/bge-large-zh-v1.5', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7fd8ff3b7550>, num_workers=None, base_url='http://ll

In [6]:
%%time

# 索引存储路径
INDEX_PATH = "weicheng-index"

CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 4.77 μs


## 启用 Qdrant 作为向量存储

In [7]:
%%time

client = qdrant_client.QdrantClient(
    location=":memory:",
    vectors_config=VectorParams(
        size=1024, 
        distance=Distance.COSINE
    ),
)

vector_store = QdrantVectorStore(client=client, collection_name="demo")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

CPU times: user 574 μs, sys: 76 μs, total: 650 μs
Wall time: 584 μs


## 加载文档

In [8]:
%%time

# load documents
documents = SimpleDirectoryReader(input_files=['围城.txt']).load_data()

CPU times: user 14.7 ms, sys: 358 μs, total: 15 ms
Wall time: 14.3 ms


In [9]:
# 文档前1000字
documents[0].text[:1000]

'『围城/作者:钱钟书』\n『状态:全本』\n『内容简介:\n    \u3000在这本书里，我想写现代中国某一部分社会、某一类人物。写这类人，我没忘记他们是人类，只是人类，具有无毛两足动物的基本根性。角色当然是虚构的，但是有考据癖的人也当然不肯错过索隐的杨会、放弃附会的权利的。\n \n \u3000\u3000这本书整整写了两年。两年里忧世伤生，屡想中止。由于杨绛女士不断的督促，替我挡了许多事，省出时间来，得以锱铢积累地写完。照例这本书该献给她。不过，近来觉得献书也像“致身于国”、“还政于民”等等佳话，只是语言幻成的空花泡影，名说交付出去，其实只仿佛魔术家玩的飞刀，放手而并没有脱手。随你怎样把作品奉献给人，作品总是作者自已的。大不了一本书，还不值得这样精巧地不老实，因此罢了。\n \n \u3000\u3000三十五年【一九四九年】十二月十五日』\n天下电子书Txt版阅读,下载和分享更多电子书请访问:http://www.txdzs.com,手机访问:http://3g.txdzs.com,E-mail:support@txdzs.com\n------章节内容开始-------\n 序\n    序\n    在这本书里，我想写现代中国某一部分社会、某一类人物。写这类人，我没忘记他们是人类，只是人类，具有无毛两足动物的基本根性。角色当然是虚构的，但是有考据癖的人也当然不肯错过索隐的杨会、放弃附会的权利的。\n    这本书整整写了两年。两年里忧世伤生，屡想中止。由于杨绛女士不断的督促，替我挡了许多事，省出时间来，得以锱铢积累地写完。照例这本书该献给她。不过，近来觉得献书也像“致身于国”、“还政于民”等等佳话，只是语言幻成的空花泡影，名说交付出去，其实只仿佛魔术家玩的飞刀，放手而并没有脱手。随你怎样把作品奉献给人，作品总是作者自已的。大不了一本书，还不值得这样津巧地不老实，因此罢了。\n    三十五年【一九四九年】十二月十五日\n 前言\n    前言\n    重印前记\n    《围城》一九四七年在上海初版，一九四八年再版，一九四九年三版，以后国内没有重印过。偶然碰见它的新版，那都是香港的“盗印”本。没有看到台湾的“盗印”，据说在那里它是禁书。美国哥轮比亚大学夏志清教授的英文著作里对它作了过高的评价，导致了一些西方语言的译本。日本京都大学荒井健教授很久以前

In [10]:
# 文档长度
len(documents[0].text)

218562

## 创建向量索引

In [11]:
%%time

import os

if not os.path.exists(INDEX_PATH):
    # 基于文档创建向量索引
    index = VectorStoreIndex.from_documents(documents)

CPU times: user 20 μs, sys: 2 μs, total: 22 μs
Wall time: 26.2 μs


## 存储向量索引

In [12]:
%%time

if not os.path.exists(INDEX_PATH):
    index.storage_context.persist(INDEX_PATH)

CPU times: user 35 μs, sys: 4 μs, total: 39 μs
Wall time: 43.4 μs


## 加载向量索引

In [13]:
%%time

from llama_index.core import load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir=INDEX_PATH)
index=load_index_from_storage(storage_context)

CPU times: user 25.8 s, sys: 140 ms, total: 25.9 s
Wall time: 25.9 s


## 直接使用检索

In [14]:
%%time

retriever = index.as_retriever()
nodes = retriever.retrieve("方鸿渐的妻子是谁")

print("\n\n".join(node.text for node in nodes))

以后飞机接连光顾，大有绝世侍人一顾倾城、再顾倾国的风度。周经理拍电报，叫鸿渐快到上海，否则交通断绝，要困守在家里。方老先生也觉得在这种时局里，儿子该快出去找机会，所以让鸿渐走了。

周太太看方鸿渐捧报老遮着脸，笑对丈夫说：“你瞧鸿渐多得意，那条新闻看了几遍不放手。”
    效成顽皮道：“鸿渐哥在仔细认那位苏文纨，想娶她来代替姐姐呢。
CPU times: user 153 ms, sys: 31 μs, total: 153 ms
Wall time: 1.6 s


In [15]:
# 默认 top_k=2
len(nodes)

2

## 基于检索的查询

### 同步输出

In [16]:
%%time

from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

retriever = index.as_retriever()

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
response = query_engine.query("方鸿渐的妻子是谁")
print(response)

方鸿渐的妻子是周太太。
CPU times: user 221 ms, sys: 10 μs, total: 221 ms
Wall time: 2.57 s


### 流式输出

In [17]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是周太太。CPU times: user 189 ms, sys: 365 μs, total: 190 ms
Wall time: 596 ms


### 增加 top_k 是否能提高准确率

#### top_k=5

In [18]:
%%time

retriever = index.as_retriever(
    similarity_top_k=5,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是孙小姐。CPU times: user 174 ms, sys: 8.08 ms, total: 182 ms
Wall time: 1.07 s


#### top_k=20

In [19]:
%%time

retriever = index.as_retriever(
    similarity_top_k=20,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐没有妻子。CPU times: user 176 ms, sys: 4.01 ms, total: 180 ms
Wall time: 3.11 s


#### top_k=30

In [20]:
%%time

retriever = index.as_retriever(
    similarity_top_k=30,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是苏小姐。CPU times: user 181 ms, sys: 8.14 ms, total: 189 ms
Wall time: 2.92 s


#### top_k=100

In [21]:
%%time

retriever = index.as_retriever(
    similarity_top_k=100,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是苏小姐。CPU times: user 238 ms, sys: 7.89 ms, total: 246 ms
Wall time: 12.2 s


#### top_k=1000

In [22]:
%%time

retriever = index.as_retriever(
    similarity_top_k=1000,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是苏小姐。CPU times: user 877 ms, sys: 16.2 ms, total: 893 ms
Wall time: 1min 37s


## reranker

### bge-reranker-base

In [23]:
%%time
%%capture

!pip install llama-index-embeddings-huggingface

CPU times: user 21.4 ms, sys: 12.1 ms, total: 33.5 ms
Wall time: 2.58 s


In [24]:
%%time

from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="/models/bge-reranker-base", top_n=5, 
    device="cpu"
)

CPU times: user 3.26 s, sys: 855 ms, total: 4.12 s
Wall time: 2.68 s


In [25]:
%%time

retriever = index.as_retriever(
    similarity_top_k=1000,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[rerank]
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是张太太和张小姐。CPU times: user 8min 9s, sys: 16.1 s, total: 8min 25s
Wall time: 1min 56s


In [26]:
%%time

retriever = index.as_retriever(
    similarity_top_k=2000,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[rerank]
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是刘太太。CPU times: user 16min 37s, sys: 28.4 s, total: 17min 6s
Wall time: 3min 53s


### LLM rerank

In [27]:
%%time

from llama_index.core.postprocessor import LLMRerank

reranker = LLMRerank(
            choice_batch_size=5,
            top_n=5,
)

CPU times: user 153 μs, sys: 0 ns, total: 153 μs
Wall time: 157 μs


In [28]:
%%time

retriever = index.as_retriever(
    similarity_top_k=100,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[rerank]
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是周太太。CPU times: user 49.6 s, sys: 687 ms, total: 50.2 s
Wall time: 12.8 s


In [29]:
%%time

retriever = index.as_retriever(
    similarity_top_k=200,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[rerank]
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是苏小姐。CPU times: user 1min 36s, sys: 1.37 s, total: 1min 38s
Wall time: 23.8 s


## 基于更大的本地模型

### 准备工作

In [30]:
%%time

# 设置默认LLM

TOKEN="sk-bJP6QSnUfjAYeYeE505d3eBf63A643BeB0B8E350Df9b7750"
HOST_URL="http://ape:3000"
MODEL_NAME="qwen2-7b-6k"

llm = OpenAILike(model= MODEL_NAME, 
                 api_base= f"{HOST_URL}/v1", 
                 api_key= TOKEN,
                 is_chat_model= True,
                 temperature= 0.1
                )

Settings.llm =llm

CPU times: user 2 ms, sys: 59 μs, total: 2.06 ms
Wall time: 1.63 ms


In [31]:
%%time

# 初始化全局 embedding 模型

from llama_index.embeddings.ollama import OllamaEmbedding

ollama_embedding = OllamaEmbedding(
    model_name="rjmalagon/gte-qwen2-1.5b-instruct-embed-f16:latest",
    base_url="http://ape:11435",
    ollama_additional_kwargs={"mirostat": 0}, # -mirostat N 使用 Mirostat 采样。
)

Settings.embed_model = ollama_embedding

CPU times: user 1.92 ms, sys: 0 ns, total: 1.92 ms
Wall time: 1.45 ms


### 只做嵌入检索的查询

#### top_k=10

In [32]:
%%time

retriever = index.as_retriever(
    similarity_top_k=10,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

根据提供的信息，关于方鸿渐妻子的具体名字没有明确提到。但是可以推测，方鸿渐的妻子是与他结婚的女性角色，可能是文中多次提及的“孙小姐”或“孙家的人”，也有可能是“方老太太”所指的女儿或者儿媳。由于信息中没有直接点明，无法确定具体是谁。CPU times: user 351 ms, sys: 16.2 ms, total: 367 ms
Wall time: 4.87 s


#### top_k=20

In [33]:
%%time

retriever = index.as_retriever(
    similarity_top_k=20,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

根据提供的上下文信息，关于方鸿渐妻子的具体名字并未直接提及。从文本中可以看出，方鸿渐有过与苏小姐的交往，并且在某些情况下被描述为“女朋友”，但并没有明确提到他最终娶了谁或他的妻子是谁。因此，无法确定方鸿渐的妻子是哪一位角色。CPU times: user 274 ms, sys: 23.4 ms, total: 298 ms
Wall time: 2.7 s


#### top_k=30

In [34]:
%%time

retriever = index.as_retriever(
    similarity_top_k=30,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

根据提供的信息，方鸿渐的妻子是孙柔嘉。在多个文件中提到了方鸿渐与孙小姐的关系，包括订婚、同路来（可能指的是结婚）以及家庭聚会等场景，这表明孙柔嘉是方鸿渐的配偶。CPU times: user 252 ms, sys: 27.6 ms, total: 280 ms
Wall time: 3.08 s


#### top_k=50

In [35]:
%%time

retriever = index.as_retriever(
    similarity_top_k=50,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

根据提供的信息，无法直接确定方鸿渐的妻子名字。但是，有一个来源提到方鸿渐的妻子名叫柔嘉。因此，在这个特定的信息中，方鸿渐的妻子是柔嘉。如果有更多上下文或具体情节提供，可能可以给出更准确的答案。CPU times: user 271 ms, sys: 19.5 ms, total: 290 ms
Wall time: 6.09 s


#### top_k=100

In [36]:
%%time

retriever = index.as_retriever(
    similarity_top_k=100,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

根据提供的信息，无法确定地回答方鸿渐的具体妻子名字。不过，有推测指出方鸿渐的妻子可能是孙柔嘉（Jia Jia）。但需要更多具体信息或上下文来精确解答这个问题。CPU times: user 287 ms, sys: 28 ms, total: 315 ms
Wall time: 10.4 s


#### top_k=200

In [37]:
%%time

retriever = index.as_retriever(
    similarity_top_k=200,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

根据提供的信息，可以推测方鸿渐的妻子是苏文纨。CPU times: user 289 ms, sys: 16.2 ms, total: 306 ms
Wall time: 16.8 s


### 增加 reranker

#### bge-reranker-base

In [38]:
%%time

from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="/models/bge-reranker-base", top_n=5, 
    device='cpu',
)

CPU times: user 2.51 s, sys: 692 ms, total: 3.2 s
Wall time: 1.77 s


##### top_k=100

In [39]:
%%time

retriever = index.as_retriever(
    similarity_top_k=100,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

根据提供的信息片段，方鸿渐的妻子是孙柔嘉。CPU times: user 364 ms, sys: 12.1 ms, total: 376 ms
Wall time: 9.14 s


##### top_k=200

In [40]:
%%time

retriever = index.as_retriever(
    similarity_top_k=200,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是孙柔嘉。CPU times: user 314 ms, sys: 4.65 ms, total: 319 ms
Wall time: 15.8 s


#### LLM rerank

In [41]:
%%time

from llama_index.core.postprocessor import LLMRerank

reranker = LLMRerank(
            choice_batch_size=5,
            top_n=5,
)

CPU times: user 109 μs, sys: 3 μs, total: 112 μs
Wall time: 115 μs


##### top_k=100

In [45]:
%%time

retriever = index.as_retriever(
    similarity_top_k=100,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[rerank]
)

streaming_response = query_engine.query("方鸿渐的妻子是谁")
streaming_response.print_response_stream()

方鸿渐的妻子是孙家的人。CPU times: user 47.9 s, sys: 462 ms, total: 48.3 s
Wall time: 11.8 s
