# 围城数据的 KBs

结论：

- 围城这样的小说，并不适合 bm25+embedding
- bm25在20万字小说中取100条很快，1秒多

## 生成数据

In [1]:
%%time

from llama_index.core import SimpleDirectoryReader

documents=SimpleDirectoryReader(input_files=["books/围城.txt"]).load_data()
len(documents)

CPU times: user 2.69 s, sys: 434 ms, total: 3.12 s
Wall time: 2.76 s


1

## 设置 LlamaIndex 全局变量

In [2]:
%%time

# 加载llm和embeddings
%run ../../utils2.py

from llama_index.core import Settings

Settings.llm=get_llm(model="qwen2-7b-6k")

# Settings.embed_model = get_embedding(model_name="quentinz/bge-large-zh-v1.5")
# embedding_dimension=1024

Settings.embed_model = get_embedding(model_name="rjmalagon/gte-qwen2-1.5b-instruct-embed-f16")
embedding_dimension=1536

persist_dir="/tmp/weicheng-kbs"
!rm -rf $persist_dir

CPU times: user 673 ms, sys: 44.3 ms, total: 717 ms
Wall time: 827 ms


## 基于嵌入的检索查询

### 设置向量存储 - Qdrant

In [3]:
%%time

from qdrant_client import QdrantClient
from qdrant_client import models

client = QdrantClient(":memory:")
collection_name="weicheng"

if not client.collection_exists(collection_name):
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=embedding_dimension,
            distance=models.Distance.COSINE,
        ),
    )

CPU times: user 778 ms, sys: 55.8 ms, total: 834 ms
Wall time: 835 ms


### 创建向量存储索引

In [4]:
%%time

from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from qdrant_client import models

vector_store = QdrantVectorStore(
    client=client, 
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=embedding_dimension,
        distance=models.Distance.COSINE,
    ),
)
storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
)

parser = SentenceSplitter(
    chunk_size=128,
    chunk_overlap=10
)
nodes = parser.get_nodes_from_documents(documents)

len(nodes)

CPU times: user 804 ms, sys: 28 ms, total: 832 ms
Wall time: 832 ms


3006

In [5]:
%%time

index = VectorStoreIndex(nodes, storage_context=storage_context)

CPU times: user 17.5 s, sys: 389 ms, total: 17.9 s
Wall time: 5min 18s


### 保存索引数据

In [6]:
%%time

index.storage_context.persist(persist_dir)

CPU times: user 2.2 ms, sys: 0 ns, total: 2.2 ms
Wall time: 1.82 ms


### 加载索引数据

In [7]:
%%time

from llama_index.core import load_index_from_storage

storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
    persist_dir=persist_dir
)
index = load_index_from_storage(storage_context)

CPU times: user 2.6 ms, sys: 114 µs, total: 2.71 ms
Wall time: 2.37 ms


### 使用嵌入检索

In [8]:
%%time

retriever = index.as_retriever()
nodes = retriever.retrieve("方鸿渐的老婆是谁")

nodes

CPU times: user 25.3 ms, sys: 23 ms, total: 48.2 ms
Wall time: 3.05 s


[NodeWithScore(node=TextNode(id_='81a490e5-8426-4ee4-b401-28b89480985e', embedding=None, metadata={'file_path': 'books/围城.txt', 'file_name': '围城.txt', 'file_type': 'text/plain', 'file_size': 644668, 'creation_date': '2024-07-17', 'last_modified_date': '2024-07-17'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='92de8610-e422-49f3-80e5-a9fd639e63fc', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'books/围城.txt', 'file_name': '围城.txt', 'file_type': 'text/plain', 'file_size': 644668, 'creation_date': '2024-07-17', 'last_modified_date': '2024-07-17'}, hash='46469e28bdc41ad497aaa3bb3630546060a20d24186daac164cb4c84ddf69d36'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='a106345f-c611

### 基于嵌入检索的查询

In [9]:
%%time

from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

retriever = index.as_retriever(
    similarity_top_k=5,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法确定方鸿渐的妻子是谁。文中提到了“苏小姐”，但并未明确指出她是方鸿渐的配偶。CPU times: user 218 ms, sys: 124 ms, total: 342 ms
Wall time: 3.88 s


## 基于 BM25 的检索查询

### 使用 BM25 检索

In [10]:
%%time

# 下载停用词

# 设置 HTTP 代理环境变量
# https://github.com/nltk/nltk_data/issues/154#issuecomment-2144880495
http_proxy="http://192.168.0.134:7890"

import nltk
nltk.set_proxy(f'{http_proxy}')
nltk.download('stopwords')

CPU times: user 6.75 ms, sys: 4.66 ms, total: 11.4 ms
Wall time: 389 ms


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
%%time

import jieba
from typing import List
from nltk.corpus import stopwords

def chinese_tokenizer(text: str) -> List[str]:
    # Use jieba to segment Chinese text
    return list(jieba.cut(text))

CPU times: user 57.1 ms, sys: 27.4 ms, total: 84.6 ms
Wall time: 84.1 ms


In [12]:
%%time

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.storage.docstore import SimpleDocumentStore

splitter = SentenceSplitter(chunk_size=128, chunk_overlap=10)
docstore = SimpleDocumentStore()
docstore.add_documents(splitter.get_nodes_from_documents(documents))

retriever=BM25Retriever.from_defaults(
    docstore=docstore, 
    similarity_top_k=2,
    tokenizer=chinese_tokenizer,
)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.399 seconds.
Prefix dict has been built successfully.


CPU times: user 2.42 s, sys: 11.4 ms, total: 2.43 s
Wall time: 2.43 s


In [13]:
%%time

retriever = index.as_retriever()
nodes = retriever.retrieve("方鸿渐的老婆是谁")

nodes

CPU times: user 27.5 ms, sys: 22.8 ms, total: 50.3 ms
Wall time: 142 ms


[NodeWithScore(node=TextNode(id_='81a490e5-8426-4ee4-b401-28b89480985e', embedding=None, metadata={'file_path': 'books/围城.txt', 'file_name': '围城.txt', 'file_type': 'text/plain', 'file_size': 644668, 'creation_date': '2024-07-17', 'last_modified_date': '2024-07-17'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='92de8610-e422-49f3-80e5-a9fd639e63fc', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'books/围城.txt', 'file_name': '围城.txt', 'file_type': 'text/plain', 'file_size': 644668, 'creation_date': '2024-07-17', 'last_modified_date': '2024-07-17'}, hash='46469e28bdc41ad497aaa3bb3630546060a20d24186daac164cb4c84ddf69d36'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='a106345f-c611

### 使用 BM25 检索的查询

#### top_k=2 不起作用

In [14]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法确定方鸿渐的老婆是谁。所提供的文本片段中并未提及方鸿渐的妻子身份或姓名。CPU times: user 91.3 ms, sys: 163 ms, total: 255 ms
Wall time: 954 ms


#### top_k=200 回答正确

In [15]:
%%time

retriever=BM25Retriever.from_defaults(
    docstore=docstore, 
    similarity_top_k=200,
    tokenizer=chinese_tokenizer,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，可以确定方鸿渐的老婆是柔嘉。CPU times: user 1.11 s, sys: 0 ns, total: 1.11 s
Wall time: 20.8 s


#### 使用 reranker

In [16]:
%%time

from llama_index.core.postprocessor import SentenceTransformerRerank

reranker=SentenceTransformerRerank(
    model='/models/bge-reranker-v2-m3',
    top_n=10
)

CPU times: user 1.56 s, sys: 856 ms, total: 2.41 s
Wall time: 1.65 s


In [17]:
%%time

retriever=BM25Retriever.from_defaults(
    docstore=docstore, 
    similarity_top_k=500,
    tokenizer=chinese_tokenizer,
)

CPU times: user 1 s, sys: 16.5 ms, total: 1.02 s
Wall time: 1.02 s


In [18]:
%%time

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[reranker]
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法直接确定方鸿渐的妻子是谁。信息中提到了多个女性角色与方鸿渐有关，包括苏小姐、唐小姐和汪太太等，但没有明确指出哪个是他的妻子。要确切回答这个问题，需要更多关于方鸿渐婚姻状况的具体信息。CPU times: user 3.89 s, sys: 125 ms, total: 4.01 s
Wall time: 6.01 s


## 基于混合检索的查询

### 混合检索

In [36]:
%%time

from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [
        index.as_retriever(similarity_top_k=5),
        BM25Retriever.from_defaults( 
            docstore=docstore,
            similarity_top_k=5,
            tokenizer=chinese_tokenizer,
        ),
    ],
    num_queries=1,
    use_async=True,
)

CPU times: user 959 ms, sys: 0 ns, total: 959 ms
Wall time: 958 ms


In [37]:
%%time

# retriever = index.as_retriever(
#     similarity_top_k=5,
# )
nodes = retriever.retrieve("方鸿渐的老婆是谁")

nodes

AttributeError: 'NoneType' object has no attribute 'search'

In [27]:
len(nodes)

5

### 基于混合检索的查询

In [28]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法确定方鸿渐的妻子是谁。文中提到了“苏文纨小姐”，但并未明确表示她就是方鸿渐的配偶。因此，不能仅凭这些信息推断出方鸿渐的妻子身份。CPU times: user 138 ms, sys: 109 ms, total: 247 ms
Wall time: 3.94 s


### 基于混合检索的查询 - 带 reranker

In [29]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法确定方鸿渐的妻子是谁。在这些段落中，并没有直接提到或暗示方鸿渐有妻子的信息。CPU times: user 136 ms, sys: 131 ms, total: 267 ms
Wall time: 1.21 s


In [1]:
%%time

from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [
        index.as_retriever(similarity_top_k=100),
        BM25Retriever.from_defaults( 
            docstore=docstore,
            similarity_top_k=100,
            tokenizer=chinese_tokenizer,
        ),
    ],
    num_queries=1,
    use_async=True,
)

NameError: name 'index' is not defined

In [33]:
%%time

import nest_asyncio
nest_asyncio.apply()

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[reranker]
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

AttributeError: 'NoneType' object has no attribute 'search'