# 围城数据的 KBs - 简化版

结论：

- qdrant 有2个 client，同步和异步的
- llamaindex 使用了二者
- 如果使用qdrant，需要使用网络客户端？（本地存储是否可以？）
- bm25+embedding组合不适合长篇小说找到正确答案
- 成功的是
    - 验证了bm25+embedding组合技术可行性
    - qdrant 使用 async client

## 生成数据

In [1]:
%%time

from llama_index.core import SimpleDirectoryReader

documents=SimpleDirectoryReader(input_files=["books/围城.txt"]).load_data()
len(documents)

CPU times: user 2.66 s, sys: 417 ms, total: 3.08 s
Wall time: 2.73 s


1

## 设置 LlamaIndex 全局变量

In [2]:
%%time

# 加载llm和embeddings
%run ../../utils2.py

from llama_index.core import Settings

Settings.llm=get_llm(model="qwen2-7b-6k")

# Settings.embed_model = get_embedding(model_name="quentinz/bge-large-zh-v1.5")
# embedding_dimension=1024

Settings.embed_model = get_embedding(model_name="rjmalagon/gte-qwen2-1.5b-instruct-embed-f16")
embedding_dimension=1536

persist_dir="/tmp/weicheng-kbs-simple"
!rm -rf $persist_dir

CPU times: user 673 ms, sys: 40.2 ms, total: 714 ms
Wall time: 823 ms


## 设置向量存储 - Qdrant

In [4]:
%%time

from qdrant_client import QdrantClient, AsyncQdrantClient
from qdrant_client import models

import nest_asyncio
nest_asyncio.apply()

# client = QdrantClient(":memory:")
# aclient = AsyncQdrantClient(":memory:")
client = QdrantClient("http://ape:6333")
aclient = AsyncQdrantClient("http://ape:6333")
collection_name="weicheng-simple"

if not client.collection_exists(collection_name):
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=embedding_dimension,
            distance=models.Distance.COSINE,
        ),
    )

CPU times: user 100 ms, sys: 0 ns, total: 100 ms
Wall time: 210 ms


## 创建向量存储索引

In [5]:
%%time

from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from qdrant_client import models

vector_store = QdrantVectorStore(
    client=client, 
    aclient=aclient,
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=embedding_dimension,
        distance=models.Distance.COSINE,
    ),
)
storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
)

parser = SentenceSplitter(
    chunk_size=128,
    chunk_overlap=10
)
document_nodes = parser.get_nodes_from_documents(documents)

len(document_nodes)

Both client and aclient are provided. If using `:memory:` mode, the data between clients is not synced.


CPU times: user 803 ms, sys: 28.1 ms, total: 831 ms
Wall time: 833 ms


3006

In [6]:
%%time

index = VectorStoreIndex(document_nodes, storage_context=storage_context)

CPU times: user 17.7 s, sys: 378 ms, total: 18.1 s
Wall time: 5min 18s


## 基于嵌入检索的查询

In [7]:
%%time

from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

embedding_retriever = index.as_retriever(
    similarity_top_k=5,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=embedding_retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法确定方鸿渐的妻子是谁。文中提到了“苏文纨小姐”，但并未明确表示她就是方鸿渐的配偶。因此，不能仅凭这些信息推断出方鸿渐的妻子身份。CPU times: user 165 ms, sys: 8.83 ms, total: 174 ms
Wall time: 4.07 s


## 基于 BM25 检索的查询

In [8]:
%%time

# 下载停用词

# 设置 HTTP 代理环境变量
# https://github.com/nltk/nltk_data/issues/154#issuecomment-2144880495
http_proxy="http://192.168.0.134:7890"

import nltk
nltk.set_proxy(f'{http_proxy}')
nltk.download('stopwords')

CPU times: user 11.2 ms, sys: 77 µs, total: 11.3 ms
Wall time: 412 ms


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
%%time

import jieba
from typing import List
from nltk.corpus import stopwords

def chinese_tokenizer(text: str) -> List[str]:
    # Use jieba to segment Chinese text
    return list(jieba.cut(text))

CPU times: user 61.5 ms, sys: 13.5 ms, total: 75.1 ms
Wall time: 75.4 ms


In [10]:
%%time

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.storage.docstore import SimpleDocumentStore

splitter = SentenceSplitter(chunk_size=128, chunk_overlap=10)
docstore = SimpleDocumentStore()
docstore.add_documents(splitter.get_nodes_from_documents(documents))

bm25_retriever=BM25Retriever.from_defaults(
    docstore=docstore, 
    similarity_top_k=5,
    tokenizer=chinese_tokenizer,
)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.413 seconds.
Prefix dict has been built successfully.


CPU times: user 2.46 s, sys: 39.5 ms, total: 2.5 s
Wall time: 2.49 s


In [11]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=bm25_retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法直接确定方鸿渐的老婆具体是哪位角色。在给定的文本片段中，并没有明确提到方鸿渐的具体妻子名字。需要更多的上下文或详细信息来准确回答这个问题。CPU times: user 94.9 ms, sys: 8.59 ms, total: 103 ms
Wall time: 1.44 s


## 基于混合检索的查询

In [12]:
%%time

from llama_index.core.retrievers import QueryFusionRetriever

fusion_retriever = QueryFusionRetriever(
    [
        embedding_retriever,
        bm25_retriever,
    ],
    num_queries=1,
    use_async=True,
)

CPU times: user 0 ns, sys: 3.27 ms, total: 3.27 ms
Wall time: 2.71 ms


In [13]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# query_engine = RetrieverQueryEngine.from_args(fusion_retriever)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=fusion_retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，方鸿渐的妻子是中国的一位白俄女性。她是在中国与方鸿渐结婚的，并且他们共同生活了二十多年，期间生了一个儿子并且该儿子已经大学毕业。不幸的是，这位妻子早逝。CPU times: user 97.6 ms, sys: 13.9 ms, total: 112 ms
Wall time: 1.52 s
