# 围城数据的 KBs - 无向量数据库

结论：

- 可以没有向量数据库
- 向量（嵌入）将存储在本地
- 运行正常

## 生成数据

In [5]:
%%time

from llama_index.core import SimpleDirectoryReader

documents=SimpleDirectoryReader(input_files=["books/围城.txt"]).load_data()
len(documents)

CPU times: user 2.02 ms, sys: 227 µs, total: 2.25 ms
Wall time: 1.87 ms


1

## 设置 LlamaIndex 全局变量

In [6]:
%%time

# 加载llm和embeddings
%run ../../utils2.py

from llama_index.core import Settings

Settings.llm=get_llm(model="qwen2-7b-6k")

# Settings.embed_model = get_embedding(model_name="quentinz/bge-large-zh-v1.5")
# embedding_dimension=1024

Settings.embed_model = get_embedding(model_name="rjmalagon/gte-qwen2-1.5b-instruct-embed-f16")
embedding_dimension=1536

persist_dir="/tmp/weicheng-kbs-simple"
!rm -rf $persist_dir

CPU times: user 3.9 ms, sys: 8.2 ms, total: 12.1 ms
Wall time: 119 ms


## 生成索引

In [7]:
%%time

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=128,
    chunk_overlap=10
)
document_nodes = parser.get_nodes_from_documents(documents)

len(document_nodes)

CPU times: user 818 ms, sys: 22.5 ms, total: 841 ms
Wall time: 840 ms


3006

In [9]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(document_nodes)

CPU times: user 25 s, sys: 205 ms, total: 25.2 s
Wall time: 5min 22s


## 存储和加载索引

In [11]:
%%time

index.storage_context.persist('./weicheng-no-vectordb')

CPU times: user 13.4 s, sys: 117 ms, total: 13.5 s
Wall time: 13.5 s


In [20]:
%%time

from llama_index.core import (
    load_index_from_storage,
    StorageContext,
)

index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./weicheng-no-vectordb"))

CPU times: user 24 s, sys: 56.6 ms, total: 24.1 s
Wall time: 24.1 s


## 基于嵌入检索的查询

In [21]:
%%time

from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

embedding_retriever = index.as_retriever(
    similarity_top_k=5,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=embedding_retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法确定方鸿渐的妻子是谁。文中提到了“苏文纨小姐”，但并未明确表示她就是方鸿渐的配偶。因此，不能仅凭这些信息推断出方鸿渐的妻子身份。CPU times: user 192 ms, sys: 16.7 ms, total: 208 ms
Wall time: 6.69 s


## 基于 BM25 检索的查询

In [13]:
%%time

# 下载停用词

# 设置 HTTP 代理环境变量
# https://github.com/nltk/nltk_data/issues/154#issuecomment-2144880495
http_proxy="http://192.168.0.134:7890"

import nltk
nltk.set_proxy(f'{http_proxy}')
nltk.download('stopwords')

CPU times: user 2.34 ms, sys: 8.62 ms, total: 11 ms
Wall time: 435 ms


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
%%time

import jieba
from typing import List
from nltk.corpus import stopwords

def chinese_tokenizer(text: str) -> List[str]:
    # Use jieba to segment Chinese text
    return list(jieba.cut(text))

CPU times: user 71.1 ms, sys: 8.57 ms, total: 79.7 ms
Wall time: 79.4 ms


In [15]:
%%time

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.storage.docstore import SimpleDocumentStore

splitter = SentenceSplitter(chunk_size=128, chunk_overlap=10)
docstore = SimpleDocumentStore()
docstore.add_documents(splitter.get_nodes_from_documents(documents))

bm25_retriever=BM25Retriever.from_defaults(
    docstore=docstore, 
    similarity_top_k=5,
    tokenizer=chinese_tokenizer,
)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.402 seconds.
Prefix dict has been built successfully.


CPU times: user 2.43 s, sys: 26.1 ms, total: 2.46 s
Wall time: 2.45 s


In [16]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=bm25_retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，无法直接确定方鸿渐的老婆的具体身份。文中提到了“结婚二十多年，生的一个儿子都在大学毕业，这老婆早死了”这一描述，但没有明确指出这个老婆的名字。因此，我们不能从这些信息中得出方鸿渐的老婆是谁的答案。CPU times: user 118 ms, sys: 0 ns, total: 118 ms
Wall time: 1.72 s


## 基于混合检索的查询

In [17]:
%%time

from llama_index.core.retrievers import QueryFusionRetriever

fusion_retriever = QueryFusionRetriever(
    [
        embedding_retriever,
        bm25_retriever,
    ],
    num_queries=1,
    use_async=True,
)

CPU times: user 2.79 ms, sys: 58 µs, total: 2.85 ms
Wall time: 2.53 ms


In [18]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# query_engine = RetrieverQueryEngine.from_args(fusion_retriever)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=fusion_retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("方鸿渐的老婆是谁")
streaming_response.print_response_stream()

根据提供的信息，方鸿渐的妻子是白俄（可能是俄罗斯或乌克兰人）出身的外国妻子。她在中国去世了，并且在去世前与方鸿渐结婚已有二十多年的时间。他们有一个儿子已经大学毕业。CPU times: user 233 ms, sys: 11.4 ms, total: 244 ms
Wall time: 1.59 s
