# Simple KBs

目标：

- 实现一个混合检索的示例，基于北京景点数据
- 完整的过程，包括自动评估效果


问题：

- bm25在当前数据下效果不好，检索召回率是0
- 可能需要找更合适的场景

## 生成数据

In [1]:
%%time

!rm -rf data && mkdir -p data

items=[
    "颐和园",
    "恭王府",
    "国家博物馆",
    "八达岭长城",
    "故宫",
    "北海公园",
    "景山公园",
    "天坛公园",
]

CPU times: user 9.02 ms, sys: 4.12 ms, total: 13.1 ms
Wall time: 113 ms


In [2]:
%%time
%%capture

!pip install trafilatura

CPU times: user 12.6 ms, sys: 62 µs, total: 12.7 ms
Wall time: 3.39 s


In [3]:
%%time

from llama_index.readers.web import TrafilaturaWebReader

documents = TrafilaturaWebReader().load_data(
    [ f"https://baike.baidu.com/item/{item}" for item in items]
)

len(documents)

CPU times: user 4.04 s, sys: 421 ms, total: 4.46 s
Wall time: 4.65 s


8

In [4]:
%%time

import os

documents_data=[[items[index], documents[index].text] for index,item in enumerate(items)]

for data in documents_data:
    with open(os.path.join('./data', f"{data[0]}.txt"), "w", encoding="utf-8") as file:
        file.write(data[1])
    

CPU times: user 2.01 ms, sys: 67 µs, total: 2.08 ms
Wall time: 1.79 ms


In [5]:
!ls ./data -hl

total 304K
-rw-r--r-- 1 root root 37K Jul 17 18:36 八达岭长城.txt
-rw-r--r-- 1 root root 14K Jul 17 18:36 北海公园.txt
-rw-r--r-- 1 root root 80K Jul 17 18:36 国家博物馆.txt
-rw-r--r-- 1 root root 29K Jul 17 18:36 天坛公园.txt
-rw-r--r-- 1 root root 26K Jul 17 18:36 恭王府.txt
-rw-r--r-- 1 root root 52K Jul 17 18:36 故宫.txt
-rw-r--r-- 1 root root 22K Jul 17 18:36 景山公园.txt
-rw-r--r-- 1 root root 31K Jul 17 18:36 颐和园.txt


## 设置 LlamaIndex 全局变量

In [6]:
%%time

# 加载llm和embeddings
%run ../../utils2.py

from llama_index.core import Settings

Settings.llm=get_llm(model="qwen2-7b-6k")

# Settings.embed_model = get_embedding(model_name="quentinz/bge-large-zh-v1.5")
# embedding_dimension=1024

Settings.embed_model = get_embedding(model_name="rjmalagon/gte-qwen2-1.5b-instruct-embed-f16")
embedding_dimension=1536

persist_dir="/tmp/simple-kbs"
!rm -rf $persist_dir

CPU times: user 668 ms, sys: 79.7 ms, total: 748 ms
Wall time: 857 ms


## 基于嵌入的检索查询

### 设置向量存储 - Qdrant

In [7]:
%%time

from qdrant_client import QdrantClient
from qdrant_client import models

client = QdrantClient(":memory:")
collection_name="attractions"

if not client.collection_exists(collection_name):
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=embedding_dimension,
            distance=models.Distance.COSINE,
        ),
    )

CPU times: user 831 ms, sys: 35.3 ms, total: 866 ms
Wall time: 866 ms


### 创建向量存储索引

In [8]:
%%time

from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from qdrant_client import models

vector_store = QdrantVectorStore(
    client=client, 
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=embedding_dimension,
        distance=models.Distance.COSINE,
    ),
)
storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
)

parser = SentenceSplitter(
    chunk_size=128,
    chunk_overlap=10
)
nodes = parser.get_nodes_from_documents(documents)

len(nodes)

CPU times: user 495 ms, sys: 19.8 ms, total: 515 ms
Wall time: 514 ms


1309

In [9]:
%%time

index = VectorStoreIndex(nodes, storage_context=storage_context)

CPU times: user 7.55 s, sys: 194 ms, total: 7.75 s
Wall time: 2min 22s


### 保存索引数据

In [10]:
%%time

index.storage_context.persist(persist_dir)

CPU times: user 1.71 ms, sys: 0 ns, total: 1.71 ms
Wall time: 1.59 ms


### 加载索引数据

In [11]:
%%time

from llama_index.core import load_index_from_storage

storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
    persist_dir=persist_dir
)
index = load_index_from_storage(storage_context)

CPU times: user 2.08 ms, sys: 0 ns, total: 2.08 ms
Wall time: 1.8 ms


### 使用嵌入检索

In [12]:
%%time

retriever = index.as_retriever()
nodes = retriever.retrieve("颐和园门票多少钱")

nodes

CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 30.6 ms


[NodeWithScore(node=TextNode(id_='d70c925e-6d6a-481f-bbfe-d03d1d3b72e4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://baike.baidu.com/item/八达岭长城', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='6094cad509ca3fc0e41e11bb8f15093212609bf61c5ba4c8bd5a99dc37df17d3'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='f9007a2d-cc1f-437c-9251-5de91b27899e', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='f7e68e475bf69ac681f99c22c1906ea43568c9f05d6ac02636d89d6a7b742c57'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='40b45cab-205b-4a5d-ab99-048d33826219', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='194bd5a3abb3dca61735dfe8ffc435a99bd5a3823e8c08d4ee1989f94ae65e14')}, text='如果当停车辆接近这个数量时会在距景区2、3公里以外将车辆分流到附近的野生动物园外和岔道村西的停车场里。为了方便游客，分流后景区将设立摆渡车，免费将游客拉到长城登城口附近。 [27]\n旺季门票执行时间：每年4月1日——10月31日\n长城普通成人票', start_char_idx=12513, end_char_i

### 基于嵌入检索的查询

In [13]:
%%time

from llama_index.core import get_response_synthesizer
from llama_index.core.query_engine import RetrieverQueryEngine

retriever = index.as_retriever(
    similarity_top_k=5,
)

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("颐和园门票多少钱")
streaming_response.print_response_stream()

颐和园的门票价格为：
- 旺季：60元/张
- 淡季：50元/张

同时，颐和园提供半价票。CPU times: user 204 ms, sys: 286 ms, total: 490 ms
Wall time: 4.02 s


## 基于 BM25 的检索查询

### 使用 BM25 检索

In [15]:
%%time

# 下载停用词

# 设置 HTTP 代理环境变量
# https://github.com/nltk/nltk_data/issues/154#issuecomment-2144880495
http_proxy="http://192.168.0.134:7890"

import nltk
nltk.set_proxy(f'{http_proxy}')
nltk.download('stopwords')

CPU times: user 2.02 ms, sys: 8.39 ms, total: 10.4 ms
Wall time: 412 ms


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
%%time

import jieba
from typing import List
from nltk.corpus import stopwords

def chinese_tokenizer(text: str) -> List[str]:
    # Use jieba to segment Chinese text
    return list(jieba.cut(text))

CPU times: user 71.3 ms, sys: 8.25 ms, total: 79.6 ms
Wall time: 79.8 ms


In [27]:
%%time

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.storage.docstore import SimpleDocumentStore

splitter = SentenceSplitter(chunk_size=128, chunk_overlap=10)
docstore = SimpleDocumentStore()
docstore.add_documents(splitter.get_nodes_from_documents(documents))

retriever=BM25Retriever.from_defaults(
    docstore=docstore, 
    similarity_top_k=2,
    tokenizer=chinese_tokenizer,
)

CPU times: user 667 ms, sys: 0 ns, total: 667 ms
Wall time: 666 ms


In [28]:
%%time

retriever = index.as_retriever()
nodes = retriever.retrieve("颐和园门票多少钱")

nodes

CPU times: user 18.3 ms, sys: 0 ns, total: 18.3 ms
Wall time: 121 ms


[NodeWithScore(node=TextNode(id_='d70c925e-6d6a-481f-bbfe-d03d1d3b72e4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://baike.baidu.com/item/八达岭长城', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='6094cad509ca3fc0e41e11bb8f15093212609bf61c5ba4c8bd5a99dc37df17d3'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='f9007a2d-cc1f-437c-9251-5de91b27899e', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='f7e68e475bf69ac681f99c22c1906ea43568c9f05d6ac02636d89d6a7b742c57'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='40b45cab-205b-4a5d-ab99-048d33826219', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='194bd5a3abb3dca61735dfe8ffc435a99bd5a3823e8c08d4ee1989f94ae65e14')}, text='如果当停车辆接近这个数量时会在距景区2、3公里以外将车辆分流到附近的野生动物园外和岔道村西的停车场里。为了方便游客，分流后景区将设立摆渡车，免费将游客拉到长城登城口附近。 [27]\n旺季门票执行时间：每年4月1日——10月31日\n长城普通成人票', start_char_idx=12513, end_char_i

### 使用 BM25 检索的查询

In [29]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("颐和园门票多少钱")
streaming_response.print_response_stream()

The provided context does not mention the price of tickets for the Summer Palace (颐和园). Please provide more specific details or check the current pricing information for the Summer Palace.CPU times: user 70.9 ms, sys: 120 ms, total: 191 ms
Wall time: 1.2 s


## 基于混合检索的查询

### 混合检索

In [36]:
%%time

from llama_index.core.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [
        index.as_retriever(similarity_top_k=5),
        BM25Retriever.from_defaults( 
            docstore=docstore,
            similarity_top_k=5,
            tokenizer=chinese_tokenizer,
        ),
    ],
    num_queries=1,
    use_async=True,
)

CPU times: user 362 ms, sys: 0 ns, total: 362 ms
Wall time: 361 ms


In [37]:
%%time

retriever = index.as_retriever(
    similarity_top_k=5,
)
nodes = retriever.retrieve("颐和园门票多少钱")

nodes

CPU times: user 18.7 ms, sys: 622 µs, total: 19.3 ms
Wall time: 128 ms


[NodeWithScore(node=TextNode(id_='d70c925e-6d6a-481f-bbfe-d03d1d3b72e4', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://baike.baidu.com/item/八达岭长城', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='6094cad509ca3fc0e41e11bb8f15093212609bf61c5ba4c8bd5a99dc37df17d3'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='f9007a2d-cc1f-437c-9251-5de91b27899e', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='f7e68e475bf69ac681f99c22c1906ea43568c9f05d6ac02636d89d6a7b742c57'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='40b45cab-205b-4a5d-ab99-048d33826219', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='194bd5a3abb3dca61735dfe8ffc435a99bd5a3823e8c08d4ee1989f94ae65e14')}, text='如果当停车辆接近这个数量时会在距景区2、3公里以外将车辆分流到附近的野生动物园外和岔道村西的停车场里。为了方便游客，分流后景区将设立摆渡车，免费将游客拉到长城登城口附近。 [27]\n旺季门票执行时间：每年4月1日——10月31日\n长城普通成人票', start_char_idx=12513, end_char_i

In [38]:
len(nodes)

5

### 基于混合检索的查询

In [39]:
%%time

response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
    streaming=True,
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# query
streaming_response = query_engine.query("颐和园门票多少钱")
streaming_response.print_response_stream()

颐和园的门票价格为：
- 旺季：60元/张
- 淡季：50元/张

同时，颐和园提供半价票。CPU times: user 75.7 ms, sys: 137 ms, total: 213 ms
Wall time: 3.93 s
